April 15, 2025 | 9 min read

The Dataset of Human Behaviour

Observational traces at scale, why depth and telemetry both fall short, and how representing populations closes the structural gap.

Modern digital systems record human behavior continuously.

Every interaction with a product leaves telemetry
Every conversation with a company produces qualitative evidence
Every public discussion leaves traces of opinion across social platforms

Taken together, these traces form one of the largest observational records of human behavior ever created. Internally, we refer to this as the dataset of human behavior.

Despite its scale, most of this data remains structurally underutilized. The difficulty is no longer in collecting behavioral signals, but in representing the populations that generate them.

Section 1

Access without representation

Current systems interact with this dataset in two distinct ways, each with its own limitations.

School of thought 1

Methods that prioritize depth

Interviews, moderated research, and AI-assisted conversations attempt to uncover how individuals interpret products, technologies, and emerging ideas. These approaches provide contextual explanations of behavior, capturing motivations, frustrations, and reasoning in detail.

However, they operate within tightly constrained conditions. Questions are predefined, participants are selected deliberately, and the behavioral space explored is limited by design. The result is a set of explanations tied to specific scenarios rather than a broader account of how behavior is distributed across a population.

School of thought 2

Systems that prioritize scale

Modern products generate continuous streams of telemetry, where every click, navigation path, and interaction is recorded. Analytics platforms aggregate these signals to reconstruct user journeys and measure engagement across large populations.

While these systems offer extensive coverage, they lack intrinsic interpretability. Telemetry captures what occurs, but not the conditions under which it occurs. A drop-off in a workflow may reflect confusion, disagreement, or changing intent, but the data itself does not resolve this ambiguity. Over time, these signals accumulate into large behavioral repositories whose analysis depends on external interpretation.

In both cases, the systems access behavioral data, but do not represent the structure of the population producing it.

Section 2

The structural gap

Both approaches operate on the dataset of human behavior without modeling the system that generates it. Interviews provide narratives from individuals, while telemetry provides event-level observations at scale, but in neither case is the distribution of behavioral dispositions across the population explicitly represented.

The underlying question remains unresolved: what structure within the population produces these behaviors?

Section 3

Reconciling the trade-off

This distinction clarifies the limitation of current approaches more precisely.

Systems built around scenario-based retrieval provide interpretability, but operate under constrained sampling and predefined questions. Systems built around telemetry and qualitative data provide broad coverage, but require extensive manual interpretation to extract meaning.

The result is a persistent trade-off between depth and scale.

A representational approach addresses this directly by operating at the level of the population itself, retaining the advantages of both without inheriting their constraints.

It allows comprehensive coverage across users and their behaviors without requiring constrained retrieval, while preserving interpretability through an explicit model of underlying structure rather than post hoc analysis.

Section 4

Representing the population

Within the SAPIENS framework, this question is addressed by treating behavioral identity as a structured composition of traits, priors, and activation cues. Individuals are not modeled as isolated data points, but as agents whose responses emerge from underlying dispositions.

We define a synthetic persona P_i as a context-space schema:

$$P_i = (V_{\text{tribe}},\; M_{\text{semantic}},\; \Delta_{\text{behavior}}) \quad (1)$$

Where:

V_tribe: traits of the tribe an individual belongs to, representing shared knowledge and norms, refined in the SGO training loop
M_semantic: the user's semantic memory, a compressed set of synthesized traits and past experiences from previous stimuli, capturing idiosyncrasies that distinguish the individual from the tribe
Δ_behavior: user-specific deviations from the tribal norm, a dynamic set of activation cues learned via the SGO training loop, acting as conditional overrides that trigger user-specific deviations (Tett et al., 2021)

This allows analysis to shift from observed actions to the distribution of behavioral tendencies across a population. Instead of asking which users performed a specific action, the system can examine how different segments are predisposed to respond under varying conditions.

Differences in sensitivity to pricing, social signals, or reputational dynamics are represented directly, rather than inferred post hoc.

These dispositions are not static. Exposure to new information reshapes preferences, communities reorganize around emerging narratives, and behavioral segments evolve over time. Episodic memory captures this temporal dimension, while observational signals, drawn from telemetry, qualitative evidence, and public discourse, ground the model in empirical data.

Behavioral data is treated as canonical evidence of an underlying system whose structure can be modeled, not just a collection of isolated observations.

Section 5

From retrieval to representation

Once populations are represented explicitly, the process of behavioral analysis changes.

Insight no longer depends exclusively on retrieving explanations from samples or reconstructing patterns from logs. Instead, it emerges from examining how behavioral dispositions are distributed and how they interact with stimuli.

This resolves the traditional trade-off between depth and coverage. Context is no longer limited to small samples, and scale no longer comes at the cost of interpretability. Both are integrated within the same representational framework.

Behavioral data, in this setting, functions as a record of what has occurred, while population structure provides an account of the conditions under which those outcomes arise.

That is their true operational value in the future of human behaviour simulation.

Conclusion

Toward population-level understanding

SAPIENS models identity through structured traits and priors.

Episodic memory captures how those identities evolve over time.

Observational signals anchor the system in real behavioral data.

Together, these components enable the representation of entire populations rather than isolated observations.

Modern organizations already possess a continuous, large-scale record of human behavior. The remaining challenge is to represent the system that produces this record.

Once populations are treated as observable systems, the focus of analysis extends beyond individual actions to the structural conditions that generate them. Behavioral data can then be interpreted not only as a record of events, but as evidence of an underlying system whose dynamics can be modeled and, under appropriate assumptions, simulated.

The objective is not simply to analyze behavior. It is to simulate the system that produces it.

The Dataset of Human Behaviour

Ready To Get Started?

Thank You!