How to think about human-likeness in the age of autonomy
Limitations of the Waymo Open Sim Agent Challenge and ideas for how we can do better
This post examines a deceptively simple question: how should we measure whether an autonomous driving policy behaves like a human? Concretely, given a trained policy π, can we define an evaluation that compresses its behavior into a single scalar h - a quantity that meaningfully reflects its degree of human-likeness?
As a case study, I analyze the Waymo Open Sim Agent Challenge (WOSAC), a widely used benchmark for evaluating the realism of simulation agents (“sim agents“) in the field. I will go through what the WOSAC score captures and what it does not.
The big picture
Our objective is to quantify how human-like a trained policy π behaves. What do we mean by human-likeness? Multiple interpretations are possible. In the context of autonomous driving, I adopt a pragmatic definition: blending in. A policy is human-like if its behavior is statistically consistent with that of human drivers. The underlying assumption here is that policies that blend in are more likely to transfer reliably to real-world deployment, where they must coordinate seamlessly with humans.1

To ground the discussion, I use the Waymo Open Sim Agent Challenge (WOSAC) as a case study. It is worth noting that this post is not intended as a critique of WOSAC per se. I honestly think that WOSAC did a lot of things right, especially given that it was one of the first realism benchmarks for traffic simulation.2
But three years have passed since its release in 2023. A lot has changed in the field. We have large models capable of learning from diverse data sources. Reinforcement learning is maturing and more reliable. Fast, grounded multi-agent simulators are now widely accessible.
So it’s worth pausing. What assumptions does WOSAC make? Which remain valid, and which have become limiting, or even counterproductive? I think answering these questions is important because if we are not measuring what matters, we waste human effort. And optimizing for a bad metric is worse than not optimizing at all!
I will focus on three questions:
Assumptions and interpretation. What assumptions underlie the WOSAC realism score, and how do they shape its interpretation as a measure of human-like behavior?
Optimizing for the score. Does there exist a meaningful notion of “human-like enough,” or is improvement toward ground truth always desirable? In other words, is a higher score always better?
Designing better evals. Given the identified limitations, what practical steps can we take to build more informative benchmarks for human-like driving behavior?
How WOSAC evaluates realism
Let me begin with a brief overview of how WOSAC evaluates policies. WOSAC defines the distributional realism of a policy π as a weighted linear combination of nine metrics, each computed using the following procedure:
Roll out the policy ( R = 32 ) times in simulation, collecting (x, y, heading) for each agent over T=81 time steps each, yielding a tensor of shape (1, R, T).
Extract trajectory features and flatten across time to obtain (1, R * T) per agent.
Build histograms for each agent using the (1, R * T) simulated features.
Compute the log-likelihood of the ground-truth trajectory features (1, T) under the policy-induced distribution. Take the exponent to obtain likelihoods.
Average the resulting likelihoods across time and agents to obtain a single scalar score per scenario.
Visual illustration

Mathematical formulation
More precisely, we compute a meta-score for each scenario s as follows:
where
What WOSAC gets right
WOSAC models driving as an inherently multi-agent problem. Each agent’s behavior is entangled with the behavior of others, and the evaluation captures these interactions rather than reducing driving to independent trajectories.
By comparing feature distributions, WOSAC captures similarities in driving dynamics between human drivers and learned policies, rather than overfitting to exact trajectory matches.
Kinematic metrics, which account for 20% of the total score, provide a meaningful signal of motion realism and physical feasibility.
Note: For the rest of this section, I focus on evaluating autonomous driving policies rather than predicting human trajectories. The policy operates under high-level intent, like a human driving toward a destination, and is evaluated on how it navigates towards that goal. That being said, my points below apply to any setting where the goal is to assess high-quality, human-like driving.
Where WOSAC breaks down
We start with a controlled analysis on a dataset chosen to admit a perfect ground truth: no collisions and no off-road events. The full dataset does contain labeling noise, which I will address in Section L2. Code for all analyses below is provided in the branch dc/wosac_analysis in PufferDrive.
To establish reference points, consider two extreme baselines with the cleaned dataset. The upper bound is the ground-truth trajectory itself, repeated 32 times to match WOSAC’s rollout procedure. The lower bound is a random policy that represents the opposite extreme of behavior. These baselines provide rough anchors for interpreting the WOSAC score: the best- and worst-case scenarios under its evaluation framework.
Running WOSAC on these baselines yields a few observations:
Upper and lower bounds. On this dataset, the maximum achievable meta-score is 0.820, while the random policy attains a meta-score of 0.454.
Average displacement. As expected, ground-truth trajectories yield an ADE of 0, whereas the random policy produces a large ADE (≈27).


With these reference points in place, we can now turn to the limitations of WOSAC and explore where its evaluation breaks down.
L1. Arbitrary weighting of metrics and collision dominance in the meta-score
The meta-score aggregates nine metrics using fixed weights (exact weighting here). Collisions and off-road events alone contribute 50% of the total score, leading to a misalignment between the score’s stated goal and what it actually measures.
This weighting causes failure events to dominate the evaluation rather than nominal driving behavior. The issue is compounded by the presence of false collisions due to labeling errors in the data, which are nonetheless counted toward the meta-score. As a result, the score becomes sensitive to data artifacts rather than true driving quality. More on that later.
L2. Binary treatment of safety events
Collision and off-road likelihoods are modeled as binary at the rollout level (code entry point). A rollout is penalized if an event occurs at least once, regardless of frequency, duration, or severity. As a result, qualitatively different behaviors, ranging from a single brief contact to repeated collisions, can receive identical scores.
Sensitivity analysis: To illustrate the implications of L1 & L2, we start from the ground-truth trajectory (green striped line), replicate it 32 times, and then inject artificial collisions before computing the metrics.
We consider two variants. In the first, we inject multiple collisions within a single rollout by adding collisions at multiple timesteps (blue line). In the second, we inject one collision per rollout across multiple rollouts (purple line). We then plot the collision likelihood and the resulting meta-score as the total number of injected collisions increases.
# Shape is (n_agents, n_rollouts, n_steps)
if collisions_to_add_per_rollout > 0:
# Always add to first step
sim_collision_per_step[:, :collisions_to_add_per_rollout, 0] = True
if collisions_to_add_per_timestep > 0:
# Always add to first rollout
sim_collision_per_step[:, 0, :collisions_to_add_per_timestep] = True
# ... continue metrics computation as usualI hope you would agree with me that a car colliding once is very different from colliding ten times in a given period of time. Yet the scores do not reflect such nuances. Injecting many collisions within a single rollout leaves the collision likelihood and, consequently, the meta-score unchanged after the first event. In contrast, spreading the same number of collisions across multiple rollouts steadily increases the collision likelihood and lowers the meta-score. This means that more severe behavior can sometimes be penalized less than milder but more distributed failures, highlighting the insensitivity that comes from treating events as binary.
As shown by the purple curve (left, x=32), introducing a single collision in every rollout reduces the meta-score by 30%, from 0.82 (ground truth) to 0.57 (purple line), even when all other aspects of behavior remain correct. Introducing a collision at every timestep in a single rollout, however, barely affects the meta-score (blue line).

In words, the collision likelihood reduces to the following logic:
If the ground truth collides at least once during a rollout, the policy should also collide at least once. The number of collisions does not matter; adding collisions across time has no effect (blue line).
If the ground truth does not collide during a rollout, the policy should never collide.
The same logic applies to the off-road likelihood metric, which carries the same weight as collision likelihood at 25 percent each.
A natural follow-up question is whether this limitation matters in practice. The issue is asymmetric: false positives are more consequential because they force the policy to mimic noise or incorrect behavior. Binary treatment of safety events would be less problematic with perfectly clean data, but real datasets often contain labeling errors.
How much noise exists in the Waymo Open Motion Dataset (WOMD)? To quantify this, I randomly sample 5,000 scenarios from the WOMD training set and count ground-truth trajectories containing one or more collisions or off-road events.
Because this dataset represents nominal driving, collisions are almost certainly labeling errors. The figure below summarizes these false positives. On average, 4.2% of trajectories contain a collision, and 16.3% contain an off-road event.3 These rates are worse than I expected: when optimizing for realism while minimizing collisions, the metric can be misled by labeling noise, causing the policy to mimic incorrect behavior rather than truly human-like driving.

L3. Timing is not taken into account
WOSAC ignores temporal structure across all nine metrics (see original WOSAC code entry point, PufferDrive implementation).
This design can produce counterintuitive behavior. For example, a single mislabeled collision early in a rollout allows a policy that collides at every timestep to score higher than one that avoids collisions entirely, even though the latter is safer and more realistic. The same issue affects kinematic metrics, which compare distributions of linear speed and acceleration over the entire rollout without regard to timing. A policy that repeatedly overshoots or accelerates aggressively early can achieve the same score as one that briefly deviates later, even though the resulting behaviors are qualitatively very different.
→ The combined implications of L1, L2 and L3: Evaluating a superhuman driving policy
To illustrate the combined effects of L1–L3, we run a simple experiment on 5,000 randomly sampled training scenarios. Consider a superhuman policy: it behaves exactly like a human but never makes a mistake. We compare two policies. The first exactly reproduces the ground-truth trajectories (“π pattern-match”). The second reproduces the same trajectories but never collides or goes off-road (“π superhuman”), implemented by setting all collision and off-road indicators to zero.
Intuitively, π superhuman is perfect and would make an ideal controllable sim agent. Yet WOSAC gives it a meta-score of only 0.72, far lower than a policy that reproduces human errors. To put this in perspective, this score would rank 35th out of 36 on the current leaderboard, placing a perfect policy near the very bottom4.

Lastly, here are some minor issues.
L4. Discontinuities in function space
The histogram bins are not smoothed; the only adjustment is adding a small value to empty bins to avoid infinities. This produces noticeable discontinuities in the likelihoods. I am not certain how this affects the final scores, but it was surprising to see that small changes in the underlying value can sometimes cause large jumps in likelihood.
The figures below illustrate this effect for the ground-truth data. The feature displayed here is the distance to the nearest object.
L5. Unclear upper bound and seemingly saturated performance.
WOSAC assumes independence over timesteps, so there is no natural upper bound. The metric is not guaranteed to stay between 0 and 1, which makes it unclear what the highest possible score really is.
We can obtain an empirical upper bound for a batch by replicating the ground-truth trajectory 32 times and computing the features, as in the analyses above. Normalizing scores by this value would make comparisons easier to interpret.
When comparing meta-scores across different sets of scenarios, noticeable differences are apparent (see the histograms above). On the leaderboard, top entries appear tightly clustered, creating the impression of saturated performance, even though policy behavior can vary significantly across scenarios. This suggests that WOSAC may obscure meaningful differences between policies.

L6. Assumption that there is a single ground-truth
WOSAC treats each agent as having a single ground-truth trajectory. In reality, human driving is inherently multimodal: there is no single “correct” way to drive. This is a tricky issue. While it is true that multiple behaviors can be equally valid, WOSAC penalizes deviations from a single reference, reducing the score for perfectly reasonable variations. The authors note this themselves in the paper; this is why they excluded the ADE from the meta-score and adopted a statistical view.
Conclusion
Circling back to the questions at the start, here are my takeaways:
Assumptions and interpretation. WOSAC makes several hidden assumptions that shape how we interpret its realism score. If your goal is not to blindly mimic the dataset, many of these assumptions are actively harmful.
Optimizing for the score. There is no clear notion of “human-like enough.” Beyond a certain point, optimization merely tracks labeling noise rather than true human behavior. Given the dataset’s noise, it is unclear how much of the leaderboard reflects meaningful human-likeness versus overfitting to errors.
Building better evals. The limitations of WOSAC suggest practical improvements:
i) Separate safety from distributional realism / human-likeness. Treat collisions and off-road events as hard constraints for nominal driving rather than probabilistic outcomes. Superhuman policies that avoid mistakes can score only 0.72 under WOSAC, showing a misalignment in the current benchmark.
ii) Incorporate temporal dynamics explicitly. Evaluate not just where a vehicle goes, but when. Time-sensitive metrics are essential for coordination and planning.
iii) Close the sim to real loop. Benchmark metrics should reflect real-world human coordination such as gap acceptance, yielding, negotiation, rather than arbitrary weighted objectives.
iiii) Smooth density estimates. Ensure likelihood or binned metrics avoid discontinuities and counterintuitive jumps.
I hope this post illuminates what WOSAC really measures and clarifies some of its limitations. If you’re interested in building better evals and benchmarks in this area, or if you have thoughts or questions about this post, I’d love to hear from you!
Acknowledgements
I am grateful to Eugene Vinitsky for feedback on this post and helpful discussions. I also thank Waël Doulazmi and Zilin Wang for comments on an earlier draft and Julian Hunt for help with initial prototyping of the sensitivity analysis.
For attribution in academic contexts, please cite this work as
@misc{cornelisse2025humanlikeness,
title={Human-likeness metrics for autonomous agents: are we measuring the right thing?},
author={Cornelisse, Daphne},
year={2025},
howpublished={Substack},
note={Blog post analyzing the Waymo Open Sim Agent Challenge (WOSAC) realism benchmark}
}Human-likeness in driving has other applications as well. In simulation, for instance, the goal is to create interactive agents that behave like humans and respond meaningfully to a driving policy. While this is not my primary focus, many of the limitations discussed here still apply.
The WOSAC authors explicitly frame the benchmark as a distribution-matching problem and do not position it as a deployment-oriented evaluation of driving policies. However, many of the limitations discussed here still apply in the simulation-agent setting. In practice, sim agents are engineering tools, and their usefulness depends not only on distributional realism but also on controllability and interpretability. Metrics that conflate nominal behavior with failure events, or collapse distinct behaviors into identical scores, limit our ability to diagnose failures and debug them efficiently.
Here, I include only collision and off-road events for agent IDs listed in the WOMD tracks_to_predict scenario metadata, as suggested by the authors. This subset targets less noisy agents; including the full set of controllable agents would introduce even more false positives.
The same holds for the 2024 leaderboard. The 2024 challenge has one notable difference: it does not include a likelihood metric for traffic lights. I was too lazy to count the entries, but in the 2024 leaderboard, a score of 0.72 lands you at the middle/bottom of the leaderboard, roughly at the same position as the diffusion model VBD.










Great to see more people realizing this! You may find this an interesting read, we came across many of the same issues: https://kashyap7x.github.io/assets/pdf/students/Fauth2025.pdf
How much is this problem specific to reinforcement learning? With imitation learning you are optimizing for human-likeness anyway, does this metric even make sense?