Human behavior modeling in naturalistic driving: Trends and opportunities
Reflections on the '25 human road user modeling workshop in Baden-baden
Last Tuesday, our workshop on models of human behavior for autonomous vehicle evaluation took place in the quaint town of Baden-Baden. I wanted to share a quick write-up of some thoughts and observations from the day.
Background: Why model human behavior?
To back up, models of human behavior are central to evaluating—and sometimes even training—autonomous vehicles. Depending on the use case, these models serve different purposes:
Human performance benchmarks: For example, in accident analysis, they can help answer whether a collision could reasonably have been avoided.
Simulation agents (”sim agents”): Here, the models act as other road users in a simulated environment. Think of NPCs in a game, except their behaviors must be realistic enough that simulation insights actually transfer to the real world.
Workshop highlights
The talks ranged from mechanistic approaches to AI/ML-based methods. A few themes stood out:
Training or fine-tuning sim agents in closed-loop settings greatly improves controllability.
We’re getting quite good at modeling nominal human driving behavior.
Progress has been made on modeling conflict behavior in small, controlled settings, but scaling to rare, high-stakes scenarios remains difficult.
Likely reasons: (1) sparse and sensitive data; (2) diverse human responses.
The field has developed quite a clear sense of what aspects of “human-like” behavior in traffic matter, and how to measure them.
The old divide between data-driven and mechanistic approaches seems to fade, with the two perspectives increasingly coming together.

Opportunities & open questions
Operationalizing interpretability
One thing that struck me is how differently people define “interpretable.” Mechanistic modelers often value this approach because they fully understand the model. But does interpretability mean transparency in parameters, or is it, for example, sufficient to understand how rewards shape an RL agent’s behavior? If a mechanistic model has 20 interacting parameters, is it still interpretable? Since humans can only hold so many variables in mind, maybe interpretability should be defined relative to these limits. It seems valuable to develop criteria for interpretability that are tailored to specific use cases. Once the type and level of interpretability are clear, choosing the right method may become straightforward.
Combining domain knowledge with scalable methods
Given the existing domain expertise, there appears to be a clear opportunity to combine this with scalable ML/RL approaches.
Evaluation: Metrics and datasets
The Waymo Open Sim Agent Challenge (WOSAC) has been, and continues to be, a useful benchmark for measuring realism. However, current metrics leave room for improvement. For example, to what extent is a meta-score of 0.78 an improvement over a meta-score of 0.76? It’s hard to tell. Developing metrics that are both rigorous and interpretable will be key to building better models of human behavior in traffic. Similarly, more benchmarks and datasets focused on long-tail or conflict scenarios are sorely needed.
Overall, I’m excited to see how progress in controllability, interpretability, and measurability will further our understanding of human behavior in naturalistic traffic scenarios and, in turn, continue to improve the safety of autonomous vehicles.



