<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Daphne’s Substack]]></title><description><![CDATA[My personal Substack]]></description><link>https://daphnecornelisse.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!C86p!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab58238f-c763-4fa9-aed5-287273cdce24_144x144.png</url><title>Daphne’s Substack</title><link>https://daphnecornelisse.substack.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 24 Jun 2026 08:13:22 GMT</lastBuildDate><atom:link href="https://daphnecornelisse.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Daphne Cornelisse]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[daphnecornelisse@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[daphnecornelisse@substack.com]]></itunes:email><itunes:name><![CDATA[Daphne Cornelisse]]></itunes:name></itunes:owner><itunes:author><![CDATA[Daphne Cornelisse]]></itunes:author><googleplay:owner><![CDATA[daphnecornelisse@substack.com]]></googleplay:owner><googleplay:email><![CDATA[daphnecornelisse@substack.com]]></googleplay:email><googleplay:author><![CDATA[Daphne Cornelisse]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[How to think about human-likeness in the age of autonomy ]]></title><description><![CDATA[Limitations of the Waymo Open Sim Agent Challenge and ideas for how we can do better]]></description><link>https://daphnecornelisse.substack.com/p/how-to-think-about-human-likeness</link><guid isPermaLink="false">https://daphnecornelisse.substack.com/p/how-to-think-about-human-likeness</guid><dc:creator><![CDATA[Daphne Cornelisse]]></dc:creator><pubDate>Sun, 08 Feb 2026 17:22:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!B8Ub!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This post examines a deceptively simple question: <em>how should we measure whether an autonomous driving policy behaves like a human?</em> Concretely, given a trained policy <em>&#960;</em>, can we define an evaluation that compresses its behavior into a single scalar <em>h</em> - a quantity that meaningfully reflects its degree of human-likeness?</p><p>As a case study, I analyze the <a href="https://waymo.com/open/challenges/2025/sim-agents/">Waymo Open Sim Agent Challenge (WOSAC)</a>, a widely used benchmark for evaluating the realism of simulation agents (&#8220;sim agents&#8220;) in the field. I will go through what the WOSAC score captures and what it does not. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B8Ub!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B8Ub!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 424w, https://substackcdn.com/image/fetch/$s_!B8Ub!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 848w, https://substackcdn.com/image/fetch/$s_!B8Ub!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!B8Ub!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B8Ub!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png" width="1456" height="792" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:792,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:191553,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!B8Ub!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 424w, https://substackcdn.com/image/fetch/$s_!B8Ub!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 848w, https://substackcdn.com/image/fetch/$s_!B8Ub!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 1272w, https://substackcdn.com/image/fetch/$s_!B8Ub!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a98874c-b5b9-4e8a-87d4-9d15d0cd48a3_1838x1000.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Human-likeness metrics for autonomous agents: are we measuring the right thing?</figcaption></figure></div><h3>The big picture</h3><p>Our objective is to quantify how human-like a trained policy &#960; behaves. What do we mean by human-likeness? Multiple interpretations are possible. In the context of autonomous driving, I adopt a pragmatic definition: <em>blending in</em>. A policy is human-like if its behavior is statistically consistent with that of human drivers. The underlying assumption here is that policies that blend in are more likely to transfer reliably to real-world deployment, where they must coordinate seamlessly with humans.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qYZv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qYZv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 424w, https://substackcdn.com/image/fetch/$s_!qYZv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 848w, https://substackcdn.com/image/fetch/$s_!qYZv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 1272w, https://substackcdn.com/image/fetch/$s_!qYZv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qYZv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png" width="1456" height="411" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/641df44b-9040-413d-8712-964bb97429f8_2290x646.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:411,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:223329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qYZv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 424w, https://substackcdn.com/image/fetch/$s_!qYZv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 848w, https://substackcdn.com/image/fetch/$s_!qYZv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 1272w, https://substackcdn.com/image/fetch/$s_!qYZv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F641df44b-9040-413d-8712-964bb97429f8_2290x646.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">We seek an evaluation procedure <em>f(</em>&#960;<em>)</em> that maps a policy &#960; to a scalar value <em>h, </em>intended to measure human-likeness. Ideally, <em>h</em> correlates with deployment performance, such as the policy&#8217;s effectiveness in coordinating with human drivers on the road.</figcaption></figure></div><p>To ground the discussion, I use the Waymo Open Sim Agent Challenge (WOSAC) as a case study. It is worth noting that this post is not intended as a critique of WOSAC per se. I honestly think that WOSAC did a lot of things right, especially given that it was one of the first realism benchmarks for traffic simulation.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> </p><p>But three years have passed since its <a href="https://arxiv.org/abs/2305.12032">release</a> in 2023. A lot has changed in the field. We have large models capable of learning from diverse data sources. Reinforcement learning is maturing and more reliable. Fast, grounded multi-agent simulators are now widely accessible.</p><p>So it&#8217;s worth pausing. What assumptions does WOSAC make? Which remain valid, and which have become limiting, or even counterproductive? I think answering these questions is important because if we are not measuring what matters, we waste human effort. And optimizing for a bad metric is worse than not optimizing at all!</p><p>I will focus on three questions:</p><ol><li><p><strong>Assumptions and interpretation.</strong> What assumptions underlie the WOSAC realism score, and how do they shape its interpretation as a measure of human-like behavior?</p></li><li><p><strong>Optimizing for the score.</strong> Does there exist a meaningful notion of &#8220;human-like enough,&#8221; or is improvement toward ground truth always desirable? In other words, is a higher score always better?</p></li><li><p><strong>Designing better evals.</strong> Given the identified limitations, what practical steps can we take to build more informative benchmarks for human-like driving behavior?</p></li></ol><h3>How WOSAC evaluates realism</h3><p>Let me begin with a brief overview of how WOSAC evaluates policies. WOSAC defines the <em>distributional realism</em> of a policy &#960; as a <em>weighted linear combination</em> of nine metrics, each computed using the following procedure:</p><ol><li><p>Roll out the policy ( <em>R = 32</em> ) times in simulation, collecting (x, y, heading) for each agent over <em>T=81</em> time steps each, yielding a tensor of shape <em>(1, R, T)</em>.</p></li><li><p>Extract trajectory features and flatten across time to obtain <em>(1, R * T)</em> per agent.</p></li><li><p>Build histograms for each agent using the <em>(1, R * T)</em> simulated features.</p></li><li><p>Compute the log-likelihood of the ground-truth trajectory features (1, T) <em>under the policy-induced distribution. </em>Take the exponent to obtain likelihoods.</p></li><li><p>Average the resulting likelihoods across time and agents to obtain a single scalar score per scenario.</p></li></ol><h3>Visual illustration</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xc24!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xc24!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 424w, https://substackcdn.com/image/fetch/$s_!xc24!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 848w, https://substackcdn.com/image/fetch/$s_!xc24!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 1272w, https://substackcdn.com/image/fetch/$s_!xc24!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xc24!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png" width="1456" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:274125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xc24!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 424w, https://substackcdn.com/image/fetch/$s_!xc24!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 848w, https://substackcdn.com/image/fetch/$s_!xc24!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 1272w, https://substackcdn.com/image/fetch/$s_!xc24!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a9867c-fec5-400b-bed5-bcb574740dc4_2120x664.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Overview of likelihood computation for a single scenario (with one agent). We roll out the policy &#960; 32 times in the environment, recording the agent&#8217;s x,y position, and heading over 81 time steps. These trajectories are then processed to derive relevant quantities, for example, the distance to the nearest object at each time step. Using the resulting 32 &#215; 81 values produced by the policy, we construct an empirical probability distribution by binning the values and computing counts per bin. The (log-)likelihood is then defined as the probability of observing the ground-truth (GT) value under this policy-induced distribution.</figcaption></figure></div><h3>Mathematical formulation</h3><p>More precisely, we compute a meta-score for each scenario <em>s</em> as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_\\text{meta-score} = \\sum_{k=1}^{9} w_k \\cdot \\exp\\left(\\frac{1}{N \\cdot T} \\sum_{i=1}^{N} \\sum_{t=1}^{T} \\log p_k(f_{i,t}^{\\text{GT}} \\mid \\mathcal{H}_k^{\\pi})\\right)&quot;,&quot;id&quot;:&quot;CDCAZEZZAN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8CM1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8CM1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 424w, https://substackcdn.com/image/fetch/$s_!8CM1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 848w, https://substackcdn.com/image/fetch/$s_!8CM1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 1272w, https://substackcdn.com/image/fetch/$s_!8CM1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8CM1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png" width="1456" height="784" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:784,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:259061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8CM1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 424w, https://substackcdn.com/image/fetch/$s_!8CM1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 848w, https://substackcdn.com/image/fetch/$s_!8CM1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 1272w, https://substackcdn.com/image/fetch/$s_!8CM1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452bb6d7-11ec-4b80-835a-01036be29f8d_1468x790.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>What WOSAC gets right</h3><ul><li><p>WOSAC <strong>models driving as an inherently multi-agent problem</strong>. Each agent&#8217;s behavior is entangled with the behavior of others, and the evaluation captures these interactions rather than reducing driving to independent trajectories.</p></li><li><p>By comparing feature <strong>distributions</strong>, WOSAC captures similarities in driving dynamics between human drivers and learned policies, rather than overfitting to exact trajectory matches.</p></li><li><p>Kinematic metrics, which account for 20% of the total score, provide a meaningful signal of <strong>motion realism and physical feasibility</strong>.</p></li></ul><blockquote><p><strong>Note: </strong>For the rest of this section, I focus on evaluating <em>autonomous driving policies</em> rather than <em>predicting</em> human trajectories. The policy operates under <strong>high-level intent</strong>, like a human driving toward a destination, and is evaluated on how it navigates <em>towards</em> that goal. That being said, my points below apply to any setting where the goal is to assess high-quality, human-like driving.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hx8u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hx8u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 424w, https://substackcdn.com/image/fetch/$s_!Hx8u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 848w, https://substackcdn.com/image/fetch/$s_!Hx8u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 1272w, https://substackcdn.com/image/fetch/$s_!Hx8u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hx8u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png" width="1456" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:172846,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hx8u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 424w, https://substackcdn.com/image/fetch/$s_!Hx8u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 848w, https://substackcdn.com/image/fetch/$s_!Hx8u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 1272w, https://substackcdn.com/image/fetch/$s_!Hx8u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc02282c-5ac2-44f3-9f06-fa59d5675310_2268x984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Assumption: We evaluate the human-likeness of goal-conditioned driving policies.</figcaption></figure></div><h3>Where WOSAC breaks down</h3><p>We start with a controlled analysis on a <a href="https://huggingface.co/datasets/daphne-cornelisse/pufferdrive_wosac_val_clean">dataset chosen to admit a </a><strong><a href="https://huggingface.co/datasets/daphne-cornelisse/pufferdrive_wosac_val_clean">perfect ground</a></strong><a href="https://huggingface.co/datasets/daphne-cornelisse/pufferdrive_wosac_val_clean"> </a><strong><a href="https://huggingface.co/datasets/daphne-cornelisse/pufferdrive_wosac_val_clean">truth</a>: no collisions and no off-road events.</strong> The full dataset does contain labeling noise, which I will address in Section L2. Code for all analyses below is provided in the branch <code>dc/wosac_analysis</code> in <a href="https://github.com/Emerge-Lab/PufferDrive">PufferDrive</a>.</p><p>To establish reference points, consider two extreme baselines with the cleaned dataset. The <strong>upper bound</strong> is the <strong>ground-truth trajectory</strong> itself, repeated 32 times to match WOSAC&#8217;s rollout procedure. The <strong>lower bound</strong> is a <strong>random</strong> policy that represents the opposite extreme of behavior. These baselines provide rough anchors for interpreting the WOSAC score: the best- and worst-case scenarios under its evaluation framework.</p><p>Running WOSAC on these baselines yields a few observations:</p><ul><li><p><strong>Upper and lower bounds.</strong> On this dataset, the maximum achievable meta-score is 0.820, while the random policy attains a meta-score of 0.454.</p></li><li><p><strong>Average displacement.</strong> As expected, ground-truth trajectories yield an ADE of 0, whereas the random policy produces a large ADE (&#8776;27).</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1PeV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1PeV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 424w, https://substackcdn.com/image/fetch/$s_!1PeV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 848w, https://substackcdn.com/image/fetch/$s_!1PeV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 1272w, https://substackcdn.com/image/fetch/$s_!1PeV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1PeV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png" width="1456" height="473" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:473,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:373171,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1PeV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 424w, https://substackcdn.com/image/fetch/$s_!1PeV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 848w, https://substackcdn.com/image/fetch/$s_!1PeV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 1272w, https://substackcdn.com/image/fetch/$s_!1PeV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0be1237a-cacc-442c-85ab-74f0fa3a591f_4442x1442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Left</em>: Average realism meta-score for the 229 perfect scenarios. GT (green) has a meta score of ~0.82; a random agent has a score of ~0.42. Center: The meta-score can be broken up into 3 distinct categories: kinematics, interactive, and map-based. Right: The average displacement error (ADE) is the average displacement per agent across rollouts. The minADE is the minimum displacement across all rollouts. The ADE is not part of the meta-score. </figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fHXu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fHXu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 424w, https://substackcdn.com/image/fetch/$s_!fHXu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 848w, https://substackcdn.com/image/fetch/$s_!fHXu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!fHXu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fHXu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png" width="1456" height="565" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:565,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:168937,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fHXu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 424w, https://substackcdn.com/image/fetch/$s_!fHXu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 848w, https://substackcdn.com/image/fetch/$s_!fHXu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!fHXu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ff66b5c-93f3-4dd9-9eec-229db0b00b81_2942x1141.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Distribution of meta-scores across scenarios for the ground-truth (left) and the random policy (right). The red striped line indicates the mean.</figcaption></figure></div><p>With these reference points in place, we can now turn to the limitations of WOSAC and explore where its evaluation breaks down.</p><h4>L1. Arbitrary weighting of metrics and collision dominance in the meta-score</h4><p>The meta-score aggregates nine metrics using fixed weights (<a href="https://github.com/waymo-research/waymo-open-dataset/blob/99a4cb3ff07e2fe06c2ce73da001f850f628e45a/src/waymo_open_dataset/wdl_limited/sim_agents_metrics/challenge_2025_sim_agents_config.textproto">exact weighting here</a>). <strong>Collisions and off-road events</strong> alone <strong>contribute 50% of the total score</strong>, leading to a misalignment between the score&#8217;s stated goal and what it actually measures.</p><p>This weighting causes failure events to dominate the evaluation rather than nominal driving behavior. The issue is compounded by the presence of false collisions due to labeling errors in the data, which are nonetheless counted toward the meta-score. As a result, the score becomes sensitive to data artifacts rather than true driving quality. More on that later.</p><h4><strong>L2. </strong>Binary treatment of safety events</h4><p>Collision and off-road likelihoods are modeled as <strong>binary at the rollout level (</strong><a href="https://github.com/waymo-research/waymo-open-dataset/blob/99a4cb3ff07e2fe06c2ce73da001f850f628e45a/src/waymo_open_dataset/wdl_limited/sim_agents_metrics/metrics.py#L139-L145">code entry point</a>). A rollout is penalized if an event occurs at least once, regardless of frequency, duration, or severity. As a result, <strong>qualitatively different behaviors</strong>, ranging from a single brief contact to repeated collisions, can <strong>receive identical scores.</strong></p><p><strong>Sensitivity analysis: </strong>To illustrate the implications of  L1 &amp; L2, we start from the ground-truth trajectory (green striped line), replicate it 32 times, and then <a href="https://github.com/Emerge-Lab/PufferDrive/blob/f938aca314572dca142b1de8a0d1f591c2b79333/pufferlib/ocean/benchmark/evaluator.py#L325">inject artificial collisions</a> before computing the metrics.</p><p>We consider two variants. In the first, we inject multiple collisions within a single rollout by adding collisions at <strong>multiple timesteps</strong> (blue line). In the second, we inject <strong>one collision per rollout across multiple rollouts</strong> (purple line). We then plot the collision likelihood and the resulting meta-score as the total number of injected collisions increases.</p><pre><code><code># Shape is (n_agents, n_rollouts, n_steps)
if collisions_to_add_per_rollout &gt; 0: 
    # Always add to first step
    sim_collision_per_step[:, :collisions_to_add_per_rollout, 0] = True

if collisions_to_add_per_timestep &gt; 0:
    # Always add to first rollout
    sim_collision_per_step[:, 0, :collisions_to_add_per_timestep] = True

# ... continue metrics computation as usual</code></code></pre><p>I hope you would agree with me that a car colliding once is very different from colliding ten times in a given period of time. Yet the scores do not reflect such nuances. Injecting many collisions within a single rollout leaves the collision likelihood and, consequently, the meta-score unchanged after the first event. In contrast, spreading the same number of collisions across multiple rollouts steadily increases the collision likelihood and lowers the meta-score. This means that <strong>more severe behavior can sometimes be penalized less</strong> than <strong>milder but more distributed failures</strong>, highlighting the insensitivity that comes from treating events as binary.</p><p>As shown by the purple curve (left, x=32), introducing a single collision in every rollout reduces the meta-score by 30%, from 0.82 (ground truth) to 0.57 (purple line), even when all other aspects of behavior remain correct. Introducing a collision at every timestep in a single rollout, however, barely affects the meta-score (blue line).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y-SQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 424w, https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 848w, https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png" width="727.9861450195312" height="295.9943666562929" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:592,&quot;width&quot;:1456,&quot;resizeWidth&quot;:727.9861450195312,&quot;bytes&quot;:348558,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 424w, https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 848w, https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 1272w, https://substackcdn.com/image/fetch/$s_!Y-SQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F761c21a3-7a0c-47b1-9e91-f0abd74bfca0_3542x1441.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>Left:</strong> Mean &#177; standard deviation of the realism meta-score over 229 scenarios as a function of the number of injected collisions. The green striped line indicates the empirical upper bound, while the orange line shows the score of a random policy. <strong>Right:</strong> Collision likelihood component of the meta-score (25% weight). The purple curve corresponds to adding collisions across rollouts, while the blue curve corresponds to adding collisions across timesteps within a single rollout.</figcaption></figure></div><p>In words, the collision likelihood reduces to the following logic:</p><ul><li><p>If the ground truth collides at least once during a rollout, the policy should also collide at least once. The number of collisions does not matter; adding collisions across time has no effect (blue line).</p></li><li><p>If the ground truth does not collide during a rollout, the policy should never collide.</p></li></ul><p>The same logic applies to the off-road likelihood metric, which carries the same weight as collision likelihood at 25 percent each.</p><p>A natural follow-up question is whether this limitation matters in practice. The issue is asymmetric: false positives are more consequential because they force the policy to mimic noise or incorrect behavior. Binary treatment of safety events would be less problematic with perfectly clean data, but real datasets often contain labeling errors.</p><p><em>How much noise exists in the <a href="https://github.com/waymo-research/waymo-open-dataset">Waymo Open Motion Dataset</a> (WOMD)?</em> To quantify this, I randomly sample 5,000 scenarios from the WOMD training set and count ground-truth trajectories containing one or more collisions or off-road events.</p><p>Because this dataset represents nominal driving, collisions are almost certainly labeling errors. The figure below summarizes these false positives. On average, <strong>4.2% of trajectories contain a collision</strong>, and <strong>16.3% contain an off-road event</strong>.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>  These rates are worse than I expected: when optimizing for realism while minimizing collisions, the metric can be misled by labeling noise, causing the policy to mimic incorrect behavior rather than truly human-like driving.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iK6e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iK6e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 424w, https://substackcdn.com/image/fetch/$s_!iK6e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 848w, https://substackcdn.com/image/fetch/$s_!iK6e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 1272w, https://substackcdn.com/image/fetch/$s_!iK6e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iK6e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png" width="1456" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:237336,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iK6e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 424w, https://substackcdn.com/image/fetch/$s_!iK6e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 848w, https://substackcdn.com/image/fetch/$s_!iK6e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 1272w, https://substackcdn.com/image/fetch/$s_!iK6e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36ede863-2282-4c31-bbf7-89cb0bf5d068_3541x1435.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Left</em>: Percentage of <strong>scenarios </strong>containing at least one false positive (collisions and off-road events taken separately). <em>Right</em>: Percentage of agent trajectories containing at least one false positive. Analysis based on 5,000 randomly sampled scenarios from the WOMD train dataset. The total number of evaluated agents in tracks to predict across all of these is N=13,960. </figcaption></figure></div><h4><strong>L3. </strong>Timing is not taken into account</h4><p>WOSAC ignores temporal structure across all nine metrics (see <a href="https://github.com/waymo-research/waymo-open-dataset/blob/99a4cb3ff07e2fe06c2ce73da001f850f628e45a/src/waymo_open_dataset/wdl_limited/sim_agents_metrics/estimators.py#L43C21-L43C42">original</a> WOSAC code entry point, <a href="https://github.com/Emerge-Lab/PufferDrive/blob/c9aaeb2423e0757e1c15e237f7fa9efd5ee570c0/pufferlib/ocean/benchmark/estimators.py#L72">PufferDrive implementation</a>). </p><p>This design can produce counterintuitive behavior. For example, a single mislabeled collision early in a rollout allows a policy that collides at every timestep to score higher than one that avoids collisions entirely, even though the latter is safer and more realistic. The same issue affects kinematic metrics, which compare distributions of linear speed and acceleration over the entire rollout without regard to timing. A policy that repeatedly overshoots or accelerates aggressively early can achieve the same score as one that briefly deviates later, even though the resulting behaviors are qualitatively very different.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yo9E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yo9E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 424w, https://substackcdn.com/image/fetch/$s_!yo9E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 848w, https://substackcdn.com/image/fetch/$s_!yo9E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 1272w, https://substackcdn.com/image/fetch/$s_!yo9E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yo9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:355034,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yo9E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 424w, https://substackcdn.com/image/fetch/$s_!yo9E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 848w, https://substackcdn.com/image/fetch/$s_!yo9E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 1272w, https://substackcdn.com/image/fetch/$s_!yo9E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1d00a10-458c-4db5-bdb4-a398876fe26d_1870x908.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">An illustrative scenario to show the potential outcomes of mislabeled collisions.</figcaption></figure></div><h3>&#8594; The combined implications of L1, L2 and L3: Evaluating a superhuman driving policy </h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TLe8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TLe8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 424w, https://substackcdn.com/image/fetch/$s_!TLe8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 848w, https://substackcdn.com/image/fetch/$s_!TLe8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 1272w, https://substackcdn.com/image/fetch/$s_!TLe8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TLe8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png" width="704" height="276.57142857142856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:572,&quot;width&quot;:1456,&quot;resizeWidth&quot;:704,&quot;bytes&quot;:158528,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TLe8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 424w, https://substackcdn.com/image/fetch/$s_!TLe8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 848w, https://substackcdn.com/image/fetch/$s_!TLe8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 1272w, https://substackcdn.com/image/fetch/$s_!TLe8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02626dbd-9053-495d-ac21-decd9f060fbb_1946x764.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Which policy will win?</figcaption></figure></div><p>To illustrate the combined effects of L1&#8211;L3, we run a simple experiment on 5,000 randomly sampled training scenarios. Consider a superhuman policy: it <em>behaves exactly like a human</em> but <em>never makes a mistake</em>. We compare two policies. The first exactly reproduces the ground-truth trajectories (&#8220;<strong>&#960; pattern-match</strong>&#8221;). The second reproduces the same trajectories but never collides or goes off-road (&#8220;<strong>&#960; superhuman</strong>&#8221;), implemented by setting all collision and off-road indicators to zero.</p><p>Intuitively, &#960; superhuman is perfect and would make an ideal controllable sim agent. Yet <strong>WOSAC gives it a meta-score of only 0.72</strong>, far lower than a policy that reproduces human errors. To put this in perspective, this score would rank <strong>35th out of 36</strong> <a href="https://waymo.com/open/challenges/2025/sim-agents/">on the current leaderboard,</a> <strong>placing a perfect policy near the very bottom</strong><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TryA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TryA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 424w, https://substackcdn.com/image/fetch/$s_!TryA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 848w, https://substackcdn.com/image/fetch/$s_!TryA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 1272w, https://substackcdn.com/image/fetch/$s_!TryA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TryA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png" width="1456" height="473" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:473,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:384377,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TryA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 424w, https://substackcdn.com/image/fetch/$s_!TryA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 848w, https://substackcdn.com/image/fetch/$s_!TryA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 1272w, https://substackcdn.com/image/fetch/$s_!TryA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F121f0a8d-5e02-4776-a753-85e01739b088_4442x1442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Left</em>: The composite WOSAC meta-score for the pattern-match policy (green, exact GT) and a superhuman policy (blue). Center: The group metric averages. Right: The ADE and minADE metrics are 0 for both, as expected.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!iW2j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!iW2j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 424w, https://substackcdn.com/image/fetch/$s_!iW2j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 848w, https://substackcdn.com/image/fetch/$s_!iW2j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!iW2j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!iW2j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png" width="1456" height="564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:564,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:175444,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!iW2j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 424w, https://substackcdn.com/image/fetch/$s_!iW2j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 848w, https://substackcdn.com/image/fetch/$s_!iW2j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 1272w, https://substackcdn.com/image/fetch/$s_!iW2j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0d9d885-d2be-4b09-b4db-24ff2a0ccd79_2943x1141.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Distributions of WOSAC meta-scores for the GT (left) and superhuman policy (right)</figcaption></figure></div><p>Lastly, here are some minor issues.</p><h4><strong>L4. </strong>Discontinuities in function space</h4><p>The histogram bins are not smoothed; <a href="https://github.com/waymo-research/waymo-open-dataset/blob/99a4cb3ff07e2fe06c2ce73da001f850f628e45a/src/waymo_open_dataset/protos/sim_agents_metrics.proto#L75">the only adjustment is adding a small value to empty bins to avoid infinities</a>. This produces noticeable <strong>discontinuities</strong> in the likelihoods. I am not certain how this affects the final scores, but it was surprising to see that small changes in the underlying value can sometimes cause large jumps in likelihood.</p><p>The figures below illustrate this effect for the ground-truth data. The feature displayed here is the distance to the nearest object.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JR-D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JR-D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 424w, https://substackcdn.com/image/fetch/$s_!JR-D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 848w, https://substackcdn.com/image/fetch/$s_!JR-D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 1272w, https://substackcdn.com/image/fetch/$s_!JR-D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JR-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png" width="1456" height="485" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82789,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JR-D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 424w, https://substackcdn.com/image/fetch/$s_!JR-D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 848w, https://substackcdn.com/image/fetch/$s_!JR-D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 1272w, https://substackcdn.com/image/fetch/$s_!JR-D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8b64809-3c89-464a-bb6c-e03800f25911_1776x591.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1vOJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1vOJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 424w, https://substackcdn.com/image/fetch/$s_!1vOJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 848w, https://substackcdn.com/image/fetch/$s_!1vOJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 1272w, https://substackcdn.com/image/fetch/$s_!1vOJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1vOJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png" width="1456" height="479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/deaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:479,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1vOJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 424w, https://substackcdn.com/image/fetch/$s_!1vOJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 848w, https://substackcdn.com/image/fetch/$s_!1vOJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 1272w, https://substackcdn.com/image/fetch/$s_!1vOJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdeaaf7a8-0076-45a0-ab57-09a45543532d_1797x591.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>L5. </strong>Unclear upper bound and seemingly saturated performance. </h4><p>WOSAC assumes independence over timesteps, so there is no natural upper bound. The metric is not guaranteed to stay between 0 and 1, which makes it unclear what the highest possible score really is.</p><p>We can obtain an empirical upper bound for a batch by replicating the ground-truth trajectory 32 times and computing the features, as in the analyses above. Normalizing scores by this value would make comparisons easier to interpret.</p><p>When comparing meta-scores across different sets of scenarios, noticeable differences are apparent (see the histograms above). On the leaderboard, top entries appear tightly clustered, creating the impression of saturated performance, even though policy behavior can vary significantly across scenarios. This suggests that WOSAC may obscure meaningful differences between policies.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XpSg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XpSg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 424w, https://substackcdn.com/image/fetch/$s_!XpSg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 848w, https://substackcdn.com/image/fetch/$s_!XpSg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 1272w, https://substackcdn.com/image/fetch/$s_!XpSg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XpSg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png" width="1456" height="589" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:589,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:178162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/186339519?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XpSg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 424w, https://substackcdn.com/image/fetch/$s_!XpSg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 848w, https://substackcdn.com/image/fetch/$s_!XpSg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 1272w, https://substackcdn.com/image/fetch/$s_!XpSg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3708dd74-6766-4a53-86ae-54eafe4a31b7_1924x778.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The top entries on the WOSAC 2025 leaderboard are so close that the top 20 are essentially competing for the 3rd or 4th decimal places, all hovering around 0.78.</figcaption></figure></div><h4><strong>L6. </strong>Assumption that there is a single ground-truth</h4><p>WOSAC treats each agent as having a <strong>single ground-truth</strong> <strong>trajectory</strong>. In reality, human driving is inherently multimodal: there is no single &#8220;correct&#8221; way to drive. This is a tricky issue. While it is true that multiple behaviors can be equally valid, WOSAC penalizes deviations from a single reference, reducing the score for perfectly reasonable variations. The authors note this themselves in the paper; this is why they excluded the ADE from the meta-score and adopted a statistical view.</p><h3>Conclusion</h3><p>Circling back to the questions at the start, here are my takeaways:</p><ol><li><p><strong>Assumptions and interpretation.</strong> WOSAC makes several hidden assumptions that shape how we interpret its realism score. If your goal is not to blindly mimic the dataset, many of these assumptions are actively harmful.</p></li><li><p><strong>Optimizing for the score.</strong> There is no clear notion of &#8220;human-like enough.&#8221; Beyond a certain point, optimization merely tracks labeling noise rather than true human behavior. Given the dataset&#8217;s noise, it is unclear how much of the leaderboard reflects meaningful human-likeness versus overfitting to errors.</p></li><li><p><strong>Building better evals.</strong> The limitations of WOSAC suggest practical improvements:</p><blockquote><p><em><strong>i</strong></em><strong>)</strong> <strong>Separate safety from distributional realism / human-likeness.</strong> Treat collisions and off-road events as <strong>hard constraints</strong> for nominal driving rather than probabilistic outcomes. Superhuman policies that avoid mistakes can score only 0.72 under WOSAC, showing a misalignment in the current benchmark.</p><p><em><strong>ii</strong></em><strong>) Incorporate temporal dynamics explicitly.</strong> Evaluate not just where a vehicle goes, but when. Time-sensitive metrics are essential for coordination and planning.</p><p><em><strong>iii</strong></em><strong>) Close the sim to real loop</strong>. Benchmark metrics should reflect real-world human coordination such as gap acceptance, yielding, negotiation, rather than arbitrary weighted objectives.</p><p><em><strong>iiii</strong></em><strong>) Smooth density estimates.</strong> Ensure likelihood or binned metrics avoid discontinuities and counterintuitive jumps.</p></blockquote></li></ol><p>I hope this post illuminates what WOSAC really measures and clarifies some of its limitations. If you&#8217;re interested in building better evals and benchmarks in this area, or if you have thoughts or questions about this post, I&#8217;d love to hear from you!</p><h4>Acknowledgements</h4><p>I am grateful to <a href="https://www.eugenevinitsky.com/">Eugene Vinitsky</a> for feedback on this post and helpful discussions. I also thank <a href="https://scholar.google.com/citations?user=MOzsfhIAAAAJ&amp;hl=fr">Wa&#235;l Doulazmi</a> and <a href="https://zilinwang.notion.site/me">Zilin Wang</a> for comments on an earlier draft and <a href="https://julian.bearblog.dev/">Julian Hunt</a> for help with initial prototyping of the sensitivity analysis.</p><div><hr></div><p>For attribution in academic contexts, please cite this work as</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">@misc{cornelisse2025humanlikeness,
  title={Human-likeness metrics for autonomous agents: are we measuring the right thing?},
  author={Cornelisse, Daphne},
  year={2025},
  howpublished={Substack},
  note={Blog post analyzing the Waymo Open Sim Agent Challenge (WOSAC) realism benchmark}
}</code></pre></div><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Human-likeness in driving has other applications as well. In simulation, for instance, the goal is to create interactive agents that behave like humans and respond meaningfully to a driving policy. While this is not my primary focus, many of the limitations discussed here still apply.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>The WOSAC authors explicitly frame the benchmark as a <strong>distribution-matching problem</strong> and <strong>do not position it as a deployment-oriented evaluation of driving policies</strong>. However, <strong>many of the limitations discussed here still apply in the simulation-agent setting</strong>. In practice, sim agents are engineering tools, and their usefulness depends not only on distributional realism but also on controllability and interpretability. Metrics that conflate nominal behavior with failure events, or collapse distinct behaviors into identical scores, limit our ability to diagnose failures and debug them efficiently. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Here, I include only collision and off-road events for agent IDs listed in the WOMD <code>tracks_to_predict</code> scenario metadata, as suggested by the authors. This subset targets less noisy agents; including the full set of controllable agents would introduce even more false positives.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The same holds for the 2024 leaderboard. The 2024 challenge has one notable difference: it does not include a likelihood metric for traffic lights. I was too lazy to count the entries, but in the <a href="https://waymo.com/open/challenges/2024/sim-agents/">2024 leaderboard</a>, a score of 0.72 lands you at the middle/bottom of the leaderboard, roughly at the same position as the diffusion model <a href="https://waymo.com/open/challenges/sim-agents/results/72289c47-190c/1715318796806000/">VBD</a>.</p></div></div>]]></content:encoded></item><item><title><![CDATA[How to catch subtle RL bugs before they catch you]]></title><description><![CDATA[Tools and habits for reliable, fast RL experimentation and development]]></description><link>https://daphnecornelisse.substack.com/p/how-to-catch-subtle-rl-bugs-before</link><guid isPermaLink="false">https://daphnecornelisse.substack.com/p/how-to-catch-subtle-rl-bugs-before</guid><dc:creator><![CDATA[Daphne Cornelisse]]></dc:creator><pubDate>Mon, 13 Oct 2025 07:28:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!v7OX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Reinforcement learning (RL) codebases tend to have a lot of moving parts. That&#8217;s why it is important to <strong>be</strong> <strong>systematic</strong> and thoroughly test before merging anything into the main branch. The earlier you catch a bug, the better, because once your code is merged, the space of <em>possible </em>root causes expands dramatically. Furthermore, you don&#8217;t want to discover a bug after having done all your experiments or even writing the paper. <strong>You can&#8217;t make scientific progress if part of your analysis is resting on shaky ground.</strong> </p><p>At the same time, you probably want to move fast. After all, part of what makes science exciting is <em>trying out new ideas</em>. We would like to test hypotheses and iterate on them quickly. Writing production-level code and exhaustive tests can feel like a drag on creativity.</p><p>So, as an RL researcher, the question becomes: <strong>how do you move fast without accidentally breaking things?</strong> How do you balance rapid experimentation with the rigor needed to write reliable, reproducible code?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v7OX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v7OX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 424w, https://substackcdn.com/image/fetch/$s_!v7OX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 848w, https://substackcdn.com/image/fetch/$s_!v7OX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 1272w, https://substackcdn.com/image/fetch/$s_!v7OX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v7OX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png" width="1456" height="974" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:974,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214844,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175795434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v7OX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 424w, https://substackcdn.com/image/fetch/$s_!v7OX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 848w, https://substackcdn.com/image/fetch/$s_!v7OX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 1272w, https://substackcdn.com/image/fetch/$s_!v7OX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93aa877b-8953-4c11-a4d9-ceac80fa9dae_1920x1284.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this post, I&#8217;ll walk through a recent example from my own research and show how I use <a href="https://wandb.ai/home">Weights &amp; Biases</a> (wandb) to catch bugs early, without slowing down iteration speed.</p><h3>Silent errors</h3><p>Let me start by introducing the concept of <em>silent errors</em>. I first came across this concept a few years ago when reading a <a href="https://karpathy.github.io/2019/04/25/recipe/">blog post by Andrej Karpathy</a> (See Section <em>&#8220;2) Neural net training fails silently&#8221;</em>). The gist is that there are two kinds of errors. Some are obvious: Your compiler throws an error, or your code crashes. For example, if you write an <code>if</code> statement like this:</p><pre><code>if something
    print(&#8221;hey, I&#8217;m here&#8221;)
else:
    print(&#8221;oh hello, now I&#8217;m over here!&#8221;)</code></pre><p>Python will immediately complain because you forgot the colon (<code>:</code>).</p><p>Silent errors, on the other hand, are different creatures. They don&#8217;t crash your code or raise an exception. Everything <em>appears</em> to run fine&#8212;but under the hood, something is off. Your logic is flawed, your gradients aren&#8217;t flowing, or your metric isn&#8217;t what you think it is. These are the kinds of issues that only reveal themselves through careful monitoring and performance benchmarking, which is the focus of this post.</p><h3>Building in redundancy</h3><p>Beyond just writing clean, logical code, one of the most effective ways to avoid bugs is to <strong>build in redundancy</strong>. The key assumption here is simple: You, as a human, will inevitably make a mistake. It&#8217;s in your best interest to build redundancy into your workflow, where different checks catch different kinds of errors. You can build in redundancy in different ways, for example:</p><ul><li><p><strong>Get code reviews</strong>. If you&#8217;re lucky enough to work with multiple people on a codebase, ask co-developers to review your code. Especially when you&#8217;re new to a codebase, a quick look from someone who built part of it can save hours of debugging. If they&#8217;re busy, highlight the sections you&#8217;re unsure about so they can focus there. If you&#8217;re working on a codebase alone, do a self-review. You can make it a little game where you pretend to be your own adversary, finding the bugs.</p></li><li><p><strong>Create tests</strong>. Perhaps an obvious one, but create some basic tests. Does the codebase build properly? Are you able to launch a training run? This becomes especially important in shared repositories where new contributors join regularly.</p></li><li><p><strong>Make videos (!)</strong>. I can&#8217;t really stress the importance of making videos enough. Reward curves only tell part of the story. If possible, make your renderer fast and occasionally log videos to W&amp;B during training. Visualizing behavior often reveals issues that metrics can&#8217;t. Plus, it&#8217;s oddly satisfying to watch your agents learn (I like to joke that about 20% of my PhD has been spent watching cars bump into each other).</p></li><li><p><strong>Benchmark performance</strong>. This is the main focus of the case study below. With a fast end-to-end setup, you can use <em>training runs themselve</em>s as a way to test your code. Within minutes, you can check whether your metrics make sense and spot regressions before they have downstream effects. As you&#8217;ll see in the example, this approach requires some understanding of the metrics you&#8217;re logging. If you&#8217;re using <a href="https://arxiv.org/abs/1707.06347">PPO</a>, for instance, you should know what the clip coefficient does, what entropy loss represents, and what their typical ranges look like. That foundation lets you interpret results quickly and notice when something is off, because you have a sense of <em>what</em> <em>&#8220;healthy&#8221; values should look like</em>. Some of this intuition comes with experience, but I think it&#8217;s an important skill for any RL researcher to develop deliberately over time.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AlrZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AlrZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 424w, https://substackcdn.com/image/fetch/$s_!AlrZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 848w, https://substackcdn.com/image/fetch/$s_!AlrZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 1272w, https://substackcdn.com/image/fetch/$s_!AlrZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AlrZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png" width="1456" height="692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:692,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:276627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175795434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AlrZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 424w, https://substackcdn.com/image/fetch/$s_!AlrZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 848w, https://substackcdn.com/image/fetch/$s_!AlrZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 1272w, https://substackcdn.com/image/fetch/$s_!AlrZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c68a4-022a-4a7f-b238-47bc30220cc7_1916x910.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">How I like to think of building in redundancy: We&#8217;re maximizing coverage of the blue areas in <em>bug space</em></figcaption></figure></div><h3>Case study: Adding a human regularization objective</h3><p>To illustrate the value of benchmarking and using metrics, let me share a recent example where I caught a <em>silent error</em> in my own code. I was working on adding a human log-likelihood regularization objective in our <a href="https://github.com/Emerge-Lab/PufferDrive">PufferDrive</a> simulator. For context, PufferDrive continues the development of <a href="https://github.com/PufferAI/PufferLib">PufferLib</a>&#8217;s <a href="https://x.com/spenccheng/status/1959665036483350994">drive environment</a>, which is a reimplementation of <a href="https://github.com/Emerge-Lab/gpudrive">GPUDrive</a> that can train end-to-end at 200K FPS on my RTX 4080 (I know, it&#8217;s wild!) and scales linearly with the number of GPUs you have.</p><p>Without getting lost in the details, I essentially augmented the PPO loss with one additional term. Here you can see the key <code>diffs</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ap0M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ap0M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 424w, https://substackcdn.com/image/fetch/$s_!Ap0M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 848w, https://substackcdn.com/image/fetch/$s_!Ap0M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 1272w, https://substackcdn.com/image/fetch/$s_!Ap0M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ap0M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png" width="1456" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:298,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:116671,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175795434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ap0M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 424w, https://substackcdn.com/image/fetch/$s_!Ap0M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 848w, https://substackcdn.com/image/fetch/$s_!Ap0M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 1272w, https://substackcdn.com/image/fetch/$s_!Ap0M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d53e5b9-4e3f-406c-ae35-19a599e5aaa2_2294x470.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Implementing this required changes across several files, but on a first pass, it looked fine to me. I ran a sweep over different log-likelihood loss coefficients (<code>human_ll_coef</code>) and saw what I expected: the human log probability approached zero with more experience. Performance metrics also looked good: agents were achieving optimal performance within a couple of minutes. A score of 1.0 indicates a 100% goal-reaching rate and 0% collision / off-road rate.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oeSt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oeSt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 424w, https://substackcdn.com/image/fetch/$s_!oeSt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 848w, https://substackcdn.com/image/fetch/$s_!oeSt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 1272w, https://substackcdn.com/image/fetch/$s_!oeSt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oeSt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png" width="1456" height="348" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:147345,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175795434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oeSt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 424w, https://substackcdn.com/image/fetch/$s_!oeSt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 848w, https://substackcdn.com/image/fetch/$s_!oeSt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 1272w, https://substackcdn.com/image/fetch/$s_!oeSt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7cfb0f0-da66-4d7a-a0a1-add623f8a7db_2298x550.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Performance metrics for different experiments. All have different values for <code>human_ll_coef.</code></figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0PLR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0PLR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 424w, https://substackcdn.com/image/fetch/$s_!0PLR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 848w, https://substackcdn.com/image/fetch/$s_!0PLR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 1272w, https://substackcdn.com/image/fetch/$s_!0PLR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0PLR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png" width="346" height="328.5838926174497" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b338b408-1671-4d92-817f-1ab24bc7d514_596x566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:566,&quot;width&quot;:596,&quot;resizeWidth&quot;:346,&quot;bytes&quot;:52390,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175795434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0PLR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 424w, https://substackcdn.com/image/fetch/$s_!0PLR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 848w, https://substackcdn.com/image/fetch/$s_!0PLR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 1272w, https://substackcdn.com/image/fetch/$s_!0PLR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb338b408-1671-4d92-817f-1ab24bc7d514_596x566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The log likelihood is approaching zero with experience for all runs with a non-zero coefficient.</figcaption></figure></div><p>The videos were looking good as well. Top-down view (simulator state):</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;5cedfa2c-86dd-45c5-b36a-9b34266fb252&quot;,&quot;duration&quot;:null}"></div><p>View of one of the controlled agents:</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;09e34bba-7e8b-433f-b749-7edc6858ee44&quot;,&quot;duration&quot;:null}"></div><p>Great! I thought. This is probably ready for review. Then I realized I hadn&#8217;t actually added a <strong>baseline (</strong>black curve<strong>)</strong>, meaning a run with the <code>human_ll_coef </code>set to 0. So I ran a baseline, mostly as a sanity check. And then I noticed something strange in the entropy loss plot:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YTQO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YTQO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 424w, https://substackcdn.com/image/fetch/$s_!YTQO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 848w, https://substackcdn.com/image/fetch/$s_!YTQO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 1272w, https://substackcdn.com/image/fetch/$s_!YTQO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YTQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png" width="1360" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1360,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183948,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175795434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YTQO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 424w, https://substackcdn.com/image/fetch/$s_!YTQO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 848w, https://substackcdn.com/image/fetch/$s_!YTQO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 1272w, https://substackcdn.com/image/fetch/$s_!YTQO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4804d455-cbab-41e3-97c1-40850c1a3d99_1360x586.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Can you see it? The entropy curve looks off for the baseline. The policy entropy should converge to 0 for discrete action spaces, at least. <em>Why</em> was it approaching 0 in every case where <code>human_ll_coef &gt; 0</code>, but not for the baseline? Going through my code again, I spotted the bug: I had accidentally overwritten the standard policy entropy with the entropy of the policy conditioned on human actions and observations. Suddenly, the odd entropy value paired with this bug made perfect sense.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1G4c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1G4c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 424w, https://substackcdn.com/image/fetch/$s_!1G4c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 848w, https://substackcdn.com/image/fetch/$s_!1G4c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 1272w, https://substackcdn.com/image/fetch/$s_!1G4c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1G4c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png" width="2264" height="713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:713,&quot;width&quot;:2264,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:227654,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175795434?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5686859c-c51a-4b75-8ade-8886879991be_2264x842.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1G4c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 424w, https://substackcdn.com/image/fetch/$s_!1G4c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 848w, https://substackcdn.com/image/fetch/$s_!1G4c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 1272w, https://substackcdn.com/image/fetch/$s_!1G4c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c194e3b-8564-47c3-b06b-71a7e3ff7a2a_2264x713.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I fixed the bug, reran the experiment, and this time the base case behaved as expected: the policy entropy converged to 0. It is a nice example of a silent error that could have subtly affected downstream results, had it gone unnoticed.</p><h3>Parting thoughts</h3><p>Hopefully, this post gave you a few useful strategies for reliable and efficient RL experimentation. But it may be worth ending with a reminder: even with all the redundancy and best practices in place, you&#8217;ll still make mistakes sometimes&#8212;and that&#8217;s okay. Just take note, learn from them, and move on. Like the RL agents we train, we slowly improve with feedback and experience.</p><div><hr></div><h5><strong>Acknowledgements</strong></h5><p>I&#8217;m grateful to <a href="https://www.eugenevinitsky.com/">Eugene Vinitsky</a> for his feedback on an earlier draft of this post and, more broadly, for always emphasizing the importance of scientific rigor in both RL experimentation and software development.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://daphnecornelisse.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Human behavior modeling in naturalistic driving: Trends and opportunities]]></title><description><![CDATA[Reflections on the '25 human road user modeling workshop in Baden-baden]]></description><link>https://daphnecornelisse.substack.com/p/human-behavior-modeling-in-naturalistic</link><guid isPermaLink="false">https://daphnecornelisse.substack.com/p/human-behavior-modeling-in-naturalistic</guid><dc:creator><![CDATA[Daphne Cornelisse]]></dc:creator><pubDate>Mon, 06 Oct 2025 07:45:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!E_ca!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last Tuesday, our <a href="https://2025.iavvc.org/program/workshops/workshop-on-computational-models">workshop on models of human behavior</a> for autonomous vehicle evaluation took place in the quaint town of Baden-Baden. I wanted to share a quick write-up of some thoughts and observations from the day.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E_ca!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E_ca!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 424w, https://substackcdn.com/image/fetch/$s_!E_ca!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 848w, https://substackcdn.com/image/fetch/$s_!E_ca!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!E_ca!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E_ca!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg" width="1456" height="1131" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1131,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1726445,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175263723?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefeee01d-0a0c-44c5-8419-61ff1fea6aa9_4032x3024.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!E_ca!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 424w, https://substackcdn.com/image/fetch/$s_!E_ca!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 848w, https://substackcdn.com/image/fetch/$s_!E_ca!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!E_ca!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2740d56b-cf67-4408-ab8b-17f4490a3353_3526x2739.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Gustav Markulla kicking off the workshop in the morning!</figcaption></figure></div><div><hr></div><h3>Background: Why model human behavior?</h3><p>To back up, models of human behavior are central to evaluating&#8212;and sometimes even training&#8212;autonomous vehicles. Depending on the use case, these models serve different purposes:</p><ul><li><p><strong>Human performance benchmarks:</strong> For example, in accident analysis, they can help answer whether a collision could reasonably have been avoided.</p></li><li><p><strong>Simulation agents (&#8221;sim agents&#8221;):</strong> Here, the models act as other road users in a simulated environment. Think of NPCs in a game, except their behaviors must be realistic enough that simulation insights actually transfer to the real world.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ERa9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ERa9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ERa9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ERa9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ERa9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ERa9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg" width="1200" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:675,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Nintendo Mario Kart: World - Nintendo Switch 2&quot;,&quot;title&quot;:&quot;Nintendo Mario Kart: World - Nintendo Switch 2&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Nintendo Mario Kart: World - Nintendo Switch 2" title="Nintendo Mario Kart: World - Nintendo Switch 2" srcset="https://substackcdn.com/image/fetch/$s_!ERa9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ERa9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ERa9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ERa9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6511603f-cb24-404d-a518-f05e56f0b39a_1200x675.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3>Workshop highlights</h3><p>The talks ranged from mechanistic approaches to AI/ML-based methods. A few themes stood out:</p><ul><li><p>Training or fine-tuning sim agents in closed-loop settings greatly improves controllability.</p></li><li><p>We&#8217;re getting quite good at modeling <em>nominal</em> human driving behavior.</p></li><li><p>Progress has been made on modeling conflict behavior in small, controlled settings, but scaling to rare, high-stakes scenarios remains difficult.</p><ul><li><p>Likely reasons: (1) sparse and sensitive data; (2) diverse human responses.</p></li></ul></li><li><p>The field has developed quite a clear sense of what aspects of &#8220;human-like&#8221; behavior in traffic matter, and how to measure them.</p></li><li><p>The old divide between data-driven and mechanistic approaches seems to fade, with the two perspectives increasingly coming together.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R-DO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R-DO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 424w, https://substackcdn.com/image/fetch/$s_!R-DO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 848w, https://substackcdn.com/image/fetch/$s_!R-DO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!R-DO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R-DO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg" width="800" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83e43916-9303-484b-8752-5191714819f8_800x600.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:800,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66983,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://daphnecornelisse.substack.com/i/175263723?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R-DO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 424w, https://substackcdn.com/image/fetch/$s_!R-DO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 848w, https://substackcdn.com/image/fetch/$s_!R-DO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!R-DO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83e43916-9303-484b-8752-5191714819f8_800x600.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Closing the workshop with a panel discussion with <a href="https://www.linkedin.com/in/arkadyzgonnikov/">Arkady Zgonnikov</a>,<a href="https://www.linkedin.com/in/johan-engstr%C3%B6m-9635261/"> Johan Engstr&#246;m</a>, <a href="https://www.linkedin.com/in/carol-flannagan-6399502a5/">Carol Flannagan</a>, and <a href="https://www.linkedin.com/in/maximilian-igl-21116992/">Maximilian Igl</a>.</figcaption></figure></div><h3>Opportunities &amp; open questions</h3><p><strong>Operationalizing interpretability</strong><br>One thing that struck me is how differently people define &#8220;interpretable.&#8221; Mechanistic modelers often value this approach because they fully <em>understand</em> the model. But does interpretability mean transparency in parameters, or is it, for example, sufficient to understand how rewards shape an RL agent&#8217;s behavior? If a mechanistic model has 20 interacting parameters, is it still interpretable? Since humans can only hold so many variables in mind, maybe interpretability should be defined relative to these limits. It seems valuable to develop criteria for interpretability that are tailored to specific use cases. Once the type and level of interpretability are clear, choosing the right method may become straightforward.</p><p><strong>Combining domain knowledge with scalable methods</strong><br>Given the existing domain expertise, there appears to be a clear opportunity to combine this with scalable ML/RL approaches.</p><p><strong>Evaluation: Metrics and datasets</strong><br>The <a href="https://waymo.com/open/challenges/2024/sim-agents/">Waymo Open Sim Agent Challenge (WOSAC)</a> has been, and continues to be, a useful benchmark for measuring realism. However, current metrics leave room for improvement. For example, to what extent is a meta-score of 0.78 an improvement over a meta-score of 0.76? It&#8217;s hard to tell. Developing metrics that are both rigorous and interpretable will be key to building better models of human behavior in traffic. Similarly, more benchmarks and datasets focused on long-tail or conflict scenarios are sorely needed.</p><div><hr></div><p>Overall, I&#8217;m excited to see how progress in controllability, interpretability, and measurability will further our understanding of human behavior in naturalistic traffic scenarios and, in turn, continue to improve the safety of autonomous vehicles.</p>]]></content:encoded></item></channel></rss>