A Decision Theoretic Framework for Measuring AI Reliance
Humans frequently make decisions with the aid of artificially intelligent (AI) systems. A common pattern is for the AI to recommend an action to the human who retains control over the final decision. Researchers have identified ensuring that a human has appropriate reliance on an AI as a critical component of achieving complementary performance. We argue that the current definition of appropriate reliance used in such research lacks formal statistical grounding and can lead to contradictions. We propose a formal definition of reliance, based on statistical decision theory, which separates the concepts of reliance as the probability the decision-maker follows the AI’s recommendation from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation. Our definition gives rise to a framework that can be used to guide the design and interpretation of studies on human-AI complementarity and reliance. Using recent AI-advised decision-making studies from literature, we demonstrate how our framework can be used to separate the loss due to mis-reliance from the loss due to not accurately differentiating the signals. We evaluate these losses by comparing a baseline and a benchmark for complementary performance defined by the expected payoff achieved by a rational decision-maker facing the same decision task as the behavioral decision-makers.
An Interactive Paradigm for Differentially-Private Exploratory Analysis
Differential privacy (DP) has the potential to enable privacy-preserving analysis on sensitive data, but requires analysts to judiciously spend a limited "privacy loss budget" epsilon across queries. Analysts conducting exploratory analyses do not, however, know all queries in advance and seldom have DP expertise. Thus, they are limited in their ability to specify epsilon allotments across queries prior to an analysis. To support analysts in spending epsilon efficiently, we propose a new interactive analysis paradigm, Measure-Observe-Remeasure, where analysts "measure" the database with a limited amount of epsilon, observe estimates and their errors, and remeasure with more epsilon as needed. We instantiate the paradigm in an interactive visualization interface which allows analysts to spend increasing amounts of epsilon under a total budget. To observe how analysts interact with the Measure-Observe-Remeasure paradigm via the interface, we conduct a user study that compares the utility of epsilon allocations and findings from sensitive data participants make to the allocations and findings expected of a rational agent who faces the same decision task. We find that participants are able to use the workflow relatively successfully, including using budget allocation strategies that maximize over half of the available utility stemming from epsilon allocation. Their loss in performance relative to a rational agent appears to be driven more by their inability to access information and report it than to allocate epsilon.
Causal Quartets
The average causal effect can often be best understood in the context of its variation. We demonstrate with two sets of four graphs, all of which represent the same average effect but with much different patterns of heterogeneity. As with the famous correlation quartet of Anscombe, these graphs dramatize the way in which real-world variation can be more complex than simple numerical summaries. The graphs also give insight into why the average effect is often much smaller than anticipated.
Musslick, S., L. Bartlett, S. Chandramouli, M. Dubova, F. Gobet, T. Griffiths, J. Hullman, et al. 2025. Automating the practice of science: Opportunities, challenges, and implications. Proceedings of the National Academy of Sciences 122(5): e2401238121.
Kamali, N., K. Nakamura, A. Kumar, A. Chatzimparmpas, J. Hullman, and M. Groh. 2025. Characterizing photorealism and artifacts in diffusion model-generated images. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–26.
Hullman, J., A. Kale, and J. Hartline. 2025. Underspecified human decision experiments considered harmful. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–15.
Hardy, M., S. Zhang, J. Hullman, J. Hofman, and D. Goldstein. 2025. Improving out-of-population prediction: The complementary effects of model assistance and judgmental bootstrapping. International Journal of Forecasting 41(2): 689–701.
Long, S., A. Chatzimparmpas, E. Alexander, M. Kay, and J. Hullman. 2025. Seeing eye to AI? Applying deep-feature-based similarity metrics to information visualization. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, 1–20.
Guo, Z., A. Kale, M. Kay, and J. Hullman. 2024. VMC: A grammar for visualizing statistical model checks. IEEE Transactions on Visualization and Computer Graphics 31(1): 1–11.
Guo, Z., Y. Wu, J. Hartline, and J. Hullman. 2024. A decision theoretic framework for measuring AI reliance. Proceedings of the ACM Conference on Fairness, Accountability, and Transparency in Artificial Intelligence 221–36.
Subramonyam, H. and J. Hullman. 2024. Are we closing the loop yet? Gaps in the generalizability of VIS4ML research. IEEE Transactions on Visualization and Computer Graphics 30(1): 1–11.
Gelman, A., J. Hullman, and L. Kennedy. 2024. Casual quartets: Different ways to achieve the same average treatment effect. The American Statistician 78(3): 267–72.
Zhang, D., J. Hartline, and J. Hullman. 2024. Designing shared information displays for agents of varying strategic sophistication. Proceedings of the ACM on Human-Computer Interaction 8(CSCW1): 1–34.
Nanayakkara, P., H. Kim, Y. Wu, A. Sarvghad, N. Mahyar, G. Miklau, and J. Hullman. 2024. Measure-bbserve-remeasure: An interactive paradigm for differentially-private exploratory analysis. Proceedings of IEEE Security & Privacy (S&P) 231.
Zhang, D., A. Chatzimparmpas, N. Kamali, and J. Hullman. 2024. Evaluating the utility of conformal prediction sets for AI-advised image labeling. Proceedings of the ACM Conference on Computer-Human Interaction (CHI) 1–19.
Gelman, A., J. Hullman, and L. Kennedy. 2023. Casual quartets: Different ways to achieve the same average treatment effect. The American Statistician 78(3): 267–72.
Wu, Y., Z. Guo, J. Hartline, and J. Hullman. 2023. The rational agent benchmark for data visualization. IEEE Transactions on Visualization and Computer Graphics 30(1): 338–47.
Nanayakkara, P., J. Bater, X. Hu, J. Hullman, and J. Rogers. 2022. Visualizing privacy-utility trade-offs in differentially private data releases. Proceedings of Privacy Enhancing Technologies (2): 601–18.
Hullman, J. and A. Gelman. 2021. Designing for interactive exploratory data analysis requires theories of graphical inference. Harvard Data Science Review 3(3).