POPPER Framework
AI Agent that automates hypothesis validation with statistical rigor


Authors
Kexin Huang1, Ying Jin2, Ryan Li1, Michael Y. Li1, Emmanuel Candès3,4, Jure Leskovec1
1Department of Computer Science, Stanford University
2Data Science Initiative & Department of Health Care Policy, Harvard University
3Department of Statistics, Stanford University
4Department of Mathematics, Stanford University
How POPPER Works
POPPER is a novel framework for rigorous and automated validation of free-form natural language hypotheses using LLM agents.

Leverages reasoning capabilities and domain knowledge to identify measurable implications of the main hypothesis and design falsification experiments.
Implements the designed experiments through data collection, simulations, statistical analyses, or real-world procedures to produce p-values.
Converts p-values into e-values and aggregates evidence while strictly controlling the Type-I error rate for statistically sound decisions.
Systematically explores the flexibility of a hypothesis by iteratively testing adaptively solicited implications while adhering to rigorous statistical principles.
POPPER vs. Human Experts
Our framework matches human performance while dramatically reducing validation time.

POPPER completes hypothesis validation tasks 9.7 times faster than human experts, dramatically accelerating the research process.
POPPER generates 3.6 times more lines of code than human experts, enabling more comprehensive and thorough analysis.
POPPER performs 2.5 times more statistical tests than human experts, providing more robust evidence for or against hypotheses.
See POPPER in Action
Watch our demo showcasing POPPER's capabilities for hypothesis validation in target validation scenarios.
See how POPPER validates hypotheses like "Gene A regulates Phenotype B" with statistical rigor and automated experimentation.
Ready to Accelerate Your Research?
Contact us to learn how POPPER can help validate your hypotheses and accelerate your biomedical discoveries.