# Human vs AI Intelligence Evolution: Structured Research Report

**Date:** March 17, 2026  
**Scope:** Human intelligence evolution, AI benchmark progression, expert consensus, and chart-ready data.

**Charts:** [Interactive HTML](2026-03-17-human-vs-ai-intelligence-charts.html)

**Primary AI data sources:** [Epoch AI](https://epoch.ai/benchmarks) (benchmarks, ECI, FrontierMath, GPQA Diamond, SWE-bench Verified); [METR](https://metr.org) (Time Horizon, RE-Bench, evaluation reports).

---

## 1. Human Intelligence Evolution

![Human intelligence evolution over 6 million years](human-evolution-chart.png)

*Chart data sources: [Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains), [NCBI](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250492/), [Wikipedia](https://en.wikipedia.org/wiki/Evolution_of_human_intelligence), [Smithsonian Human Origins](https://humanorigins.si.edu/evidence/human-evolution-timeline-interactive)*

### When Did Human-Level Intelligence Emerge?

Human-level intelligence did not emerge at a single point. It evolved gradually over ~7 million years, with key transitions:

- **~6 mya:** [Sahelanthropus](https://en.wikipedia.org/wiki/Sahelanthropus), transition to [bipedalism](https://en.wikipedia.org/wiki/Bipedalism) ([Science News Today](https://www.sciencenewstoday.org/the-evolution-of-human-intelligence); [Smithsonian](https://www.smithsonianmag.com/science-nature/essential-timeline-understanding-evolution-homo-sapiens-180976807/); [Nature 2022](https://www.nature.com/articles/s41586-022-04901-z))
- **~2.5 mya:** [Homo habilis](https://en.wikipedia.org/wiki/Homo_habilis), larger brains (~510–650 cc), deliberate tool-making ([Wikipedia](https://en.wikipedia.org/wiki/Evolution_of_human_intelligence); [Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains))
- **~1.8 mya:** [Homo erectus](https://en.wikipedia.org/wiki/Homo_erectus), brains ~800–1,200 cc, fire control, more complex tools ([Wikipedia](https://en.wikipedia.org/wiki/Homo_erectus))
- **~550–750 kya:** [Homo heidelbergensis](https://en.wikipedia.org/wiki/Homo_heidelbergensis) (~1,250 cc), common ancestor to H. sapiens, Neanderthals, Denisovans ([Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains))
- **~200 kya:** Anatomically modern [Homo sapiens](https://en.wikipedia.org/wiki/Homo_sapiens) (~1,350–1,500 cc) ([Smithsonian](https://www.smithsonianmag.com/science-nature/essential-timeline-understanding-evolution-homo-sapiens-180976807/))
- **~50 kya:** [Behavioral modernity](https://en.wikipedia.org/wiki/Behavioral_modernity) (symbolic thought, art, complex culture) ([Wikipedia](https://en.wikipedia.org/wiki/Evolution_of_human_intelligence))

The "cognitive revolution" is often placed around 50–70 kya with behavioral modernity, though cognitive capacity likely increased gradually within lineages rather than in sharp jumps ([PNAS 2024](https://www.pnas.org/doi/10.1073/pnas.2409542121); [NCBI cognitive evolution](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3385680/)).

### Timeline (Approximate Dates)

| Period (mya) | Species / Event | Brain Size (cc) |
|--------------|-----------------|-----------------|
| 6 | [Sahelanthropus](https://en.wikipedia.org/wiki/Sahelanthropus), [bipedalism](https://en.wikipedia.org/wiki/Bipedalism) | ~350 |
| 3–4 | [Australopithecus afarensis](https://en.wikipedia.org/wiki/Australopithecus_afarensis) | 430–550 |
| 2–2.4 | [Homo habilis](https://en.wikipedia.org/wiki/Homo_habilis) | 510–650 |
| 1.8–0.14 | [Homo erectus](https://en.wikipedia.org/wiki/Homo_erectus) | 800–1,200 |
| 0.7–0.2 | [Homo heidelbergensis](https://en.wikipedia.org/wiki/Homo_heidelbergensis) | 1,250 |
| 0.2–0.03 | [Homo neanderthalensis](https://en.wikipedia.org/wiki/Neanderthal) | 1,500 |
| 0.2–present | [Homo sapiens](https://en.wikipedia.org/wiki/Homo_sapiens) | 1,350–1,500 |

*Sources: [NCBI Plio-Pleistocene brain size](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250492/), [Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains), [Wikipedia Evolution of human intelligence](https://en.wikipedia.org/wiki/Evolution_of_human_intelligence), [Smithsonian Human Origins](https://humanorigins.si.edu/evidence/human-evolution-timeline-interactive)*

### Magnitude of Change

- Brain size increased roughly **3x** from early Australopithecines (~450 cc) to modern humans (~1,350 cc) ([Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains); [NCBI Plio-Pleistocene](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250492/)).
- Most expansion occurred within species rather than between them; within-species encephalization accelerated over time ([PNAS 2024](https://www.pnas.org/doi/10.1073/pnas.2409542121)).
- Brain size correlates with cognitive and behavioral traits, though the relationship is not strictly linear ([NCBI cognitive evolution](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3385680/)).

### Key Sources

- [PNAS 2024](https://www.pnas.org/doi/10.1073/pnas.2409542121): "Hominin brain size increase has emerged from within-species encephalization"
- [NCBI](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250492/): "Pattern and rate in the Plio-Pleistocene evolution of modern human brain size"
- [NCBI](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3385680/): "Hominin cognitive evolution: identifying patterns and processes"
- [Wikipedia](https://en.wikipedia.org/wiki/Evolution_of_human_intelligence): Evolution of human intelligence
- [Smithsonian](https://www.smithsonianmag.com/science-nature/essential-timeline-understanding-evolution-homo-sapiens-180976807/): "An Evolutionary Timeline of Homo Sapiens"
- [Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains): Larger brains
- [Smithsonian Human Origins Program](https://humanorigins.si.edu/evidence/human-evolution-timeline-interactive): Human Evolution Interactive Timeline
- [Nature 2022](https://www.nature.com/articles/s41586-022-04901-z): "Postcranial evidence of late Miocene hominin bipedalism in Chad" (Sahelanthropus)
- [Science News Today](https://www.sciencenewstoday.org/the-evolution-of-human-intelligence): The Evolution of Human Intelligence

---

## 2. Machine/AI Intelligence Evolution

![AI benchmark performance 2015-2024 vs human baseline](ai-benchmarks-chart.png)

### Timeline (1950s to Present)

| Era | Milestone |
|-----|-----------|
| 1950s | Dartmouth Conference, birth of AI |
| 1997 | Deep Blue defeats Kasparov (chess) |
| 2012 | AlexNet wins ImageNet (15.3% top-5 error) |
| 2015 | ResNet reaches 3.57% error on ImageNet |
| 2016 | AlphaGo defeats Lee Sedol (Go) |
| 2020 | GPT-3, MMLU launch (GPT-3: 43.9%) |
| 2021 | HumanEval launch, Codex 28.8% pass@1 |
| 2023 | GPT-4, new benchmarks (MMMU, GPQA, SWE-bench) |
| 2024 | o1/o3 reasoning models, SWE-bench 71.7%, MMLU ~93% |
| 2026 | Frontier benchmarks shift to GPQA, SWE-bench, real-world evals; MMLU relegated to baseline |

### Epoch AI Benchmarking Hub

[Epoch AI](https://epoch.ai) maintains a public [benchmarking database](https://epoch.ai/benchmarks) (updated Mar. 17, 2026) tracking 40+ benchmarks and leading model performance. It combines internally run evals, benchmark-creator leaderboards, and developer-reported scores. Categories include mathematics, software engineering, agent tasks, world knowledge, science, long context, multimodal, and more ([Epoch AI Benchmarks](https://epoch.ai/benchmarks)).

The [Epoch Capabilities Index (ECI)](https://epoch.ai/benchmarks/eci) stitches 37+ benchmarks into a single capability scale using Item Response Theory, enabling comparisons across models and time even as individual benchmarks saturate. Methodology: [A Rosetta Stone for AI Benchmarks](https://arxiv.org/abs/2512.00193). As of Mar. 9, 2026, GPT-5.4 Pro leads the ECI, narrowly ahead of Gemini 3.1 Pro ([benchmarking update](https://epoch.ai/benchmarks)).

### METR (Model Evaluation & Threat Research)

[METR](https://metr.org) is a research nonprofit that evaluates frontier AI capabilities and risks. It works with OpenAI, Anthropic, Amazon, and AISI on third-party evaluations of autonomous capabilities and threat models ([METR](https://metr.org)).

The [METR Time Horizon](https://metr.org/time-horizons) measures the task duration (human expert completion time) at which an AI agent succeeds with a given reliability. The 50% time horizon is the length of task an agent can complete with 50% success rate. Tasks from [RE-Bench](https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/), [HCAST](https://arxiv.org/html/2503.17354v1), and SWAA; primarily software engineering, ML, and cybersecurity. Paper: [Measuring AI Ability to Complete Long Tasks](https://arxiv.org/abs/2503.14499). [Time Horizon 1.1](https://metr.org/blog/2026-1-29-time-horizon-1-1/) (Jan 2026) expanded to 228 tasks. The metric has doubled ~every 7 months over 6 years, possibly accelerating to ~4 months in 2024 ([Epoch AI](https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up)). METR also runs evaluation reports ([evaluations.metr.org](https://evaluations.metr.org)) and advises on [Frontier AI Safety Policies](https://metr.org/fsp).

### Key Benchmarks and Score Trends

**ImageNet (Computer Vision)**  
- 2012 AlexNet: 15.3% top-5 error (84.7% accuracy)  
- 2015 ResNet: 3.57% error (~96.4% accuracy)  
- ~2020: Models approach human-level (~5% error); Shankar et al. show trained humans ~4% better on fine-grained classes  
- Human baseline: ~5% error (context-dependent; humans can reach ~96% with training)

**MMLU (Massive Multitask Language Understanding)**  
- 2020 GPT-3 175B: 43.9%  
- 2024 frontier models: ~88–93%  
- Human expert: ~86.8% (MMLU paper)  

MMLU is still used but no longer a meaningful frontier benchmark. It is a multiple-choice benchmark with ~16k questions across 57 academic subjects (math, law, medicine, philosophy, etc.); it tests broad knowledge, not reasoning or generation. Historically it became the default "general intelligence" benchmark around 2021–2023.

*Why it is less relevant today:*
- **Saturation:** Frontier models score >90%; the benchmark no longer discriminates between models. Publishing an MMLU score today conveys little information about relative capability.
- **Multiple-choice artifact:** The format allows guessing, exploiting answer patterns, and test-taking strategies; it measures classification accuracy, not real reasoning or generation.
- **Dataset contamination:** Questions are widely available and likely in training corpora; memorization risk makes scores less trustworthy.
- **Misalignment with modern use cases:** Real deployments involve long-context reasoning, coding, tool use, agent workflows, and retrieval-augmented systems; MMLU evaluates none of these.

*What replaced it (roughly):*

| Capability | Example benchmark |
|------------|-------------------|
| Expert reasoning | GPQA, [GPQA Diamond](https://epoch.ai/benchmarks/gpqa-diamond) |
| Coding | [SWE-bench Verified](https://epoch.ai/benchmarks/swe-bench-verified) |
| Math | [FrontierMath](https://epoch.ai/frontiermath), AIME, MATH Level 5 |
| Multimodal | MMMU |
| Long context | LongBench |
| Agent tasks | [APEX-Agents](https://epoch.ai/benchmarks), [ARC-AGI-2](https://epoch.ai/benchmarks/arc-agi-2), tool-use evals |
| Task length / autonomy | [METR Time Horizon](https://metr.org/time-horizons) |
| Saturation-resistant | [Humanity's Last Exam](https://epoch.ai/benchmarks/hle) |

MMLU-Pro attempts to restore difficulty with harder questions and more answer options. In 2026, MMLU serves as a baseline general knowledge test, historical reference, and quick regression check, not a frontier benchmark. Benchmark evolution: GLUE (2019) → MMLU (2022) → GPQA / SWE-bench / real-world evals (2026).

*Practical takeaway for AI engineers:* Do not rely on MMLU alone; use task-specific benchmarks; combine with human or workflow evaluation. Many modern leaderboards exclude MMLU because it is saturated.

**HumanEval (Code Generation)**  
- 2021 Codex: 28.8% pass@1 (70.2% with sampling)  
- 2024 o1-mini, Claude 3.5 Sonnet: ~92%  
- Saturated; harder benchmarks (e.g. BigCodeBench) show ~35.5% vs human 97%

**SWE-bench (Software Engineering)**  
- 2023: 4.4%  
- 2024: 71.7%  
- [SWE-bench Verified](https://epoch.ai/benchmarks/swe-bench-verified): human-validated 500-sample subset; Epoch AI runs internally ([Epoch AI](https://epoch.ai/benchmarks))  
- Caveat: [METR](https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/) found ~50% of test-passing SWE-bench Verified PRs written by AI agents would not be merged by repo maintainers; naive benchmark interpretation may overestimate real-world usefulness

**GPQA (Graduate-Level Q&A)**  
- 2023–2024: +48.9 percentage points  
- [GPQA Diamond](https://epoch.ai/benchmarks/gpqa-diamond): 198-question harder subset; PhD experts ~65% ([Epoch AI](https://epoch.ai/benchmarks))

**MMMU (Multimodal Reasoning)**  
- 2023–2024: +18.8 percentage points  
- Stanford HAI AI Index 2025

**SQuAD (Reading Comprehension)**  
- 2016: Human 86.8% F1, best system ~51%  
- 2018 SQuAD 2.0: Strong systems ~66%  
- Later: Some ensembles exceed human on SQuAD 2.0

**METR Time Horizon (Task Length)**  
- Measures human-equivalent task duration at which AI agents succeed (50% or 80% reliability). Tasks from RE-Bench, HCAST, SWAA ([METR](https://metr.org/time-horizons)).  
- Exponential increase over 6 years; doubling ~every 7 months, possibly ~4 months in 2024 ([METR](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/); [Epoch AI](https://epoch.ai/benchmarks/metr-time-horizons)).  
- Extrapolation: month-long tasks in under a decade if trend continues ([METR](https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/)).

**Games**  
- 1997: Deep Blue beats Kasparov (chess)  
- 2016: AlphaGo beats Lee Sedol (Go)

### Improvement Rates

- **Training compute:** 4–5x per year ([Epoch AI](https://epoch.ai))  
- **AI chip production:** ~3.3x per year (doubling ~7 months) ([Epoch AI](https://epoch.ai))  
- **Epoch Capabilities Index:** ~8 pts/year before Apr 2024, ~15 pts/year after; [90% acceleration in Apr 2024](https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up), coinciding with reasoning models and RL focus  
- **METR Time Horizon:** doubling ~every 7 months over 6 years; [40% acceleration in Oct 2024](https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up)  
- **Efficiency:** 2022 PaLM 540B needed for 60% MMLU; 2024 Phi-3-mini 3.8B reaches same (142x smaller)

### Key Sources

- [Epoch AI](https://epoch.ai): [benchmarking database](https://epoch.ai/benchmarks), [Epoch Capabilities Index](https://epoch.ai/benchmarks/eci), [FrontierMath](https://epoch.ai/frontiermath), [METR Time Horizons](https://epoch.ai/benchmarks/metr-time-horizons), training compute, chip production, [AI capabilities progress](https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up)
- [METR](https://metr.org): [Time Horizon](https://metr.org/time-horizons), [RE-Bench](https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/), [SWE-bench PR merge study](https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/), [evaluation reports](https://evaluations.metr.org)
- Stanford HAI AI Index 2024, 2025 (Technical Performance)
- Our World in Data: AI performance, AI timelines
- Kiela et al. (2023): Dynabench, Plotting Progress in AI
- Wikipedia: ImageNet, AlexNet, ResNet, MMLU, HumanEval, Deep Blue

---

## 3. Human vs AI Comparison

### Tasks Where AI Has Surpassed Humans

- **Image classification (ImageNet):** AI at or above human in standard setups; humans still ahead on fine-grained classes (Shankar et al.)
- **Speech recognition (Switchboard):** Surpassed
- **Reading comprehension (SQuAD 1.1, 2.0):** Surpassed by top ensembles
- **Language understanding (GLUE, SuperGLUE, MMLU):** GLUE and SuperGLUE are historical; MMLU is saturated and no longer a meaningful frontier metric. Modern evaluations use GPQA, SWE-bench, and task-specific benchmarks.
- **Code generation (HumanEval):** ~92%+; saturated
- **Chess, Go:** Surpassed (1997, 2016)
- **Short-horizon agent tasks ([RE-Bench](https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/)):** AI ~4x human at 2-hour budget; [METR Time Horizon](https://metr.org/time-horizons) shows exponential growth in task length agents can complete

### Tasks Where AI Still Lags

- **Competition-level mathematics:** e.g. IMO; [FrontierMath](https://epoch.ai/frontiermath) (350 problems, research-level) – GPT-5.4 Pro 50% Tiers 1–3, 38% Tier 4 (Mar 2026); [FrontierMath: Open Problems](https://epoch.ai/frontiermath/open-problems) tests unsolved math research
- **Visual commonsense reasoning**
- **Planning**
- **Long-horizon tasks:** At 32-hour budget, humans outperform AI ~2:1 ([RE-Bench](https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/)); [METR Time Horizon](https://metr.org/time-horizons) tracks growth toward longer autonomous task completion
- **BigCodeBench:** AI ~35.5% vs human 97%
- **Humanity's Last Exam (HLE):** 2,500 questions across 100+ subjects, graduate-level; created by Center for AI Safety and Scale AI to address saturation; frontier models still far from human ([Epoch AI](https://epoch.ai/benchmarks/hle))
- **Arithmetic and planning beyond training distribution:** Unreliable (Stanford HAI 2025)

### Expert Consensus on General Human-Level AI

![Expert forecasts for human-level AI (50% probability)](expert-forecasts-chart.png)

**Surveys and forecasts:**

- **Grace et al. 2022 (AI Impacts):** Median 50% chance of HLMI by 2061; 90% by 2100
- **Zhang et al. 2022:** Median 50% by 2047 (fixed-probability) or 2070 (fixed-years)
- **AI Impacts 2022 (738 researchers):** 25% chance by early 2030s, 50% by 2047
- **Metaculus:** Median AGI by 2040 (Nov 2022); shortened to ~2031 by 2025
- **Cotra 2020:** 50% transformative AI by 2050; 2022 update: ~10 years earlier (~2040)
- **80,000 Hours 2025:** Company leaders often 2–5 years; researchers median ~2047; Metaculus ~2031
- **Müller & Bostrom 2012–2013:** 50% high-level machine intelligence by 2040–2050; 90% by 2075

**Caveats:** Large disagreement, framing effects, and limited track record of expert forecasts in their own fields (Our World in Data; Tetlock).

### Expectations for Coming Decades

- Many experts assign non-trivial probability to human-level or transformative AI within 20–50 years.
- Timelines have shortened since 2020–2022.
- AGI before 2030 is within the range of expert opinion; 2028 is cited as plausible given reasoning advances.
- Uncertainty is high in both directions.

---

## 4. Dominance and the Intelligence Gap

### Will Humans Cease to Be the Dominant Intelligence?

- Many researchers treat this as a serious possibility, not science fiction (Our World in Data).
- Bostrom defines superintelligence as "an intellect that is much smarter than the best human brains in practically every field" (Superintelligence).
- Expert timelines for HLMI/AGI imply the question could become relevant within decades.

### The "Human Explaining to a Dog" Analogy

- Bostrom uses a primate analogy: humans vs gorillas, with one species dominant and the other marginalized.
- Superintelligence discussions often invoke extreme cognitive gaps: e.g. humans explaining justice to ants (Michael Bass; Azeem Azhar).
- Potential dimensions of a superintelligence gap: processing speed (possibly millions of times faster), memory, reasoning depth, domain breadth, adaptive learning.
- The gap would be qualitative as well as quantitative: modes of thought humans may not be able to emulate or imagine.

### Superintelligence Scenarios

- Bostrom (2014): Superintelligence could follow HLMI within ~30 years.
- Control and alignment are central concerns; "The Unfinished Fable of the Sparrows" illustrates control difficulties.
- Intelligence is not a single linear scale; AI can exceed humans in narrow domains while lacking general capabilities (Nature 2024; arxiv 2602.04986).

---

## 5. CHART_DATA

Structured JSON-ready data for visualization.

### human_evolution

Capability index: 0–100 scale, with 100 = modern H. sapiens. Brain size used as proxy; normalized linearly from ~400 cc (baseline) to ~1,400 cc (modern). Sources: [Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains), [NCBI Plio-Pleistocene](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250492/), [Wikipedia Evolution of human intelligence](https://en.wikipedia.org/wiki/Evolution_of_human_intelligence).

```json
{
  "human_evolution": [
    {"year_mya": 6, "species": "Sahelanthropus", "brain_cc": 350, "capability_index": 5},
    {"year_mya": 4, "species": "Australopithecus afarensis", "brain_cc": 450, "capability_index": 12},
    {"year_mya": 2.2, "species": "Homo habilis", "brain_cc": 580, "capability_index": 22},
    {"year_mya": 1.5, "species": "Homo erectus (early)", "brain_cc": 850, "capability_index": 40},
    {"year_mya": 0.8, "species": "Homo erectus (late)", "brain_cc": 1100, "capability_index": 60},
    {"year_mya": 0.5, "species": "Homo heidelbergensis", "brain_cc": 1250, "capability_index": 75},
    {"year_mya": 0.2, "species": "Homo sapiens (anatomical)", "brain_cc": 1400, "capability_index": 90},
    {"year_mya": 0.05, "species": "Homo sapiens (behavioral modernity)", "brain_cc": 1400, "capability_index": 100}
  ]
}
```

### ai_benchmarks

Scores as percentages (accuracy or pass rate). Human baseline included where known. MMLU entries are historical; MMLU is now a baseline, not a frontier benchmark. FrontierMath 2026 from [Epoch AI](https://epoch.ai/benchmarks).

```json
{
  "ai_benchmarks": [
    {"year": 1997, "benchmark_name": "Chess (vs World Champion)", "score": 100, "human_baseline": 50, "model": "Deep Blue"},
    {"year": 2012, "benchmark_name": "ImageNet top-5", "score": 84.7, "human_baseline": 95, "model": "AlexNet"},
    {"year": 2015, "benchmark_name": "ImageNet top-5", "score": 96.4, "human_baseline": 95, "model": "ResNet"},
    {"year": 2016, "benchmark_name": "Go (vs World Champion)", "score": 100, "human_baseline": 50, "model": "AlphaGo"},
    {"year": 2016, "benchmark_name": "SQuAD F1", "score": 51, "human_baseline": 86.8, "model": "baseline"},
    {"year": 2018, "benchmark_name": "SQuAD 2.0 F1", "score": 66, "human_baseline": 86.8, "model": "strong neural"},
    {"year": 2020, "benchmark_name": "MMLU", "score": 43.9, "human_baseline": 86.8, "model": "GPT-3 175B"},
    {"year": 2021, "benchmark_name": "HumanEval pass@1", "score": 28.8, "human_baseline": null, "model": "Codex"},
    {"year": 2022, "benchmark_name": "MMLU", "score": 70, "human_baseline": 86.8, "model": "PaLM"},
    {"year": 2023, "benchmark_name": "MMLU", "score": 86, "human_baseline": 86.8, "model": "GPT-4"},
    {"year": 2023, "benchmark_name": "SWE-bench", "score": 4.4, "human_baseline": null, "model": "2023 best"},
    {"year": 2024, "benchmark_name": "MMLU", "score": 93, "human_baseline": 86.8, "model": "o3"},
    {"year": 2024, "benchmark_name": "HumanEval pass@1", "score": 92.4, "human_baseline": null, "model": "o1-mini"},
    {"year": 2024, "benchmark_name": "SWE-bench", "score": 71.7, "human_baseline": null, "model": "2024 best"},
    {"year": 2024, "benchmark_name": "GPQA", "score": 65, "human_baseline": null, "model": "2024 best"},
    {"year": 2024, "benchmark_name": "MMMU", "score": 85, "human_baseline": null, "model": "2024 best"},
    {"year": 2026, "benchmark_name": "FrontierMath Tiers 1-3", "score": 50, "human_baseline": null, "model": "GPT-5.4 Pro"},
    {"year": 2026, "benchmark_name": "FrontierMath Tier 4", "score": 38, "human_baseline": null, "model": "GPT-5.4 Pro"}
  ]
}
```

### metr_time_horizon

50% time horizon in minutes (human-equivalent task length at which agent succeeds with 50% reliability). Source: [Epoch AI benchmark_data.zip](https://epoch.ai/data/benchmark_data.zip) (`metr_time_horizons_external.csv`), from [METR](https://metr.org/time-horizons).

```json
{
  "metr_time_horizon": [
    {"year": 2019.83, "minutes": 0.05, "model": "GPT-2"},
    {"year": 2020.45, "minutes": 0.15, "model": "GPT-3"},
    {"year": 2023.17, "minutes": 5.4, "model": "GPT-4"},
    {"year": 2023.42, "minutes": 4.0, "model": "GPT-4 (Jun)"},
    {"year": 2023.67, "minutes": 0.6, "model": "GPT-3.5"},
    {"year": 2024.08, "minutes": 6.4, "model": "Claude 3 Opus"},
    {"year": 2024.25, "minutes": 6.6, "model": "GPT-4 Turbo"},
    {"year": 2024.58, "minutes": 7.0, "model": "GPT-4o"},
    {"year": 2024.67, "minutes": 22.2, "model": "o1-preview"},
    {"year": 2024.75, "minutes": 29.6, "model": "Claude 3.5"},
    {"year": 2024.92, "minutes": 39.2, "model": "o1"},
    {"year": 2025.08, "minutes": 60.4, "model": "Claude 3.7"},
    {"year": 2025.25, "minutes": 119.7, "model": "o3"},
    {"year": 2025.58, "minutes": 203.0, "model": "GPT-5"},
    {"year": 2025.83, "minutes": 293.0, "model": "Opus 4.5"},
    {"year": 2025.92, "minutes": 352.2, "model": "GPT-5.2"},
    {"year": 2026.08, "minutes": 718.8, "model": "Opus 4.6"}
  ]
}
```

### future_projections

Expert forecasts for human-level or transformative AI (50% probability unless noted).

```json
{
  "future_projections": [
    {"source": "Müller & Bostrom 2012", "year_50pct": 2045, "year_90pct": 2075, "notes": "High-level machine intelligence"},
    {"source": "Grace et al. 2022", "year_50pct": 2061, "year_90pct": 2100, "notes": "HLMI, 356 experts"},
    {"source": "Zhang et al. 2022", "year_50pct": 2047, "year_90pct": 2105, "notes": "Fixed-probability framing"},
    {"source": "AI Impacts 2022", "year_50pct": 2047, "year_25pct": 2033, "notes": "738 ML researchers"},
    {"source": "Metaculus 2022", "year_50pct": 2040, "notes": "AGI devised, tested, announced"},
    {"source": "Metaculus 2025", "year_50pct": 2031, "notes": "Shortened from 2040"},
    {"source": "Cotra 2020", "year_50pct": 2050, "notes": "Transformative AI"},
    {"source": "Cotra 2022", "year_50pct": 2040, "notes": "Updated, 10 years earlier"},
    {"source": "Samotsvety 2023", "year_50pct": 2041, "std_years": 9, "notes": "AGI"},
    {"source": "Company leaders 2025", "year_min": 2027, "year_max": 2030, "notes": "2-5 years from 2025"}
  ]
}
```

### improvement_rates

```json
{
  "improvement_rates": [
    {"metric": "Training compute (frontier models)", "rate_per_year": "4-5x", "source": "Epoch AI", "url": "https://epoch.ai"},
    {"metric": "AI chip production capacity", "rate_per_year": "3.3x", "doubling_months": 7, "source": "Epoch AI", "url": "https://epoch.ai"},
    {"metric": "Epoch Capabilities Index (pre-Apr 2024)", "points_per_year": 8, "source": "Epoch AI", "url": "https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up"},
    {"metric": "Epoch Capabilities Index (post-Apr 2024)", "points_per_year": 15, "source": "Epoch AI", "url": "https://epoch.ai/data-insights/ai-capabilities-progress-has-sped-up"},
    {"metric": "METR Time Horizon", "doubling_months": 7, "acceleration_2024": "~4 months", "source": "METR", "url": "https://metr.org/time-horizons"},
    {"metric": "AI supercomputer performance", "rate_per_year": "2.5x", "doubling_months": 9, "source": "Epoch AI", "url": "https://epoch.ai"}
  ]
}
```

---

## Sources Summary

| Domain | Key Sources |
|--------|-------------|
| Human evolution | [PNAS 2024](https://www.pnas.org/doi/10.1073/pnas.2409542121), [NCBI Plio-Pleistocene](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9250492/), [NCBI cognitive evolution](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3385680/), [Wikipedia](https://en.wikipedia.org/wiki/Evolution_of_human_intelligence), [Smithsonian](https://www.smithsonianmag.com/science-nature/essential-timeline-understanding-evolution-homo-sapiens-180976807/), [Smithsonian Human Origins](https://humanorigins.si.edu/evidence/human-evolution-timeline-interactive), [Australian Museum](https://australian.museum/learn/science/human-evolution/larger-brains), [Nature 2022 Sahelanthropus](https://www.nature.com/articles/s41586-022-04901-z), [Science News Today](https://www.sciencenewstoday.org/the-evolution-of-human-intelligence) |
| AI benchmarks | [Epoch AI](https://epoch.ai/benchmarks) (40+ benchmarks, ECI, FrontierMath, HLE, GPQA Diamond, SWE-bench Verified), [METR](https://metr.org) (Time Horizon, RE-Bench, SWE-bench merge study), Stanford HAI AI Index 2024/2025, Our World in Data, Kiela et al. |
| Expert timelines | Grace et al., Zhang et al., AI Impacts, Metaculus, Cotra, 80,000 Hours |
| Superintelligence | Bostrom, Nature, arxiv |