Human vs AI Intelligence Evolution

Past, present, and projected trends | March 17, 2026

Estimated read time: ~15 minutes

I put this together to compare biological evolution, human tech progress, and AI capability growth over time. The main idea: intelligence development has compressed from millions of years (hominin cognition) to decades (modern AI). Benchmarks are rough proxies, not directly comparable across domains.

Primary AI data sources: Epoch AI; METR.

Executive summary

Human intelligence evolution (millions of years)

The capability index (0–100) is a rough proxy based on brain size, normalized from ~400 cc (early hominins) to ~1,400 cc (modern H. sapiens). Treat it as a rough ordering, not a precise score. Brain size correlates with some cognitive traits, but it's not a strict linear relationship.

Capability index is a normalized illustrative scale used to visualize relative progress across different systems. Values are not directly comparable across domains.

Brain volume is used here as a rough proxy for cognitive capacity. It is not a direct measure of intelligence.

Key milestones

Sources: PNAS 2024 "Hominin brain size increase has emerged from within-species encephalization"; NCBI "Pattern and rate in the Plio-Pleistocene evolution of modern human brain size"; NCBI "Hominin cognitive evolution"; Australian Museum "Larger brains"; Smithsonian "An Evolutionary Timeline of Homo Sapiens"; Smithsonian Human Origins; Nature 2022 (Sahelanthropus bipedalism); Science News Today; Wikipedia Evolution of human intelligence.

Selected AI benchmark progress by benchmark family

These charts split out benchmark families that I used to show together. MMLU is a knowledge-heavy exam benchmark with a published human-expert baseline. HumanEval and SWE-bench are both coding-related but test different things. Shared percentage units don't mean a shared scale of general intelligence.

Direct comparison is strongest here because the chart includes a published human-expert baseline for the same benchmark.

These percentages share a 0-100 range but measure different tasks: short code generation versus issue-level software engineering.

Key benchmarks

Sources: Stanford HAI AI Index 2024, 2025 (Technical Performance); Hendrycks et al. "Measuring Massive Multitask Language Understanding"; Chen et al. "Evaluating Large Language Models Trained on Code"; SWE-bench; Epoch AI (benchmarks, training compute, capabilities).

METR Time Horizon: task length AI agents can complete (50% reliability)

The METR Time Horizon measures the human-equivalent task duration at which an AI agent succeeds with 50% probability. Tasks are from RE-Bench, HCAST, and SWAA (software engineering, ML, cybersecurity). The metric has doubled approximately every 7 months over 6 years [4].

Log scale

Linear scale

Model Release 50% time horizon

Key points

Sources: Data from Epoch AI benchmark_data.zip (metr_time_horizons_external.csv), sourced from METR; METR Time Horizons; arXiv 2503.14499.

Breadth of current AI performance by domain

This table complements the benchmark trends above. I'm not claiming a single "intelligence" score. It's a summary of where frontier AI looks ahead, mixed, or behind humans in selected domains, based on the evidence I've cited.

Domain Status Why this status fits
Chess and Go Ahead AI is decisively above top human play.
ImageNet and speech Ahead AI is at or above strong human baselines in standard benchmark setups.
MMLU-style exams Ahead Frontier models are above the published human-expert baseline on the same benchmark.
Short coding tasks Ahead HumanEval is near saturation on frontier systems [3].
Software engineering Mixed Progress is rapid, but human comparison is still murky and depends heavily on task framing.
Competition math Behind Hard math evaluations still expose major weakness.
Long-horizon planning Behind Humans still lead on extended tasks, robustness, and planning depth.
Hard frontier evals Behind Benchmarks like BigCodeBench and Humanity's Last Exam remain difficult.

Important caveat: benchmark scores do not directly measure general intelligence. Many benchmarks become saturated over time as models train on related data or evaluation methods improve. They are useful indicators of progress but should not be interpreted as direct measures of human-level capability.

How to read

Benchmark links: Epoch AI, METR Time Horizon, ImageNet, MMLU, HumanEval, SWE-bench, BigCodeBench, Humanity's Last Exam (lastexam.ai).

Hypothetical future AI growth scenarios (illustrative)

These charts are intentionally hypothetical, but the scenario labels are grounded in real trendlines and bottlenecks from recent research. I use an illustrative frontier capability index (2024 = 100) and extend forward with dashed lines. This isn't a forecast or a claim that AI progress follows one smooth curve. I'm showing how different evidence-backed assumptions could lead to very different futures. The log-scale view keeps all scenarios visible; the linear-scale view makes the divergence more dramatic.

Log scale

Linear scale

Research-backed scenario assumptions

Why these assumptions are plausible: Stanford HAI AI Index 2025 notes both saturation on older benchmarks and strong gains on newer hard evaluations; Epoch AI trends and Our World in Data on scaling up AI document rapid growth in compute and efficiency; FrontierMath and ARC-AGI-2 show that hard reasoning benchmarks remain far from solved; LEAP's synthesis of Waves 1-3 reports that by 2030 the average expert expects only 23% of panelists to say reality looks like a rapid-progress scenario, while 28% would say it looks closer to a slow-progress scenario; by 2040 the median expert still expects AI to be comparable to a technology of the century, with a 32% chance of technology-of-the-millennium-level impact. METR on long tasks and RE-Bench suggest that agentic task horizons and AI R&D capability are improving fast enough that compounding remains a serious possibility.

Sources: Stanford HAI AI Index 2025, Technical Performance; Stanford HAI AI Index 2025, Economy; Epoch AI Trends in Artificial Intelligence; Our World in Data: Scaling up AI; FrontierMath; ARC-AGI-2; LEAP Wave 1: Headliners; LEAP Insights from Waves 1, 2, and 3; LEAP Wave 4: AI R&D; The Longitudinal Expert AI Panel; METR: Measuring AI Ability to Complete Long Tasks; METR RE-Bench.

Expert forecasts: human-level AI, AGI, and transformative AI

The first chart shows single point estimates. It's useful for seeing disagreement, but the targets aren't identical. Some sources forecast HLMI, some AGI, some transformative AI, and company-leader statements aren't formal elicitation studies. The second chart strips this down to reported interval bars only. Each bar is labeled by the thresholds it connects (e.g. 25%-50% or 50%-90%), so read it as a transition between probability levels, not uncertainty around a midpoint.

Why two charts?

Sources: Metaculus (AGI prediction market); Grace et al. 2022, Zhang et al. 2022, AI Impacts 2022 (researcher surveys); Cotra 2020/2022 (compute-based forecasts); Müller & Bostrom 2012–2013; 80,000 Hours 2025; Our World in Data "AI timelines".

Timeline of major human inventions

Major milestones in human technology from prehistory to the digital era. One continuous timeline on a log scale (years before present), so the compression of invention pace toward recent centuries is visible. Dates remain approximate, and many inventions evolved across regions rather than appearing at one exact moment.

How to read

Sources and citations: Our World in Data, Technology over the long run; Wikipedia, Timeline of historic inventions; Encyclopaedia Britannica, Invention; Smithsonian, Wright Brothers' flight (1903); Computer History Museum, Transistor (1947); Wikipedia, Invention of the integrated circuit (1958); CERN, Birth of the Web; Wikipedia, Discovery of penicillin (1928); Wikipedia, History of the iPhone (2007); PNAS Berna 2012 (fire); Nature Harmand 2015 (stone tools).

Timescale compression: biology, technology, and AI

Durations below are shown on a logarithmic scale. The point is not that these processes are equivalent, but that biologically driven change, civilization-scale invention, and modern AI progress operate on radically different time horizons.

Takeaway

Why current AI doesn't learn (and what could change)

In arXiv:2603.15381, Dupoux, LeCun, and Malik argue that current AI systems, once deployed, learn essentially nothing. Learning is outsourced to human experts (MLOps): data curation, training recipes, in-context learning, tool use, and adaptation happen offline. Children and animals, by contrast, learn autonomously by flexibly switching between observation, action, communication, and imagination.

Diagnosis and proposed architecture

How this fits the report

Source: Dupoux, LeCun, Malik (2026). "Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science." arXiv:2603.15381. arxiv.org/html/2603.15381.

Implications

A few takeaways from the data:

References

[1] Hendrycks et al., "Measuring Massive Multitask Language Understanding", 2021
[2] Chen et al., "Evaluating Large Language Models Trained on Code (Codex)", 2021
[3] Anthropic Claude 3.5 Sonnet Model Card, 2024
[4] METR, "Measuring AI Ability to Complete Long Tasks", 2025