Human vs AI Intelligence Evolution

I put this together to compare biological evolution, human tech progress, and AI capability growth over time. The main idea: intelligence development has compressed from millions of years (hominin cognition) to decades (modern AI). Benchmarks are rough proxies, not directly comparable across domains.

Executive summary

Human evolution: ~7 million years; brain size ~3x (450 cc to 1,350 cc); behavioral modernity ~50 kya.
AI evolution: ~70 years; rapid gains on many benchmarks, but benchmark percentages are not directly comparable across domains.
Surpassed? AI is ahead on some standardized tasks, mixed on real-world software work, and still behind on competition math and long-horizon planning.
When human-level? Forecast medians span ~2031–2061, but the underlying targets differ: AGI, HLMI, and transformative AI are often conflated.
Dominance? Many researchers treat superintelligence as a serious possibility. Bostrom uses human–gorilla analogy; extreme gaps (human–ant) discussed.
Dog analogy? Maybe. The idea: we might understand ASI's goals and methods about as well as a smart dog understands why you're paying an invoice online. It can watch, but it can't grasp the concepts, purpose, or reasoning, let alone do it.

Human intelligence evolution (millions of years)

The capability index (0–100) is a rough proxy based on brain size, normalized from ~400 cc (early hominins) to ~1,400 cc (modern H. sapiens). Treat it as a rough ordering, not a precise score. Brain size correlates with some cognitive traits, but it's not a strict linear relationship.

Capability index is a normalized illustrative scale used to visualize relative progress across different systems. Values are not directly comparable across domains.

Brain volume is used here as a rough proxy for cognitive capacity. It is not a direct measure of intelligence.

Key milestones

~6 mya: Bipedalism (Sahelanthropus)
~2.5 mya: Deliberate tool-making (H. habilis)
~1.8 mya: Fire control (H. erectus)
~50 kya: Behavioral modernity (symbolic thought, art)
Brain size ~3x from Australopithecus (~450 cc) to H. sapiens (~1,350 cc)

Sources: PNAS 2024 "Hominin brain size increase has emerged from within-species encephalization"; NCBI "Pattern and rate in the Plio-Pleistocene evolution of modern human brain size"; NCBI "Hominin cognitive evolution"; Australian Museum "Larger brains"; Smithsonian "An Evolutionary Timeline of Homo Sapiens"; Smithsonian Human Origins; Nature 2022 (Sahelanthropus bipedalism); Science News Today; Wikipedia Evolution of human intelligence.

Selected AI benchmark progress by benchmark family

These charts split out benchmark families that I used to show together. MMLU is a knowledge-heavy exam benchmark with a published human-expert baseline. HumanEval and SWE-bench are both coding-related but test different things. Shared percentage units don't mean a shared scale of general intelligence.

Direct comparison is strongest here because the chart includes a published human-expert baseline for the same benchmark.

These percentages share a 0-100 range but measure different tasks: short code generation versus issue-level software engineering.

Key benchmarks

MMLU: 43.9% (GPT-3 2020) [1] → 93% (o3 2024); human expert ~86.8%. Measures performance on a broad set of academic knowledge tasks across many disciplines. MMLU evaluates performance on multiple-choice academic questions. High scores indicate strong exam-style reasoning and factual recall but do not necessarily reflect real-world task performance.
HumanEval: 28.8% (Codex 2021) [2] → 92% (o1-mini 2024). Evaluates the ability of models to write short correct Python functions. HumanEval scores are typically reported as pass@1, meaning the probability that the first generated solution passes all tests. Some papers also report pass@k (e.g. pass@10), which allows multiple attempts and usually yields higher scores.
SWE-bench: 4.4% (2023) → 71.7% (2024). Measures whether models can fix real bugs in open-source GitHub repositories. SWE-bench evaluates the ability of models to fix real GitHub issues in open-source repositories. Scores depend on evaluation setup (agent loops, tool use, retrieval, etc.), so comparisons across papers should be interpreted cautiously.
Note (2026): Some frontier models now exceed ~97% on HumanEval and ~77% on SWE-bench Verified, indicating continued rapid progress in coding benchmarks.
Percentages are benchmark-specific, not a universal intelligence score
Training compute: 4–5x per year (Epoch AI)

Sources: Stanford HAI AI Index 2024, 2025 (Technical Performance); Hendrycks et al. "Measuring Massive Multitask Language Understanding"; Chen et al. "Evaluating Large Language Models Trained on Code"; SWE-bench; Epoch AI (benchmarks, training compute, capabilities).

METR Time Horizon: task length AI agents can complete (50% reliability)

The METR Time Horizon measures the human-equivalent task duration at which an AI agent succeeds with 50% probability. Tasks are from RE-Bench, HCAST, and SWAA (software engineering, ML, cybersecurity). The metric has doubled approximately every 7 months over 6 years [4].

Log scale

Linear scale

Model	Release	50% time horizon

Key points

50% time horizon = task length (in human expert time) at which the agent succeeds half the time
Claude 3.7 Sonnet (Apr 2025): ~1 hour; GPT-5 (Aug 2025): ~2h 17m; Claude Opus 4.5 (Dec 2025): ~4h 49m; Claude Opus 4.6 (Feb 2026): ~14.5h
Extrapolation: month-long tasks in under a decade if trend continues (METR)

Sources: Data from Epoch AI benchmark_data.zip (metr_time_horizons_external.csv), sourced from METR; METR Time Horizons; arXiv 2503.14499.

Breadth of current AI performance by domain

This table complements the benchmark trends above. I'm not claiming a single "intelligence" score. It's a summary of where frontier AI looks ahead, mixed, or behind humans in selected domains, based on the evidence I've cited.

Domain	Status	Why this status fits
Chess and Go	Ahead	AI is decisively above top human play.
ImageNet and speech	Ahead	AI is at or above strong human baselines in standard benchmark setups.
MMLU-style exams	Ahead	Frontier models are above the published human-expert baseline on the same benchmark.
Short coding tasks	Ahead	HumanEval is near saturation on frontier systems [3].
Software engineering	Mixed	Progress is rapid, but human comparison is still murky and depends heavily on task framing.
Competition math	Behind	Hard math evaluations still expose major weakness.
Long-horizon planning	Behind	Humans still lead on extended tasks, robustness, and planning depth.
Hard frontier evals	Behind	Benchmarks like BigCodeBench and Humanity's Last Exam remain difficult.

Important caveat: benchmark scores do not directly measure general intelligence. Many benchmarks become saturated over time as models train on related data or evaluation methods improve. They are useful indicators of progress but should not be interpreted as direct measures of human-level capability.

How to read

Ahead: AI has clearly surpassed strong human baselines in standard benchmark setups
Mixed: Rapid progress, but weak or missing human baselines and strong dependence on task framing
Behind: Humans still lead, especially on robustness, planning, or harder frontier evaluations

Benchmark links: Epoch AI, METR Time Horizon, ImageNet, MMLU, HumanEval, SWE-bench, BigCodeBench, Humanity's Last Exam (lastexam.ai).

Hypothetical future AI growth scenarios (illustrative)

These charts are intentionally hypothetical, but the scenario labels are grounded in real trendlines and bottlenecks from recent research. I use an illustrative frontier capability index (2024 = 100) and extend forward with dashed lines. This isn't a forecast or a claim that AI progress follows one smooth curve. I'm showing how different evidence-backed assumptions could lead to very different futures. The log-scale view keeps all scenarios visible; the linear-scale view makes the divergence more dramatic.

Log scale

Linear scale

Research-backed scenario assumptions

Slowdown: older benchmarks such as MMLU and HumanEval are saturating, while harder evaluations such as FrontierMath and ARC-AGI-2 show large remaining gaps; LEAP also suggests a non-trivial slow-progress path, with experts in the Wave 1-3 synthesis assigning substantial weight to slower-by-2030 trajectories.
Base case: progress continues, but more unevenly than the 2023–2024 jump, because compute, efficiency, and benchmark gains remain strong even as measurement gets noisier on harder tasks; LEAP's cross-wave view is broadly consistent with substantial progress by 2030 and very large impact by 2040 without requiring an immediate discontinuity.
Acceleration: better algorithms, more frontier compute, and AI-assisted software and research work could compound, especially if models increasingly help automate the work of improving future models; LEAP Wave 4 and METR Time Horizon evidence both make this mechanism plausible.
All three lines are dashed after 2024 to mark them as hypothetical scenarios, not predictions.

Why these assumptions are plausible: Stanford HAI AI Index 2025 notes both saturation on older benchmarks and strong gains on newer hard evaluations; Epoch AI trends and Our World in Data on scaling up AI document rapid growth in compute and efficiency; FrontierMath and ARC-AGI-2 show that hard reasoning benchmarks remain far from solved; LEAP's synthesis of Waves 1-3 reports that by 2030 the average expert expects only 23% of panelists to say reality looks like a rapid-progress scenario, while 28% would say it looks closer to a slow-progress scenario; by 2040 the median expert still expects AI to be comparable to a technology of the century, with a 32% chance of technology-of-the-millennium-level impact. METR on long tasks and RE-Bench suggest that agentic task horizons and AI R&D capability are improving fast enough that compounding remains a serious possibility.

Sources: Stanford HAI AI Index 2025, Technical Performance; Stanford HAI AI Index 2025, Economy; Epoch AI Trends in Artificial Intelligence; Our World in Data: Scaling up AI; FrontierMath; ARC-AGI-2; LEAP Wave 1: Headliners; LEAP Insights from Waves 1, 2, and 3; LEAP Wave 4: AI R&D; The Longitudinal Expert AI Panel; METR: Measuring AI Ability to Complete Long Tasks; METR RE-Bench.

Expert forecasts: human-level AI, AGI, and transformative AI

The first chart shows single point estimates. It's useful for seeing disagreement, but the targets aren't identical. Some sources forecast HLMI, some AGI, some transformative AI, and company-leader statements aren't formal elicitation studies. The second chart strips this down to reported interval bars only. Each bar is labeled by the thresholds it connects (e.g. 25%-50% or 50%-90%), so read it as a transition between probability levels, not uncertainty around a midpoint.

Why two charts?

Point estimates are easy to compare but hide definition differences and uncertainty
Reported intervals reveal that several sources span decades between one probability threshold and another
The bars are threshold-to-threshold ranges, not uncertainty around a midpoint

Sources: Metaculus (AGI prediction market); Grace et al. 2022, Zhang et al. 2022, AI Impacts 2022 (researcher surveys); Cotra 2020/2022 (compute-based forecasts); Müller & Bostrom 2012–2013; 80,000 Hours 2025; Our World in Data "AI timelines".

Timeline of major human inventions

Major milestones in human technology from prehistory to the digital era. One continuous timeline on a log scale (years before present), so the compression of invention pace toward recent centuries is visible. Dates remain approximate, and many inventions evolved across regions rather than appearing at one exact moment.

How to read

X-axis is logarithmic: equal horizontal distance represents equal multiplicative change in time
Recent centuries occupy more visual space, reflecting the acceleration of invention
The set is intentionally curated to the highest-signal milestones rather than every possible invention
Stone tools (~3.3 mya) and fire control (~1 mya) anchor the deep prehistoric end of the timeline

Sources and citations: Our World in Data, Technology over the long run; Wikipedia, Timeline of historic inventions; Encyclopaedia Britannica, Invention; Smithsonian, Wright Brothers' flight (1903); Computer History Museum, Transistor (1947); Wikipedia, Invention of the integrated circuit (1958); CERN, Birth of the Web; Wikipedia, Discovery of penicillin (1928); Wikipedia, History of the iPhone (2007); PNAS Berna 2012 (fire); Nature Harmand 2015 (stone tools).

Timescale compression: biology, technology, and AI

Durations below are shown on a logarithmic scale. The point is not that these processes are equivalent, but that biologically driven change, civilization-scale invention, and modern AI progress operate on radically different time horizons.

Takeaway

Human cognitive evolution unfolded across millions of years
Major invention history compressed into roughly ten thousand years
Modern digital and frontier-model progress occurred across decades and then single-digit years

Why current AI doesn't learn (and what could change)

In arXiv:2603.15381, Dupoux, LeCun, and Malik argue that current AI systems, once deployed, learn essentially nothing. Learning is outsourced to human experts (MLOps): data curation, training recipes, in-context learning, tool use, and adaptation happen offline. Children and animals, by contrast, learn autonomously by flexibly switching between observation, action, communication, and imagination.

Diagnosis and proposed architecture

Three missing abilities: selecting their own training data (active learning), switching between learning modes (meta-control), and sensing their own performance (meta-cognition)
System A: learning from observation (self-supervised learning, world models)
System B: learning from action (reinforcement learning, policy learning)
System M: meta-control that routes data and switches modes (input selection, loss modulation, mode switching). In their blueprint, M is fixed; A and B learn during a lifetime

How this fits the report

Learning is central to how we usually define intelligence: humans and animals are considered intelligent in part because they adapt, generalize, and improve from experience. Current AI excels at inference over fixed models but does not learn after deployment, so the gap is conceptual as well as technical
Explains why AI lags on long-horizon planning and robustness: no autonomous learning, domain mismatch, fixed deployment
One reason progress might slow: scaling alone may not yield autonomous learning; an A-B-M architecture would require different research
Research direction: evolutionary-developmental (Evo/Devo) bilevel optimization to bootstrap Systems A, B, and M jointly

Source: Dupoux, LeCun, Malik (2026). "Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science." arXiv:2603.15381. arxiv.org/html/2603.15381.

Human vs AI Intelligence Evolution

Executive summary

Human intelligence evolution (millions of years)

Key milestones

Selected AI benchmark progress by benchmark family

Key benchmarks

METR Time Horizon: task length AI agents can complete (50% reliability)

Log scale

Linear scale

Key points

Breadth of current AI performance by domain

How to read

Hypothetical future AI growth scenarios (illustrative)

Log scale

Linear scale

Research-backed scenario assumptions

Expert forecasts: human-level AI, AGI, and transformative AI

Why two charts?

Timeline of major human inventions

How to read

Timescale compression: biology, technology, and AI

Takeaway

Why current AI doesn't learn (and what could change)

Diagnosis and proposed architecture

How this fits the report

Implications

References