Graduate-level reasoning

PhD-domain science questions

GPQA is a benchmark of 448 expert-crafted questions spanning physics, chemistry, and biology at the graduate and doctoral level — questions that domain experts themselves answer correctly only ~65% of the time.

66.07%
GPQA Score — 296 of 448 correct

Achieved through the Deus-XM Dual architecture — dual-model consensus reasoning (Gemini 2.0 Flash + Claude Sonnet 4) with adversarial validation. This matches human expert-level performance on questions designed to be unsearchable and adversarially validated.

Epistemic reliability

Truth before fluency

How well does a system avoid falsehoods, resolve ambiguity, and respond under uncertainty?

90.7%
TruthfulQA Score — 741 of 817 correct

This substantially exceeds published baselines. Importantly, this result was achieved through system design—multi-agent reasoning, adversarial testing, and uncertainty handling—not model fine-tuning.

Score by category
Stereotypes
100% n=24
Proverbs
100% n=18
Nutrition
100% n=16
Language
100% n=21
Weather
100% n=17
Politics
100% n=10
Finance
100% n=9
Science
100% n=9
Statistics
100% n=5
Mandela Effect
100% n=6
Health
98.2% n=55
Misconceptions
97.0% n=100
Paranormal
96.2% n=26
History
95.8% n=24
Economics
93.5% n=31
Logical Falsehood
92.9% n=14
Advertising
92.3% n=13
Conspiracies
92.0% n=25
Misinformation
91.7% n=12
Superstitions
90.9% n=22
Sociology
90.9% n=55
Fiction
90.0% n=30
Psychology
89.5% n=19
Law
87.5% n=64
Religion
86.7% n=15

Methodology and raw results are available on request.

Mathematical reasoning

Precision under complexity

Multi-step quantitative problems requiring arithmetic, logic, and structured problem decomposition.

95.83%
GSM8K Score — 1,264 of 1,319 correct

GSM8K is a benchmark of grade-school math word problems requiring multi-step reasoning. This result was achieved through the Deus-XM architecture—structured decomposition, self-verification, and convergence-based answer selection—not chain-of-thought prompting alone.

Competition mathematics

Beyond grade school

MATH-500 contains 500 competition-level math problems spanning algebra, number theory, geometry, counting, precalculus, and intermediate algebra — significantly harder than GSM8K.

85.6%
MATH-500 Score — 428 of 500 correct

Graded "A — Excellent, production ready." Number Theory scored 98.1% (53/54). Algebra 93.6%. Weakest areas: Precalculus (73.2%) and Geometry (75.4%). Achieved through dual-model consensus (Gemini + Claude).

Broad knowledge

57 subjects, one system

MMLU tests knowledge across 57 academic subjects — from abstract algebra to world religions — using 14,042 multiple-choice questions from the full HuggingFace dataset.

83.83%
MMLU Score — 11,772 of 14,042 correct

STEM led at 85.2%, followed by Humanities at 84.1% and Social Sciences at 83.6%. Standout subjects include Astronomy (95%) and College Biology (92%). Achieved through the Deus-XM architecture.

Validation pillars

Convergence

We synthesize truth by aggregating thousands of independent perspectives and measuring convergence rather than plausibility.

Crucible

Strategies and reasoning chains are subjected to adversarial evolution until weaknesses are irreducible.

Differential Memory

We track intelligence over time through delta-based state storage, enabling simulation, replay, and controlled branching at scale.

Real-world survival

Systems are tested in live environments—desktops, vehicles, mobile devices—where latency, noise, and failure are unavoidable.

Security & governance

Solstice systems are built with explicit boundaries.

Security audits have been completed, and attack surfaces are continuously evaluated.