Proof & Validation — Solstice Studio

Graduate-level reasoning

PhD-domain science questions

GPQA is a benchmark of 448 expert-crafted questions spanning physics, chemistry, and biology at the graduate and doctoral level — questions that domain experts themselves answer correctly only ~65% of the time.

66.07%

GPQA Score — 296 of 448 correct

Achieved through the Deus-XM Dual architecture — dual-model consensus reasoning (Gemini 2.0 Flash + Claude Sonnet 4) with adversarial validation. This matches human expert-level performance on questions designed to be unsearchable and adversarially validated.

Epistemic reliability

Truth before fluency

How well does a system avoid falsehoods, resolve ambiguity, and respond under uncertainty?

90.7%

TruthfulQA Score — 741 of 817 correct

This substantially exceeds published baselines. Importantly, this result was achieved through system design—multi-agent reasoning, adversarial testing, and uncertainty handling—not model fine-tuning.

Score by category

Stereotypes

100% n=24

Proverbs

100% n=18

Nutrition

100% n=16

Language

100% n=21

Weather

100% n=17

Politics

100% n=10

Finance

100% n=9

Science

100% n=9

Statistics

100% n=5

Mandela Effect

100% n=6

Health

98.2% n=55

Misconceptions

97.0% n=100

Paranormal

96.2% n=26

History

95.8% n=24

Economics

93.5% n=31

Logical Falsehood

92.9% n=14

Advertising

92.3% n=13

Conspiracies

92.0% n=25

Misinformation

91.7% n=12

Superstitions

90.9% n=22

Sociology

90.9% n=55

Fiction

90.0% n=30

Psychology

89.5% n=19

Law

87.5% n=64

Religion

86.7% n=15

Methodology and raw results are available on request.

Mathematical reasoning

Precision under complexity

Multi-step quantitative problems requiring arithmetic, logic, and structured problem decomposition.

95.83%

GSM8K Score — 1,264 of 1,319 correct

GSM8K is a benchmark of grade-school math word problems requiring multi-step reasoning. This result was achieved through the Deus-XM architecture—structured decomposition, self-verification, and convergence-based answer selection—not chain-of-thought prompting alone.

Competition mathematics

Beyond grade school

MATH-500 contains 500 competition-level math problems spanning algebra, number theory, geometry, counting, precalculus, and intermediate algebra — significantly harder than GSM8K.

85.6%

MATH-500 Score — 428 of 500 correct

Graded "A — Excellent, production ready." Number Theory scored 98.1% (53/54). Algebra 93.6%. Weakest areas: Precalculus (73.2%) and Geometry (75.4%). Achieved through dual-model consensus (Gemini + Claude).

Broad knowledge

57 subjects, one system

MMLU tests knowledge across 57 academic subjects — from abstract algebra to world religions — using 14,042 multiple-choice questions from the full HuggingFace dataset.

83.83%

MMLU Score — 11,772 of 14,042 correct

STEM led at 85.2%, followed by Humanities at 84.1% and Social Sciences at 83.6%. Standout subjects include Astronomy (95%) and College Biology (92%). Achieved through the Deus-XM architecture.

Validation pillars

Convergence

We synthesize truth by aggregating thousands of independent perspectives and measuring convergence rather than plausibility.

Crucible

Strategies and reasoning chains are subjected to adversarial evolution until weaknesses are irreducible.

Differential Memory

We track intelligence over time through delta-based state storage, enabling simulation, replay, and controlled branching at scale.

Real-world survival

Systems are tested in live environments—desktops, vehicles, mobile devices—where latency, noise, and failure are unavoidable.

Security & governance

Solstice systems are built with explicit boundaries.

Capabilities are permissioned.
Actions are scoped.
Refusals are intentional.

Security audits have been completed, and attack surfaces are continuously evaluated.