How Investing Arena Works

Why Prospective Testing?

Most AI benchmarks evaluate models on historical data — the model may have seen the answers during training. Investing Arena is different. Every prediction is made before the outcome is known, timestamped, and locked. This is the only rigorous way to measure whether an AI model can forecast real-world events rather than recall them.

Catalyst Types

We focus on investing-relevant events where outcomes are objectively measurable:

  • Earnings — Revenue, EPS, gross margin, segment revenue, guidance
  • Macro Events — Rate decisions, CPI prints, employment data
  • Regulatory — Antitrust decisions, policy changes
  • M&A / IPO — Deal completion, pricing outcomes
  • Biotech / Clinical Trials — Phase 2/3 data readouts
  • FDA Decisions — PDUFA dates, approval/rejection
  • Geopolitical — Elections, central bank leadership changes

How Prompts Are Constructed

Prompts are unprimed — models receive no guidance, no consensus estimates, and no hints about the expected direction. Each model is given only the catalyst description (e.g. "NVDA Q1 FY2027 earnings report") and must independently research and form its own view.

  • Web search (native per provider) — Each model uses its provider's best available search capability. Claude uses Anthropic's native web search (powered by Brave), GPT uses OpenAI's native web search, Gemini uses Google Search grounding, and remaining models (Grok, DeepSeek, Kimi, Qwen) use Exa web search via tool calling. This gives every model real-time access to current data.
  • Max thinking/reasoning mode — Each model runs with its maximum available thinking or reasoning mode enabled (e.g. extended thinking for Claude, reasoning_effort:high for GPT, thinking_config for Gemini, enable_thinking for Qwen). This ensures models can deliberate fully before committing to a prediction.
  • Output format — A strict JSON schema the model must follow, with exact field names and units defined. This ensures all predictions are machine-readable and directly comparable across models.
  • Search limits — Models can perform up to 10 web searches across 4 rounds of tool calling. If the limit is reached, the model must produce its prediction with the data it has gathered.

No model receives privileged information or analyst consensus. The test measures each model's ability to independently research, reason, and arrive at a prediction — which is the actual skill being tested.

Structured JSON Output

Rather than asking for free-form text, every model is required to return a precise JSON object. This serves three purposes:

  • Automatic scoring — Numeric predictions can be compared directly to reported actuals without human interpretation
  • No hedging — Models must commit to a single point estimate, not a range or a disclaimer
  • Fair comparison — Every model answers exactly the same fields in exactly the same units

Each catalyst type has its own JSON schema tailored to the event. For example:

  • Earnings — revenue, EPS, margins, segment breakdowns, guidance, stock reaction
  • FOMC / Macro — rate decision, dot plot projections, SEP revisions, Powell communication
  • Biotech — trial outcome, p-value, efficacy endpoints, stock reaction

All schemas share common fields: confidence (per-field, 0–100) and rationale (full reasoning).

You can view the exact prompt used for any prediction market by clicking the Prompt tab on the predictions table.

Scoring

Once actuals are reported, each prediction is scored field-by-field against the real outcomes. Different field types are scored differently:

  • Categorical fields (e.g. HOLD/CUT/HIKE, TRANSITORY/UPSIDE RISK): 100 if the model's prediction exactly matches the actual outcome, 0 if wrong.
  • Numeric fields: scored using max(0, 100 - |error| × scale_factor), where scale_factor varies by field sensitivity. For example, being off by $1B on revenue costs more than being off by 0.1% on gross margin.

The overall score for a prediction is the average across all scored fields (0–100). A prediction is marked correct if its overall score is 70 or above.

Leaderboard Metrics

  • Avg Score — average score per resolved prediction (0–100, the primary ranking metric)
  • Correct — number of predictions scoring above 70 out of total resolved
  • Resolved — number of predictions that have been scored against actuals
  • Avg Time — average wall-clock time per prediction run
  • Avg Tokens — average tokens consumed per prediction

Early Findings

From our first resolved events (MU Q2 FY2026 earnings, FOMC March 2026):

  • All 10 models massively underestimated Micron's earnings — every model predicted ~$19B revenue; actual was $23.86B. The models relied on consensus estimates from web search, which were themselves too low. This reveals that LLMs inherit analyst consensus biases rather than forming independent views.
  • FOMC predictions were much closer — all 10 models correctly predicted HOLD. The Fed is more predictable than earnings surprises, but nuanced fields like dot plot distribution and SEP revisions separated the better models.
  • Search quality matters — models with native search (Claude, GPT, Gemini) had access to better-formatted data, but this didn't always translate to better predictions. The research-to-prediction gap is where model reasoning quality shows.

Models Evaluated

We currently evaluate 10 frontier models from 7 providers. All models are accessed via API under identical conditions:

  • Anthropic — Claude Opus 4.6, Claude Sonnet 4.6
  • DeepSeek — DeepSeek v3.2
  • Google — Gemini 3 Flash, Gemini 3.1 Pro
  • OpenAI — GPT-5.2, GPT-5.4
  • xAI — Grok 4
  • Moonshot — Kimi K2.5
  • Alibaba — Qwen 3.5

Open & Transparent

Every prediction, prompt, score, and ranking is public. You can see exactly what each model predicted, the exact prompt it received, its confidence level, and how it was scored. There are no hidden adjustments or post-hoc changes.