Dr-CiK: A Testbed for Foresight-Driven Agents

Motivation

Forecasting needs context you have to go find

Time-series forecasting in the real world rarely depends on history alone. A traffic forecast hinges on a planned road closure; a demand forecast hinges on an upcoming promotion; a sensor forecast hinges on a maintenance window. That context lives in documents — reports, tickets, notes — scattered across noisy, heterogeneous sources, mixed in with material that looks relevant but is not.

Existing context-aided forecasting benchmarks hand the model the right context up front. That leaves the central question untouched: can an agent identify the right context on its own? Dr-CiK is built to answer it.

CiK

Context is Key

When it comes to forecasting, the right external context often matters more than a better model. Quality context substantially improves forecasts.

Dr

Deep Research

Finding that context in a large corpus — and distilling it into forecast-useful evidence while rejecting distractors — demands genuine deep research.

The benchmark

What an agent has to do

Each task pairs a time series with a corpus of supporting and distractor documents, and the agent works through four steps to produce an evidence-grounded forecast.

1

Retrieve

Search the document space for context relevant to the series being forecast.

2

Filter

Reject distractors — confounders, noise, profile and temporal mismatches, and misleading time-series claims.

3

Distill

Turn the retrieved context into concise, forecast-useful evidence.

4

Forecast

Produce a forecast grounded in that evidence — and be judged against ground truth.

279

Forecasting tasks

10,342

Documents

5

Distractor types

6,975

Distractor documents

Of the 10,342 documents, 3,367 are supporting and 6,975 are distractors — exactly five per distractor type per task. Tasks span synthetic and human-authored sources across domains like infrastructure, healthcare, transportation, and systems observability. Ground-truth evidence and future values are retained for evaluation.

Key findings

Today's agents struggle to find the future's context

~40%

Context pays off. With ground-truth context, the best forecaster cuts scaled CRPS by roughly 40% versus no context — the prize is real.

<5%

But evidence is missed. Most deep-research agents recover under 5% of the ground-truth supporting evidence in a task.

>80%

And distractors win. Agents are frequently misled — a large majority of cited documents are distractors, and retrieved context can push forecasts below the no-context baseline.

Leaderboard

Results

Two leaderboards — forecasting and deep-research quality. The official ranking is on the hidden test set (80 tasks, labels withheld, scored by us); the paper's 240-task results are kept as reference. Switch protocol and sort any column.

Task showcase

Look inside a task

Each task is a time series, a forecast target, ground-truth evidence, and a corpus of supporting and distractor documents. Explore a few interactively.

CPU usage5 minutes

A scheduled load spike

Predict server CPU where an upcoming batch job, buried in ops notes, reshapes the trajectory the history alone can't reveal.

View full task →

Solar irradiance1 hour

Cloud cover over a solar site

Forecast global horizontal irradiance where the swing is explained by an incoming weather pattern described in the corpus — amid look-alike distractors.

View full task →

Sales volume1 day

A promotion that moves demand

Predict daily sales when an upcoming promotion and calendar effects — stated in supporting documents — bend the trend away from history.

View full task →

Open the interactive showcase

Citation

Cite Dr-CiK

BibTeX

@article{tang2026dr,
  title={Dr-CiK: A Testbed for Foresight-Driven Agents},
  author={Tang, Yihong and Williams, Andrew Robert and Ashok, Arjun and Zheng, Vincent Zhihao and Sun, Lijun and Drouin, Alexandre and Laradji, Issam H and Marcotte, {\'E}tienne and Zantedeschi, Valentina},
  journal={arXiv preprint arXiv:2605.27904},
  year={2026}
}