Skip to main content

Researcher / model-eval workflow

This workflow is the clearest home for the released benchmark, compare, export, and learn path.

Page role

Released user workflow. Keep this page on commands that ship today. For current-source conveniences that are not released yet, use Next release track. For deeper package-level research APIs, use Developer / integrator workflow.

Use it when you want to run repeatable event-forecasting passes, inspect the exact artifacts produced by a run, compare outcomes, interpret the metrics honestly, and keep the evaluation evidence on disk.

Why this path exists

XRTM becomes especially useful when you care about:

  • local, inspectable forecasting workflows
  • probabilistic scoring and calibration signals
  • historical replay and backtest-oriented evaluation
  • exports you can analyze in notebooks or custom pipelines

1. Prove the product path first

Complete Getting started so you have at least one run directory from the released provider-free demo to inspect.

2. Generate deterministic benchmark evidence

xrtm perf run --scenario provider-free-smoke --iterations 3 --limit 1 --runs-dir runs-perf --output performance.json

Provider-free mode is still the right starting point for research and model-eval because it removes provider noise while you learn the artifact model and gather reproducible evidence.

What this benchmark proves:

  • the workflow is fast and repeatable on the same machine
  • runs-perf/<run-id>/run_summary.json captures the scored summary used by compare/export
  • you now have a control run before trying a different provider or configuration

3. Inspect, compare, and export

xrtm runs list --runs-dir runs
xrtm runs show <run-id> --runs-dir runs
xrtm artifacts inspect runs/<run-id>
xrtm report html runs/<run-id>
xrtm runs compare <run-id-a> <run-id-b> --runs-dir runs
xrtm runs export <run-id> --runs-dir runs --output export.json

Use this stage to review forecast counts, scores, warnings, durations, and the underlying JSON artifacts.

4. How to read the released metrics

SignalWhere to lookHow to interpret it
Brier scorerun_summary.json, eval.json, compare outputLower is better. 0.000 is perfect. Around 0.250 is roughly the balanced 50/50 binary baseline.
ECErun_summary.json, compare outputLower is better. Near 0 means your stated confidence matches observed frequency more closely.
Warnings / errorsrun_summary.json, compare outputThese should stay at 0 on a healthy run.
Duration / tokensrun_summary.json, compare outputThe cost side of a quality change. Improvements that double runtime should earn that cost.
Exported forecast rowsexport.json and downstream analysisUse exports for notebook/spreadsheet review after compare identifies the run worth keeping.

5. One honest improvement workflow

Use this control → candidate → compare loop:

  1. Control: run the released provider-free path first and, if needed, repeat it on the same question set. Repeated mock runs should stay effectively unchanged.
  2. Learn the gate: use compare output to learn which metrics matter before you make any stronger claim.
  3. Introduce one meaningful change: move to an advanced path such as a real local model, a runtime-level prompt/configuration change, or calibration/replay work from the package layer.
  4. Compare on the same question set: only compare runs that are answering the same questions.
  5. Decide and export: keep the candidate only if quality improved enough to justify its runtime/tokens cost, then export it for deeper review.

This is honest precisely because the default provider-free path is deterministic. The repeated mock run is your control, not your improvement proof.

6. How to turn compare output into an action

Compare resultWhat it meansWhat to do next
Mock vs mock is unchangedYour control is stable, which is what you want from the released default path.Keep the baseline and try one meaningful change before claiming improvement.
Brier/ECE improve and warnings/errors stay cleanThe candidate may be genuinely better.Export the run, review question-level differences, and consider promoting it.
Scores improve but runtime/tokens jump sharplyQuality improved, but the cost may not be worth it.Keep it as an experiment until the cost is acceptable.
Scores regress or warnings/errors appearThe change hurt quality or robustness.Revert, retune, or inspect the per-question deltas before trying again.

7. Validation status on the released surface

The newer corpus-validation workflow visible in current source work is not part of the published xrtm==0.3.1 release. Until a later coordinated release ships it with compatible upstream packages, keep release-pinned research docs on xrtm perf run, explicit run inspection, comparison, and JSON/CSV export.

8. Move into calibration and replay work

XRTM's package stack includes shipped examples for deeper evaluation work such as calibration demos, trace replay, and evaluation harnesses. See Examples and proof and Packages and architecture.

Those deeper paths are where stronger "improved over time" proofs should live. They involve a real system change rather than repeated runs of the deterministic default baseline.

Shipped surfaces this workflow uses

  • xrtm demo --provider mock --limit 1 --runs-dir runs
  • xrtm perf run --scenario provider-free-smoke --iterations 3 --limit 1 --runs-dir runs-perf --output performance.json
  • xrtm runs list
  • xrtm runs show <run-id> --runs-dir runs
  • xrtm runs compare <run-id-a> <run-id-b> --runs-dir runs
  • xrtm runs export <run-id> --runs-dir runs --output export.json
  • xrtm artifacts inspect runs/<run-id>
  • xrtm profile create my-local --provider mock --limit 2 --runs-dir runs
  • xrtm report html runs/<run-id>
  • WebUI and TUI over local run artifacts

Optional later: local-LLM evaluation

Local-LLM mode is useful once the provider-free path is already working and you specifically want to test a real local model. It is not the default first step.

Use the operator runbook for local-LLM health checks and operational guidance.