A Verification Harness for Model Evaluation and Backtesting Workflows

ML, Backtesting, and Workflow Control

A Verification Harness for Model Evaluation and Backtesting Workflows

A technical design note on a harness built to make leakage, stale state, unsupported claims, and fake completion harder to get away with.

This harness exists for one reason: in data analysis, model selection, and backtesting, a polished answer can be much more dangerous than a loud failure. The useful work is therefore not better prose from a model. It is a control layer that blocks known bad patterns, forces state continuity across long sessions, writes canonical evidence artifacts, and loops failure back into remediation instead of allowing the system to narrate success it has not earned. The current benchmark evidence is already good enough to show why this matters: a preserved pilot report scores output highly while giving routing a 0.00, which is exactly the kind of polished-but-wrong run the harness is meant to expose.

1. What broke without mechanical controls

The motivating problem was not a weak model. It was a weak control surface. Parts of the surrounding harness looked strict in text but were not reliably strict in behavior. That is a serious failure in high-cost technical work, because the system can still produce smooth output while silently violating the conditions that make the output worth trusting.

That matters more in model training, time-series analysis, and backtesting than in generic automation. In those settings, one silent mistake can poison the entire result: future data leaks in, the holdout gets burned too early, a continuation forgets what phase the project is in, or a completion claim outruns the evidence because the verifier did not force the loopback.

Design target. A useful harness is not one that describes a good workflow. It is one that can stop a bad one.

2. Five time-series failure paths blocked up front

The harness is not organized around generic “best practices.” It is organized around specific ways that time-series and model workflows lie to themselves.

Blocked before Python should proceed:

shuffle=True on time-series data

KFold(...) on time-series problems

.shift(-1) or any negative shift

label='left' on rolling or resampling

.fillna(df[col].mean()) using a global mean on time-series features

Each rule blocks a direct path to fake signal. Shuffled time-series data destroys temporal ordering. K-fold makes future information look ordinary. Negative shifts and left-labelled windows leak future information. Global-mean fills inject information from outside the historical horizon. These are not style concerns. They are direct mechanisms for manufacturing fake performance.

Failure mode	What it corrupts	Control	Why the control matters
Leakage	Model skill, feature relevance, backtest validity	Blocked patterns + review gate	Prevents fake predictive power from entering later stages.
Holdout misuse	Final evaluation credibility	Holdout gate + optimization checklist	Preserves the only honest final test.
State drift	Phase position, assumptions, current task state	File-backed continuity + compaction recovery	Makes long sessions resumable instead of theatrical.
Unsupported prose	Decision quality and release claims	Verifier loopback + evidence triggers	Reduces the gap between what the system says and what it can prove.

Table 1. The harness rules map to concrete technical failure modes rather than generic process language.

3. Where execution is allowed to start

The execution model is intentionally plain: orient, inspect, plan, implement, verify, wrap up. The important feature is not the sequence itself. It is that some transitions should block. Substantial work should not start without a plan. Missing state or missing contract files should stop the next action. Failed verification should route work back into remediation instead of allowing a completion claim.

Transition	Allowed when	Blocked when
Inspect -> Plan	Relevant repo and current state are identified.	Wrong project or stale state still unresolved.
Plan -> Implement	Concrete design exists and required gate artifacts exist.	No plan, missing contract/spec, or wrong phase.
Implement -> Complete	Verification artifacts agree with the claim.	Verification fails, times out, or lacks evidence.

Table 2. The harness is safer when stage transitions are explicit and admissibility is checked at the boundary, not after the fact.

This matters because data and model work accumulates damage quietly. A workflow that skips planning or verification usually compensates later with cleanup, narration, and false confidence. Blocking early is cheaper than explaining later.

// local gate-tools.mjs plugin
// physically rejects bash tool calls that violate prerequisites

const isPlanGateCmd = /python\s+(?:\.\/)?scripts\/(?:build_|run_)\w+\.py/.test(cmd)
if (isPlanGateCmd) {
  const hasPlan = await exists(`${worktree}/.pi/plan/architect-plan.md`)
  if (!hasPlan) throw new Error("BLOCKED: plan-gate")
}

const isBacktestCmd = /python\s+(?:\.\/)?scripts\/run_\w*(backtest|walkforward|baseline|optuna)/.test(cmd)
if (isBacktestCmd && !hasContractSpec) {
  throw new Error("BLOCKED: contract-gate")
}

These gates are local harness controls, not platform-wide guarantees. That distinction matters. The page should claim only what the current local harness can actually do and not pretend the entire runtime has already converged to this behavior everywhere.

Stage	Required condition	What breaks without it	Control type
Orient / Inspect	Correct repo, current state files, correct task frame	Work starts from stale assumptions or the wrong project	Foundation
Plan	A concrete design exists before substantial execution	Architecture gets improvised under pressure	Blocking
Gate	Required state, contracts, or phase conditions are present	The system acts as if prerequisites exist when they do not	Blocking
Verify	Claims match tests, artifacts, or explicit evidence	Completion becomes narration instead of proof	Loopback

Table 3. The useful part of plan-first execution is that some transitions are allowed and some are not.

4. Durable state beats chat memory

Long sessions guarantee compaction. Once that is true, any important project state that lives only in the active conversation is temporary state. The harness is safer when continuity is rebuilt from files rather than from remembered chat context.

// local compaction-context.mjs plugin
// reads durable state before continuation

1. docs/STATE_MACHINE.md
2. docs/specs/*.md or docs/contracts/*.md
3. .pi/pipeline-state.json
4. .pi/plan/architect-plan.md
5. git branch + recent commits

This changes the risk profile of long sessions completely. A continuation that reconstructs from durable state is still limited, but at least the limitation is honest. A continuation that pretends chat memory is sufficient is fragile and often wrong. The important boundary is that the page is describing a local implemented control surface, not claiming that all session continuity problems are solved in the abstract.

# runner.py
# strip inherited session variables so standalone runs do not reuse stale session state

keep = {
  "APPDATA", "PATH", "TEMP", "TMP", "USERPROFILE", "USERNAME"
}
env = {key: value for key, value in os.environ.items() if key.upper() in keep}

4.1 Deterministic context and deterministic data access

The same principle applies to data access. A remembered path is still just a claim that a path existed when it was remembered. The harness pushes toward deterministic data discovery and deterministic context injection so that workers are given the exact slice of state and data relevant to the task they are performing.

State principle. More context is not the goal. The right context, rebuilt from durable artifacts, is the goal.

4.2 Hooks, plugins, and skills are the actual control surfaces

The harness does not rely on one monolithic enforcement layer. It relies on several smaller control surfaces that fire at different times: operating rules in the main system prompt, plugins that intercept or enrich tool use, and skills that inject domain-specific verification checklists into the session. That split matters because different failure modes appear at different levels of the workflow.

Control surface	Example	What it does	Why it matters
Rules	`AGENTS.md`	Blocks known time-series mistakes and enforces state-machine discipline.	Stops obvious quantitative failure modes before they turn into code or claims.
Plugins	`gate-tools.mjs`, `compaction-context.mjs`, `research-grounding.mjs`	Intercept tool calls, rebuild state, and track whether literature-dependent work was grounded.	Makes control logic executable instead of purely advisory.
Skills	`qi-temporal-integrity`, `qi-walk-forward`, `qi-evidence-validator`	Inject narrow, domain-specific checklists for leakage, windowing, and evidence completeness.	Keeps specialized quantitative checks from being forgotten in long sessions.

Table 4. The harness is strongest when control logic is distributed across the layer where it can actually fire.

// research-grounding.mjs
// soft gate for literature-dependent work

if (toolLooksLikeLiteratureSearch && noGroundingUsedYet) {
  log.warn("prefer alphaXiv or direct primary sources")
}

// persists alphaXiv usage across compaction
output.context.push("Research Grounding Status")

5. Runs need artifacts, not narration

The evaluation harness is useful because it writes canonical artifacts instead of relying on textual self-description. Each run builds pipeline evidence and a manifest/hash chain. That means later review can ask what the run actually did rather than trusting what the run said it did.

# runner.py
pipeline_evidence = _build_pipeline_evidence(result, case)
pipeline_evidence_path = _write_pipeline_evidence(run_dir, pipeline_evidence)

manifest = _build_run_manifest(result, case, cmd)
manifest_path = _write_manifest(run_dir, manifest)

The scorer then treats routing, compliance, and output as separate dimensions rather than one opaque overall score. That is a better design for technical workflows because a strong final paragraph should not be able to hide a routing failure or a missed policy block.

# scorer.py
WEIGHTS = {"routing": 0.4, "compliance": 0.4, "output": 0.2}

if pipeline_score is not None:
    score.routing_score = pipeline_score
elif expected_agents:
    score.routing_score = routed

if execution_failed:
    score.overall = 0.0

That weighting choice is important. Routing and compliance together outweigh the surface quality of the answer. In other words, the harness is deliberately biased toward process correctness over rhetorical polish. For high-cost technical work, that is the right bias.

# README routing detection notes
1. pipeline-state.json roles/verdicts
2. pipeline-state.resolved.json canonical evidence
3. task tool calls (subagent_type)

# plain text alone is not a reliable routing signal

Artifact	Why it exists	What it prevents
`pipeline-state.resolved.json`	Canonical routing/completion evidence	Fake “done” claims without machine-readable state
`run_manifest.json`	Hash-linked artifact chain and execution metadata	Vague provenance and unverifiable runs
Split manifest	Hash-backed train/public/holdout provenance	Quiet dataset drift and holdout confusion

Table 5. The harness gets stronger when key claims become file-backed rather than chat-backed.

# pipeline-state.resolved.json
{
  "pipelineComplete": false,
  "passRatio": 0.0,
  "reason": "partial_routing_no_pipeline_state",
  "expectedAgents": [...],
  "detectedAgents": [...],
  "requiredOutputSections": {...},
  "execution": {"timedOut": false, "exitCode": 0}
}

5.1 Release gates are explicit and intentionally unforgiving

The benchmark logic is useful because it defines concrete reasons why a final claim should fail. Missing manifest artifacts, missing pipeline evidence, missing human-quality evidence, insufficient reproducibility runs, unverifiable dataset provenance, and absent routing-completion evidence are all explicit release-gate failures.

# benchmark.py release gate summary
if critical_policy_breaches > 0:
    reasons.append("critical_policy_breaches")
if not all(manifest_exists):
    reasons.append("missing_manifest_artifacts")
if not human_ratings:
    reasons.append("human_quality_evidence_missing")
if runs_per_cell < 2:
    reasons.append("reproducibility_runs_missing")
if provenance_split != "private-holdout":
    reasons.append("final_holdout_not_used")

The practical point is that the system does not just say “quality is important.” It names the exact conditions under which quality is considered unproven.

6. Failure only counts if it changes the system

The harness became more believable once failure stopped being a conversational annoyance and became a persistent system input. The self-improvement loop is simple enough to state and useful enough to matter: detect, diagnose, fix, record, prevent. The key move is from fix to record. A correction that does not update durable lessons is only a patch to the current session.

DETECT -> DIAGNOSE -> FIX -> RECORD -> PREVENT

The point is not ceremony. It is that recurring failures such as context bleed, tool misuse, stale state, unsupported claims, and enforcement gaps are turned into explicit classes with explicit prevention mechanisms. That is how the system gets better across sessions instead of only within one.

6.1 Human review is part of the evidence model, not a decorative final step

The human-rating flow is useful because it treats human review as an admissibility condition rather than a vague preference. The protocol requires a blinded packet, one rating row per benchmark cell, a real output-quality score, and rejection of example/template rating files. That is a stronger design than treating human judgment as an informal afterthought after the benchmark is already written up.

# human_rating_kit.py
prepare_human_rating_kit(...)
  -> rating-packet.jsonl
  -> ratings.blank.jsonl
  -> unblind-map.json

finalize_human_ratings(...)
  -> human-ratings.final.jsonl

Failure class	Observable symptom	Persistent output	Prevention mechanism
Context bleed	Wrong project or stale assumptions after continuation	`lessons-learned.md`	State-file checks
Tool misuse	Wrong cwd, wrong path, wrong tool choice	`incident-log.json`	Checklist and hook updates
Unsupported claim	Prose outruns evidence	`anti-patterns.md`	Evidence gate or prompt rule
Enforcement gap	A rule exists in text but does not fire mechanically	`lessons-learned.md`	Hook or validation implementation

Table 6. The harness becomes more trustworthy when recurring failures are converted into durable classes with durable countermeasures.

7. What the current benchmark actually supports

The benchmark and report artifacts are useful, but they need to be read honestly. The harness clearly shows artifact generation, split governance, and separated routing/compliance/output scoring. At the same time, the preserved March report shows routing failure despite strong output scores. That is not embarrassing; it is useful. It shows that the system can expose polished-but-wrong runs instead of hiding them.

# 2026-03-01 pilot report
overall: 0.40
routing: 0.00
compliance: 0.50
output: 1.00
agents spawned: []

The dataset governance layer makes that stronger rather than weaker. The grounded manifest fixes the split ratios, hashes the split outputs, records redaction findings, and stores the seed. That means even the evaluation substrate has provenance and can be challenged mechanically rather than socially.

# grounded manifest highlights
train-dev: 8
public-test: 2
private-holdout: 1
seed: 1337
redactionFindings:
  email: 1
  phone: 1

# finalized human rating row
{
  "case_id": "...",
  "variant_id": "...",
  "model_label": "...",
  "quality_score": 0.75,
  "blinded_label": "A"
}

Claim	Current status	Why
The harness writes real evidence artifacts	Supported	Backed by runner and manifest code.
The harness enforces useful blocked patterns	Supported	Backed by explicit operating rules.
The harness already demonstrates strong routing quality	Not supported	Sampled report shows routing score 0.00 and no agents spawned.
The harness is a useful design for high-cost data and ML workflows	Supported	Backed by control logic plus failure-mode alignment.

Table 7. The page is stronger when it distinguishes implemented controls from pilot evidence and from things that remain unproven.

7.1 What still remains unproven

The honest limitation is that the strongest public evidence here is still about control logic, artifact generation, and failure exposure. It is not yet enough to claim mature routing quality or broad empirical dominance. That does not weaken the page. It sharpens it. A system that can show where its current evidence stops is more credible than one that pretends the stop line does not exist.

8. Where this control plane earns its keep

This harness is useful anywhere the cost of silent process failure is high: ML experiments, leakage-prone feature work, backtesting and validation loops, long-running technical sessions that need state continuity, and evidence-sensitive writing or review.

The claim is deliberately narrow. The system is not interesting because it sounds autonomous. It is interesting because it makes stale state, unsupported claims, weak verification, and known time-series mistakes harder to hide.

8.1 Implemented controls versus planned controls

Mechanism	Status	Public-safe claim
Blocked time-series rules	Implemented policy	The harness has explicit anti-leakage stop conditions.
Artifact generation	Implemented code	Runs emit canonical manifests and pipeline evidence.
Benchmark split governance	Implemented workflow	Train/public/holdout provenance is mechanically tracked.
Local gate plugins	Implemented locally	Useful local controls, not universal platform guarantees.
Full routing excellence	Not yet proven	Current public evidence does not support a strong success claim.

Table 8. This is the sharpest honest summary of the current state: some controls are real and useful now; some are local; some are still not proven.

Notes

The factual basis for this page comes from the actual operating rules, control-plan documents, orchestration prompt, evaluation code, and benchmark artifacts. The value here is the control logic, not the branding.
This page avoids role-name marketing on purpose. The hiring signal is the enforcement logic, state handling, artifact generation, and failure discipline.
The examples and rule blocks are included because they map directly to real ML, backtesting, and time-series failure modes.