A Verification Harness for Model Evaluation and Backtesting Workflows
A technical design note on a harness built to make leakage, stale state, unsupported claims, and fake completion harder to get away with.
This harness exists for one reason: in data analysis, model selection, and backtesting, a polished answer can be much more dangerous than a loud failure. The useful work is therefore not better prose from a model. It is a control layer that blocks known bad patterns, forces state continuity across long sessions, writes canonical evidence artifacts, and loops failure back into remediation instead of allowing the system to narrate success it has not earned. The current benchmark evidence is already good enough to show why this matters: a preserved pilot report scores output highly while giving routing a 0.00, which is exactly the kind of polished-but-wrong run the harness is meant to expose.
1. What broke without mechanical controls
The motivating problem was not a weak model. It was a weak control surface. Parts of the surrounding harness looked strict in text but were not reliably strict in behavior. That is a serious failure in high-cost technical work, because the system can still produce smooth output while silently violating the conditions that make the output worth trusting.
That matters more in model training, time-series analysis, and backtesting than in generic automation. In those settings, one silent mistake can poison the entire result: future data leaks in, the holdout gets burned too early, a continuation forgets what phase the project is in, or a completion claim outruns the evidence because the verifier did not force the loopback.
2. Five time-series failure paths blocked up front
The harness is not organized around generic “best practices.” It is organized around specific ways that time-series and model workflows lie to themselves.
Blocked before Python should proceed:
shuffle=True on time-series data
KFold(...) on time-series problems
.shift(-1) or any negative shift
label='left' on rolling or resampling
.fillna(df[col].mean()) using a global mean on time-series features
Each rule blocks a direct path to fake signal. Shuffled time-series data destroys temporal ordering. K-fold makes future information look ordinary. Negative shifts and left-labelled windows leak future information. Global-mean fills inject information from outside the historical horizon. These are not style concerns. They are direct mechanisms for manufacturing fake performance.
| Failure mode | What it corrupts | Control | Why the control matters |
|---|---|---|---|
| Leakage | Model skill, feature relevance, backtest validity | Blocked patterns + review gate | Prevents fake predictive power from entering later stages. |
| Holdout misuse | Final evaluation credibility | Holdout gate + optimization checklist | Preserves the only honest final test. |
| State drift | Phase position, assumptions, current task state | File-backed continuity + compaction recovery | Makes long sessions resumable instead of theatrical. |
| Unsupported prose | Decision quality and release claims | Verifier loopback + evidence triggers | Reduces the gap between what the system says and what it can prove. |
3. Where execution is allowed to start
The execution model is intentionally plain: orient, inspect, plan, implement, verify, wrap up. The important feature is not the sequence itself. It is that some transitions should block. Substantial work should not start without a plan. Missing state or missing contract files should stop the next action. Failed verification should route work back into remediation instead of allowing a completion claim.
| Transition | Allowed when | Blocked when |
|---|---|---|
| Inspect -> Plan | Relevant repo and current state are identified. | Wrong project or stale state still unresolved. |
| Plan -> Implement | Concrete design exists and required gate artifacts exist. | No plan, missing contract/spec, or wrong phase. |
| Implement -> Complete | Verification artifacts agree with the claim. | Verification fails, times out, or lacks evidence. |
This matters because data and model work accumulates damage quietly. A workflow that skips planning or verification usually compensates later with cleanup, narration, and false confidence. Blocking early is cheaper than explaining later.
// local gate-tools.mjs plugin
// physically rejects bash tool calls that violate prerequisites
const isPlanGateCmd = /python\s+(?:\.\/)?scripts\/(?:build_|run_)\w+\.py/.test(cmd)
if (isPlanGateCmd) {
const hasPlan = await exists(`${worktree}/.pi/plan/architect-plan.md`)
if (!hasPlan) throw new Error("BLOCKED: plan-gate")
}
const isBacktestCmd = /python\s+(?:\.\/)?scripts\/run_\w*(backtest|walkforward|baseline|optuna)/.test(cmd)
if (isBacktestCmd && !hasContractSpec) {
throw new Error("BLOCKED: contract-gate")
}These gates are local harness controls, not platform-wide guarantees. That distinction matters. The page should claim only what the current local harness can actually do and not pretend the entire runtime has already converged to this behavior everywhere.
| Stage | Required condition | What breaks without it | Control type |
|---|---|---|---|
| Orient / Inspect | Correct repo, current state files, correct task frame | Work starts from stale assumptions or the wrong project | Foundation |
| Plan | A concrete design exists before substantial execution | Architecture gets improvised under pressure | Blocking |
| Gate | Required state, contracts, or phase conditions are present | The system acts as if prerequisites exist when they do not | Blocking |
| Verify | Claims match tests, artifacts, or explicit evidence | Completion becomes narration instead of proof | Loopback |
4. Durable state beats chat memory
Long sessions guarantee compaction. Once that is true, any important project state that lives only in the active conversation is temporary state. The harness is safer when continuity is rebuilt from files rather than from remembered chat context.
// local compaction-context.mjs plugin // reads durable state before continuation 1. docs/STATE_MACHINE.md 2. docs/specs/*.md or docs/contracts/*.md 3. .pi/pipeline-state.json 4. .pi/plan/architect-plan.md 5. git branch + recent commits
This changes the risk profile of long sessions completely. A continuation that reconstructs from durable state is still limited, but at least the limitation is honest. A continuation that pretends chat memory is sufficient is fragile and often wrong. The important boundary is that the page is describing a local implemented control surface, not claiming that all session continuity problems are solved in the abstract.
# runner.py
# strip inherited session variables so standalone runs do not reuse stale session state
keep = {
"APPDATA", "PATH", "TEMP", "TMP", "USERPROFILE", "USERNAME"
}
env = {key: value for key, value in os.environ.items() if key.upper() in keep}4.1 Deterministic context and deterministic data access
The same principle applies to data access. A remembered path is still just a claim that a path existed when it was remembered. The harness pushes toward deterministic data discovery and deterministic context injection so that workers are given the exact slice of state and data relevant to the task they are performing.
4.2 Hooks, plugins, and skills are the actual control surfaces
The harness does not rely on one monolithic enforcement layer. It relies on several smaller control surfaces that fire at different times: operating rules in the main system prompt, plugins that intercept or enrich tool use, and skills that inject domain-specific verification checklists into the session. That split matters because different failure modes appear at different levels of the workflow.
| Control surface | Example | What it does | Why it matters |
|---|---|---|---|
| Rules | AGENTS.md | Blocks known time-series mistakes and enforces state-machine discipline. | Stops obvious quantitative failure modes before they turn into code or claims. |
| Plugins | gate-tools.mjs, compaction-context.mjs, research-grounding.mjs | Intercept tool calls, rebuild state, and track whether literature-dependent work was grounded. | Makes control logic executable instead of purely advisory. |
| Skills | qi-temporal-integrity, qi-walk-forward, qi-evidence-validator | Inject narrow, domain-specific checklists for leakage, windowing, and evidence completeness. | Keeps specialized quantitative checks from being forgotten in long sessions. |
// research-grounding.mjs
// soft gate for literature-dependent work
if (toolLooksLikeLiteratureSearch && noGroundingUsedYet) {
log.warn("prefer alphaXiv or direct primary sources")
}
// persists alphaXiv usage across compaction
output.context.push("Research Grounding Status")5. Runs need artifacts, not narration
The evaluation harness is useful because it writes canonical artifacts instead of relying on textual self-description. Each run builds pipeline evidence and a manifest/hash chain. That means later review can ask what the run actually did rather than trusting what the run said it did.
# runner.py pipeline_evidence = _build_pipeline_evidence(result, case) pipeline_evidence_path = _write_pipeline_evidence(run_dir, pipeline_evidence) manifest = _build_run_manifest(result, case, cmd) manifest_path = _write_manifest(run_dir, manifest)
The scorer then treats routing, compliance, and output as separate dimensions rather than one opaque overall score. That is a better design for technical workflows because a strong final paragraph should not be able to hide a routing failure or a missed policy block.
# scorer.py
WEIGHTS = {"routing": 0.4, "compliance": 0.4, "output": 0.2}
if pipeline_score is not None:
score.routing_score = pipeline_score
elif expected_agents:
score.routing_score = routed
if execution_failed:
score.overall = 0.0That weighting choice is important. Routing and compliance together outweigh the surface quality of the answer. In other words, the harness is deliberately biased toward process correctness over rhetorical polish. For high-cost technical work, that is the right bias.
# README routing detection notes 1. pipeline-state.json roles/verdicts 2. pipeline-state.resolved.json canonical evidence 3. task tool calls (subagent_type) # plain text alone is not a reliable routing signal
| Artifact | Why it exists | What it prevents |
|---|---|---|
pipeline-state.resolved.json | Canonical routing/completion evidence | Fake “done” claims without machine-readable state |
run_manifest.json | Hash-linked artifact chain and execution metadata | Vague provenance and unverifiable runs |
| Split manifest | Hash-backed train/public/holdout provenance | Quiet dataset drift and holdout confusion |
# pipeline-state.resolved.json
{
"pipelineComplete": false,
"passRatio": 0.0,
"reason": "partial_routing_no_pipeline_state",
"expectedAgents": [...],
"detectedAgents": [...],
"requiredOutputSections": {...},
"execution": {"timedOut": false, "exitCode": 0}
}5.1 Release gates are explicit and intentionally unforgiving
The benchmark logic is useful because it defines concrete reasons why a final claim should fail. Missing manifest artifacts, missing pipeline evidence, missing human-quality evidence, insufficient reproducibility runs, unverifiable dataset provenance, and absent routing-completion evidence are all explicit release-gate failures.
# benchmark.py release gate summary
if critical_policy_breaches > 0:
reasons.append("critical_policy_breaches")
if not all(manifest_exists):
reasons.append("missing_manifest_artifacts")
if not human_ratings:
reasons.append("human_quality_evidence_missing")
if runs_per_cell < 2:
reasons.append("reproducibility_runs_missing")
if provenance_split != "private-holdout":
reasons.append("final_holdout_not_used")The practical point is that the system does not just say “quality is important.” It names the exact conditions under which quality is considered unproven.
6. Failure only counts if it changes the system
The harness became more believable once failure stopped being a conversational annoyance and became a persistent system input. The self-improvement loop is simple enough to state and useful enough to matter: detect, diagnose, fix, record, prevent. The key move is from fix to record. A correction that does not update durable lessons is only a patch to the current session.
DETECT -> DIAGNOSE -> FIX -> RECORD -> PREVENT
The point is not ceremony. It is that recurring failures such as context bleed, tool misuse, stale state, unsupported claims, and enforcement gaps are turned into explicit classes with explicit prevention mechanisms. That is how the system gets better across sessions instead of only within one.
6.1 Human review is part of the evidence model, not a decorative final step
The human-rating flow is useful because it treats human review as an admissibility condition rather than a vague preference. The protocol requires a blinded packet, one rating row per benchmark cell, a real output-quality score, and rejection of example/template rating files. That is a stronger design than treating human judgment as an informal afterthought after the benchmark is already written up.
# human_rating_kit.py prepare_human_rating_kit(...) -> rating-packet.jsonl -> ratings.blank.jsonl -> unblind-map.json finalize_human_ratings(...) -> human-ratings.final.jsonl
| Failure class | Observable symptom | Persistent output | Prevention mechanism |
|---|---|---|---|
| Context bleed | Wrong project or stale assumptions after continuation | lessons-learned.md | State-file checks |
| Tool misuse | Wrong cwd, wrong path, wrong tool choice | incident-log.json | Checklist and hook updates |
| Unsupported claim | Prose outruns evidence | anti-patterns.md | Evidence gate or prompt rule |
| Enforcement gap | A rule exists in text but does not fire mechanically | lessons-learned.md | Hook or validation implementation |
7. What the current benchmark actually supports
The benchmark and report artifacts are useful, but they need to be read honestly. The harness clearly shows artifact generation, split governance, and separated routing/compliance/output scoring. At the same time, the preserved March report shows routing failure despite strong output scores. That is not embarrassing; it is useful. It shows that the system can expose polished-but-wrong runs instead of hiding them.
# 2026-03-01 pilot report overall: 0.40 routing: 0.00 compliance: 0.50 output: 1.00 agents spawned: []
The dataset governance layer makes that stronger rather than weaker. The grounded manifest fixes the split ratios, hashes the split outputs, records redaction findings, and stores the seed. That means even the evaluation substrate has provenance and can be challenged mechanically rather than socially.
# grounded manifest highlights train-dev: 8 public-test: 2 private-holdout: 1 seed: 1337 redactionFindings: email: 1 phone: 1
# finalized human rating row
{
"case_id": "...",
"variant_id": "...",
"model_label": "...",
"quality_score": 0.75,
"blinded_label": "A"
}| Claim | Current status | Why |
|---|---|---|
| The harness writes real evidence artifacts | Supported | Backed by runner and manifest code. |
| The harness enforces useful blocked patterns | Supported | Backed by explicit operating rules. |
| The harness already demonstrates strong routing quality | Not supported | Sampled report shows routing score 0.00 and no agents spawned. |
| The harness is a useful design for high-cost data and ML workflows | Supported | Backed by control logic plus failure-mode alignment. |
7.1 What still remains unproven
The honest limitation is that the strongest public evidence here is still about control logic, artifact generation, and failure exposure. It is not yet enough to claim mature routing quality or broad empirical dominance. That does not weaken the page. It sharpens it. A system that can show where its current evidence stops is more credible than one that pretends the stop line does not exist.
8. Where this control plane earns its keep
This harness is useful anywhere the cost of silent process failure is high: ML experiments, leakage-prone feature work, backtesting and validation loops, long-running technical sessions that need state continuity, and evidence-sensitive writing or review.
The claim is deliberately narrow. The system is not interesting because it sounds autonomous. It is interesting because it makes stale state, unsupported claims, weak verification, and known time-series mistakes harder to hide.
8.1 Implemented controls versus planned controls
| Mechanism | Status | Public-safe claim |
|---|---|---|
| Blocked time-series rules | Implemented policy | The harness has explicit anti-leakage stop conditions. |
| Artifact generation | Implemented code | Runs emit canonical manifests and pipeline evidence. |
| Benchmark split governance | Implemented workflow | Train/public/holdout provenance is mechanically tracked. |
| Local gate plugins | Implemented locally | Useful local controls, not universal platform guarantees. |
| Full routing excellence | Not yet proven | Current public evidence does not support a strong success claim. |
Notes
- The factual basis for this page comes from the actual operating rules, control-plan documents, orchestration prompt, evaluation code, and benchmark artifacts. The value here is the control logic, not the branding.
- This page avoids role-name marketing on purpose. The hiring signal is the enforcement logic, state handling, artifact generation, and failure discipline.
- The examples and rule blocks are included because they map directly to real ML, backtesting, and time-series failure modes.