A Prediction-Market Thesis That Failed Under Audit

Prediction Markets • Research Note

A Prediction-Market Thesis That Failed Under Audit

A Polymarket-first research note on what I explored, what failed under audit, what did not port, and which market lessons were still worth carrying forward.

The strongest thesis in this body of work did not survive review. The primary backtest failed on severe leakage, the Kalshi portability study tested 135 configurations across 107,427 resolved markets and found zero positive edge, and the useful output shifted from a trading story to a sharper set of lessons about venue mechanics, fee-adjusted selection, provenance, and what a market claim has to survive before it deserves trust.

1. Why the thesis looked plausible on Polymarket

The project started from a concrete market question, not a methodological one: did Polymarket's combination of cheap contracts, noisy event pricing, and category-specific wording create a favorite/longshot opportunity that could survive a disciplined screen? That question was worth taking seriously because the venue actually exposed the ingredients that make this kind of mistake plausible: cheap opposite-side exposure, broad market surface, and enough wording drift or event duplication to make naive pricing errors believable on first pass.

The useful version of the project therefore became chronological: what was explored first, what seemed promising, what failed when the evidence standard increased, and what parts of the surrounding system still improved future market research.

1.1 What was built, what failed, what survived

Workstream	Question	What happened	Status
Polymarket exploration	Was there enough visible structure to justify deeper work?	Yes as exploration. No as evidence by itself.	Input
Primary backtest	Could the thesis survive formal evaluation?	Audit found complete train/test overlap and severe leakage.	Invalidated
Kalshi portability	Did the thesis survive a second venue?	135 configs across 107,427 resolved markets produced zero positive edge.	Failed to port
Project Haven selector	Could same-day candidates be ranked under realistic economics?	Useful as selection infrastructure, not as execution proof.	Survived
Matching and provenance	Could cross-venue comparisons be made cheaply and cleanly?	Useful systems work with direct research value.	Survived

Table 1. The failed edge claim and the surviving systems work are not the same kind of output, so they should not be narrated as if they were.

2. Why the headline backtest was invalid

The first strong-looking result died at the only stage that mattered: chronology review. The audit flagged severe data leakage, found 100 percent training/evaluation overlap, and recommended stopping any trading based on the result. Once that happened, the backtest ceased to function as evidence of generalization.

A market workflow that cannot kill its own false edge is not a trustworthy workflow. The important output was not the invalidated backtest. It was the fact that the process had enough friction to invalidate it before the result hardened into a strategy story.

Audit result. The workflow did what it was supposed to do: it killed an attractive claim when the evidence base turned out to be contaminated.

Audit summary
- severity: severe leakage
- finding: complete train/test overlap
- consequence: current backtest is invalid
- action: stop any trading based on this result

3. Why the thesis died on Kalshi

Kalshi was the portability test, not the center of the story. Once the Polymarket backtest was invalidated, the next sensible question was whether the underlying thesis survived on a different venue. It did not.

The important part is why. The original intuition depended on cheap opposite-side exposure. On Polymarket, low-priced NO contracts can create an appealing payoff profile because the cost of being wrong is small relative to the payout if the event resolves the other way. On Kalshi, the same structure becomes symmetric: a 5-cent YES price implies a 95-cent NO price. The cheap opposite side disappears. So does the original appeal.

Check	Polymarket	Kalshi	Implication
Cheap opposite-side exposure	Part of the original thesis.	Removed by symmetric pricing.	The payoff intuition does not transfer.
Configuration sweep	Motivated deeper work.	135 configs, 0 positive edge.	The thesis failed cleanly under tested mechanics.
Category depth	Exploratory breadth existed.	More than 99% of resolved sample was sports/esports.	Portability conclusion is structural, not universal.

Table 2. The portability lesson is structural. Venue design destroyed the exact asymmetry that made the original idea attractive.

The scale of the failure matters. The portability study covered 107,427 resolved Kalshi markets across 135 tested configurations and produced zero positive edge. That closes the obvious escape hatch. The thesis did not merely degrade. Under the tested venue mechanics, it failed.

3.1 What would have been required to believe the portability result in the other direction

For the cross-venue story to survive, at least three things would have had to be true. First, the second venue would need an equivalent cheap-side structure rather than merely the same market category labels. Second, there would need to be enough category depth to test the original intuition in something closer to the environment where it first appeared. Third, the pricing relationship would need to stay attractive after fees and after realistic fill assumptions. None of those conditions held in the preserved Kalshi evidence.

Portability requirements
1. same payoff asymmetry
2. enough category depth to test the thesis honestly
3. fee-adjusted economics still positive
4. enough liquidity to make the screen operationally relevant

4. What survived after the edge died

The surviving value came from the parts of the project that still mattered after the trading story died: fee-aware selection, explicit scope boundaries, cross-venue provenance, and a higher standard for what should count as a candidate in the first place.

4.1 Fee-adjusted selection

The selector logic remained useful because it forced every candidate to answer to explicit economics.

Symbol definitions: P = current contract price, f = fee term, q = estimated probability of the contract resolving in your favor, q* = break-even probability threshold.

f = 0.07 * P * (1 - P)

q* = (P + f) / (1 - 2f)

EV = q(1 - P - f) + (1 - q)(-P - f)

The point of this math was not elegance. It changed the question from whether the price looked interesting to whether the price was still interesting after fees, slippage, and concentration controls. If not, it was not a candidate.

The selector also evaluated both sides of the market explicitly. A YES candidate had to clear a favorite-side range and fee-adjusted threshold. A NO candidate had to survive the same economics from the opposite side. That mattered because it prevented the screen from quietly hard-coding one narrative about where value was supposed to live.

Selector logic, simplified
1. evaluate YES candidate bands
2. evaluate NO candidate bands
3. compute fee-adjusted EV
4. enforce theme and underlying caps
5. keep only ranked candidates that survive all filters

4.2 Project Haven as a scope-correct system

Project Haven stayed selection and monitoring only. It used a 24-hour scan window and a tighter <=4-hour action window to build same-day watchlists, rank candidates under explicit fee-aware rules, and preserve an audit trail without pretending to be an execution engine.

No trades were placed, no positions were tracked, and no order execution was attempted. That boundary matters because it keeps the surviving system honest about what it is and what it is not.

That scope boundary was valuable. It kept the system from quietly advertising itself as execution proof.

Unresolved limitations. The selector was still carrying real defects: favorite price-band filters were not fully enforced in watchlist mode, theme caps needed verification, and database price persistence still had failure cases. The system survived as infrastructure with caveats, not as something ready to overclaim.

4.3 Provenance as a research problem

Cross-venue research is meaningless if two contracts do not refer to the same event. That moved matching and provenance from cleanup work into a first-class integrity problem. What survived here is the lesson, not a finished production matcher: event identity has to be treated as part of the research problem, otherwise every later comparison can be corrupted before the analysis starts.

Carry-forward mechanism	What it changed	Why it still matters
Fee-adjusted filtering	Stopped price-alone candidates from looking serious.	Reduces false positives before deeper analysis starts.
Selection-only scope	Separated ranking from execution claims.	Prevents research infrastructure from pretending to be a deployed system.
Provenance and matching	Turned event identity into a first-class research problem.	Any later cross-venue analysis depends on getting this right.
Audit-first posture	Made attractive results answer to chronology and realism earlier.	Raises the standard for every later market idea.

Table 3. The surviving value is not a hidden alpha claim. It is a stricter market-research workflow with reusable technical components.

5. Rules the next market thesis must pass

Venue mechanics outrank broad asset-class labels.
Category coverage is part of feasibility, not a later caveat.
Selection logic is where execution realism starts.
Provenance is part of market research.
A workflow that cannot kill a false edge is not a trustworthy workflow.

What looked enough before	What I would require now
A strong-looking backtest	Chronology review, leakage check, and explicit invalidation path before belief.
Same asset class on a second venue	Proof that the payoff asymmetry and category depth actually transfer.
Cheap contract intuition	Fee-adjusted EV, slippage assumptions, and concentration control.
A useful selector	Selection-only scope plus honest bug accounting before any execution implication.

Table 4. The main value of the project is that it changed the admission standard for later market ideas.

Those lessons are what make the project worth keeping. The next time a result looks strong, the checklist is no longer vague: chronology first, venue structure second, fees and fill assumptions third, and only then any serious discussion of deployment.

This page does not claim a validated tradable edge. It claims that the work made the standard for future market research much sharper.

Notes

Kalshi is included here as the portability test. It is not the center of gravity of the original research motivation.
The collector, matching, scheduling, and deployment systems behind this work belong to a separate systems page and are intentionally compressed here.
All numeric statements in this page are bounded to the audited evidence base; no performance claims survive the final framing.