A Prediction-Market Thesis That Failed Under Audit
A Polymarket-first research note on what I explored, what failed under audit, what did not port, and which market lessons were still worth carrying forward.
The strongest thesis in this body of work did not survive review. The primary backtest failed on severe leakage, the Kalshi portability study tested 135 configurations across 107,427 resolved markets and found zero positive edge, and the useful output shifted from a trading story to a sharper set of lessons about venue mechanics, fee-adjusted selection, provenance, and what a market claim has to survive before it deserves trust.
1. Why the thesis looked plausible on Polymarket
The project started from a concrete market question, not a methodological one: did Polymarket's combination of cheap contracts, noisy event pricing, and category-specific wording create a favorite/longshot opportunity that could survive a disciplined screen? That question was worth taking seriously because the venue actually exposed the ingredients that make this kind of mistake plausible: cheap opposite-side exposure, broad market surface, and enough wording drift or event duplication to make naive pricing errors believable on first pass.
The useful version of the project therefore became chronological: what was explored first, what seemed promising, what failed when the evidence standard increased, and what parts of the surrounding system still improved future market research.
1.1 What was built, what failed, what survived
| Workstream | Question | What happened | Status |
|---|---|---|---|
| Polymarket exploration | Was there enough visible structure to justify deeper work? | Yes as exploration. No as evidence by itself. | Input |
| Primary backtest | Could the thesis survive formal evaluation? | Audit found complete train/test overlap and severe leakage. | Invalidated |
| Kalshi portability | Did the thesis survive a second venue? | 135 configs across 107,427 resolved markets produced zero positive edge. | Failed to port |
| Project Haven selector | Could same-day candidates be ranked under realistic economics? | Useful as selection infrastructure, not as execution proof. | Survived |
| Matching and provenance | Could cross-venue comparisons be made cheaply and cleanly? | Useful systems work with direct research value. | Survived |
2. Why the headline backtest was invalid
The first strong-looking result died at the only stage that mattered: chronology review. The audit flagged severe data leakage, found 100 percent training/evaluation overlap, and recommended stopping any trading based on the result. Once that happened, the backtest ceased to function as evidence of generalization.
A market workflow that cannot kill its own false edge is not a trustworthy workflow. The important output was not the invalidated backtest. It was the fact that the process had enough friction to invalidate it before the result hardened into a strategy story.
Audit summary - severity: severe leakage - finding: complete train/test overlap - consequence: current backtest is invalid - action: stop any trading based on this result
3. Why the thesis died on Kalshi
Kalshi was the portability test, not the center of the story. Once the Polymarket backtest was invalidated, the next sensible question was whether the underlying thesis survived on a different venue. It did not.
The important part is why. The original intuition depended on cheap opposite-side exposure. On Polymarket, low-priced NO contracts can create an appealing payoff profile because the cost of being wrong is small relative to the payout if the event resolves the other way. On Kalshi, the same structure becomes symmetric: a 5-cent YES price implies a 95-cent NO price. The cheap opposite side disappears. So does the original appeal.
| Check | Polymarket | Kalshi | Implication |
|---|---|---|---|
| Cheap opposite-side exposure | Part of the original thesis. | Removed by symmetric pricing. | The payoff intuition does not transfer. |
| Configuration sweep | Motivated deeper work. | 135 configs, 0 positive edge. | The thesis failed cleanly under tested mechanics. |
| Category depth | Exploratory breadth existed. | More than 99% of resolved sample was sports/esports. | Portability conclusion is structural, not universal. |
The scale of the failure matters. The portability study covered 107,427 resolved Kalshi markets across 135 tested configurations and produced zero positive edge. That closes the obvious escape hatch. The thesis did not merely degrade. Under the tested venue mechanics, it failed.
3.1 What would have been required to believe the portability result in the other direction
For the cross-venue story to survive, at least three things would have had to be true. First, the second venue would need an equivalent cheap-side structure rather than merely the same market category labels. Second, there would need to be enough category depth to test the original intuition in something closer to the environment where it first appeared. Third, the pricing relationship would need to stay attractive after fees and after realistic fill assumptions. None of those conditions held in the preserved Kalshi evidence.
Portability requirements 1. same payoff asymmetry 2. enough category depth to test the thesis honestly 3. fee-adjusted economics still positive 4. enough liquidity to make the screen operationally relevant
4. What survived after the edge died
The surviving value came from the parts of the project that still mattered after the trading story died: fee-aware selection, explicit scope boundaries, cross-venue provenance, and a higher standard for what should count as a candidate in the first place.
4.1 Fee-adjusted selection
The selector logic remained useful because it forced every candidate to answer to explicit economics.
Symbol definitions: P = current contract price, f = fee term, q = estimated probability of the contract resolving in your favor, q* = break-even probability threshold.
f = 0.07 * P * (1 - P)
q* = (P + f) / (1 - 2f)
EV = q(1 - P - f) + (1 - q)(-P - f)
The point of this math was not elegance. It changed the question from whether the price looked interesting to whether the price was still interesting after fees, slippage, and concentration controls. If not, it was not a candidate.
The selector also evaluated both sides of the market explicitly. A YES candidate had to clear a favorite-side range and fee-adjusted threshold. A NO candidate had to survive the same economics from the opposite side. That mattered because it prevented the screen from quietly hard-coding one narrative about where value was supposed to live.
Selector logic, simplified 1. evaluate YES candidate bands 2. evaluate NO candidate bands 3. compute fee-adjusted EV 4. enforce theme and underlying caps 5. keep only ranked candidates that survive all filters
4.2 Project Haven as a scope-correct system
Project Haven stayed selection and monitoring only. It used a 24-hour scan window and a tighter <=4-hour action window to build same-day watchlists, rank candidates under explicit fee-aware rules, and preserve an audit trail without pretending to be an execution engine.
No trades were placed, no positions were tracked, and no order execution was attempted. That boundary matters because it keeps the surviving system honest about what it is and what it is not.
That scope boundary was valuable. It kept the system from quietly advertising itself as execution proof.
4.3 Provenance as a research problem
Cross-venue research is meaningless if two contracts do not refer to the same event. That moved matching and provenance from cleanup work into a first-class integrity problem. What survived here is the lesson, not a finished production matcher: event identity has to be treated as part of the research problem, otherwise every later comparison can be corrupted before the analysis starts.
| Carry-forward mechanism | What it changed | Why it still matters |
|---|---|---|
| Fee-adjusted filtering | Stopped price-alone candidates from looking serious. | Reduces false positives before deeper analysis starts. |
| Selection-only scope | Separated ranking from execution claims. | Prevents research infrastructure from pretending to be a deployed system. |
| Provenance and matching | Turned event identity into a first-class research problem. | Any later cross-venue analysis depends on getting this right. |
| Audit-first posture | Made attractive results answer to chronology and realism earlier. | Raises the standard for every later market idea. |
5. Rules the next market thesis must pass
- Venue mechanics outrank broad asset-class labels.
- Category coverage is part of feasibility, not a later caveat.
- Selection logic is where execution realism starts.
- Provenance is part of market research.
- A workflow that cannot kill a false edge is not a trustworthy workflow.
| What looked enough before | What I would require now |
|---|---|
| A strong-looking backtest | Chronology review, leakage check, and explicit invalidation path before belief. |
| Same asset class on a second venue | Proof that the payoff asymmetry and category depth actually transfer. |
| Cheap contract intuition | Fee-adjusted EV, slippage assumptions, and concentration control. |
| A useful selector | Selection-only scope plus honest bug accounting before any execution implication. |
Those lessons are what make the project worth keeping. The next time a result looks strong, the checklist is no longer vague: chronology first, venue structure second, fees and fill assumptions third, and only then any serious discussion of deployment.
This page does not claim a validated tradable edge. It claims that the work made the standard for future market research much sharper.
Notes
- Kalshi is included here as the portability test. It is not the center of gravity of the original research motivation.
- The collector, matching, scheduling, and deployment systems behind this work belong to a separate systems page and are intentionally compressed here.
- All numeric statements in this page are bounded to the audited evidence base; no performance claims survive the final framing.