The Problem This Episode Solves

Most traders do not fail because they avoid backtesting. They fail because their backtesting produces false confidence.

Retail backtesting typically breaks down for one of four reasons: scenarios are undefined, outcomes are not binary, variables are undefined, or sample selection shifts emotionally. The first three distort structure. The fourth distorts reality.

If scenarios are loosely defined, you end up measuring multiple different behaviours under one label. If outcomes are vague, probabilities become subjective. If variables are not defined, you cannot isolate what actually caused different outcomes. And if you select backtesting windows based on recent performance or emotional bias, your dataset stops reflecting market behaviour and starts reflecting your psychology.

If you've followed the series correctly, the first three threats have already been neutralised. Anchor events are defined. Outcomes are binary. Variables are structured. The only remaining threat is emotional sample selection. This episode addresses that directly.

Where We Are in the Pipeline

Within the quant-inspired framework, this is the first validation stage. The hypothesis has been constructed. Variables have been defined. Now we test whether the hypothesis holds significance.

The objective is simple: collect data without interpretation.

Manual vs Automated Backtesting

There are two broad approaches. Manual backtesting applies predefined rules by hand, one example at a time. Automated backtesting codes those rules so a computer applies them to historical data.

At the beginning of strategy development, manual backtesting is superior. It forces you to recognise anchor events precisely, identify variables consistently, confirm regime correctly, and build conviction in your structural logic. Automation increases sample size. Manual work increases structural understanding. At this stage, structural clarity matters more than scale.

Selecting an Unbiased Backtesting Window

One of the most overlooked factors in retail backtesting is session selection. Markets behave differently across sessions — liquidity changes, participants change, volatility profiles change. A scenario in the Asian session is not the same environment as the same scenario in London or New York. If you combine sessions indiscriminately, you mix behavioural environments. That corrupts probability.

Backtesting must reflect how you actually trade. The process: align with real availability, choose liquid and repeatable conditions, and maintain consistency across datasets. Once defined, the window does not change. Consistency prevents environmental mixing.

The Manual Backtesting Procedure

Once the window is locked, the procedure is mechanical. First, confirm the scenario before opening charts — anchor, features, variables, and regime definitions must already exist. Second, operate strictly within the defined time window. Third, identify the complete scenario: anchor event, market regime, and features and variables. Only after the full scenario is defined do you assess outcome.

Fourth, record — do not interpret. You do not analyse probabilities yet. You do not speculate about patterns. You do not adjust rules mid-process. You simply classify: valid or invalid. That is all. Interpretation happens later.

Handling Outliers Impartially

Outliers must be defined before reviewing results. Removing examples after seeing performance is curve fitting. Exclusions must be structural, not emotional — defined in advance. Anything outside predefined criteria remains in the dataset.

In this model, outliers are predefined as: forecasted high-impact news events, abnormal volatility spikes, and US or UK bank holidays.

Example — Applying the Process

An anchor event appears: weak displacement through 15-minute swing liquidity. Market structure is identified across timeframes. In the example: weekly — lack of structure, daily — structured, 4-hour — lack of structure, hourly — lack of structure. This classifies as lack of hourly and below.

Supplementary confluences are evaluated using the predefined 5-minute validation rule. Initiation liquidity is identified and confirmed at the correct timeframe. Once the scenario is fully classified, the objective is defined. Then price action is replayed. Did price reach the objective before invalidation? Yes. The outcome is recorded as valid. No interpretation — just classification.

For bookkeeping, date and screenshots are logged to allow later auditing if required. This is repeated for every example.

The Critical Discipline

The strength of this method lies in repetition. Every scenario is identified the same way, classified the same way, and evaluated the same way. No discretionary interpretation is introduced during collection. By separating data collection from analysis, you prevent narrative contamination.

Improper backtesting builds confidence. Proper backtesting builds evidence. Confidence feels good. Evidence holds under scrutiny. This stage ensures that when probabilities are calculated in the next episode, they reflect behaviour — not bias.

Bridge to the Next Episode

The goal of this stage is not to prove the strategy works. The goal is to build a clean dataset. In Episode 07, raw data becomes probability structure. Now that the collection process is defined, the next step is interpretation — converting rows into meaning.