Episode 06

How to Collect Impartial Trading Data (Manual Backtesting Done Right)

Watch On YouTube

Back To Series Hub

Key Takeaways

Most retail backtesting fails because sample selection and interpretation contaminate results before probabilities are even calculated.
If scenarios, outcomes, and variables are already defined, the only remaining threat is emotional data collection.
Backtesting must reflect the environment you actually trade; mixing sessions mixes behaviour and distorts probability.
Manual backtesting builds structural discipline and conviction before scale is introduced.
Data collection is not interpretation — classification comes first, analysis comes later.
Outliers must be excluded using predefined rules, not removed after reviewing results.
Impartial repetition creates stable datasets; stability is required before probability has meaning.

The Problem This Episode Solves

Most traders do not fail because they avoid backtesting.

They fail because their backtesting produces false confidence.

Retail backtesting typically breaks down for one of four reasons:

Scenarios are undefined.
Outcomes are not binary.
Variables are undefined.
Sample selection shifts emotionally.

The first three distort structure.

The fourth distorts reality.

If scenarios are loosely defined, you end up measuring multiple different behaviors under one label. Instead of isolating the behavior of a specific scenario, you measure the average of many unrelated conditions without realizing it.

If outcomes are vague — “it kind of worked” or “it moved a bit” — probabilities become subjective. Clean statistical comparison becomes impossible.

If variables are not defined, you cannot isolate what actually caused different outcomes. Without that clarity, strategy improvement becomes guesswork.

And finally, if you select backtesting windows or examples based on recent performance or emotional bias, your dataset stops reflecting market behavior and starts reflecting your psychology.

If you’ve followed the series correctly, the first three threats have already been neutralized. Anchor events are defined. Outcomes are binary. Variables are structured.

The only remaining threat is emotional sample selection.

This episode addresses that directly.

Where We Are in the Pipeline

Within the quant-inspired framework, this is the first validation stage.

The hypothesis has been constructed.

Variables have been defined.

Now we test whether the hypothesis holds significance.

Within the series progression:

Anchor event selected
Features defined
Market regimes structured
Binary hypothesis built
Variables formalized

Now we move into impartial validation.

The objective is simple:

Collect data without interpretation.

Objectives of This Stage

The purpose of this episode is to establish disciplined manual backtesting.

Specifically:

Select unbiased backtesting windows
Apply fixed scenario and variable definitions
Collect probability-focused data correctly
Avoid common retail backtesting bias
Exclude outliers using rules, not discretion

This is not optimization.

This is not refinement.

This is clean measurement.

Manual vs Automated Backtesting

There are two broad approaches:

Manual backtesting applies predefined rules by hand, one example at a time.

Automated backtesting codes those rules so a computer applies them to historical data.

At the beginning of strategy development, manual backtesting is superior.

It forces you to:

Recognize anchor events precisely
Identify variables consistently
Confirm regime correctly
Build conviction in your structural logic

Automation increases sample size.

Manual work increases structural understanding.

At this stage, structural clarity matters more than scale.

Selecting an Unbiased Backtesting Window

One of the most overlooked factors in retail backtesting is session selection.

Markets behave differently across sessions.

Liquidity changes.

Participants change.

Volatility profiles change.

A scenario in Asian session is not the same environment as the same scenario in London or New York.

If you combine sessions indiscriminately, you mix behavioral environments. That corrupts probability.

Backtesting must reflect how you actually trade.

The process is straightforward:

Align with real availability.
Choose liquid, repeatable conditions.
Maintain consistency across datasets.

For example, my backtesting window runs from one hour before London open to one hour before New York open. This aligns with availability and captures the most liquid and repeatable market behavior.

Once defined, that window does not change.

Consistency prevents environmental mixing.

The Manual Backtesting Procedure

Once the window is locked, the procedure is mechanical.

First, confirm the scenario before opening charts. Anchor, features, variables, and regime definitions must already exist.

Second, operate strictly within the defined time window.

Third, identify the complete scenario:

Anchor event
Market regime
Features/Variables

Only after the full scenario is defined do you assess outcome.

Fourth, record — do not interpret.

You do not analyze probabilities yet.

You do not speculate about patterns.

You do not adjust rules mid-process.

You simply classify:

Valid or invalid.

That is all.

Interpretation happens later.

Handling Outliers Impartially

Outliers must be defined before reviewing results.

Removing examples after seeing performance is curve fitting.

In my model, outliers are predefined as:

Forecasted high-impact news events
Abnormal volatility spikes
US or UK bank holidays

These exclusions are structural, not emotional.

They are defined in advance.

Anything outside predefined criteria remains in the dataset.

Example: Applying the Process

Within the defined session window, the process is executed identically each time.

An anchor event appears — weak displacement through 15-minute swing liquidity.

Market structure is identified across timeframes.

In the example provided:

Weekly: Lack of structure
Daily: Structured
4-hour: Lack of structure
Hourly: Lack of structure

This classifies as lack of hourly and below.

Supplementary confluences are evaluated using the predefined 5-minute validation rule.

Initiation liquidity is identified and confirmed at the correct timeframe.

Once the scenario is fully classified, the objective is defined.

Then price action is replayed.

Did price reach the objective before invalidation?

Yes.

The outcome is recorded as valid.

No interpretation.

Just classification.

For bookkeeping, date and screenshots are logged. This allows later auditing if required.

This is repeated for every example.

The Critical Discipline

The strength of this method lies in repetition.

Every scenario is:

Identified the same way
Classified the same way
Evaluated the same way

No discretionary interpretation is introduced during collection.

By separating data collection from analysis, you prevent narrative contamination.

Why This Matters

Improper backtesting builds confidence.

Proper backtesting builds evidence.

Confidence feels good.

Evidence holds under scrutiny.

This stage ensures that when probabilities are calculated in the next episode, they reflect behavior — not bias.

Closing Thoughts

By now you should understand:

Why most retail backtesting produces false confidence
How session selection influences probability
Why manual backtesting builds structural discipline
How to collect data impartially
How to exclude outliers without curve fitting

The goal is not to prove the strategy works.

The goal is to build a clean dataset.

In Episode 07, raw data becomes probability structure.

Now that the collection process is defined, the next step is interpretation — converting rows into meaning.

Transcript

Episode 05

Episode 07