Back to Case Studies
Production System
Regime-Switching
Algorithmic Trading
WebSocket Streaming
Oracle Cloud

Multi-Asset Regime-Switching ScalperDynamic Parameter Selection via Empirical Walk-Forward Analysis

A solo-built, 24/7 production trading system executing mean-reversion scalping across six instruments. The core engineering challenge: no single parameter set is robust across all market regimes. The solution: empirically derive the best configuration per regime, detect regime at runtime, and switch.

6-week solo build
55.74% mean OOS win rate
10× cycle latency reduction via WebSocket buffer
+100ms further reduction — CCXT eliminated, direct Binance API
Oracle Always Free — 4 vCPU / 24 GB
12,288 configs screened via distributed brute-force
Problem Space

The Regime Dependency Problem

The actual problem is not "build a trading bot." It is: a mean-reversion scalper designed for sideways markets frequently generates false signals in trending conditions, and vice versa. The naive fix — find one parameter set that works everywhere — is empirically unsound. A single static configuration either over-tunes to the dominant historical regime or averages to mediocrity across all of them.

The design problem therefore became: empirically identify the best-performing configuration per regime, implement a runtime mechanism to detect the current regime and apply the corresponding config, and do this within a latency budget that does not compromise signal integrity.

Hard Constraints

1.5–3s execution budget per 60s cycle
6 instruments × 2 asset classes
Self-managed VM, no vendor risk layer
Solo build, Oracle Always Free budget

Key insight: Stability scores across the full config space range from 0.043–0.350. The system is treated as statistically promising, not statistically proven, until live execution data accumulates.

Technology

Stack & Infrastructure

LayerTechnology
Strategy runtimePython 3.12, pandas, NumPy, SciPy (OLS)
API serverFastAPI + uvicorn (async)
Market dataDirect Binance API WebSocket (streaming) + REST — CCXT eliminated
FrontendNext.js — WebSocket real-time signal streaming
DatabaseSQLite (trade journal, candle cache)
InfrastructureOracle Cloud Japan — 4 vCPU, 24 GB RAM, Always Free
Deploymentsystemd service, 24/7
AlertingTelegram Bot API
Parameter researchDistributed brute-force — Azure + DigitalOcean + Oracle VMs
Architecture

System Overview

A unified asyncio.gather entry point co-hosts the trading loop and FastAPI in a single process — enabling zero-IPC state sharing between the executor and the live dashboard.

Internal Topology

Container Diagram

Every container shares the same process. MultiStrategyExecutor is injected as a singleton into the API router — the fix for the always-empty positions production bug.

Signal Layer

Regime Detection & Config Switching

CryptoScalpingStrategyFF wraps the base strategy. Before each signal call it runs _detect_regime_realtime() over the trailing 1000 bars and mutates its own instance attributes to the pre-validated specialist config for that regime.

STRESSRealized vol > 90th pct of 1000-bar window. Hard USDT cap applied.
BULLPrice above EMA200, EMA50 above EMA200.
BEARPrice below EMA200, EMA50 below EMA200.
SIDEWAYSNone of the above — low vol, EMA convergence.
brain/strategy/crypto_scalping_strategy_ff.py
1def _detect_regime_realtime(self, df: pd.DataFrame) -> str:
2 close = df["close"]
3 ema200 = close.ewm(span=200).mean()
4 ema50 = close.ewm(span=50).mean()
5
6 # STRESS: realized vol > 90th percentile
7 log_ret = np.log(close / close.shift(1)).dropna()
8 roll_vol = log_ret.rolling(20).std()
9 if roll_vol.iloc[-1] > roll_vol.quantile(0.90):
10 return "STRESS"
11
12 price = close.iloc[-1]
13 if price > ema200.iloc[-1] and ema50.iloc[-1] > ema200.iloc[-1]:
14 return "BULL"
15 if price < ema200.iloc[-1] and ema50.iloc[-1] < ema200.iloc[-1]:
16 return "BEAR"
17 return "SIDEWAYS"
18
19def generate_signal(self, df: pd.DataFrame):
20 regime = self._detect_regime_realtime(df.tail(1000))
21 cfg = REGIME_SPECIALIST_CONFIGS[regime]
22
23 # Mutate to specialist parameters for this regime
24 self.lookback = cfg["lookback"]
25 self.atr_period = cfg["atr_period"]
26 self.vol_period = cfg["vol_period"]
27 self.mode = cfg["mode"]
28
29 return super().generate_signal(df)
State Machine

Regime Transition Logic

Regime is re-evaluated on every bar. Transitions are stateless — no hysteresis or debounce. STRESS classification runs first as the highest-priority check.

Dual-Strategy Execution

Capital Allocation & Conflict Resolution

Two concurrent strategies run per cycle. If both signal opposing directions on the same instrument, FF takes precedence — preventing simultaneous opposing positions.

CryptoScalpingStrategyFF

70% Capital

Regime-switching wrapper. Applies specialist parameters per regime. The active bet — higher allocation reflects confidence in the regime-aware approach, especially in STRESS.

CryptoScalpingStrategyLR

30% Capital

Static lr_snapshot|lb180 config — the robust baseline. Higher signal volume (465 OOS signals). Hedges against regime misclassification.

brain/execution/multi_strategy_executor.py
1def aggregate_signals(self, symbol: str, df: pd.DataFrame):
2 ff_signal = self.elite_ff.generate_signal(df) # 70%
3 static_signal = self.static_pivot.generate_signal(df) # 30%
4
5 if ff_signal and static_signal:
6 if ff_signal.direction != static_signal.direction:
7 self._log_conflict(symbol, ff_signal, static_signal)
8 return self._build_order(ff_signal, allocation=0.70)
9 return self._blend_signals(ff_signal, static_signal)
10
11 if ff_signal:
12 return self._build_order(ff_signal, allocation=0.70)
13 if static_signal:
14 return self._build_order(static_signal, allocation=0.30)
15 return None
Data Flow

Per-Cycle Execution Sequence

The WebSocket kline buffer eliminates per-cycle REST pagination. The data stage drops from network-bound to a constant-time O(1) memory read.

Architecture Upgrade

WebSocket Kline Buffer: REST → Streaming

The dominant latency bottleneck was the per-cycle REST pagination fetch (6,224 ms). The strategy only needs the initial history plus incremental 1-minute updates — a WebSocket-backed buffer was introduced to fix this.

Before — REST Polling

OHLCV REST fetch 3×1100 bars6,224 ms
Reconcile304 ms
Signal generation686 ms
Total cycle7,215 ms

After — WebSocket Buffer

Buffer seed startup only5,873 ms
Buffer read O(1) memory0.4 ms
Reconcile298 ms
Signal generation392 ms
Total per cycle691 ms

Data fetch latency

0.4 ms

15,560× faster

Full cycle time

0.7 s

≈10× faster

Sleep budget / cycle

59.3 s

+6.5 s faster

brain/data/kline_stream.py
1class KlineBuffer:
2 def __init__(self, symbols: List[str], window: int = 1100):
3 self._buffers = {s: deque(maxlen=window) for s in symbols}
4 self._locks = {s: asyncio.Lock() for s in symbols}
5 self._seeded = {s: False for s in symbols}
6
7 async def seed_from_rest(self, symbol: str, exchange) -> None:
8 """One-time REST fetch. Blocks loop until complete."""
9 bars = await exchange.fetch_ohlcv(symbol, "1m", limit=1100)
10 async with self._locks[symbol]:
11 self._buffers[symbol].extend(bars)
12 self._seeded[symbol] = True
13
14 async def stream(self) -> None:
15 """Long-running WebSocket. Exponential backoff on disconnect."""
16 backoff = 1
17 while True:
18 try:
19 async with websockets.connect(self._ws_url()) as ws:
20 backoff = 1
21 async for msg in ws:
22 await self._handle_kline(json.loads(msg))
23 except Exception:
24 await asyncio.sleep(backoff)
25 backoff = min(backoff * 2, 60)
26
27 async def get(self, symbol: str) -> pd.DataFrame:
28 """O(1) read under async lock."""
29 async with self._locks[symbol]:
30 return pd.DataFrame(list(self._buffers[symbol]),
31 columns=["ts","open","high","low","close","vol"])
Validation Results

Walk-Forward Analysis

All figures are out-of-sample results across 5–7 years of historical 1-minute data on BTC, ETH, BNB/USDT. 12,288 configurations screened in-sample via a distributed Ray cluster across three cloud VMs. Production config selected exclusively on OOS walk-forward metrics.

How the search cluster worked — Distributed Ray Scalper case study
ConfigurationRoleMean OOS WRRobust ScoreStabilityOOS SignalsSTRESS WR
lr_snapshot|lb180
30% — Static
55.74%0.5370.35046561.44%
no_lr_snapshot
70% — FF Base
48.68%0.4900.3464,16659.17%
lr_pivot_strict|lb180|atr12
Candidate
47.30%0.4480.2911,65961.46%

STRESS Regime Caveat

Several configs show 11–60 signals in STRESS. At those sizes, 61–72% WR is not statistically significant. STRESS specialist retained on directional hypothesis grounds. Production capital during STRESS is capped at hard USDT limit, not full Kelly.

Selection Methodology

In-sample backtest used only for screening. Production selection based exclusively on walk-forward OOS. Robustness score (0.537) and stability score (0.350) were the primary criteria — not raw win rate.

War Stories

Production Bugs — The Real Engineering

Real bugs discovered in deployment. Each reveals a structural decision that created the failure mode.

get_live_positions() always returning empty

high

Root cause: WebSocket router instantiated a new MultiStrategyExecutor per call. New instance, no positions, always empty.

Fix: Pass executor as a singleton to the router at startup. API server and trading loop share one object reference.

WebSocket broadcast storm

high

Root cause: Each client connection spawned 5 background tasks including the signal loop. Two tabs = 10 concurrent loops hammering exchange rate limits.

Fix: Scope background tasks to the ConnectionManager singleton, not per-connection. Tasks start once on first connection.

Silent Telegram spam loop

medium

Root cause: run_health_checks() referenced in trading loop but absent from HealthMonitor. AttributeError silently swallowed; send_alert() called on every exception.

Fix: Method stub + explicit error surfacing. Health check failures now surface as a distinct error class.

brain/risk/__init__.py dangling imports

medium

Root cause: __init__.py imported risk.py and risk_manager.py, both deleted. Any import from brain.risk crashed with ModuleNotFoundError.

Fix: Rewrite __init__.py to export only what exists: max_shares from sizer.py and ATRRiskCalculator from atr_risk_calculator.py.

CryptoScalpingStrategyFF instance mutation is not thread-safe

architectural

Root cause: Regime-switching mutates self.lookback etc. per bar. One shared instance across all symbols → race condition under true parallelism.

Fix: Sequential symbol processing in aggregate_signals() as current mitigation. Requires per-symbol instances if parallelism is introduced.

Design Decisions

Trade-offs & Rejected Alternatives

systemd over Docker

Single always-on VM with no orchestration requirement. Docker adds startup overhead, volume mount complexity, and an extra failure surface for a system that only needs process supervision. Reversal condition: multi-VM distributed deployment.

Manual distribution over Ray

Ray's infrastructure overhead was prohibitive. It required strict head/worker node setups (ports 6379, 8265, 10001 open), exact version pinning across all VMs, and suffered from 5–10 min cold starts. Ray's Global Control Store (GCS) created a SPOF where a head crash killed the entire job. Ultimately, the network serialization overhead for BacktestConfig dataclasses and pandas DataFrames, plus Ray's scheduling overhead, added latency compared to direct threading when processing 276k configs × 12 regimes.

Walk-forward over in-sample for config selection

In-sample used only for screening 12,288 combos. Selection is exclusively OOS. In-sample results for high-frequency parameter searches will overfit — this is not a hypothesis, it is a known fact.

Unified asyncio process over separate processes

Initially production_trader.py and uvicorn were independent processes with no shared state. The dashboard had no live trading state. asyncio.gather enabled zero-IPC state sharing and real-time signal streaming.

70/30 split over equal allocation

The split reflects relative confidence: the regime-switching approach handles STRESS better where all top configs converge to outperformance, while the static config provides signal volume across BULL/BEAR/SIDEWAYS.

Retrospective

What I'd Do Differently

Extract regime detection into its own testable module

Regime detection is embedded inside the strategy class. A misclassified regime is the highest-impact failure mode and currently has zero dedicated test coverage. It should be a stateless module with an explicit contract.

Replay trade journal on startup

PerformanceStats do not survive process restarts — win rate, Sharpe, and drawdown reset to zero. Fix is straightforward: replay the journal on startup. Was not prioritized before deployment.

Strategy factory over instance mutation

Overwriting self.lookback per regime is a design smell. A factory instantiating four independent, immutable specialists — one per regime — eliminates the thread-safety concern and makes each specialist independently testable.

Key Technical Insight

Distributing a brute-force parameter search across heterogeneous cloud VMs is not primarily a compute problem — it is a data consistency and result aggregation problem. Ensuring identical OHLCV source across all VMs, safe merging of partial results, and zero silent skips required more engineering discipline than the search itself.

Infrastructure Efficiency

The full stack — FastAPI, Next.js, trading loop, SQLite, Telegram alerts — runs on Oracle Always Free within budget. RAM utilization is well below capacity because the in-memory KlineBuffer deque eliminates redundant disk I/O every cycle.

Status

Final testnet validation phase. Walk-forward analysis complete. Live testnet data is being collected to establish a statistically meaningful baseline before full capital deployment.

Chat on WhatsApp