Multi-Asset Regime-Switching ScalperDynamic Parameter Selection via Empirical Walk-Forward Analysis
A solo-built, 24/7 production trading system executing mean-reversion scalping across six instruments. The core engineering challenge: no single parameter set is robust across all market regimes. The solution: empirically derive the best configuration per regime, detect regime at runtime, and switch.
The Regime Dependency Problem
The actual problem is not "build a trading bot." It is: a mean-reversion scalper designed for sideways markets frequently generates false signals in trending conditions, and vice versa. The naive fix — find one parameter set that works everywhere — is empirically unsound. A single static configuration either over-tunes to the dominant historical regime or averages to mediocrity across all of them.
The design problem therefore became: empirically identify the best-performing configuration per regime, implement a runtime mechanism to detect the current regime and apply the corresponding config, and do this within a latency budget that does not compromise signal integrity.
Hard Constraints
Key insight: Stability scores across the full config space range from 0.043–0.350. The system is treated as statistically promising, not statistically proven, until live execution data accumulates.
Stack & Infrastructure
| Layer | Technology |
|---|---|
| Strategy runtime | Python 3.12, pandas, NumPy, SciPy (OLS) |
| API server | FastAPI + uvicorn (async) |
| Market data | Direct Binance API WebSocket (streaming) + REST — CCXT eliminated |
| Frontend | Next.js — WebSocket real-time signal streaming |
| Database | SQLite (trade journal, candle cache) |
| Infrastructure | Oracle Cloud Japan — 4 vCPU, 24 GB RAM, Always Free |
| Deployment | systemd service, 24/7 |
| Alerting | Telegram Bot API |
| Parameter research | Distributed brute-force — Azure + DigitalOcean + Oracle VMs |
System Overview
A unified asyncio.gather entry point co-hosts the trading loop and FastAPI in a single process — enabling zero-IPC state sharing between the executor and the live dashboard.
Container Diagram
Every container shares the same process. MultiStrategyExecutor is injected as a singleton into the API router — the fix for the always-empty positions production bug.
Regime Detection & Config Switching
CryptoScalpingStrategyFF wraps the base strategy. Before each signal call it runs _detect_regime_realtime() over the trailing 1000 bars and mutates its own instance attributes to the pre-validated specialist config for that regime.
1def _detect_regime_realtime(self, df: pd.DataFrame) -> str:2 close = df["close"]3 ema200 = close.ewm(span=200).mean()4 ema50 = close.ewm(span=50).mean()56 # STRESS: realized vol > 90th percentile7 log_ret = np.log(close / close.shift(1)).dropna()8 roll_vol = log_ret.rolling(20).std()9 if roll_vol.iloc[-1] > roll_vol.quantile(0.90):10 return "STRESS"1112 price = close.iloc[-1]13 if price > ema200.iloc[-1] and ema50.iloc[-1] > ema200.iloc[-1]:14 return "BULL"15 if price < ema200.iloc[-1] and ema50.iloc[-1] < ema200.iloc[-1]:16 return "BEAR"17 return "SIDEWAYS"1819def generate_signal(self, df: pd.DataFrame):20 regime = self._detect_regime_realtime(df.tail(1000))21 cfg = REGIME_SPECIALIST_CONFIGS[regime]2223 # Mutate to specialist parameters for this regime24 self.lookback = cfg["lookback"]25 self.atr_period = cfg["atr_period"]26 self.vol_period = cfg["vol_period"]27 self.mode = cfg["mode"]2829 return super().generate_signal(df)
Regime Transition Logic
Regime is re-evaluated on every bar. Transitions are stateless — no hysteresis or debounce. STRESS classification runs first as the highest-priority check.
Capital Allocation & Conflict Resolution
Two concurrent strategies run per cycle. If both signal opposing directions on the same instrument, FF takes precedence — preventing simultaneous opposing positions.
CryptoScalpingStrategyFF
Regime-switching wrapper. Applies specialist parameters per regime. The active bet — higher allocation reflects confidence in the regime-aware approach, especially in STRESS.
CryptoScalpingStrategyLR
Static lr_snapshot|lb180 config — the robust baseline. Higher signal volume (465 OOS signals). Hedges against regime misclassification.
1def aggregate_signals(self, symbol: str, df: pd.DataFrame):2 ff_signal = self.elite_ff.generate_signal(df) # 70%3 static_signal = self.static_pivot.generate_signal(df) # 30%45 if ff_signal and static_signal:6 if ff_signal.direction != static_signal.direction:7 self._log_conflict(symbol, ff_signal, static_signal)8 return self._build_order(ff_signal, allocation=0.70)9 return self._blend_signals(ff_signal, static_signal)1011 if ff_signal:12 return self._build_order(ff_signal, allocation=0.70)13 if static_signal:14 return self._build_order(static_signal, allocation=0.30)15 return None
Per-Cycle Execution Sequence
The WebSocket kline buffer eliminates per-cycle REST pagination. The data stage drops from network-bound to a constant-time O(1) memory read.
WebSocket Kline Buffer: REST → Streaming
The dominant latency bottleneck was the per-cycle REST pagination fetch (6,224 ms). The strategy only needs the initial history plus incremental 1-minute updates — a WebSocket-backed buffer was introduced to fix this.
Before — REST Polling
After — WebSocket Buffer
Data fetch latency
0.4 ms
15,560× faster
Full cycle time
0.7 s
≈10× faster
Sleep budget / cycle
59.3 s
+6.5 s faster
1class KlineBuffer:2 def __init__(self, symbols: List[str], window: int = 1100):3 self._buffers = {s: deque(maxlen=window) for s in symbols}4 self._locks = {s: asyncio.Lock() for s in symbols}5 self._seeded = {s: False for s in symbols}67 async def seed_from_rest(self, symbol: str, exchange) -> None:8 """One-time REST fetch. Blocks loop until complete."""9 bars = await exchange.fetch_ohlcv(symbol, "1m", limit=1100)10 async with self._locks[symbol]:11 self._buffers[symbol].extend(bars)12 self._seeded[symbol] = True1314 async def stream(self) -> None:15 """Long-running WebSocket. Exponential backoff on disconnect."""16 backoff = 117 while True:18 try:19 async with websockets.connect(self._ws_url()) as ws:20 backoff = 121 async for msg in ws:22 await self._handle_kline(json.loads(msg))23 except Exception:24 await asyncio.sleep(backoff)25 backoff = min(backoff * 2, 60)2627 async def get(self, symbol: str) -> pd.DataFrame:28 """O(1) read under async lock."""29 async with self._locks[symbol]:30 return pd.DataFrame(list(self._buffers[symbol]),31 columns=["ts","open","high","low","close","vol"])
Walk-Forward Analysis
All figures are out-of-sample results across 5–7 years of historical 1-minute data on BTC, ETH, BNB/USDT. 12,288 configurations screened in-sample via a distributed Ray cluster across three cloud VMs. Production config selected exclusively on OOS walk-forward metrics.
How the search cluster worked — Distributed Ray Scalper case study| Configuration | Role | Mean OOS WR | Robust Score | Stability | OOS Signals | STRESS WR |
|---|---|---|---|---|---|---|
| lr_snapshot|lb180 | 30% — Static | 55.74% | 0.537 | 0.350 | 465 | 61.44% |
| no_lr_snapshot | 70% — FF Base | 48.68% | 0.490 | 0.346 | 4,166 | 59.17% |
| lr_pivot_strict|lb180|atr12 | Candidate | 47.30% | 0.448 | 0.291 | 1,659 | 61.46% |
STRESS Regime Caveat
Several configs show 11–60 signals in STRESS. At those sizes, 61–72% WR is not statistically significant. STRESS specialist retained on directional hypothesis grounds. Production capital during STRESS is capped at hard USDT limit, not full Kelly.
Selection Methodology
In-sample backtest used only for screening. Production selection based exclusively on walk-forward OOS. Robustness score (0.537) and stability score (0.350) were the primary criteria — not raw win rate.
Production Bugs — The Real Engineering
Real bugs discovered in deployment. Each reveals a structural decision that created the failure mode.
get_live_positions() always returning empty
Root cause: WebSocket router instantiated a new MultiStrategyExecutor per call. New instance, no positions, always empty.
Fix: Pass executor as a singleton to the router at startup. API server and trading loop share one object reference.
WebSocket broadcast storm
Root cause: Each client connection spawned 5 background tasks including the signal loop. Two tabs = 10 concurrent loops hammering exchange rate limits.
Fix: Scope background tasks to the ConnectionManager singleton, not per-connection. Tasks start once on first connection.
Silent Telegram spam loop
Root cause: run_health_checks() referenced in trading loop but absent from HealthMonitor. AttributeError silently swallowed; send_alert() called on every exception.
Fix: Method stub + explicit error surfacing. Health check failures now surface as a distinct error class.
brain/risk/__init__.py dangling imports
Root cause: __init__.py imported risk.py and risk_manager.py, both deleted. Any import from brain.risk crashed with ModuleNotFoundError.
Fix: Rewrite __init__.py to export only what exists: max_shares from sizer.py and ATRRiskCalculator from atr_risk_calculator.py.
CryptoScalpingStrategyFF instance mutation is not thread-safe
Root cause: Regime-switching mutates self.lookback etc. per bar. One shared instance across all symbols → race condition under true parallelism.
Fix: Sequential symbol processing in aggregate_signals() as current mitigation. Requires per-symbol instances if parallelism is introduced.
Trade-offs & Rejected Alternatives
✓ systemd over Docker
Single always-on VM with no orchestration requirement. Docker adds startup overhead, volume mount complexity, and an extra failure surface for a system that only needs process supervision. Reversal condition: multi-VM distributed deployment.
✓ Manual distribution over Ray
Ray's infrastructure overhead was prohibitive. It required strict head/worker node setups (ports 6379, 8265, 10001 open), exact version pinning across all VMs, and suffered from 5–10 min cold starts. Ray's Global Control Store (GCS) created a SPOF where a head crash killed the entire job. Ultimately, the network serialization overhead for BacktestConfig dataclasses and pandas DataFrames, plus Ray's scheduling overhead, added latency compared to direct threading when processing 276k configs × 12 regimes.
✓ Walk-forward over in-sample for config selection
In-sample used only for screening 12,288 combos. Selection is exclusively OOS. In-sample results for high-frequency parameter searches will overfit — this is not a hypothesis, it is a known fact.
✓ Unified asyncio process over separate processes
Initially production_trader.py and uvicorn were independent processes with no shared state. The dashboard had no live trading state. asyncio.gather enabled zero-IPC state sharing and real-time signal streaming.
✓ 70/30 split over equal allocation
The split reflects relative confidence: the regime-switching approach handles STRESS better where all top configs converge to outperformance, while the static config provides signal volume across BULL/BEAR/SIDEWAYS.
What I'd Do Differently
Extract regime detection into its own testable module
Regime detection is embedded inside the strategy class. A misclassified regime is the highest-impact failure mode and currently has zero dedicated test coverage. It should be a stateless module with an explicit contract.
Replay trade journal on startup
PerformanceStats do not survive process restarts — win rate, Sharpe, and drawdown reset to zero. Fix is straightforward: replay the journal on startup. Was not prioritized before deployment.
Strategy factory over instance mutation
Overwriting self.lookback per regime is a design smell. A factory instantiating four independent, immutable specialists — one per regime — eliminates the thread-safety concern and makes each specialist independently testable.
Key Technical Insight
Distributing a brute-force parameter search across heterogeneous cloud VMs is not primarily a compute problem — it is a data consistency and result aggregation problem. Ensuring identical OHLCV source across all VMs, safe merging of partial results, and zero silent skips required more engineering discipline than the search itself.
Infrastructure Efficiency
The full stack — FastAPI, Next.js, trading loop, SQLite, Telegram alerts — runs on Oracle Always Free within budget. RAM utilization is well below capacity because the in-memory KlineBuffer deque eliminates redundant disk I/O every cycle.
Status
Final testnet validation phase. Walk-forward analysis complete. Live testnet data is being collected to establish a statistically meaningful baseline before full capital deployment.