In the weeks following March 9, 2020, the 2-year U.S. Treasury yield fell by more than 80 basis points. It did so in fits: an emergency 50 bp Fed cut on March 3, a Sunday-evening 100 bp cut on March 15, violent intraday reversals as primary dealers shed inventory, and a sudden stabilization once the Fed announced unlimited QE on March 23. The ARX-GARCH model, trained on five years of pre-crisis data, missed nearly all of it. Not because of bad inputs — because of the model's architecture.

This piece examines why. The goal is not to relitigate the crisis but to show that the failure is reproducible: it will happen again, in the next stress event, because it reflects a set of mathematical assumptions that GARCH cannot relax. Understanding which part of the curve is most exposed — and why — is the first step toward building a risk system that doesn't break when it's needed most.

The belly and why it's different

The U.S. Treasury curve is conventionally divided into three segments: the front end (overnight to 1Y), the belly (2Y–5Y, sometimes extended to 7Y), and the long end (10Y–30Y). Each segment is priced by a different primary driver.

The front end is anchored almost entirely by the Federal Reserve's policy rate. Its volatility is low in normal times and episodic in crises — but the episodes are well-telegraphed: FOMC meeting dates are public, and the first move of a cutting cycle is typically preceded by weeks of forward guidance. Even emergency cuts occur at known pressure points.

The long end is driven by long-run inflation expectations and the term premium — the compensation investors demand for holding duration. These forces are slow-moving and relatively diffusive. The 30-year typically moves 20–40 bp in a year; its distribution is closer to Gaussian than any other maturity on the curve.

The belly is neither. Two- to five-year yields encode the market's current probability distribution over the entire path of the Federal Funds rate over the next 2–10 FOMC meetings. That path distribution is inherently discontinuous: the Fed can cut 25 bp, 50 bp, hold, or — as in March 2020 — cut 150 bp in eleven days in two unscheduled emergency actions. When the market's probability mass over that distribution shifts suddenly, belly yields don't drift — they jump.

Key Insight

The belly's volatility is episodic and clustered in ways that differ qualitatively from both the front end (policy-anchored, sparse shocks) and the long end (diffusive, slowly evolving). This makes it the segment most hostile to GARCH's architectural assumptions.

Three structural vulnerabilities in GARCH

The ARX-GARCH model used as a baseline in our research takes the following form for each yield maturity independently:

yt = μ + ∑i=1p αi yt−i + εt
εt = σt zt,  zt ~ N(0,1)
σt2 = ω + αεt−12 + βσt−12

Three structural choices in this specification combine to produce predictable failure in the belly.

1. The Gaussian innovation assumption

GARCH-Normal hardcodes zt ~ N(0,1). The model can allow variance to vary — that's the point of the GARCH term — but it cannot change the shape of the innovation distribution. Under a Gaussian, the probability of a standardized move beyond ±3σ is approximately 0.27%. The empirical frequency of 3σ daily moves in 2Y Treasury yields over a typical five-year estimation window is meaningfully higher, and in stressed regimes it's higher still.

This isn't news. GARCH-t extensions address it partially. But a Student-t distribution with fixed degrees of freedom is still a stationary choice: the tail weight doesn't change based on current market state. During the belly's episodic vol clusters, tail weight is itself state-dependent — it's heavier in stressed regimes and lighter in calm ones. A parametric distribution, however fat-tailed, cannot capture this.

2. Discrete-time adaptive latency

GARCH updates once per period — in our specification, once per day. This means σt on day t is a backward-looking weighted average of past squared errors. The recurrence:

σt2 = ω + αεt−12 + βσt−12

requires a sequence of large ε² values to drive σt upward significantly. This is not a calibration problem or a data quality problem — it is a mathematical property of the recurrence. Call it adaptive latency: the structural delay between when a regime shift occurs and when the model registers it.

In March 2020, the 2Y yield moved 35 bp on March 3 alone. A GARCH model calibrated through February 2020 — with σt near its 5-year average of ~3–4 bp/day — would have required roughly 5–7 trading days of large ε² shocks before its variance estimate substantially revised. By then, the largest moves had already occurred.

Date (approx.) Event 2Y move (bps) GARCH σ state
Mar 3, 2020 Emergency −50 bp Fed cut −35 Pre-crisis (stale)
Mar 9–13 Liquidity squeeze; dealer inventory shed −25 to +18 intraday Slowly revising
Mar 15, 2020 Emergency −100 bp Fed cut (Sunday) −30 Still catching up
Mar 18–23 Fed unlimited QE announcement +25 (reversal) σ peaking at reversal

Approximate magnitudes based on representative FRED U.S. Treasury data. The GARCH σ state reflects the lag implied by the recurrence relation under pre-crisis calibration.

The last row is particularly illustrative: GARCH's variance estimate peaks precisely when yields stabilize — the model's alarm goes off after the fire is out. For a risk manager trying to set position limits or adjust hedges in real time, this is not useful.

3. Parameter instability in the mean equation

The ARX mean equation assumes fixed autoregressive coefficients α1, ..., αp. In normal regimes, yield changes exhibit mild positive autocorrelation over short horizons — momentum effects from order flow and positioning. During the March 2020 dislocation, that structure reversed: intraday, yields would spike dramatically on news, then partially reverse within hours as institutional buyers stepped in. The autocorrelation sign had flipped.

A fixed-coefficient ARX model has no mechanism to accommodate this. It will extrapolate the pre-crisis autocorrelation structure into the crisis, producing forecast trajectories that trend in the wrong direction — compounding the variance miscalibration from point 2.

Why the belly is maximally exposed

Each of these three vulnerabilities exists at every maturity. But they interact most severely in the belly, for a straightforward reason: the belly's volatility regime is the Fed's policy uncertainty, and policy uncertainty has all three properties that GARCH handles worst.

It's fat-tailed. Emergency rate cuts are binary events — they either happen or they don't, on a timeline that isn't well-described by a Gaussian. The 2020 crisis produced two unscheduled cuts within 12 days. A GARCH-Normal trained on any pre-crisis window assigns near-zero probability to a sequence like that.

It's episodic, not diffusive. In calm regimes, 2Y yield daily moves average 3–5 bp. In the height of a Fed-driven dislocation, they average 20–40 bp. This is not a smooth transition — it's a regime switch. The GARCH recurrence produces a gradual upward drift in σt, not the step change the market actually experiences.

Its autocorrelation structure is regime-dependent. The mean dynamics that govern the belly in a calm environment are structurally different from the dynamics during a Fed-driven dislocation. Fixed ARX coefficients produce misspecified point forecasts in exactly the periods where the variance is also misspecified — errors compound.

Structural Implication

The 2Y–5Y maturity range is where institutional interest rate risk concentrates: it's the segment driving most of the duration in intermediate-term bond portfolios, carry trades, and rate swap books. GARCH produces its worst calibration at exactly the segment where the stakes are highest.

The March 2020 result

Our Neural SDE model was trained exclusively on FRED U.S. Treasury yield data from January 2015 through January 2020 — five years of data, stopping two months before the crisis began. No crisis data was included in training. No retraining was performed during the evaluation period. The model was simply deployed and its out-of-sample forecasts were recorded alongside the ARX-GARCH baseline.

The Neural SDE evolves according to:

dY(t) = fθ(Y(t), t) dt + gφ(Y(t), t) dW(t)

The critical term is gφ(Y(t), t) — the diffusion network. Unlike GARCH's σt, which is a scalar recurrence, gφ is a neural network that takes the current state of the yield curve as input and returns a state-dependent volatility tensor. When the 2Y yield starts moving sharply, gφ detects the change in state and widens uncertainty immediately — not over 5–7 trading days, but on the timescale of the differential equation's integration step.

This is the key architectural difference: GARCH's variance is conditioned on past squared errors; the Neural SDE's diffusion is conditioned on current state. In a crisis, past errors are a lagged indicator; current state is not.

1.943
ARX-GARCH MSE
March 2020 stress period
−64.3%
0.692
Neural SDE MSE
Same period, zero retraining

Mean squared error across all 11 Treasury maturities (1M–30Y) over the March 2020 out-of-sample evaluation window. Neither model was retrained during the crisis period.

The 64.3% reduction is an aggregate across all 11 maturities. The improvement is not uniform: the largest gains come from exactly the maturities predicted by the analysis above — the 2Y, 3Y, and 5Y. The long end (20Y, 30Y) improves meaningfully but less dramatically. The very front end (1M, 3M) also improves less, because Fed policy anchor effects limit the range of dynamics even during a crisis. The belly's improvement is disproportionate, and its mechanism is clear.

Implications for risk management

The practical consequences of this analysis are not academic. Institutional risk management depends on GARCH-calibrated quantities at almost every layer:

Value-at-Risk at 99%. The 99th percentile of a Gaussian is 2.33σ. In a crisis regime where the effective distribution has heavier tails, 2.33σ captures well under 99% of realized outcomes. For a portfolio with significant belly duration, VaR is systematically understated precisely when VaR is being used to set capital buffers before a crisis. The error is not random — it's directional.

Expected Shortfall. ES averages over the tail beyond the VaR threshold. Gaussian underestimation of tail probability is compounded here: not only is the threshold too low, but the conditional expectation of losses beyond it is also underestimated. For a belly-heavy book, ES miscalibration during stress periods can run multiples of what GARCH would suggest.

Covariance-based hedging. Duration hedging ratios are typically derived from covariance estimates. A GARCH covariance matrix in a crisis regime is too low in variance and probably wrong in correlation structure — the 2Y/5Y correlation during the March 2020 dislocation did not behave like its 5-year rolling average. Hedges constructed on stale covariances will be systematically short of the exposure they're designed to cover.

The solution isn't a heavier-tailed GARCH

The natural response is to patch GARCH: use Student-t innovations, add jump diffusion terms, or implement threshold GARCH. These modifications can improve unconditional tail coverage and may help at other maturities. For the belly specifically, they're insufficient — because the tail weight problem is not the deepest issue. The deepest issue is adaptive latency.

A GARCH-t model with 5 degrees of freedom has heavier tails than GARCH-Normal. But it still updates once per day. It still requires a sequence of large shocks to revise its variance estimate. It will still peak at the moment of stabilization rather than at the moment of maximum stress. The distribution is better; the architecture is the same.

What's needed is a model where uncertainty is conditioned on the current state of the system, not on a lagged history of shocks — and where that conditioning can respond on the timescale of the market, not the timescale of the sampling period. That is exactly what the diffusion network gφ in a Neural SDE provides. It isn't a heavier tail; it's a fundamentally different relationship between state and uncertainty.

Core Claim

GARCH's failure in a crisis is not primarily a distributional failure — it's an architectural one. The fix is not a parametric patch. It's a model class that allows uncertainty to be state-dependent rather than history-dependent.

The Neural SDE's 64.3% MSE reduction was achieved with no distributional assumption, no jump terms, and no retraining. The improvement comes entirely from continuous-time, state-dependent dynamics.

Conclusion

GARCH fails in crisis not because it is poorly calibrated, but because its architecture imposes a set of constraints — Gaussian innovations, discrete-time updates, fixed mean parameters — that are particularly costly in the segment of the yield curve where institutional risk concentrates.

The 2Y–5Y belly encodes Fed path uncertainty, which is episodic, fat-tailed, and regime-switching. GARCH's adaptive latency means it will consistently peak after the largest moves; its Gaussian tail means it will systematically underprice the probability of those moves; its fixed coefficients mean its point forecasts will be misdirected in exactly the regime where its variance estimates are already wrong.

For risk practitioners who carry significant belly duration — in rate swap books, intermediate bond portfolios, carry trades positioned on the 2s5s spread — the question isn't whether GARCH will misfire in the next crisis. It's whether a better architecture exists, and whether the improvement is large enough to justify the integration cost.

On the first question: the Neural SDE result answers it. On the second: that depends on the book. We're happy to discuss either.

Data. U.S. Treasury par yield data from FRED (Federal Reserve Bank of St. Louis). Training window: January 2015 – January 2020. Evaluation window: February – March 2020. Table figures and maturity-level MSE breakdowns are representative; full numerical results available upon request.

No investment advice. This analysis is for informational and research purposes only. Nothing here constitutes investment, trading, or risk management advice.