Robustness lab · stress-tested market evidence
We tried to break the market result. It held.
Bottom line. The headline finding rests on three things any reader can check: what the proxy actually says, who ExxonMobil's shareholders are, and how the stock moved on announcement day. This page goes further. It runs the announcement-day analysis through every advanced test a finance Ph.D. would request — different factor models, different ways of correcting for volatility, different ways of correcting for multiple testing, and a 20-firm cohort comparison. None of them changes the answer. The market priced no robust shareholder-rights discount, and the harder tests confirm it.
Estimation window: 250 trading days (T−260 through T−11; v1.3 canonical). Baseline event window: [−1, +1] trading days. The v1.3 inventory is 54 tests in total — an 18-cell core multi-window battery (3 benchmarks × 6 windows), plus 8 stress tests, placebo permutations, cohort tests, pre-trend diagnostics, equivalence tests, and robustness checks. Full inventory in reviewer_package/results/TEST_INVENTORY_2026-05-17.txt. P-values reported under Patell, Corrado generalized rank, and wild bootstrap (10,000 draws). All p-values are then corrected via Benjamini–Hochberg.
Article of record vs. v1.3 replication kit
The specifications on this page are the v1.3 kit canonical (May 17, 2026): 250-day pre-window, 20-firm donor pool, FF6 R² = 0.32 / FF6+BNO R² = 0.62, GARCH(1,1) z = −1.02, min BH-corrected p = 0.186. The May 5, 2026 publication of record (Goodwin, Columbia Law School Blue Sky Blog) reports the earlier specifications used in footnote 27 (60-day pre-window, 10-firm pool, SPY-only R² = 0.22 / SPY+BNO R² = 0.55, GARCH z = −1.13, min BH p = 0.82). Both classify NULL across every test. The reconciliation table appears in reviewer_package/EXTERNAL_RED_TEAM_FINDINGS_2026-05-17.md; an article erratum on fn. 27 is targeted for June 1, 2026.
Five focal specifications. The matched-pair test is the most informative because it absorbs the dominant oil-shock confounder.
| # | Specification | Benchmark | Window | Day-0 AR | Patell p | Corrado p | Wild boot. p | BH-corrected p | Outcome |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Synthetic control (ADH) | 20-firm energy donor pool | Day 0 | +0.15 pp/day | — | — | — | 0.286 (250-day pre, canonical) | Null |
| 2 | Matched-pair | Chevron | Day 0 | +0.04 pp | 0.958 | — | 0.967 | 0.96 | Null |
| 3 | Oil-augmented (FF6+BNO) | Fama-French six-factor + Brent | Day 0 | −2.01 pp | 0.046 | 0.057 | 0.119 (Romano-Wolf) | 0.186 (BH across 4 BHAR windows) | Null after BH |
| 4 | Oil-augmented (FF6+BNO) | Fama-French six-factor + Brent | [−1, +5] | −1.05 pp | — | — | — | 0.186 | Null |
| 5 | Synthetic control (ADH) | 20-firm energy donor pool | [−1, +5] | +1.33 pp | — | — | — | — | Null |
Composition: an 18-cell core multi-window battery (3 benchmarks (market / FF3 / oil-augmented) × 6 windows ([0], [-1,+1], [-1,+5], [-2,+5], [-5,+5], [0,+1])), plus 8 stress tests, placebo permutations (cross-firm and in-time), cohort difference-in-differences, pre-trend diagnostics, leakage-inclusive windows, equivalence tests, and robustness checks.
Tests in the v1.3 inventory
54
Full battery, every spec a finance reviewer would request
Tests returning a real effect
0
Zero of 54 — not one survives correction
Min BH-corrected p (core battery)
0.186
Far above the 0.05 significance line
Conclusion
Null
No effect detected on any test
| Check | Statistic | p / interpretation |
|---|---|---|
| Nested F-test Market model vs. oil-augmented (1 d.f.) |
F(1, 217) = 158.5 | p < 10−16; R2 jumps from 0.22 to 0.55 |
| Wild bootstrap 10,000 draws, Rademacher weights |
— | p = 0.967 |
| GARCH(1,1) Volatility-clustered variance |
z = −1.02 | p > 0.20 |
| HC3 robust SE Heteroskedasticity-consistent |
— | p = 0.281 |
| Pre-trend F No pre-announcement drift |
F = 0.052 | p = 0.95 |
Market model
±4.01 pp
Oil-augmented
±3.11 pp
Matched-pair
±2.38 pp
Two-one-sided-tests (Schuirmann, 1987) test the inferiority hypothesis: that the true effect is more negative than −Δ. Rejection means the data affirmatively establish equivalence within ±Δ.
Plain English. TOST "rejects inferiority" means the data affirmatively establish that the true effect is no worse than the stated bound. A standard hypothesis test fails to reject the null of "no effect"; TOST goes further and shows the data are inconsistent with any effect more negative than −Δ. Rejection is the favorable outcome.
| Bound Δ | p-value | Result |
|---|---|---|
| ±1.5 pp | 0.044 | Rejects inferiority |
| ±2.0 pp | 0.011 | Rejects inferiority |
| ±3.0 pp | 0.0003 | Rejects inferiority |
Read together: the standard null tests fail to reject the null of no effect. The TOST tests affirmatively reject the inferiority alternative at three bounds. The Bayesian posterior agrees: P(effect < −2pp) = 0.004.