# How to pull the data — replication kit data provenance

This document shows exactly how an independent reviewer can reconstruct the four data files shipped in `/replication/` from primary sources: WRDS (Wharton Research Data Services), S&P Capital IQ, and the Ken French Data Library. Run the queries in any of the three environments to produce panels that match `daily_closes.csv` within tolerance.

The article's canonical feed is S&P Capital IQ's `IQ_CLOSEPRICE_ADJ`. A CRSP cross-check via WRDS is the academic gold standard and is the source we will use in v2.

---

## 1. WRDS / CRSP — daily stock prices (academic gold standard)

### 1a. Python via the WRDS Python wrapper

```python
import wrds
db = wrds.Connection(wrds_username='your_id')  # one-time setup at wrds.wharton.upenn.edu

# 23 tickers used in the kit: XOM + 21 energy peers + SPY + BNO + XLE
tickers = (
    'XOM','CVX','COP','PSX','MPC','VLO','HES','OXY','APA','EOG','DVN',
    'FANG','OVV','CTRA','SLB','HAL','BKR','LNG','TRGP','WMB','ET','OKE',
    'KMI','SPY','XLE','BNO','MRO','ARCB','COIN','TCBI','TSLA','DDS','EQT'
)
ticker_list = ",".join(f"'{t}'" for t in tickers)

sql = f"""
WITH ranked AS (
  SELECT s.permno, s.date, s.ticker, s.prc, s.cfacpr, s.ret,
         ABS(s.prc) / NULLIF(s.cfacpr, 0) AS adj_close
  FROM crsp.dsf s
  JOIN crsp.dsenames n ON s.permno = n.permno
                        AND s.date BETWEEN n.namedt AND n.nameendt
  WHERE s.date BETWEEN '2023-06-01' AND '2026-04-02'
    AND n.ticker IN ({ticker_list})
)
SELECT date, ticker, adj_close
FROM ranked
ORDER BY date, ticker;
"""

df = db.raw_sql(sql, date_cols=['date'])
panel = df.pivot(index='date', columns='ticker', values='adj_close')
panel.to_csv('daily_closes_CRSP.csv')
```

### 1b. Direct SQL (psql to wrds-pgdata)

```sql
-- Run via: psql -h wrds-pgdata.wharton.upenn.edu -U your_user -d wrds -f query.sql
SELECT
    dsf.date,
    dsenames.ticker,
    ABS(dsf.prc) / NULLIF(dsf.cfacpr, 0) AS adj_close
FROM crsp.dsf dsf
JOIN crsp.dsenames dsenames
  ON dsf.permno = dsenames.permno
 AND dsf.date BETWEEN dsenames.namedt AND dsenames.nameendt
WHERE dsf.date BETWEEN '2023-06-01' AND '2026-04-02'
  AND dsenames.ticker IN (
    'XOM','CVX','COP','PSX','MPC','VLO','HES','OXY','APA','EOG','DVN',
    'FANG','OVV','CTRA','SLB','HAL','BKR','LNG','TRGP','WMB','ET','OKE',
    'KMI','SPY','XLE','BNO','MRO','EQT'
  )
ORDER BY date, ticker;
```

**Notes for CRSP users**:
- `prc < 0` means the closing price is a bid/ask average; `ABS()` is standard.
- `cfacpr` is the cumulative price-adjustment factor; dividing gives split- and dividend-adjusted close.
- CRSP covers NYSE / NASDAQ / AMEX (no NYSE ARCA — BNO trades on ARCA so use the alternative below).
- For BNO specifically pull from CRSP supplemental `crsp_a_etf.etf_daily` or use the Capital IQ feed below.

### 1c. Fama-French factors via WRDS

```python
import wrds
db = wrds.Connection()
ff3 = db.raw_sql("""
  SELECT date, mktrf, smb, hml, rf
  FROM ff.factors_daily
  WHERE date BETWEEN '2023-06-01' AND '2026-04-02';
""", date_cols=['date'])
ff3.to_csv('ff3_daily.csv', index=False)

# Fama-French 5 + momentum (FFC6)
ffc6 = db.raw_sql("""
  SELECT date, mktrf, smb, hml, rmw, cma, umd, rf
  FROM ff.fivefactors_daily f5
  JOIN ff.factors_daily_momentum m USING (date)
  WHERE date BETWEEN '2023-06-01' AND '2026-04-02';
""", date_cols=['date'])
ffc6.to_csv('ffc6_daily.csv', index=False)
```

---

## 2. S&P Capital IQ — `IQ_CLOSEPRICE_ADJ` (article's canonical feed)

### 2a. Excel plug-in (CIQ Office Add-in)

```excel
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:XOM",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:CVX",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NasdaqGS:FANG",   "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:EOG",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:SLB",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:WMB",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NasdaqGS:BKR",    "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:OXY",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:COP",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:MPC",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:PSX",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:VLO",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:DVN",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:HAL",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:CTRA",       "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:OVV",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NasdaqGS:APA",    "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:EQT",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:OKE",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:TRGP",       "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSEAM:LNG",      "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "NYSE:ET",         "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "ARCA:SPY",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "ARCA:BNO",        "06/01/2023", "04/02/2026", "D", "USD")
=CIQ("IQ_CLOSEPRICE_ADJ", "ARCA:XLE",        "06/01/2023", "04/02/2026", "D", "USD")
```

Output: one column of dates plus one column of adjusted closes per ticker. Concatenate horizontally to produce `daily_closes.csv`.

### 2b. CIQ REST API (programmatic)

```python
import requests
import pandas as pd

API_BASE = "https://api-ciq.marketintelligence.spglobal.com/gdsapi/rest/v3"
HEADERS = {"Authorization": f"Bearer {YOUR_TOKEN}"}

def pull_ciq_adj_close(identifier, start, end):
    payload = {
        "inputRequests": [{
            "function": "GDSHE",
            "identifier": identifier,
            "mnemonic": "IQ_CLOSEPRICE_ADJ",
            "properties": {
                "STARTDATE": start,
                "ENDDATE": end,
                "FREQUENCY": "D",
                "CURRENCYID": "USD",
                "PERIODTYPE": "IQ_LTM"
            }
        }]
    }
    r = requests.post(f"{API_BASE}/gdsapi", json=payload, headers=HEADERS)
    rows = r.json()["GDSSDKResponse"][0]["Rows"]
    return pd.DataFrame(rows, columns=["date", "value"])

tickers = {
    "XOM":  "NYSE:XOM",  "CVX":  "NYSE:CVX",  "FANG": "NasdaqGS:FANG",
    "EOG":  "NYSE:EOG",  "SLB":  "NYSE:SLB",  "WMB":  "NYSE:WMB",
    "BKR":  "NasdaqGS:BKR","OXY":  "NYSE:OXY",  "COP":  "NYSE:COP",
    "MPC":  "NYSE:MPC",  "PSX":  "NYSE:PSX",  "VLO":  "NYSE:VLO",
    "DVN":  "NYSE:DVN",  "HAL":  "NYSE:HAL",  "CTRA": "NYSE:CTRA",
    "OVV":  "NYSE:OVV",  "APA":  "NasdaqGS:APA","EQT":  "NYSE:EQT",
    "OKE":  "NYSE:OKE",  "TRGP": "NYSE:TRGP", "LNG":  "NYSEAM:LNG",
    "ET":   "NYSE:ET",   "SPY":  "ARCA:SPY",  "BNO":  "ARCA:BNO",
    "XLE":  "ARCA:XLE"
}

frames = []
for tkr, ciq_id in tickers.items():
    df = pull_ciq_adj_close(ciq_id, "2023-06-01", "2026-04-02")
    df = df.rename(columns={"value": tkr})
    frames.append(df.set_index("date"))

panel = pd.concat(frames, axis=1).sort_index()
panel.to_csv("daily_closes_CIQ.csv")
```

### 2c. CIQ SDK alternative (Excel COM or VBA)

The CIQ desktop SDK supports the same `IQ_CLOSEPRICE_ADJ` mnemonic via VBA:

```vb
Dim ciq As Object
Set ciq = CreateObject("SP.SPCIQ.1")
result = ciq.GDSP("NYSE:XOM", "IQ_CLOSEPRICE_ADJ", "06/01/2023", "USD")
```

---

## 3. Reconciliation: CRSP vs CIQ

The two vendors apply slightly different dividend back-adjustment conventions. Expect a per-day difference of <0.05% on adjusted closes; cumulative difference over a 700-day panel can reach 0.5% on individual tickers but the daily-return panels match to within ±0.001 (0.1 percentage point) on Day 0 and the regression coefficients are stable.

The article's tolerance bands (point estimates ±0.5 pp, p-values ±0.05, R² ±0.01) accommodate both feeds. A reviewer running CRSP will see slightly different alpha/beta coefficients and slightly different R² values but the same headline conclusions.

---

## 4. Cross-walk: CRSP ⇄ CIQ ⇄ raw closes

| Field | CRSP source | CIQ source | Notes |
|---|---|---|---|
| Adjusted close | `ABS(prc) / cfacpr` | `IQ_CLOSEPRICE_ADJ` | Both include splits + dividends |
| Unadjusted close | `ABS(prc)` | `IQ_LASTSALEPRICE` | Use only if you handle splits manually |
| Daily return | `ret` (CRSP-computed) | `IQ_TOTRETURN_DAY_1` | Both include dividend reinvestment |
| Volume | `vol` | `IQ_VOLUME` | Match exactly |
| Shares outstanding | `shrout` | `IQ_TOTAL_SHARES_OUTSTANDING` | Use for market cap weighting |
| Market cap | `prc * shrout` | `IQ_MARKETCAP` | CIQ rounds to one decimal in $M |

---

## 5. Donor weight derivation (advanced)

The 10 donor weights in `donor_weights.json` are not pulled from a vendor — they are computed by an SLSQP-simplex optimization on the 21-firm pre-period return panel. The procedure:

```python
import numpy as np
from scipy.optimize import minimize

# panel: T x N matrix of daily returns over 240-day pre-event window
# treated: T-vector of XOM daily returns
# donors:  T x 21 matrix of peer daily returns

def objective(w, treated, donors):
    synth = donors @ w
    return np.sum((treated - synth) ** 2)  # pre-period RMSPE^2

def constraint_sum(w):
    return np.sum(w) - 1.0

n_donors = 21
w0 = np.ones(n_donors) / n_donors
constraints = [{'type': 'eq', 'fun': constraint_sum}]
bounds = [(0.0, 1.0)] * n_donors

result = minimize(
    objective, w0, args=(treated, donors),
    method='SLSQP', bounds=bounds, constraints=constraints
)
weights = result.x  # 21-vector, sums to 1, all non-negative

# Frozen: weights logged + timestamped + committed prior to event date
# canonical_source: xom_rerun_results.json, 2026-05-16 15:21 UTC
```

The 10 published weights are the non-zero entries from this optimization. The `_weights_sum_pre_renorm` value (0.968) in `donor_weights.json` reflects rounding to three decimal places before renormalization to unit simplex; both pre-renorm and renormalized weights are exposed.

---

## 6. Ken French factor data (for FF3, FFC6, FF5 robustness)

```python
import pandas_datareader.data as pdr
import pandas as pd

# Ken French 3-factor daily (excess market, SMB, HML)
ff3 = pdr.DataReader(
    'F-F_Research_Data_Factors_daily', 'famafrench',
    start='2023-06-01', end='2026-04-02'
)[0]

# Ken French 5-factor daily
ff5 = pdr.DataReader(
    'F-F_Research_Data_5_Factors_2x3_daily', 'famafrench',
    start='2023-06-01', end='2026-04-02'
)[0]

# Momentum daily (Carhart's UMD)
mom = pdr.DataReader(
    'F-F_Momentum_Factor_daily', 'famafrench',
    start='2023-06-01', end='2026-04-02'
)[0]
```

For FFC6 (the modern standard), join FF5 + Momentum on date and run:

```python
import statsmodels.api as sm
ret = panel['XOM'].pct_change().dropna() - ff3['RF'] / 100  # excess XOM return
X = pd.concat([ff3[['Mkt-RF','SMB','HML']], ff5[['RMW','CMA']], mom], axis=1).dropna() / 100
X = sm.add_constant(X.loc[ret.index])
model = sm.OLS(ret, X).fit()
print(model.summary())
```

---

## 7. Brent oil benchmark — three alternatives to BNO

The current kit uses `ARCA:BNO` (United States Brent Oil Fund ETF). Alternatives:

| Source | Identifier | Notes |
|---|---|---|
| WTI futures (front-month) | `NYM:CL.1` (CIQ) / `crsp_a_futures.fc_monthly` (WRDS) | Standard in event-study literature; same maturity each day |
| Brent futures (front-month) | `IFE:CO.1` (CIQ) | London ICE Brent |
| Brent spot | `IFE:LCO` (CIQ) | Bloomberg-equivalent spot |
| WTI spot (EIA) | EIA API: `PET.RWTC.D` | Free, public, no auth |

A reviewer can rerun the oil-augmented regression with any of these to validate that the −2.19pp Day-0 AR isn't an artifact of BNO's ETF roll structure.

---

## 8. Reproducibility checklist for an external reviewer

If you are an independent reviewer, this is the minimum set of files you need from the live deploy:

1. `daily_closes.csv` — the panel
2. `donor_weights.json` — the SC weights (pre-registered, frozen)
3. `expected_results.json` — the reference values + tolerances
4. One of: `event_study.py`, `event_study.R`, `event_study.do` — the script

Plus, **optionally for cross-vendor validation**, regenerate `daily_closes.csv` from CRSP via WRDS using the queries in §1. Expect the headline numbers (oil-augmented Day-0 AR = −2.19 pp, Patell p = 0.049, matched-pair = +0.04 pp) to reproduce to within the ±0.5 pp tolerance band.
