Rather than a list that's ordered the same as the received columns.
Most nontrivial loaders were constructing dicts internally and then
converting back to lists, only to have the engine convert **back again**
into a dict. This cuts out the middleman, and prevents bugs due to
incorrect ordering of the output arrays.
Previously we were using Close, and we calculated returns on the first
day of a window against the Open for that day. We now always look back
an extra day to get the previous day's close.
- Fixes an issue with the canadian treasury loader where it would never
have enough data to not redownload because it can only download data
in the last 10 years.
- Uses module objects directly instead of lazy imports.
- Adds lots of docstrings.
Replaces our custom XML parsing with a single call to `pd.read_csv`
against the federal reserve's API. This produces nearly identical
results as compared to the old loader, but it's dramatically simpler and
roughly 10x faster on my machine.
The average difference in magnitude between new and old is approximately
10e-7, and only one entry is different to a degree greater than the
number of significant figures provided by treasury.gov.
Additionally, the new loader correctly ignores Columbus Day of 2010, for
which the old loader erroneously produced an all-NaN row.
This also changes the interface that treasury modules modules are
required to implement. Modules must now supply a `get_treasury_data`
function that returns a `DataFrame` with a daily `DatetimeIndex` and a
column for each supported treasury duration.
Detailed comparison between results from new and old loader::
from zipline.data.treasuries import get_treasury_data
new = get_treasury_data() # New implementation
old = pd.read_csv( # Previously cached data
'/home/ssanderson/.zipline/data/treasury_curves.csv'
parse_dates=[0],
index_col=0,
)
# These columns were unused.
del old['tid']; del old['date']
old = old.tz_localize('UTC')
old.dropna(how='all')
# old data erroneously contained an all-NaN entry for Columbus Day
# in 2010. Remove before comparing.
old = old.dropna(how='all')
In [25]: len(new) == len(old)
Out[25]: True
In [26]: abs(old - new).max()
Out[26]:
10year 2.000000e-04
1month 6.938894e-18
1year 1.000000e-04
20year 1.000000e-04
2year 2.000000e-04
30year 1.000000e-04
3month 1.000000e-03
3year 1.000000e-04
5year 1.387779e-17
6month 1.000000e-04
7year 1.000000e-04
dtype: float64
In [27]: abs(old - new).mean()
Out[27]:
10year 3.097414e-08
1month 4.396534e-19
1year 1.548707e-08
20year 3.624502e-08
2year 4.646120e-08
30year 1.830496e-08
3month 1.549427e-07
3year 1.548707e-08
5year 1.702619e-18
6month 1.548707e-08
7year 1.548707e-08
dtype: float64
Since www.treasury.gov only reports values up to three significant
digits, we should only care about differences of greater than 1e-3.
There is exactly one such difference: the entry for the three month bond
on 1999-10-01::
In [60]: new[(abs(new - old) >= 1e-3).any(axis=1)].T
Out[60]:
Time Period 1999-10-01 00:00:00+00:00
1month NaN
3month 0.0498
6month 0.0501
1year 0.0530
2year 0.0573
3year 0.0583
5year 0.0590
7year 0.0622
10year 0.0600
20year 0.0657
30year 0.0615
In [61]: old[(abs(new - old) >= 1e-3).any(axis=1)].T
Out[61]:
1999-10-01 00:00:00+00:00
10year 0.0600
1month NaN
1year 0.0530
20year 0.0657
2year 0.0573
30year 0.0615
3month 0.0488
3year 0.0583
5year 0.0590
6month 0.0501
7year 0.0622
The US Treasury website (our old source) provides a value of 0.488 here,
whereas the Federal Reserve site (our new source) provides a value of
0.498.
Previously we were converting our date to a string, then calling
`searchsorted` on the DatetimeIndex with the string, which would cause
pandas to convert the string back into a date to actually do the lookup.
Previously we were not accounting for cases where we would invoke
next_market_minute() with a time on a trading day *before* the
market open, or previous_market_minute() with a time on a trading
day *after* the market close.
The price shock occurs on the effective_date. Had changed the effective_date to
be day before the ex_date with the belief that pipeline was applying values up
and until the effective_date, but the lookback windows apply before the
effective_date. Thus, the price shock calculation should still use the previous
days data but be dated on the ex_date to stay aligned with splits and
merger dating.