Write arrays representing corresponding market opens and market closes,
which will eventually replace the `minute_index` field.
The market closes are being added for incoming work on another branch
which will use the market closes to generate a list of non-market
minutes to filter out when returning data from `unadjusted_window`.
Add a method to minute bar reader which returns the OHLCV for all
requested fields for a list assets over the specified start and end
minutes.
Initial usage is intended for use by a loader which consumes minute bar
data to resample into daily bars, but may also be used when aggregating
minute data during '1d' history calls in Q2.0.
This iteration does not include including of early closes.
So that consumers can write empty days worth of data, without needing
to construct a DataFrame with zero data force a write.
The internal loader uses `last_date_in_output_for_sid` to signify that
data has been attempted to be retrieved for all dates up until that, so
that when resuming a job those retrieval of data for those dates are not
re-attempted.
Also, used to make the write logic cleaneer, by making it only
necessary to create an array large enough for the given df.
Use the preexisting metadata method when instantiating the minute bar
reader.
An internal sublcass uses the `_get_metadata` method to setup data for
directories that have not used the new writer/reader interface.
(i.e. allows for reader creation when the metadata.json file does not
exist.)
Fix a bug where creating a sid bcolz file when the containing directory
was already occupied by a sid caused an OSError on attempt of creating
the directory because it already existed.
e.g. if there were two sids, `1` and `2`. The paths would be
`00/00/000001.bcolz` and `00/00/000002.bcolz` which share the same
directory `00/00`.
Fixed by checking for directory existence before calling `makedirs`.
Add test coverage which exercises writing of sids that are siblings in
the sid directory structure.
Implement a writer for minute data into a format comprised of multiple
ctables, one for each individual asset, with a common 'index' shared by
all ctables where a given a dt maps to the same array index for all
equities and fields.
This format is pulled from the lazy-mainline/Q2.0 branch, with some
changes to the interface.
Add basic retrieval of values at a given dt to reader. Not yet used by
Zipline simulations, but added to support unit tests.
Also, rename stubbed out us_equity_minutes to minute_bars, since the
writer can be agnostic to asset type.
Moved from the `lazy-mainline` branch,
https://github.com/quantopian/zipline/pull/858
The intent of this patch to provide the basic class and readers
interfaces, developed on that branch, so that the use of creating the
object and opening paths etc. can be tested internally.
Additional changes beyond the lazy-mainline branch, addition of future
minute reader, and daily bar reader.
Also allow an argument of the future_daily_reader, though no such reader
yet exists.
It may be that future and equity readers share an interface, and a
further improvement would be providing an abstract base class.
co-author: @jbredeche <jean@quantopian.com>
Return -1 when there is a zero value for a spot price.
Intended for use by the incoming data portal changes. When the data
portal will see a -1 value, the portal will seek back a trading day
until a non-negative value is returned.
Volumes were incorrectly having the thousands factor applied, however
the volume is written as is (without the factor, since it volume is an
int, not float value.)
Fix by adding a special case for volume which returns the price as is.
Rather than repeatedly try and fail to download data that's not yet
available, only try to download again if we haven't successfully
downloaded in the last hour.
Previously we were using Close, and we calculated returns on the first
day of a window against the Open for that day. We now always look back
an extra day to get the previous day's close.
- Fixes an issue with the canadian treasury loader where it would never
have enough data to not redownload because it can only download data
in the last 10 years.
- Uses module objects directly instead of lazy imports.
- Adds lots of docstrings.
Replaces our custom XML parsing with a single call to `pd.read_csv`
against the federal reserve's API. This produces nearly identical
results as compared to the old loader, but it's dramatically simpler and
roughly 10x faster on my machine.
The average difference in magnitude between new and old is approximately
10e-7, and only one entry is different to a degree greater than the
number of significant figures provided by treasury.gov.
Additionally, the new loader correctly ignores Columbus Day of 2010, for
which the old loader erroneously produced an all-NaN row.
This also changes the interface that treasury modules modules are
required to implement. Modules must now supply a `get_treasury_data`
function that returns a `DataFrame` with a daily `DatetimeIndex` and a
column for each supported treasury duration.
Detailed comparison between results from new and old loader::
from zipline.data.treasuries import get_treasury_data
new = get_treasury_data() # New implementation
old = pd.read_csv( # Previously cached data
'/home/ssanderson/.zipline/data/treasury_curves.csv'
parse_dates=[0],
index_col=0,
)
# These columns were unused.
del old['tid']; del old['date']
old = old.tz_localize('UTC')
old.dropna(how='all')
# old data erroneously contained an all-NaN entry for Columbus Day
# in 2010. Remove before comparing.
old = old.dropna(how='all')
In [25]: len(new) == len(old)
Out[25]: True
In [26]: abs(old - new).max()
Out[26]:
10year 2.000000e-04
1month 6.938894e-18
1year 1.000000e-04
20year 1.000000e-04
2year 2.000000e-04
30year 1.000000e-04
3month 1.000000e-03
3year 1.000000e-04
5year 1.387779e-17
6month 1.000000e-04
7year 1.000000e-04
dtype: float64
In [27]: abs(old - new).mean()
Out[27]:
10year 3.097414e-08
1month 4.396534e-19
1year 1.548707e-08
20year 3.624502e-08
2year 4.646120e-08
30year 1.830496e-08
3month 1.549427e-07
3year 1.548707e-08
5year 1.702619e-18
6month 1.548707e-08
7year 1.548707e-08
dtype: float64
Since www.treasury.gov only reports values up to three significant
digits, we should only care about differences of greater than 1e-3.
There is exactly one such difference: the entry for the three month bond
on 1999-10-01::
In [60]: new[(abs(new - old) >= 1e-3).any(axis=1)].T
Out[60]:
Time Period 1999-10-01 00:00:00+00:00
1month NaN
3month 0.0498
6month 0.0501
1year 0.0530
2year 0.0573
3year 0.0583
5year 0.0590
7year 0.0622
10year 0.0600
20year 0.0657
30year 0.0615
In [61]: old[(abs(new - old) >= 1e-3).any(axis=1)].T
Out[61]:
1999-10-01 00:00:00+00:00
10year 0.0600
1month NaN
1year 0.0530
20year 0.0657
2year 0.0573
30year 0.0615
3month 0.0488
3year 0.0583
5year 0.0590
6month 0.0501
7year 0.0622
The US Treasury website (our old source) provides a value of 0.488 here,
whereas the Federal Reserve site (our new source) provides a value of
0.498.
Previously we were converting our date to a string, then calling
`searchsorted` on the DatetimeIndex with the string, which would cause
pandas to convert the string back into a date to actually do the lookup.
The price shock occurs on the effective_date. Had changed the effective_date to
be day before the ex_date with the belief that pipeline was applying values up
and until the effective_date, but the lookback windows apply before the
effective_date. Thus, the price shock calculation should still use the previous
days data but be dated on the ex_date to stay aligned with splits and
merger dating.
When the prev_close is 0 or does not exist, the resulting ration was either +inf
or nan, respectively.
Create a mask on the non-zero effective dates, where effective date is only
written when the prev close is sufficient for a valid ratio; and use that mask
to filter out the bad rows.
Also, use prev close as the effective date.
To prepare for querying for payouts from SQLite, write the dividend
payouts to a new table `dividend_payouts`.
Change the expected columns of the passed dividend frame to contain the
payout data, and use that data to calculate the ratios (this moves
internal code that was calcualting the ratios into Zipline.)
The end result is that instead of just a `dividends` table with the
backward looking adjustment ratios, also write a `dividend_payouts`
table and a `stock_dividend_payout` table.
For a pipeline doing simple computations on USEquityPricing data, we
were spending ~60% of `run_pipeline` loading adjustments. Almost all of
that time was spent in calls to `DatetimeIndex.get_loc` to find the
indices of adjustment `eff_date`s.
This optimizes the eff_date lookups by pre-populating a cache of
seconds-since-epoch timestamps that we expect to see, and falling back
to `np.searchsorted` on cache misses.
In testing, this reduces the time to compute a 1-year pipeline with 30
and 90 day moving averages from 3.1 seconds to 0.9 seconds.