Previously, whenever we try to access a missing value on the Positions
dict, we return a default Position and save it to the dict. Instead,
just return the Position
MAINT: add event date col field and filter rows where this field is null
TST: modify tests to filter nulls in event date col
MAINT: calculate value repeats by vectorized computation on separate start and end dates.
MAINT: pass DatetimeIndex instead of list of strings
The BcolzDailyBarReader was optimized for the pipeline case of reading
all assets at once.
Now that the reader is also used to support daily history the case of
reading a data for a small number of assets is more common, particularly
in algorithms that use the history API which have a high rotation of
assets (e.g. an algorithm which pipeline uses to set the active
universe)
Remove the bottleneck in reading a small number of assets by
conditionally reading the slice for each asset from the carray, instead
of reading the data for all equities and then indexing into that full
array. On a certain number of assets, it is still better to read all the
data at once. On the Quantopian dataset, which holds data for 20000
about for the last 10 years of equity data (where not all equities trade
over the full range), stored in 118 blosc blp files per column, the
tipping point where the 'read all' mode wins out between 3000-4000
assets.
That number was tested by trying to exercise a worst case scenario where
the equities were spread out evenly across the blp files, by stepping
along a sorted list of assets that were alive over a query range which
spanned 70 trading days.
```
size = 3000
sids = [assets[i] for i in range(0, len(assets), len(assets) /
size)][:size]
```
Also, add parameter to WithBcolzDailyBarReader fixture which allows the
test to specify what the threshold count for reading all data should be,
so that the test_us_equity_pricing can be forced into either mode to
make sure that both branches in logic are covered by all test cases.
On local dev machine this patch improves the read time of `load_raw_array`
for one asset from 100 ms to 96.5 µs. (10^5 improvement.) With reading
only asset per call a being an observed common case when populating the
non-cached values in USEquityHistoryLoader.
We were trying to use the previous day in before_trading_start because
we were looking for the previous market minute, then normalizing it. That's
no longer the case, as we want to use today's date for fetcher lookups
in before_trading_start.
Also refactored a bit how dataportal determines if a query should be
routed to the fetcher data structures.
Changes BcolzDailyBarWriter to not be an abc, data is passed as an
iterator of (sid, dataframe) pairs to the write method.
Changes the AssetsDBWriter to be a single class which accepts an engine
at construction time and has a `write` method for writing dataframes for
the various tables. We no longer support writing the various other data
types, callers should coerce their data into a dataframe themselves. See
zipline.assets.synthetic for some helpers to do this.
Adds many new fixtures and updates some existing fixtures to use the new
ones:
WithDefaultDateBounds
A fixture that provides the suite a START_DATE and END_DATE. This is
meant to make it easy for other fixtures to synchronize their date
ranges without depending on eachother in strange ways. For example,
WithBcolzMinuteBarReader and WithBcolzDailyBarReader by default should
both have data for the same dates, so they may use depend on
WithDefaultDates without forcing a dependency between them.
WithTmpDir, WithInstanceTmpDir
Provides the suite or individual test case a temporary directory.
WithBcolzDailyBarReader
Provides the suite a BcolzDailyBarReader which reads from bcolz data
written to a temporary directory. The data will be read from
dataframes and then converted to bcolz files with
BcolzDailyBarWriter.write
WithBcolzDailyBarReaderFromCSVs
Provides the suite a BcolzDailyBarReader which reads from bcolz data
written to a temporary directory. The data will be read from a
collection of CSV files and then converted into the bcolz data through
BcolzDailyBarWriter.write_csvs
WithBcolzMinuteBarReader
Provides the suite a BcolzMinuteBarReader which reads from bcolz data
written to a temporary directory. The data will be read from
dataframes and then converted to bcolz files with
BcolzMinuteBarWriter.write
WithAdjustmentReader
Provides the suite a SQLiteAdjustmentReader which reads from an in
memory sqlite database. The data will be read from dataframes and then
converted into sqlite with SQLiteAdjustmentWriter.write
WithDataPortal
Provides each test case a DataPortal object with data from temporary
resources.
Fix a bug where if history were called with assets `[1, 2]` and then
subsequently, `[2, 1]`, the loader would return the cached array in
order for `[1, 2]`.
Instead cache an AdjustedArray for each asset, then when a history
window is requested, check if each asset has a sufficient cache, and if
not then read values for the assets which are missing or need to be
refreshed.
An added benefit of this change is that if a subsequent call to history
has a smaller number of assets than the previous, no new data needs to
be read from disk. e.g. a call with assets `[1, 2, 3]` and then `[1, 2]`
would use the cached values for `1` and `2` from the first call.
Conversely, if the second call has more assets, then only the data for
the new assets needs to be retrieved. e.g. a history with `[1, 2]`, then
`[1, 2, 3]` would only need (assuming `1` and `2` have not expired) to
retrieve data for `3`. Unfortunately, the benefit here is not great
because `load_raw_arrays` is optimized for reading many assets, and
pulls the entire daily bar dataset into memory. This change makes tuning
`load_raw_arrays` so that faster reads (e.g. by slicing from the carray
for each asset, instead of pulling all data into a numpy array), when
only a few assets are requested, more beneficial than it would have been
previously.
Add a cache interface which supports expirable entries with a changeable
backend for the cache into which they are entered.
The default cache is a `dict` but could swapped for
`cachetools.LRUCache` or any other cache which supports `__get__`,
`__set__`, and `__del__`.
So that consumers can change the use of `CachedObjects` stored in a
cache from:
```
self._cache = {}
...
try:
obj = self._cache[key]
try:
return obj.unwrap(dt)
except Expired:
pass
except KeyError:
pass
...
self._cache[key] = CachedObject(value, new_expiration)
```
to:
```
self._cache = ExpiringCache(LRUCache(maxsize=6))
...
try:
return self._cache.get(key, dt)
except KeyError:
# Get fresh value
...
self._cache.set(key, value, new_expiration)
```
The intervals are returned as a set, so order is not guaranteed,
which becomes exposed when reading windows which span multiple years.
The deletion of values from the regular sized minute array assumes that
intervals can be reversed to delete the array from the back.
When the dts and length of cols are mismatched the writer behaves in
unintended ways. e.g. in a case where a consumer passed dts which had
minutes with no trades removed, but regular (market minute for day)
sized arrays for the data with `0`'s on minutes without trades, the non
trade minutes from cols are written to slots in the output where a trade
is intended.
Protect against this misuse by checking that all lengths are equal when
using the `write_cols` method.
Make a separate `_write_cols` method for use by both `write_cols` and
`write`, since the `write` method which takes a DataFrame has the
matched input length enforced by the DataFrame.
BUG: correctly create asset finder
MAINT: rename fixture
STY: fixes for flake8
STY: add space around assignment
MAINT: add var back to constructor
MAINT: remove unused import
MAINT: compare var with None directly
MAINT: fix merge errors
MAINT: remove record date - not needed.
MAINT: restructure dividends dataset.
MAINT: restructure dividends factors.
WIP: update dividends tests.
MAINT: correct the way to get the 'next' event frame.