We were trying to use the previous day in before_trading_start because
we were looking for the previous market minute, then normalizing it. That's
no longer the case, as we want to use today's date for fetcher lookups
in before_trading_start.
Also refactored a bit how dataportal determines if a query should be
routed to the fetcher data structures.
Changes BcolzDailyBarWriter to not be an abc, data is passed as an
iterator of (sid, dataframe) pairs to the write method.
Changes the AssetsDBWriter to be a single class which accepts an engine
at construction time and has a `write` method for writing dataframes for
the various tables. We no longer support writing the various other data
types, callers should coerce their data into a dataframe themselves. See
zipline.assets.synthetic for some helpers to do this.
Adds many new fixtures and updates some existing fixtures to use the new
ones:
WithDefaultDateBounds
A fixture that provides the suite a START_DATE and END_DATE. This is
meant to make it easy for other fixtures to synchronize their date
ranges without depending on eachother in strange ways. For example,
WithBcolzMinuteBarReader and WithBcolzDailyBarReader by default should
both have data for the same dates, so they may use depend on
WithDefaultDates without forcing a dependency between them.
WithTmpDir, WithInstanceTmpDir
Provides the suite or individual test case a temporary directory.
WithBcolzDailyBarReader
Provides the suite a BcolzDailyBarReader which reads from bcolz data
written to a temporary directory. The data will be read from
dataframes and then converted to bcolz files with
BcolzDailyBarWriter.write
WithBcolzDailyBarReaderFromCSVs
Provides the suite a BcolzDailyBarReader which reads from bcolz data
written to a temporary directory. The data will be read from a
collection of CSV files and then converted into the bcolz data through
BcolzDailyBarWriter.write_csvs
WithBcolzMinuteBarReader
Provides the suite a BcolzMinuteBarReader which reads from bcolz data
written to a temporary directory. The data will be read from
dataframes and then converted to bcolz files with
BcolzMinuteBarWriter.write
WithAdjustmentReader
Provides the suite a SQLiteAdjustmentReader which reads from an in
memory sqlite database. The data will be read from dataframes and then
converted into sqlite with SQLiteAdjustmentWriter.write
WithDataPortal
Provides each test case a DataPortal object with data from temporary
resources.
Fix a bug where if history were called with assets `[1, 2]` and then
subsequently, `[2, 1]`, the loader would return the cached array in
order for `[1, 2]`.
Instead cache an AdjustedArray for each asset, then when a history
window is requested, check if each asset has a sufficient cache, and if
not then read values for the assets which are missing or need to be
refreshed.
An added benefit of this change is that if a subsequent call to history
has a smaller number of assets than the previous, no new data needs to
be read from disk. e.g. a call with assets `[1, 2, 3]` and then `[1, 2]`
would use the cached values for `1` and `2` from the first call.
Conversely, if the second call has more assets, then only the data for
the new assets needs to be retrieved. e.g. a history with `[1, 2]`, then
`[1, 2, 3]` would only need (assuming `1` and `2` have not expired) to
retrieve data for `3`. Unfortunately, the benefit here is not great
because `load_raw_arrays` is optimized for reading many assets, and
pulls the entire daily bar dataset into memory. This change makes tuning
`load_raw_arrays` so that faster reads (e.g. by slicing from the carray
for each asset, instead of pulling all data into a numpy array), when
only a few assets are requested, more beneficial than it would have been
previously.
Add a cap of 5 sliding windows (one per OHCLV column) to the history
loader's cache of sliding windos.
This prevents unbounded growth on algorithms that call history with a
highly varied list of equities.
To follow is splitting the cache up by column and by sid, so that the
loader does not re-prefetch sids which have already been read with
sufficient data; however this patch is enough to fix the issue where an
algo with high rotation can add up a megabyte per day of memory on
algorithms which rotate on a 5% dollar volume pipeline. With this cap
those algorithms have more plateaus with regard to memory consumption.
This patch requires new dependency of `cachetools` library.
Add a cache interface which supports expirable entries with a changeable
backend for the cache into which they are entered.
The default cache is a `dict` but could swapped for
`cachetools.LRUCache` or any other cache which supports `__get__`,
`__set__`, and `__del__`.
So that consumers can change the use of `CachedObjects` stored in a
cache from:
```
self._cache = {}
...
try:
obj = self._cache[key]
try:
return obj.unwrap(dt)
except Expired:
pass
except KeyError:
pass
...
self._cache[key] = CachedObject(value, new_expiration)
```
to:
```
self._cache = ExpiringCache(LRUCache(maxsize=6))
...
try:
return self._cache.get(key, dt)
except KeyError:
# Get fresh value
...
self._cache.set(key, value, new_expiration)
```
Instead of using the `remember_last` memoization on all calls to
`_find_position_of_minute`, add an instance local cache which is only
used by the `get_value` call. The `get_value` call is very hot, so any
extra overhead (e.g. creating the WeakArgs on every invocation) becomes
costly. The current usage `get_value` also has the property that it is
called with monotonically increasing, but with a high repeat count on
each value. (A further improvement could making a `get_value` which
supports being used by many sids, for use by the update portfolio
positions.)
The caching is not done at the `_find_position_of_minute_level` because
`unadjusted_window` always uses two positions on the tape (start and end
of range) which would cause the entries and removal into the cache which
would be invalidated both between the calls of start and end, and next
call of the function.
The argument was only needed for mapping the positions which need to be
removed on adjusted windows. The start and end position of each range
can be derived from the early closes' positions and the market open,
respectively.
Remove to reduce moving parts.
The cache in data portal was added before the change to using a
CachedObject to wrap the window_blocks in the USEquityHistoryLoader.
Removing this extra layer saves some cycles.
Does not fix current memory investigaton (since only one sids/dts pair
per column was cached in `_equity_daily_reader_array_data` at a time),
but removing should make it more clear where needed references are being
held.