- Adds a new class, ``LabelArray``, which is a subclass of np.ndarray.
LabelArray is conceptually similar to pandas.Categorical, in that it
stores data with many duplicate values as indices into an array of
unique values. For string data with many duplicates (e.g. time-series
of tickers or or industry classifications), this provides multiple
orders of magnitude of improvement when doing string operations,
especially string comparison/matching operations.
- Adds a new generic object "specialization" for `AdjustedArrayWindow`,
and a corresponding ObjectOverwrite adjustment.
- Adds a new ``postprocess`` method to ``zipline.pipeline.term.Term``.
This method is called on the final result of any pipeline expression
after screen filtering has occurred. The default implementation of
``postprocess`` is identity, but Classifier overrides it to coerce
string columns into pandas.Categoricals before presenting them to the
user.
Instead of letting the cache of carrays grow unbounded, use an LRUCache
to cap the number of equities for any given column.
Tested with the size 1000, on an algo that was using pipeline which was
using over 3000, runtimes were similar, but the memory usage was
successfully capped to around 1.2GB.
Also, tested with an algorithm which bought and hold just one equity and
no major slow down was seen when using the LRUCache vs. a dictionary.
We may want to follow this up with an extension to `carray` which is not
as memory hungry per column; e.g. by not loading repeated/similar
metadata or releasing the last read chunk after a certain amount of
time.
Adds the data bundle concept which makes it easy for users to register
loading functions to build out minute and daily data along with an
assets db and adjustments db. By default we have provided a `quandl`
bundle which pulls from the public domain WIKI dataset. Users may
register new bundles by decorating an ingest function with
`zipline.data.bundles.register(<name>)`. This also provides a
`yahoo_equities` function for creating an ingestion function that will
load a static set of assets from yahoo.
The cli is now structured as a couple of subcommands and has been
changed to `python -m zipline`. The old behavior of `run_algo.py` has
been moved to the `run` subcommand. This is almost entirely the same
except that it now takes the name of the data bundle to use, defaulting
to `quandl`.
The next subcommand is `ingest` which takes the name of
a data bundle to ingest. This will run the loading machinery and write
the data to a specified location that `run` can find.
There is also a `clean` subcommand which deletes the data that was
written with `ingest`.
Extensions have also been added to zipline. This is an experimental
feature where users can provide an extra set of python files to run at
the start of the process. These can be used to configure aspects of
zipline. Right now the only thing that is supported in an extension file
is the registration of a new data bundle.
Updates the BcolzMinuteBarWriter.write api to allow users to pass their
data as a stream instead of requiring that they loop over their data
externally. This matches the API presented by BcolzDailyBarWriter.
Previously, whenever we try to access a missing value on the Positions
dict, we return a default Position and save it to the dict. Instead,
just return the Position
MAINT: add event date col field and filter rows where this field is null
TST: modify tests to filter nulls in event date col
MAINT: calculate value repeats by vectorized computation on separate start and end dates.
MAINT: pass DatetimeIndex instead of list of strings
The BcolzDailyBarReader was optimized for the pipeline case of reading
all assets at once.
Now that the reader is also used to support daily history the case of
reading a data for a small number of assets is more common, particularly
in algorithms that use the history API which have a high rotation of
assets (e.g. an algorithm which pipeline uses to set the active
universe)
Remove the bottleneck in reading a small number of assets by
conditionally reading the slice for each asset from the carray, instead
of reading the data for all equities and then indexing into that full
array. On a certain number of assets, it is still better to read all the
data at once. On the Quantopian dataset, which holds data for 20000
about for the last 10 years of equity data (where not all equities trade
over the full range), stored in 118 blosc blp files per column, the
tipping point where the 'read all' mode wins out between 3000-4000
assets.
That number was tested by trying to exercise a worst case scenario where
the equities were spread out evenly across the blp files, by stepping
along a sorted list of assets that were alive over a query range which
spanned 70 trading days.
```
size = 3000
sids = [assets[i] for i in range(0, len(assets), len(assets) /
size)][:size]
```
Also, add parameter to WithBcolzDailyBarReader fixture which allows the
test to specify what the threshold count for reading all data should be,
so that the test_us_equity_pricing can be forced into either mode to
make sure that both branches in logic are covered by all test cases.
On local dev machine this patch improves the read time of `load_raw_array`
for one asset from 100 ms to 96.5 µs. (10^5 improvement.) With reading
only asset per call a being an observed common case when populating the
non-cached values in USEquityHistoryLoader.