The write_data methods invokes the relevant AssetDBWriter subclass
to write data to the database. update_asset_finder is no longer
a relevant method since the AssetFinder is strictly a reader class.
The AssetDBWriter class and its subclasses will
ultimately be responsible for creating the SQLite
database tables and writing data to these tables.
In the longer term AssetDBWriter and AssetFinder will
be decoupled, sharing only an SQLite connection.
However, for backward compatibility reasons this has
not yet been fully implemented.
Modify tests since AssetFinder no longer has a
metadata_cache attribute.
- Add an `ascending=True` keyword to `rank()`.
- Add `top(N)` and `bottom(N)` methods to Factor. These return Filters
that pass the top and bottom N elements each day.
- Add a slightly faster path for rank(method='ordinal'). I had
originally thought the fast path was 2-3x faster because I had my
benchmark data axes flipped. The actual speedup is only 5-10%, which
means it probably wasn't worth the effort to Cythonize...but we have a
slightly faster version now so we might as well use it.
- Refactor test_filter and test_factor to make it easier to implement
and test transformations on factors. These tests now subclass
BaseFFCTestCase, which provides facilities for passing a dict of terms
and an "initial_workspace", the values for which are used by
SimpleFFCEngine rather than needing to manually manage the inputs and
outputs of each term.
This makes ordering with the returned assets much easier, and there's no
performance degradation for non-broadcasting operations on the Index.
Timings
-------
from random import sample
finder = AssetFinder(create_table=False, assets.db')
assets = load_8000_assets(finder)
AAPL = finder.retrieve_asset(24)
RANDOM_ASSETS = sample(assets, 500)
df = DataFrame(
index=assets,
data=np.random.randn(len(assets), 4),
columns=['a', 'b', 'c', 'd'],
)
df_int = DataFrame(
index=map(int, assets),
data=np.random.randn(len(assets), 4),
columns=['a', 'b', 'c', 'd'],
)
%timeit df.loc[24]
%timeit df_int.loc[24]
10000 loops, best of 3: 45.3 µs per loop
10000 loops, best of 3: 44.7 µs per loop
%timeit df.loc[AAPL]
%timeit df_int.loc[AAPL]
10000 loops, best of 3: 45.1 µs per loop
10000 loops, best of 3: 44.8 µs per loop
%timeit df.loc[RANDOM_ASSETS]
%timeit df_int.loc[RANDOM_ASSETS]
1000 loops, best of 3: 1.53 ms per loop
100 loops, best of 3: 2.18 ms per loop
%timeit df.sum()
%timeit df_int.sum()
10000 loops, best of 3: 56 µs per loop
10000 loops, best of 3: 55.7 µs per loop
%timeit df.index == 3
%timeit df_int.index == 3
1000 loops, best of 3: 253 µs per loop
100000 loops, best of 3: 6.76 µs per loop
%timeit df.iloc[:50]
%timeit df_int.iloc[:50]
10000 loops, best of 3: 44.3 µs per loop
10000 loops, best of 3: 44 µs per loop
If lookup_future_chain was provided with an as_of_date or knowledge date that was pandas.NaT, the query we were forming wasn't what we want. Instead, as_of_date, if not NaT, is used for knowledge_date, and if both are NaT, no date filtering is done in the query.
- Previously it was returning a DataFrame because of how we applied an &
with a DataFrame mask. The error was masked by the fact that
`np.assert_array_equal` coerces inputs to arrays before comparing.
- Added `zp.utils.test_utils.check_arrays`, which checks type equality
before calling `np.assert_array_equal`.
This patch lays the groundwork for a compute engine designed to
facilitate construction of factor-based universe screening and portfolio
allocation. It contains:
A new module, `zipline.modelling`, containing entities that can be used
to express computations as dependency graphs. Each node in such a graph
is an instance of the base `Term` class, defined in
`zipline.modelling.term`. Dependency graphs are executed by instances
of `FFCEngine`, defined in `zipline.modelling.engine`.
A new module, `zipline.data.ffc`, containing loaders and dataset
definitions for inputs to the modelling API.
New `TradingAlgorithm` api methods: `add_factor`, and `add_filter`.
These methods can only be called from `initialize`, and are used to
inform the algorithm that each day it should compute the given terms.
Computed factor results are made available through a new attribute of
the `data` object in `before_trading_start` and `handle_data`. Computed
filter results control which assets are available in the factor matrix
on each day.
The minutely calculation of risk metrics had been removed with a
previous patch, remove vestigial references.
Remove a test which tested the behavior of updating the second minute of
a day.
Remove the logic that changed the datetime index of the risk metrics
depending on emission rate, now only trading_days are needed.
Remove `returns_frequency` parameter since both minute and daily
data frequency always use daily returns.
Attack the startup bottleneck of creating the asset finders caches for a
large universe, which was between 1-2 seconds on development and
production machines.
Instead, allow the AssetFinder to be passed a sqlite3 file that has
already been populated and then hydrate asset objects only when an
equity is referenced for the first time.
To create aforementioned sqlite3, create an AssetFinder with an db_path
and `create_table` set to True. If `create_table` is set to False, the
prepopulated data in the sqlite file found at db_path will be used.
Default behavior is to use an in memory database.
Behavior that changes:
- Fuzzy lookup now only works on one character, that character needs to be
specified at write/metadata consumption time, since the fuzzy lookup key
is created by dropping the character from each symbol.
- Overwriting partially written metadata is no longer
supported. i.e. some unit tests allowed for inserting just the identifier,
and then later updating the symbol, end_date, etc.
Instead of building an upsert behavior at this time, this patch
changes the unit tests so that the data for each asset is only
inserted once.
Other notes:
- populate_cache is now removed, since there is no longer a two step
process of inserting metadata and then realizing that metadata into
assets. _spawn_asset is rolled into insert_metadata, so that a call to
insert_metadata both converts the metadata and makes it available in
the data store.
The lookup of future contract by individual symbol is a constraint on
incoming changes of changing how the asset finder stores data.
(i.e. the asset finder is changing so that there are separate tables for
both futures and equities.)
Since this lookup is not yet fully supported, we can add it back in on
top of the new asset finder.
Since most brokers will cease accepting trades by the notice date, contracts should not be considered valid after the notice date. This commit adjusts the lookup_future_chain method to consider all contracts with notice dates on or following the current date invalid.
The asset finder retrieved from the test environment is empty, so the
test does not end up testing anything, since the test cases loop over
the empty list of sids in the asset finder.
Remove to possibly be added back in and re-implemented after a larger
refactoring of the module.
Removes unused future lookup methods and consolidates everything into lookup_future_chain. Since the FutureChain object will have to hold a root symbol and dates, it should be responsible for cleaning the user input, so this is removed from the lookup method.
Adds knowledge date to future lookups. This makes our definition of valid contracts more flexible. We know about a contract if it starts trading by the knowledge date, and a contract is expired if it expires by the as_of_date.
Also fixes a bug with computing future chains, where contracts were not included in the chain on their expiration date.
This commit modifies the DataFrameSource and DataPanelSource to accept only Int64Indexes on the incoming data and moves the burden of mapping user identifiers to TradingAlgorithm.run().
The identifier cache's usage was nearly identical to using lookup_generic, so this commit removes identifier-keyed caching and modifies anything that uses it.
Instead of using the pandas.Series datetime index for every single
vector, get the index at the beginning of the update loop based on the
dt and then use that index to set the values.
Also, since the dt lookup is no longer needed, store the values as numpy
arrays, which are more lightweight.
Locally, this patch cuts out about 60% of the time spent in the update
method.