`data.loader.ensure_benchmark_data()` was trying to use data after an exception was raised loading it. The code was logging and swallowing exceptions; this re-raises.
Remove module scope invocations of `get_calendar('NYSE')`, which cuts
zipline import time in half on my machine. This make the zipline CLI
noticeably more responsive, and it reduces memory consumed at import
time from 130MB to 90MB.
Before:
$ time python -c 'import zipline'
real 0m1.262s
user 0m1.128s
sys 0m0.120s
After:
$ time python -c 'import zipline'
real 0m0.676s
user 0m0.536s
sys 0m0.132s
Instead of having separate ExchangeCalendar and TradingSchedule objects, we
now just have TradingCalendar. The TradingCalendar keeps track of each
session (defined as a contiguous set of minutes between an open and a close).
It's also responsible for handling the grouping logic of any given minute
to its containing session, or the next/previous session if it's not a market
minute for the given calendar.
Adds the data bundle concept which makes it easy for users to register
loading functions to build out minute and daily data along with an
assets db and adjustments db. By default we have provided a `quandl`
bundle which pulls from the public domain WIKI dataset. Users may
register new bundles by decorating an ingest function with
`zipline.data.bundles.register(<name>)`. This also provides a
`yahoo_equities` function for creating an ingestion function that will
load a static set of assets from yahoo.
The cli is now structured as a couple of subcommands and has been
changed to `python -m zipline`. The old behavior of `run_algo.py` has
been moved to the `run` subcommand. This is almost entirely the same
except that it now takes the name of the data bundle to use, defaulting
to `quandl`.
The next subcommand is `ingest` which takes the name of
a data bundle to ingest. This will run the loading machinery and write
the data to a specified location that `run` can find.
There is also a `clean` subcommand which deletes the data that was
written with `ingest`.
Extensions have also been added to zipline. This is an experimental
feature where users can provide an extra set of python files to run at
the start of the process. These can be used to configure aspects of
zipline. Right now the only thing that is supported in an extension file
is the registration of a new data bundle.
Rather than repeatedly try and fail to download data that's not yet
available, only try to download again if we haven't successfully
downloaded in the last hour.
Previously we were using Close, and we calculated returns on the first
day of a window against the Open for that day. We now always look back
an extra day to get the previous day's close.
- Fixes an issue with the canadian treasury loader where it would never
have enough data to not redownload because it can only download data
in the last 10 years.
- Uses module objects directly instead of lazy imports.
- Adds lots of docstrings.
Replaces our custom XML parsing with a single call to `pd.read_csv`
against the federal reserve's API. This produces nearly identical
results as compared to the old loader, but it's dramatically simpler and
roughly 10x faster on my machine.
The average difference in magnitude between new and old is approximately
10e-7, and only one entry is different to a degree greater than the
number of significant figures provided by treasury.gov.
Additionally, the new loader correctly ignores Columbus Day of 2010, for
which the old loader erroneously produced an all-NaN row.
This also changes the interface that treasury modules modules are
required to implement. Modules must now supply a `get_treasury_data`
function that returns a `DataFrame` with a daily `DatetimeIndex` and a
column for each supported treasury duration.
Detailed comparison between results from new and old loader::
from zipline.data.treasuries import get_treasury_data
new = get_treasury_data() # New implementation
old = pd.read_csv( # Previously cached data
'/home/ssanderson/.zipline/data/treasury_curves.csv'
parse_dates=[0],
index_col=0,
)
# These columns were unused.
del old['tid']; del old['date']
old = old.tz_localize('UTC')
old.dropna(how='all')
# old data erroneously contained an all-NaN entry for Columbus Day
# in 2010. Remove before comparing.
old = old.dropna(how='all')
In [25]: len(new) == len(old)
Out[25]: True
In [26]: abs(old - new).max()
Out[26]:
10year 2.000000e-04
1month 6.938894e-18
1year 1.000000e-04
20year 1.000000e-04
2year 2.000000e-04
30year 1.000000e-04
3month 1.000000e-03
3year 1.000000e-04
5year 1.387779e-17
6month 1.000000e-04
7year 1.000000e-04
dtype: float64
In [27]: abs(old - new).mean()
Out[27]:
10year 3.097414e-08
1month 4.396534e-19
1year 1.548707e-08
20year 3.624502e-08
2year 4.646120e-08
30year 1.830496e-08
3month 1.549427e-07
3year 1.548707e-08
5year 1.702619e-18
6month 1.548707e-08
7year 1.548707e-08
dtype: float64
Since www.treasury.gov only reports values up to three significant
digits, we should only care about differences of greater than 1e-3.
There is exactly one such difference: the entry for the three month bond
on 1999-10-01::
In [60]: new[(abs(new - old) >= 1e-3).any(axis=1)].T
Out[60]:
Time Period 1999-10-01 00:00:00+00:00
1month NaN
3month 0.0498
6month 0.0501
1year 0.0530
2year 0.0573
3year 0.0583
5year 0.0590
7year 0.0622
10year 0.0600
20year 0.0657
30year 0.0615
In [61]: old[(abs(new - old) >= 1e-3).any(axis=1)].T
Out[61]:
1999-10-01 00:00:00+00:00
10year 0.0600
1month NaN
1year 0.0530
20year 0.0657
2year 0.0573
30year 0.0615
3month 0.0488
3year 0.0583
5year 0.0590
6month 0.0501
7year 0.0622
The US Treasury website (our old source) provides a value of 0.488 here,
whereas the Federal Reserve site (our new source) provides a value of
0.498.
Previously we were converting our date to a string, then calling
`searchsorted` on the DatetimeIndex with the string, which would cause
pandas to convert the string back into a date to actually do the lookup.
This patch lays the groundwork for a compute engine designed to
facilitate construction of factor-based universe screening and portfolio
allocation. It contains:
A new module, `zipline.modelling`, containing entities that can be used
to express computations as dependency graphs. Each node in such a graph
is an instance of the base `Term` class, defined in
`zipline.modelling.term`. Dependency graphs are executed by instances
of `FFCEngine`, defined in `zipline.modelling.engine`.
A new module, `zipline.data.ffc`, containing loaders and dataset
definitions for inputs to the modelling API.
New `TradingAlgorithm` api methods: `add_factor`, and `add_filter`.
These methods can only be called from `initialize`, and are used to
inform the algorithm that each day it should compute the given terms.
Computed factor results are made available through a new attribute of
the `data` object in `before_trading_start` and `handle_data`. Computed
filter results control which assets are available in the factor matrix
on each day.
On Ubuntu (assume this is true for all posix) tickers containing a slash char ("CRD/A", "BRK/A", both valid tickers with yahoo api accessible timeseries) lead to a path error in loader.py line 286.
Python 2 and 3 throw different exception types when a file does
not exist.
Catch both exception types to trigger the download, so that the
loader works under both Python versions.
The compatibility between the two versions was made easier by
letting pandas handle the heavy lifting, so pass filenames to the
pandas serialization methods, instead of dealing doing the file
handling and reading/writing within the data module.
Use the six module to import functions and types that are
consistent between Python 2 and 3, so that one code base can
support both versions.
- Use integer types instead of int and long.
- Use string_types instead of basestring.
- Account for iteritems, itervalues, iterkeys.
- Use six.moves for filter and zip, reduce
- Use compatible bytes for md5 hasher.
- xrange and range
- Use `print()` function for all print calls
- Fix strip and format calls that were on the outside of the
print function for some reason.
(Which were breaking in Python 3 because of print returning None.)
- Remove commented out print calls.
Check for whether or not the index's timezone is UTC or not before
attempting to localize, since an already localized index throws an
error when tz_localize is called.
Remove the lists of DailyReturn objects in favor of using pd.Series
to store the return values.
Should make it easier to inspect the values when stepping through,
make the windowing of data to a certain range more facile by using,
and have some performance increases due to removing object creation
and member access.
The dump and update of curves were both using the entire history.
So instead of having the update use a different code path, always
use dump and overwrite.
Both unit tests and repeated runs while developing an algorithm
can benefit from having a local copy of the Yahoo data, instead
of doing a network call each time.
Store the web request results as a csv file in a cache directory,
named by symbol and date range.