Commit Graph

25 Commits

Author SHA1 Message Date
Stewart Douglas 17c20da026 TST: Add test to append minutely data 2016-05-26 09:38:25 -04:00
Andrew Daniels f1cfe1f2db BUG: Fixes bcolz padding to not always pad 390 minutes
If minutes already exist for the last existing day, adjust the number of
minutes padded to account for them. Previously we would always pad 390,
leading to a mismatch in the number of rows.
2016-05-25 14:26:16 -04:00
Stewart Douglas 8217cdb1bd ENH: Allow BcolzMinuteBarWriter to append to most recent day
Minutely data can now be appended to bcolz files even when
minutes in the same day have already been written. For example,
previously attempting to write data for the minute 2016-05-11 16:30
would raise an exception if any OHLCV data for 2016-05-11 had been
written to the same file.

Trying to overwrite existing minutes still raises a
BcolzMinuteOverlappingData exception.

Note that previously all sids' bcolz files ended at the same time.
This is no longer necessarily the case. The last record in each
sid's bcolz file now corresponds to the latest minute for which
OHLCV data is provided to the writer.
2016-05-13 16:24:21 -04:00
Joe Jevnik 55f1548160 BUG: fix inverted splits in quandl data 2016-05-09 14:00:35 -04:00
Joe Jevnik d819721d96 ENH: use more human readable format for bundle ingest directories
We are now using isoformats with ':' replaced with ';'. We cannot use a
normal isoformat because windows does not allow files or directories
with ':' in the name.
2016-05-05 18:22:13 -04:00
Joe Jevnik 89542e33bd ENH: Adds quantopian-quandl bundle as new default.
This data bundle will use the quantopian mirror of the quandl WIKI data
instead of downloading from quandl directly. This dramatically improves
the speed because we do not pay the rate limiting for quandl and we can
send the data in the format zipline expects.
2016-05-05 18:22:13 -04:00
Joe Jevnik 59c8e371a2 ENH: Updates the cli, data bundles and extensions.
Adds the data bundle concept which makes it easy for users to register
loading functions to build out minute and daily data along with an
assets db and adjustments db. By default we have provided a `quandl`
bundle which pulls from the public domain WIKI dataset. Users may
register new bundles by decorating an ingest function with
`zipline.data.bundles.register(<name>)`. This also provides a
`yahoo_equities` function for creating an ingestion function that will
load a static set of assets from yahoo.

The cli is now structured as a couple of subcommands and has been
changed to `python -m zipline`. The old behavior of `run_algo.py` has
been moved to the `run` subcommand. This is almost entirely the same
except that it now takes the name of the data bundle to use, defaulting
to `quandl`.

The next subcommand is `ingest` which takes the name of
a data bundle to ingest. This will run the loading machinery and write
the data to a specified location that `run` can find.

There is also a `clean` subcommand which deletes the data that was
written with `ingest`.

Extensions have also been added to zipline. This is an experimental
feature where users can provide an extra set of python files to run at
the start of the process. These can be used to configure aspects of
zipline. Right now the only thing that is supported in an extension file
is the registration of a new data bundle.
2016-05-03 18:38:24 -04:00
Joe Jevnik efac476976 ENH: make BcolzMinuteBarWriter.write take iterable
Updates the BcolzMinuteBarWriter.write api to allow users to pass their
data as a stream instead of requiring that they loop over their data
externally. This matches the API presented by BcolzDailyBarWriter.
2016-04-29 16:14:48 -04:00
Eddie Hebert 66d05aa563 PERF: Improve read time for smaller num of assets.
The BcolzDailyBarReader was optimized for the pipeline case of reading
all assets at once.

Now that the reader is also used to support daily history the case of
reading a data for a small number of assets is more common, particularly
in algorithms that use the history API which have a high rotation of
assets (e.g. an algorithm which pipeline uses to set the active
universe)

Remove the bottleneck in reading a small number of assets by
conditionally reading the slice for each asset from the carray, instead
of reading the data for all equities and then indexing into that full
array. On a certain number of assets, it is still better to read all the
data at once. On the Quantopian dataset, which holds data for 20000
about for the last 10 years of equity data (where not all equities trade
over the full range), stored in 118 blosc blp files per column, the
tipping point where the 'read all' mode wins out between 3000-4000
assets.

That number was tested by trying to exercise a worst case scenario where
the equities were spread out evenly across the blp files, by stepping
along a sorted list of assets that were alive over a query range which
spanned 70 trading days.
```
size = 3000
sids = [assets[i] for i in range(0, len(assets), len(assets) /
size)][:size]
```

Also, add parameter to WithBcolzDailyBarReader fixture which allows the
test to specify what the threshold count for reading all data should be,
so that the test_us_equity_pricing can be forced into either mode to
make sure that both branches in logic are covered by all test cases.

On local dev machine this patch improves the read time of `load_raw_array`
for one asset from 100 ms to 96.5 µs. (10^5 improvement.) With reading
only asset per call a being an observed common case when populating the
non-cached values in USEquityHistoryLoader.
2016-04-21 20:43:52 -04:00
Joe Jevnik bc0b117dc9 MAINT: make the data loading apis more consistent.
Changes BcolzDailyBarWriter to not be an abc, data is passed as an
iterator of (sid, dataframe) pairs to the write method.

Changes the AssetsDBWriter to be a single class which accepts an engine
at construction time and has a `write` method for writing dataframes for
the various tables. We no longer support writing the various other data
types, callers should coerce their data into a dataframe themselves. See
zipline.assets.synthetic for some helpers to do this.

Adds many new fixtures and updates some existing fixtures to use the new
ones:

WithDefaultDateBounds
  A fixture that provides the suite a START_DATE and END_DATE. This is
  meant to make it easy for other fixtures to synchronize their date
  ranges without depending on eachother in strange ways. For example,
  WithBcolzMinuteBarReader and WithBcolzDailyBarReader by default should
  both have data for the same dates, so they may use depend on
  WithDefaultDates without forcing a dependency between them.

WithTmpDir, WithInstanceTmpDir
  Provides the suite or individual test case a temporary directory.

WithBcolzDailyBarReader
  Provides the suite a BcolzDailyBarReader which reads from bcolz data
  written to a temporary directory. The data will be read from
  dataframes and then converted to bcolz files with
  BcolzDailyBarWriter.write

WithBcolzDailyBarReaderFromCSVs
  Provides the suite a BcolzDailyBarReader which reads from bcolz data
  written to a temporary directory. The data will be read from a
  collection of CSV files and then converted into the bcolz data through
  BcolzDailyBarWriter.write_csvs

WithBcolzMinuteBarReader
  Provides the suite a BcolzMinuteBarReader which reads from bcolz data
  written to a temporary directory. The data will be read from
  dataframes and then converted to bcolz files with
  BcolzMinuteBarWriter.write

WithAdjustmentReader
  Provides the suite a SQLiteAdjustmentReader which reads from an in
  memory sqlite database. The data will be read from dataframes and then
  converted into sqlite with SQLiteAdjustmentWriter.write

WithDataPortal
  Provides each test case a DataPortal object with data from temporary
  resources.
2016-04-15 23:46:10 -04:00
Eddie Hebert d27f85e16b BUG: Enforce sorted order on minutes to delete.
The intervals are returned as a set, so order is not guaranteed,
which becomes exposed when reading windows which span multiple years.

The deletion of values from the regular sized minute array assumes that
intervals can be reversed to delete the array from the back.
2016-04-12 14:16:10 -04:00
Eddie Hebert 0a3a2f3653 BUG: Ensure matched input length to minute writer.
When the dts and length of cols are mismatched the writer behaves in
unintended ways. e.g. in a case where a consumer passed dts which had
minutes with no trades removed, but regular (market minute for day)
sized arrays for the data with `0`'s on minutes without trades, the non
trade minutes from cols are written to slots in the output where a trade
is intended.

Protect against this misuse by checking that all lengths are equal when
using the `write_cols` method.

Make a separate `_write_cols` method for use by both `write_cols` and
`write`, since the `write` method which takes a DataFrame has the
matched input length enforced by the DataFrame.
2016-04-07 13:53:59 -04:00
Eddie Hebert 16fd6681a6 ENH: Rewrite of Zipline to use lazy access pattern
More documentation to follow in release notes.

Based on lazy-mainline branch, see for more details.

Also-By: Jean Bredeche <jean@quantopian.com>
Also-By: Andrew Liang <aliang@quantopian.com>
Also-By: Abhijeet Kalyan <akalyan@quantopian.com>
2016-04-04 16:12:58 -04:00
Eddie Hebert be08a77d76 BUG: Prevent writing int max instead of nan.
np.array.astype can not be relied upon to convert nan's reliably to 0

Fix by calling nan_to_num on the float arrays before converting to
uint32.
2016-03-30 14:35:06 -04:00
Eddie Hebert 75213ac176 MAINT: Write open and closes for minute bar format
Write arrays representing corresponding market opens and market closes,
which will eventually replace the `minute_index` field.

The market closes are being added for incoming work on another branch
which will use the market closes to generate a list of non-market
minutes to filter out when returning data from `unadjusted_window`.
2016-03-24 23:18:42 -04:00
Eddie Hebert 0f14972e08 ENH: Unadjusted window data for minute bars.
Add a method to minute bar reader which returns the OHLCV for all
requested fields for a list assets over the specified start and end
minutes.

Initial usage is intended for use by a loader which consumes minute bar
data to resample into daily bars, but may also be used when aggregating
minute data during '1d' history calls in Q2.0.

This iteration does not include including of early closes.
2016-03-14 21:52:01 -04:00
Joe Jevnik 721dd36116 TST: move test_utils and adds test fixture classes
Renames zipline.utils.test_utils to zipline.testing

Adds zipline.testing.fixtures.ZiplineTestCase to manage setup and
teardown and adds mixins to define fixtures like an asset finder or
trading calendar.
2016-03-10 15:39:52 -05:00
Eddie Hebert 27f94f83fa ENH: Allow passing of numpy arrays to writer.
For faster parsing and writing workflows, do not require a DataFrame.
2016-02-02 14:03:42 -05:00
Eddie Hebert 488721e805 ENH: Add padding method to minute bars writer.
So that consumers can write empty days worth of data, without needing
to construct a DataFrame with zero data force a write.

The internal loader uses `last_date_in_output_for_sid` to signify that
data has been attempted to be retrieved for all dates up until that, so
that when resuming a job those retrieval of data for those dates are not
re-attempted.

Also, used to make the write logic cleaneer, by making it only
necessary to create an array large enough for the given df.
2016-02-01 14:18:22 -05:00
Eddie Hebert 984e934e83 BUG: Fix OSError when creating sids that share dir
Fix a bug where creating a sid bcolz file when the containing directory
was already occupied by a sid caused an OSError on attempt of creating
the directory because it already existed.

e.g. if there were two sids, `1` and `2`. The paths would be
`00/00/000001.bcolz` and `00/00/000002.bcolz` which share the same
directory `00/00`.

Fixed by checking for directory existence before calling `makedirs`.

Add test coverage which exercises writing of sids that are siblings in
the sid directory structure.
2016-01-25 10:37:50 -05:00
Eddie Hebert d5c3b5a15c ENH: Add writer for minute bcolz format.
Implement a writer for minute data into a format comprised of multiple
ctables, one for each individual asset, with a common 'index' shared by
all ctables where a given a dt maps to the same array index for all
equities and fields.

This format is pulled from the lazy-mainline/Q2.0 branch, with some
changes to the interface.

Add basic retrieval of values at a given dt to reader. Not yet used by
Zipline simulations, but added to support unit tests.

Also, rename stubbed out us_equity_minutes to minute_bars, since the
writer can be agnostic to asset type.
2016-01-21 10:54:27 -05:00
Eddie Hebert 5f81acea05 ENH: Return -1 for missing spot prices.
Return -1 when there is a zero value for a spot price.
Intended for use by the incoming data portal changes. When the data
portal will see a -1 value, the portal will seek back a trading day
until a non-negative value is returned.
2015-11-25 11:32:36 -05:00
Eddie Hebert 53dae6320c BUG: Fix volume value returned by daily spot price
Volumes were incorrectly having the thousands factor applied, however
the volume is written as is (without the factor, since it volume is an
int, not float value.)

Fix by adding a special case for volume which returns the price as is.
2015-11-25 10:19:52 -05:00
Eddie Hebert 5338c8e611 ENH: Add spot_price to BcolzDailyBarReader.
Add new method to BcolzDailyBarReader, `spot_price` which returns the
unadjusted price for the specified day and sid.
2015-10-10 07:19:03 -04:00
Eddie Hebert e33f6dcdcd MAINT: Move equity data formats out of loader.
Put the logic for reading and writing the equity price and adjustment
data into a module located in data, making it distinct from the pipeline
loader usage of the formats.

This prepares for both incoming changes of how adjustments are written,
(which includes using the bcolz daily reader as an input), as well as
eventually providing the readers to a DataPortal object.
2015-10-09 17:20:19 -04:00