DOC: blaze loader docstring

This commit is contained in:
llllllllll
2015-10-06 18:07:16 -04:00
parent 2df4dfd1bb
commit 8ec562e2ed
+121
View File
@@ -1,4 +1,125 @@
"""Blaze integration with the Pipeline API.
For an overview of the blaze project, see blaze.pydata.org
The blaze loader for the Pipeline API is designed to allow us to load
data from arbitrary sources as long as we can execute the needed expressions
against the data with blaze.
Data Format
-----------
The blaze Pipline API loader expects that data is formatted in a tabular way.
The only required column in your table is ``asof_date`` where this column
represents the date this data is referencing. For example, one might have a CSV
like:
asof_date,value
2014-01-01,0
2014-01-02,1
2014-01-03,2
This says that the value on 2014-01-01 was 0 and so on.
Optionally, we may provide a ``timestamp`` column to be used to represent
point in time data. This column tells us when the data was known, or became
available to for use. Using our same CSV, we could write this with a timestamp
like:
asof_date,timestamp,value
2014-01-01,2014-01-02,0
2014-01-02,2014-01-03,1
2014-01-03,2014-01-04,2
This says that the value was 0 on 2014-01-01; however, we did not learn this
until 2014-01-02. This is useful for avoiding look-ahead bias in your pipelins.
If this column does not exist, the ``asof_date`` column will be used instead.
If your data references a particular security, you can add a ``sid`` column to
your dataset to represent this. For example:
asof_date,value,sid
2014-01-01,0,10
2014-01-01,1,20
2014-01-02,1,10
2014-01-02,2,20
2014-01-03,2,10
2014-01-03,3,20
This says that on 2014-01-01, the asset with id 10 had a value of 0, and the
asset with id 20 had a value of 1.
One of the key features of the Pipeline API is the handling of adjusments and
restatements. Often our data will be amended after the fact and we would like
to trade on the newest information; however, we do not want to introduce this
knowledge to our model too early. The blaze loader handles this case by
accepting a second ``deltas`` expression that contains all of the restatements
in the original expression.
For example, let's use our table from above:
asof_date,value
2014-01-01,0
2014-01-02,1
2014-01-03,2
Imagine that on the fourth the vendor realized that the calculation was
incorrect and the value on the first was actually -1. Then, on the fifth, they
realized that the value for the third was actually 3. We can construct a
``deltas`` expression to pass to our blaze loader that has the same shape as
our base table but only contains these new values like:
asof_date,timestamp,value
2014-01-01,2014-01-04,-1
2014-01-03,2014-01-05,3
This shows that we learned on the fourth that the value on the first was
actually -1 and that we learned on the fifth that the value on the third was
actually 3. By pulling our data into these two tables and not silently updating
our original table we can run our pipelines using the information we would
have had on that day, and we can prevent lookahead bias in the pipelines.
Conversion from Blaze to the Pipeline API
-----------------------------------------
Now that our data is structured in the way that the blaze loader expects we
are ready to convert our blaze expressions into Pipline API objects.
This module (zipline.pipeline.loaders.blaze) exports a function called
``from_blaze`` which performs this mapping.
The expression that you are trying to convert must either be tabular or
array-like. This means the ``dshape`` must be like:
``Dim * {A: B}`` or ``Dim * A``.
This represents an expression of dimension 1 which may be fixed or variable,
whose measure is either some record or a scalar.
The record case defines the entire table with all of the columns, this maps the
blaze expression into a pipeline DataSet. This dataset will have a column for
each field of the record. Some datashape types cannot be coerced into Pipeline
API compatible types and in that case, a column cannot be constructed.
Currently any numeric type that may be promoted to a float64 is compatible with
the Pipeline API.
The scalar case defines a single column pulled out a table. For example, let
``expr = bz.symbol('s', 'var * {field: int32, asof_date: datetime}')``.
When we pass ``expr.field`` to ``from_blaze``, we will walk back up the
expression tree until we find the table that ``field`` is defined on. We will
then proceed with the record case to construct a dataset; however, before
returning the dataset we will pull out only the column that was passed in.
For full documentation, see ``help(from_blaze)`` or ``from_blaze?`` in jupyter.
Using our Pipeline DataSets and Columns
---------------------------------------
Once we have mapped our blaze expressions into Pipeline API objects, we may
use them just like any other datasets or columns. For more information on how
to run a pipeline or using the Pipeline API, see:
www.quantopian.com/help#pipeline-api
"""
from __future__ import division, absolute_import