Remove pieces that are no longer used now that the simple transforms are
wrappers around history via the SIDData object.
Move window length related pieces into batch_transform, since the rest
of the utils module is no longer used.
Overhaul the core HistoryContainer logic to be more robust to changing
universes.
Major Changes
-------------
* Remove `return_frame` cache. The original purpose of using
return_frames was to avoid having to create new DataFrames on each
iteration of handle_data, but we ended up having to copy the return
frames anyway because user code could mutate the frames in place.
Removing the return_frames reduces unnecessary copying, and reduces
the logic of `get_history` to just forward-filling and concatenating
two DataFrames.
* Use a `MultiIndex`ed DataFrame to represent
`last_known_prior_values`. This makes lookups faster and greatly
simplifies the logic of adding and dropping sids.
* HistoryContainer no longer attempts to determine its universe based on
the contents of its internal buffers. The TradingAlgorithm
controlling the container is now responsible for explicitly calling
`add_sids` or `drop_sids` when securities enter or leave the
algorithm's universe. These methods, along with the internal
`_realign` method, provide a clean interface for changing the universe
of securities managed by the container.
* Refactor index mutation logic in `RollingPanel` into a
`MutableIndexRollingPanel` subclass. Maintenance of the old behavior
is regrettably necessary to support `BatchTransform`.
* Refactor shared logic from `roll` and `get_history` into a single
`aggregate_ohlcv_panel` method that's responsible for collapsing an
OHLCV buffer into a frame.
In support of source that emits a subclass of Event which defines some
fields as properties instead of doubling the value in the
`Event.__dict__`
Use hasattr instead of the overridden __contains__ method of the Event
class, so that when non-algorithm facing code checks for field existence,
properties count.
Intentionally not touching the `__contains__` in Event, to avoid
changing, at the moment, any algo behavior that relies on the
`__contains__` behavior's use of `__dict__`
Fixes a crash in various transforms when providing CUSTOM events whose fields
don't match the fields required for the transform.
This is fixed by requiring all `EventWindow` subclasses to supply a `fields`
property, which returns a list of strings that are required keys for any event
that can be processed by the window. Any CUSTOM events the don't supply the
required fields for a transform window are ignored by that window.
Adding a copy of the Event's dt field as datetime via the
`alias_dt` generator, so that the API was forgiving and allowed
both datetime and dt on a SIDData object, was creating noticeable
overhead, even on an noop algorithms.
Instead of incurring the cost of copying the datetime value and
assigning it to the Event object on every event that is passed
through the system, add a property to SIDData which acts as an
alias `datetime` to `dt`.
Eventually support for `data['foo'].datetime` may be removed,
and could be considered deprecated.
The WrongDataForTransform was referencing a `self.fields` member,
which did not exist.
Add a self.fields member set to `price` and `volume` and use
it to iterate over during the check.
Use six's with_metaclass to have objects that use metaclasses, in
both Python 2 and 3.
Otherwise, in Python 3 the objects were being treated as if they
did not have a metaclass, when the Python 2 syntax is used, leading
to errors because of missing attributes, etc.
Instead of porting these cases of type checking, remove them instead.
Slightly more Python-ic to be more generous in what is allowed, and
the conversion to make these compatible with Python 3 are more trouble
than they are worth.
Use the six module to import functions and types that are
consistent between Python 2 and 3, so that one code base can
support both versions.
- Use integer types instead of int and long.
- Use string_types instead of basestring.
- Account for iteritems, itervalues, iterkeys.
- Use six.moves for filter and zip, reduce
- Use compatible bytes for md5 hasher.
- xrange and range
`for s in data` and methods like `for s in data.keys` were not producing
the same list of active sids
Make the other iteration methods match __iter__ by using the contains
method to check whether or not the sid is active.
For use of data outside of the algoscript context, which needs access
to all data fields use data._data
The underlying RollingPanel in batch_transform was always accumulating
all values to ever appear in data.
However, at any given algo time the desired return value is what the
current active sids are.
Instead, mask down to the sids that are passed in as the data parameter.
So that with minute data, 2.5 orders of magnitude of data can
be cut, allowing for longer window_lenghts, when the daily
values are what are desired for a signal.
Expect the same shape of data for the supplemental data, to make
working and preparing with the supplemental data consistent with
what is passed to the algorithm.
For TALib functions like MACD that have output names, return a
DataFrame that for which the columns are the output names of the
function.
So that when using a TALib function, the algorithm doesn't need
to know the index position of the desired result, in favor of using
the name of the result.
e.g.
```
macd_result['AAPL'][0]
```
becomes,
```
macd_result['AAPL']['macd']
```
and
```
macd_result['AAPL'][1]
```
becomes,
```
macd_result['AAPL']['macdsignal']
```
Also, change return type of functions that return floats from a
dictionary to a Series, so that the function is always returning a
pandas type.
If a stock stops gettign updated values, e.g. if a stock rolls out
of a universe strategy, currently the underlying batch transform
for TALib may have nans (which is another issue that could be addressed),
the nans cause crashes when passed to some TALib function, e.g. Bollinger
Bands are incompatible with all nan values.
So, drop sids that only have nan values for the current data panel.
The deepcopy of events into the EventWindow's ticks was causing
a significant increase in memory consumption, e.g. an algorithm with
almost 200 sids and 14 vwaps removing the deepcopy reduces the amount
of memory consumed by about 40%.
The downside is that if an event's properties are changed, which is
not advised, later on, then the signal derived from vwap etc.
may be changed.
A multi-stock TALib transform was returning the same value for
all stocks, specifically the value for the first stock in the panel.
Index into the datapanel using `sid` instead of using the `[0:]`
index which was used when only supporting one sid.
The TALib transform only supported operating on the first value
of a given batch transform panel row.
Instead of returning the one value, even if an panel with multiple
sids was provided, return a dictionary that maps stock to TALib
result.
Prepare for making the zipline_wrapper operate on multiple sids,
as the needed nested logic will get cramped within the nested function.
Also, should help clearly define the inputs of the zipline_wrapper
function that are needed before it is passed to the BatchTransform
constructor.
Set the `compute_only_full` to False so that the 'is window full' logic
is delegated to the TALib's lookback function.
If the window is not full to the `timeperiod` or other lookback setting,
then TALib returns a `np.nan`.
Also, fix the bars/data_frequency not being passed to the BatchTransform
init.
This further shows need to create a minute test for TALib transforms.
Add a `bars` keyword arg, as is used with BatchTransform.
Also, instead of overwriting the window_length kwarg with timeperiod,
always use the lookback value from the created TALib function,
as timeperiod will be an input into that value if it exists.
Calculate `window_length` in minute mode so that there are enough
days to cover the minutes in the timeperiod.
For the creation of a TALib transform use timeperiod intsead of
window_length, to be more in the style of TALib usage, since all
TALib functions may not ending up using BatchTransform, so start
the practice of adhering to TALib conventions to make porting and
explanation easier.
Python 3 requires using dot syntax for relative imports,
otherwise the import is treated as an absolute import, i.e.
an import of a module from outside of the project.
By using dot syntax now, imports should be compatible with both
Python 2.7 and Python 3.
In the batch_transform we were incrementing the trading_days counter if there
is a new day event. Thus with a window_length of 1 and daily bars you will
update the batch_transform on the first day which is correct. But with minutes
you update with the first minute bar of the day which is not correct.
This is fixed by calculating the market_close explicity and seeing whether the
event.dt is on or past it.
I also added a unittest to test the correct behavior of this.
Before the change to the RollingPanel, window_length
specified the number of days that should be in a window.
The previous commit broke this if data was minute resolution.
By passing bar='minute' to the batch_transform we internally
multiply the window_length by 60*6.5 to have a full day.
Also adds a (still rudamentary) test for batch_transform
with minute data.
When setting timeperiod in the talib function it subtracts by 1. We then used this subtracted value to set the window_length in the batch_transform which was then not passing a big enough panel. Ultimately this caused the talib transforms to always return nans.
This also makes the unittest more stringent by explicitly comparing the output of the wrapped TALib moving average to pandas rolling_mean().
Finally, this also allows passing of window_length instead of timeperiod to allow usage of the same interface as before.