Commit Graph

47 Commits

Author SHA1 Message Date
Scott Sanderson 493e18252d MAINT: Temporarily ignore pandas warnings in categoricals.
Pandas 0.18 doesn't like having null-ish values in categoricals.  Fixing
this properly requires re-thinking the semantics for missing_value on
pipeline terms, so we're punting on that until after we've upgraded to
0.18.
2016-09-20 17:12:07 -04:00
Scott Sanderson a9c02935c6 Revert "MAINT: Remove support for custom string Column missing values."
This reverts commit 1b1e842e2339d6d0ee40cdfe34dcd27b4e4a7c0c.
2016-09-20 17:12:07 -04:00
Scott Sanderson ed365dc5fe MAINT: Remove support for custom string Column missing values.
Pandas 0.18 deprecated passing "null-ish" values to pd.categorical.  The
expectation, instead, is that you use categorical's native support for
missing data, which means the user will always get NaN's for missing
entries of the categorical.

A follow-up to this change should probably drop support for custom
missing values entirely and to use LabelArray/categorical for integer
data.
2016-09-20 17:12:07 -04:00
Scott Sanderson af5f4be17c MAINT: Fix warnings from numpy on NaT comparison. 2016-09-20 17:12:07 -04:00
Scott Sanderson fc3eac36aa DOC: Update LabelArray docstring. 2016-09-20 16:24:55 -04:00
Scott Sanderson b06ef66f44 DOC: Remove out of date comment. 2016-09-20 16:24:55 -04:00
Scott Sanderson 49bb8264dc ENH: Finish adding groupby to rank/top/bottom.
- Added test coverage for grouped and masked top/bottom.

- Added test coverage for grouped rank on datetime factors.

- Fixed an issue where grouped rank would fail on datetime inputs
  because unary-negative isn't defined for datetimes.  We now instead
  directly invoke a function from rank.pyx that does the normalizations
  as neeeded.

- Fixed an issue where GroupedRowTransform assumed that it produced the
  same dtype as its input.  This isn't true for rank() of a
  datetime-dtype factor.  GroupedRowTransform now takes a required dtype
  parameter.

- Similarly, fixed an issue where GroupedRowTransform assumed that its
  missing_value was the same as its parent's, which isn't true for
  rank() of a datetime-dtype factor.  GroupedRowTransform now takes a
  required dtype parameter.

- Fixed an issue where Factor.demean() and Factor.zscore() weren't
  properly cached because their static_identity included a closure that
  was dynamically generated on each invocation.  They both now always
  use a function defined at module scope.
2016-07-26 02:57:35 -04:00
Joe Jevnik 5925107052 TST: fix doctests to actually run 2016-06-21 15:07:03 -04:00
dmichalowicz 1ec0bced6d ENH: Add builtin factors for correlation and regression 2016-05-18 15:11:12 -04:00
Scott Sanderson f7e9281b14 BUG: Fix groupby with string columns.
The previous algorithm assumed that the group labels were integers. It
produced nonsense with LabelArrays (though sadly didn't crash because
numpy promotes None and void to object).
2016-05-10 16:57:59 -04:00
Scott Sanderson 9fd8ec180d BUG: View with specific int dtype.
Just viewing as int is broken on win32.
2016-05-05 02:13:14 -04:00
Scott Sanderson e0aeda4c3e BUG: Fix bytes/unicode issues in py3. 2016-05-05 01:46:35 -04:00
Scott Sanderson 2ceeac1237 BUG: Use compat unicode. 2016-05-04 19:58:55 -04:00
Scott Sanderson bd49647ce0 BUG: Fix failure on pandas >= 0.17. 2016-05-04 19:38:28 -04:00
Scott Sanderson b78501e54a BUG: Fix broken isnull() on string classifiers.
Adds a special case in NullFilter to handle LabelArrays correctly.
2016-05-04 17:26:27 -04:00
Scott Sanderson 620d7648b0 BUG: Tests/bugfixes for LabelArray slicing.
- Fixes a bug where __setitem__ was not called when setting with a slice
  on Python 2 (__setslice__ was called instead), which caused strange
  behavior when setting an empty string.  This is fixed by overriding
  __setslice__ and forwarding to __setitem__.

- Fixes a bug where __getitem__ returned an instance of np.void when
  returning a scalar.  We now correctly return an entry from our
  categoricals.
2016-05-04 15:54:50 -04:00
Scott Sanderson 4dbc7eac56 MAINT: Remove byteswap and newbyteorder from LabelArray. 2016-05-04 15:54:50 -04:00
Scott Sanderson 8de45540f2 ENH: NaN semantics for LabelArray missing values. 2016-05-04 15:54:50 -04:00
Scott Sanderson 2395cbb671 ENH: Use np.void for labelarray storage.
This disables most broken ufuncs
2016-05-04 15:54:50 -04:00
Scott Sanderson 5cd7d79818 MAINT: Restore support for bytes/unicode AdjustedArrays. 2016-05-04 15:54:50 -04:00
Scott Sanderson 47e9b107ec DOC: Clean up docstring cruft. 2016-05-04 15:54:50 -04:00
Scott Sanderson 23324b4218 DOC: Add docstring for LabelArray. 2016-05-04 15:54:50 -04:00
Scott Sanderson 1a2ed2724b BUG: Pass correct class to super call. 2016-05-04 15:54:50 -04:00
Scott Sanderson c40bbfae03 TEST: More tests for string predicates. 2016-05-04 15:54:50 -04:00
Scott Sanderson 5f190395ad ENH: Add support for strings in Pipeline.
- Adds a new class, ``LabelArray``, which is a subclass of np.ndarray.
  LabelArray is conceptually similar to pandas.Categorical, in that it
  stores data with many duplicate values as indices into an array of
  unique values.  For string data with many duplicates (e.g. time-series
  of tickers or or industry classifications), this provides multiple
  orders of magnitude of improvement when doing string operations,
  especially string comparison/matching operations.

- Adds a new generic object "specialization" for `AdjustedArrayWindow`,
  and a corresponding ObjectOverwrite adjustment.

- Adds a new ``postprocess`` method to ``zipline.pipeline.term.Term``.
  This method is called on the final result of any pipeline expression
  after screen filtering has occurred. The default implementation of
  ``postprocess`` is identity, but Classifier overrides it to coerce
  string columns into pandas.Categoricals before presenting them to the
  user.
2016-05-04 15:50:52 -04:00
Eddie Hebert 16fd6681a6 ENH: Rewrite of Zipline to use lazy access pattern
More documentation to follow in release notes.

Based on lazy-mainline branch, see for more details.

Also-By: Jean Bredeche <jean@quantopian.com>
Also-By: Andrew Liang <aliang@quantopian.com>
Also-By: Abhijeet Kalyan <akalyan@quantopian.com>
2016-04-04 16:12:58 -04:00
Scott Sanderson 872b84e09a ENH: Implement Factor.quantiles. 2016-03-25 15:11:18 -04:00
Scott Sanderson 53d3b0855b ENH: Add support for Classifiers.
Classifiers are computations that represent grouping keys. They can be
used in conjuction with normalization functions like ``zscore`` or
``demean`` to perform normalizations over subsets of a dataset.

Notable changes:

- Added ``demean()`` and ``zscore()`` methods to ``Factor``.

- Added a classifier versions of ``Latest`` and ``CustomTermMixin``.
  The .latest attribute of int64 dataset columns no produces a
  classifier by default.

- Added ``Everything``, a classifier that maps all data to the same
  value.

- Added ``zipline.lib.normalize``, which implements a naive, pure-Python
  grouped normalize function.  This will likely be moved to Cython in a
  subsequent PR.
2016-03-19 17:04:28 -04:00
Scott Sanderson f635a14289 ENH: Add isnull and notnull methods to Factor. 2016-03-07 16:19:08 -05:00
Scott Sanderson 6287987c0b BUG: Work around scipy >= 0.17 changing dtype of rankdata. 2016-02-16 13:43:56 -05:00
Scott Sanderson 0115cdc46c MAINT: Fail fast on unsupported dtypes. 2016-02-12 21:23:47 -05:00
Scott Sanderson c105735574 DEV: Add support for specifying missing_value.
Consequently, enable support for `int`-dtyped Factors and BoundColumns.
2016-02-12 21:23:47 -05:00
Richard Frank 18db1904bc BUG: Need to format message, not ValueError instance 2016-01-06 16:02:18 -05:00
Richard Frank 1499051df7 BUG: TypeError message had only str of numpy.dtype class
We want to use the dtype of the data that was passed in.
2016-01-06 15:29:58 -05:00
Scott Sanderson 67d546f000 MAINT: Use an enum for the AdjustmentKind. 2015-12-10 16:21:46 -05:00
Scott Sanderson 77bce4ec9d MAINT: Refactor next_adj logic into method. 2015-12-10 16:12:51 -05:00
Scott Sanderson 64ce6d26aa BUG: Fix hardcoded type repr in test.
Types repr differently in py2 vs py3.
2015-12-09 15:29:57 -05:00
llllllllll 48536add73 TST: fix doctests 2015-12-09 11:22:13 -05:00
Scott Sanderson 8220d1ee86 ENH: Adds support for different typed adjusted arrays and adds an
EarningsCalendar loader.

- Moves most of AdjustedArray back into Python. The window iterator is
  the only part that's performance-intensive.

- Adds a bootleg templating system for creating specialized versions of
  AdjustedArrayWindow for each concrete type we care about.

- Adds support for differently dtyped terms in pipeline. This allows us
  to use datetime64s which are needed in the EarningsCalendar.

- Adds EarningsCalendar dataset for the next and previous earnings
  announcements in pipeline.

- Adds in memory loader for EarningsCalendar.

- Adds blaze loader for EarningsCalendar.
2015-12-08 20:24:06 -05:00
Scott Sanderson 5d8a915d15 ENH: Add inspect() function to adjusted_array. 2015-11-20 20:15:43 -05:00
llllllllll 3fb91e4d39 MAINT: cleanup doctests 2015-10-19 16:35:03 -04:00
llllllllll 0fff04d9c1 DOC: update doctest 2015-10-19 16:35:03 -04:00
llllllllll 0183d0a914 ENH: Allows Float64Adjustments to act on a range of columns 2015-10-19 16:35:03 -04:00
Scott Sanderson 26fd6fda8b ENH/BUG: Modeling API enhancements.
- Fixes an error where Modeling API data known as of the close of `day
  N` would be shown to algorithms during `before_trading_start` as of
  the close of the same day.  Algorithms should now only receive data
  during `before_trading_start/handle_data` that was known as of the
  simulation time at which the function would be called.

- All Term instances now have a `mask` attribute that must be a `Filter`
  or an instance of `AssetExists()`.  `mask` can be used to specify that
  a Factor should be computed in a manner that ignores the values that
  were not `True` in the mask.

- Changed the interface for `FFCLoader.load_adjusted_array` and
  `Term._compute` from `(columns, mask)`, with mask as a DataFrame, to
  `(columns, dates, assets, mask)`, where mask is a numpy array.  This
  is primarily to avoid having to reconstruct extra DataFrames when
  using masks produced by non `AssetExists` filters.

- Adds `BoundColumn.latest`, which gives the most-recently-known value
  of a column.
2015-09-16 01:47:11 -04:00
Scott Sanderson 6e8a4b8144 ENH: Improvements to rank().
- Add an `ascending=True` keyword to `rank()`.

- Add `top(N)` and `bottom(N)` methods to Factor.  These return Filters
  that pass the top and bottom N elements each day.

- Add a slightly faster path for rank(method='ordinal').  I had
  originally thought the fast path was 2-3x faster because I had my
  benchmark data axes flipped.  The actual speedup is only 5-10%, which
  means it probably wasn't worth the effort to Cythonize...but we have a
  slightly faster version now so we might as well use it.

- Refactor test_filter and test_factor to make it easier to implement
  and test transformations on factors.  These tests now subclass
  BaseFFCTestCase, which provides facilities for passing a dict of terms
  and an "initial_workspace", the values for which are used by
  SimpleFFCEngine rather than needing to manually manage the inputs and
  outputs of each term.
2015-08-31 00:32:33 -04:00
Scott Sanderson 41d4133c74 BUG: Use NAN from numpy.
MSVC doesn't define NAN in math.h because they only implement C89.

See http://tdistler.com/2011/03/24/how-to-define-nan-not-a-number-on-windows.
2015-08-21 11:33:20 -04:00
Scott Sanderson ef4f642e62 ENH: Compute engine architecture for FFC API.
This patch lays the groundwork for a compute engine designed to
facilitate construction of factor-based universe screening and portfolio
allocation.  It contains:

A new module, `zipline.modelling`, containing entities that can be used
to express computations as dependency graphs.  Each node in such a graph
is an instance of the base `Term` class, defined in
`zipline.modelling.term`.  Dependency graphs are executed by instances
of `FFCEngine`, defined in `zipline.modelling.engine`.

A new module, `zipline.data.ffc`, containing loaders and dataset
definitions for inputs to the modelling API.

New `TradingAlgorithm` api methods: `add_factor`, and `add_filter`.
These methods can only be called from `initialize`, and are used to
inform the algorithm that each day it should compute the given terms.
Computed factor results are made available through a new attribute of
the `data` object in `before_trading_start` and `handle_data`.  Computed
filter results control which assets are available in the factor matrix
on each day.
2015-07-29 12:30:46 -04:00