catalyst

mirror of https://github.com/wassname/catalyst.git synced 2026-07-04 00:48:24 +08:00

Author	SHA1	Message	Date
Scott Sanderson	493e18252d	MAINT: Temporarily ignore pandas warnings in categoricals. Pandas 0.18 doesn't like having null-ish values in categoricals. Fixing this properly requires re-thinking the semantics for missing_value on pipeline terms, so we're punting on that until after we've upgraded to 0.18.	2016-09-20 17:12:07 -04:00
Scott Sanderson	a9c02935c6	Revert "MAINT: Remove support for custom string Column missing values." This reverts commit 1b1e842e2339d6d0ee40cdfe34dcd27b4e4a7c0c.	2016-09-20 17:12:07 -04:00
Scott Sanderson	ed365dc5fe	MAINT: Remove support for custom string Column missing values. Pandas 0.18 deprecated passing "null-ish" values to pd.categorical. The expectation, instead, is that you use categorical's native support for missing data, which means the user will always get NaN's for missing entries of the categorical. A follow-up to this change should probably drop support for custom missing values entirely and to use LabelArray/categorical for integer data.	2016-09-20 17:12:07 -04:00
Scott Sanderson	af5f4be17c	MAINT: Fix warnings from numpy on NaT comparison.	2016-09-20 17:12:07 -04:00
Scott Sanderson	fc3eac36aa	DOC: Update LabelArray docstring.	2016-09-20 16:24:55 -04:00
Scott Sanderson	b06ef66f44	DOC: Remove out of date comment.	2016-09-20 16:24:55 -04:00
Scott Sanderson	49bb8264dc	ENH: Finish adding groupby to rank/top/bottom. - Added test coverage for grouped and masked top/bottom. - Added test coverage for grouped rank on datetime factors. - Fixed an issue where grouped rank would fail on datetime inputs because unary-negative isn't defined for datetimes. We now instead directly invoke a function from rank.pyx that does the normalizations as neeeded. - Fixed an issue where GroupedRowTransform assumed that it produced the same dtype as its input. This isn't true for rank() of a datetime-dtype factor. GroupedRowTransform now takes a required dtype parameter. - Similarly, fixed an issue where GroupedRowTransform assumed that its missing_value was the same as its parent's, which isn't true for rank() of a datetime-dtype factor. GroupedRowTransform now takes a required dtype parameter. - Fixed an issue where Factor.demean() and Factor.zscore() weren't properly cached because their static_identity included a closure that was dynamically generated on each invocation. They both now always use a function defined at module scope.	2016-07-26 02:57:35 -04:00
Joe Jevnik	5925107052	TST: fix doctests to actually run	2016-06-21 15:07:03 -04:00
dmichalowicz	1ec0bced6d	ENH: Add builtin factors for correlation and regression	2016-05-18 15:11:12 -04:00
Scott Sanderson	f7e9281b14	BUG: Fix groupby with string columns. The previous algorithm assumed that the group labels were integers. It produced nonsense with LabelArrays (though sadly didn't crash because numpy promotes None and void to object).	2016-05-10 16:57:59 -04:00
Scott Sanderson	9fd8ec180d	BUG: View with specific int dtype. Just viewing as int is broken on win32.	2016-05-05 02:13:14 -04:00
Scott Sanderson	e0aeda4c3e	BUG: Fix bytes/unicode issues in py3.	2016-05-05 01:46:35 -04:00
Scott Sanderson	2ceeac1237	BUG: Use compat unicode.	2016-05-04 19:58:55 -04:00
Scott Sanderson	bd49647ce0	BUG: Fix failure on pandas >= 0.17.	2016-05-04 19:38:28 -04:00
Scott Sanderson	b78501e54a	BUG: Fix broken isnull() on string classifiers. Adds a special case in NullFilter to handle LabelArrays correctly.	2016-05-04 17:26:27 -04:00
Scott Sanderson	620d7648b0	BUG: Tests/bugfixes for LabelArray slicing. - Fixes a bug where __setitem__ was not called when setting with a slice on Python 2 (__setslice__ was called instead), which caused strange behavior when setting an empty string. This is fixed by overriding __setslice__ and forwarding to __setitem__. - Fixes a bug where __getitem__ returned an instance of np.void when returning a scalar. We now correctly return an entry from our categoricals.	2016-05-04 15:54:50 -04:00
Scott Sanderson	4dbc7eac56	MAINT: Remove byteswap and newbyteorder from LabelArray.	2016-05-04 15:54:50 -04:00
Scott Sanderson	8de45540f2	ENH: NaN semantics for LabelArray missing values.	2016-05-04 15:54:50 -04:00
Scott Sanderson	2395cbb671	ENH: Use np.void for labelarray storage. This disables most broken ufuncs	2016-05-04 15:54:50 -04:00
Scott Sanderson	5cd7d79818	MAINT: Restore support for bytes/unicode AdjustedArrays.	2016-05-04 15:54:50 -04:00
Scott Sanderson	47e9b107ec	DOC: Clean up docstring cruft.	2016-05-04 15:54:50 -04:00
Scott Sanderson	23324b4218	DOC: Add docstring for LabelArray.	2016-05-04 15:54:50 -04:00
Scott Sanderson	1a2ed2724b	BUG: Pass correct class to super call.	2016-05-04 15:54:50 -04:00
Scott Sanderson	c40bbfae03	TEST: More tests for string predicates.	2016-05-04 15:54:50 -04:00
Scott Sanderson	5f190395ad	ENH: Add support for strings in Pipeline. - Adds a new class, ``LabelArray``, which is a subclass of np.ndarray. LabelArray is conceptually similar to pandas.Categorical, in that it stores data with many duplicate values as indices into an array of unique values. For string data with many duplicates (e.g. time-series of tickers or or industry classifications), this provides multiple orders of magnitude of improvement when doing string operations, especially string comparison/matching operations. - Adds a new generic object "specialization" for `AdjustedArrayWindow`, and a corresponding ObjectOverwrite adjustment. - Adds a new ``postprocess`` method to ``zipline.pipeline.term.Term``. This method is called on the final result of any pipeline expression after screen filtering has occurred. The default implementation of ``postprocess`` is identity, but Classifier overrides it to coerce string columns into pandas.Categoricals before presenting them to the user.	2016-05-04 15:50:52 -04:00
Eddie Hebert	16fd6681a6	ENH: Rewrite of Zipline to use lazy access pattern More documentation to follow in release notes. Based on lazy-mainline branch, see for more details. Also-By: Jean Bredeche <jean@quantopian.com> Also-By: Andrew Liang <aliang@quantopian.com> Also-By: Abhijeet Kalyan <akalyan@quantopian.com>	2016-04-04 16:12:58 -04:00
Scott Sanderson	872b84e09a	ENH: Implement Factor.quantiles.	2016-03-25 15:11:18 -04:00
Scott Sanderson	53d3b0855b	ENH: Add support for Classifiers. Classifiers are computations that represent grouping keys. They can be used in conjuction with normalization functions like ``zscore`` or ``demean`` to perform normalizations over subsets of a dataset. Notable changes: - Added ``demean()`` and ``zscore()`` methods to ``Factor``. - Added a classifier versions of ``Latest`` and ``CustomTermMixin``. The .latest attribute of int64 dataset columns no produces a classifier by default. - Added ``Everything``, a classifier that maps all data to the same value. - Added ``zipline.lib.normalize``, which implements a naive, pure-Python grouped normalize function. This will likely be moved to Cython in a subsequent PR.	2016-03-19 17:04:28 -04:00
Scott Sanderson	f635a14289	ENH: Add `isnull` and `notnull` methods to Factor.	2016-03-07 16:19:08 -05:00
Scott Sanderson	6287987c0b	BUG: Work around scipy >= 0.17 changing dtype of rankdata.	2016-02-16 13:43:56 -05:00
Scott Sanderson	0115cdc46c	MAINT: Fail fast on unsupported dtypes.	2016-02-12 21:23:47 -05:00
Scott Sanderson	c105735574	DEV: Add support for specifying missing_value. Consequently, enable support for `int`-dtyped Factors and BoundColumns.	2016-02-12 21:23:47 -05:00
Richard Frank	18db1904bc	BUG: Need to format message, not ValueError instance	2016-01-06 16:02:18 -05:00
Richard Frank	1499051df7	BUG: TypeError message had only str of numpy.dtype class We want to use the dtype of the data that was passed in.	2016-01-06 15:29:58 -05:00
Scott Sanderson	67d546f000	MAINT: Use an enum for the AdjustmentKind.	2015-12-10 16:21:46 -05:00
Scott Sanderson	77bce4ec9d	MAINT: Refactor next_adj logic into method.	2015-12-10 16:12:51 -05:00
Scott Sanderson	64ce6d26aa	BUG: Fix hardcoded type repr in test. Types repr differently in py2 vs py3.	2015-12-09 15:29:57 -05:00
llllllllll	48536add73	TST: fix doctests	2015-12-09 11:22:13 -05:00
Scott Sanderson	8220d1ee86	ENH: Adds support for different typed adjusted arrays and adds an EarningsCalendar loader. - Moves most of AdjustedArray back into Python. The window iterator is the only part that's performance-intensive. - Adds a bootleg templating system for creating specialized versions of AdjustedArrayWindow for each concrete type we care about. - Adds support for differently dtyped terms in pipeline. This allows us to use datetime64s which are needed in the EarningsCalendar. - Adds EarningsCalendar dataset for the next and previous earnings announcements in pipeline. - Adds in memory loader for EarningsCalendar. - Adds blaze loader for EarningsCalendar.	2015-12-08 20:24:06 -05:00
Scott Sanderson	5d8a915d15	ENH: Add inspect() function to adjusted_array.	2015-11-20 20:15:43 -05:00
llllllllll	3fb91e4d39	MAINT: cleanup doctests	2015-10-19 16:35:03 -04:00
llllllllll	0fff04d9c1	DOC: update doctest	2015-10-19 16:35:03 -04:00
llllllllll	0183d0a914	ENH: Allows Float64Adjustments to act on a range of columns	2015-10-19 16:35:03 -04:00
Scott Sanderson	26fd6fda8b	ENH/BUG: Modeling API enhancements. - Fixes an error where Modeling API data known as of the close of `day N` would be shown to algorithms during `before_trading_start` as of the close of the same day. Algorithms should now only receive data during `before_trading_start/handle_data` that was known as of the simulation time at which the function would be called. - All Term instances now have a `mask` attribute that must be a `Filter` or an instance of `AssetExists()`. `mask` can be used to specify that a Factor should be computed in a manner that ignores the values that were not `True` in the mask. - Changed the interface for `FFCLoader.load_adjusted_array` and `Term._compute` from `(columns, mask)`, with mask as a DataFrame, to `(columns, dates, assets, mask)`, where mask is a numpy array. This is primarily to avoid having to reconstruct extra DataFrames when using masks produced by non `AssetExists` filters. - Adds `BoundColumn.latest`, which gives the most-recently-known value of a column.	2015-09-16 01:47:11 -04:00
Scott Sanderson	6e8a4b8144	ENH: Improvements to rank(). - Add an `ascending=True` keyword to `rank()`. - Add `top(N)` and `bottom(N)` methods to Factor. These return Filters that pass the top and bottom N elements each day. - Add a slightly faster path for rank(method='ordinal'). I had originally thought the fast path was 2-3x faster because I had my benchmark data axes flipped. The actual speedup is only 5-10%, which means it probably wasn't worth the effort to Cythonize...but we have a slightly faster version now so we might as well use it. - Refactor test_filter and test_factor to make it easier to implement and test transformations on factors. These tests now subclass BaseFFCTestCase, which provides facilities for passing a dict of terms and an "initial_workspace", the values for which are used by SimpleFFCEngine rather than needing to manually manage the inputs and outputs of each term.	2015-08-31 00:32:33 -04:00
Scott Sanderson	41d4133c74	BUG: Use NAN from numpy. MSVC doesn't define NAN in math.h because they only implement C89. See http://tdistler.com/2011/03/24/how-to-define-nan-not-a-number-on-windows.	2015-08-21 11:33:20 -04:00
Scott Sanderson	ef4f642e62	ENH: Compute engine architecture for FFC API. This patch lays the groundwork for a compute engine designed to facilitate construction of factor-based universe screening and portfolio allocation. It contains: A new module, `zipline.modelling`, containing entities that can be used to express computations as dependency graphs. Each node in such a graph is an instance of the base `Term` class, defined in `zipline.modelling.term`. Dependency graphs are executed by instances of `FFCEngine`, defined in `zipline.modelling.engine`. A new module, `zipline.data.ffc`, containing loaders and dataset definitions for inputs to the modelling API. New `TradingAlgorithm` api methods: `add_factor`, and `add_filter`. These methods can only be called from `initialize`, and are used to inform the algorithm that each day it should compute the given terms. Computed factor results are made available through a new attribute of the `data` object in `before_trading_start` and `handle_data`. Computed filter results control which assets are available in the factor matrix on each day.	2015-07-29 12:30:46 -04:00

47 Commits