Pandas 0.18 doesn't like having null-ish values in categoricals. Fixing
this properly requires re-thinking the semantics for missing_value on
pipeline terms, so we're punting on that until after we've upgraded to
0.18.
Pandas 0.18 deprecated passing "null-ish" values to pd.categorical. The
expectation, instead, is that you use categorical's native support for
missing data, which means the user will always get NaN's for missing
entries of the categorical.
A follow-up to this change should probably drop support for custom
missing values entirely and to use LabelArray/categorical for integer
data.
They're not meaningful, and they cause warnings from numpy.
Implemented in terms of a new preprocessor, `expect_bounded`, which
takes a tuple of `upper_bound` and `lower_bound`.
The daily/session bar reader's `spot_price` took the same parameters and
returned the same kind of output as the minute bar reader's `get_value`.
Standardize on one method to make a common interface, which may be
formally factored out in a later patch; to help enable writing reader
implementations or mixins which can be agnostic to the bar frequency.
Adds a new ``downsample`` method to all computable terms. Computable
terms (Filters, Factors, and Classifiers) can be downsampled to yearly,
quarterly, monthly, or weekly frequency.
The result of ``term.downsample`` is a new term of the same
family (Filter/Factor/Classifier) as ``term``. The downsampled term
computes by delegating to the original term; repeatedly calling its
``compute`` method with length-1 date ranges.
Downsampled terms take advantage of a new ``compute_extra_rows`` Term
method, which allows terms to dynamically request that additional extra
rows of themselves be computed based on the dates for which they're
being computed. This ensures, for example, that a monthly-downsampled
term always computes at the start of a month, even when a
naively-calculated pipeline window would end in the middle of the month.
- Split out extra_rows handling into an `ExecutionPlan` subclass.
`ExecutionPlan` now requires the dates and calendar against which a
set of terms will be computed, and now defers to a term's
`compute_extra_rows` method when deciding how many extra rows are
required to compute for that term. This will allow downsampled terms
to request enough extra rows to guarantee that we can maintain consistent
calculation dates.
As a consequence of the above, `TermGraph` now only deals with logical
dependencies, not with metadata surrounding extra row calculations.
This means that TermGraph can be used to generate dependency
visualizations in interactive contexts where we don't yet have a
calendar or start/end dates.
- Refactored test_{filter,factor,classifier} to use check_terms instead
of run_graph. This makes it easier to make changes to TermGraph,
since the testing interface is now to simply provide a dict of terms.
- Refactored BasePipelineTestCase to use fixtures to create an asset
finder. This fixes a potential leak of the test's asset db, which was
not being explicitly cleaned up.
- Refactored test_technical to use BasePipelineTestCase.
- Added a new special term, `InputDates()`, which can be used to request
date labels for inputs. Like `AssetExists`, `InputDates` is provided
in the initial workspace by default.
- Added a default (failing) `_compute` method to `AssetExists` which
provides a more useful error than AttributeError.
- Don't create unnecessary extra data (requires passing fastd_period=1
to TA-Lib or else it fills the FastK with NaNs even though it must
have already computed them...
- Use random_sample instead of random_integers so that we're not
dependent on integer arithmetic.
- Pass array_decimal to assert_equal so that we do almost equal checking
on results.
When adding fixtures for futures data, there will be a need for multiple
calendars in the fixture ecosystem. e.g. a test that includes both
equities and futures would need an overall calendar which encompasses
both equities and futures; however, the test data for equities should
still still be limited to the bounds set by the NYSE calendar.
Make the fixtures that setup trading calendars and values dervied from
the trading calendar (e.g. trading sessions) accept an iterable of
calendars which need to be created, then populate those values into a
dict keyed by the calendar name.
Change `WithNYSETradingDays` to include sessions in the name,
since we are moving to session as the name for the 'day' unit.
Provide `trading_days` which is really "NYSE trading sessions` on
`WithTradingSessions` for backwards compatibility.
Changes the overlap behavior so that it is an error to write data which
would have two companies holding the same ticker. Other than one test
around which company would win in that case, all the other tests are
passing. That single test has been changed to check the write-time
error.
- Added test coverage for grouped and masked top/bottom.
- Added test coverage for grouped rank on datetime factors.
- Fixed an issue where grouped rank would fail on datetime inputs
because unary-negative isn't defined for datetimes. We now instead
directly invoke a function from rank.pyx that does the normalizations
as neeeded.
- Fixed an issue where GroupedRowTransform assumed that it produced the
same dtype as its input. This isn't true for rank() of a
datetime-dtype factor. GroupedRowTransform now takes a required dtype
parameter.
- Similarly, fixed an issue where GroupedRowTransform assumed that its
missing_value was the same as its parent's, which isn't true for
rank() of a datetime-dtype factor. GroupedRowTransform now takes a
required dtype parameter.
- Fixed an issue where Factor.demean() and Factor.zscore() weren't
properly cached because their static_identity included a closure that
was dynamically generated on each invocation. They both now always
use a function defined at module scope.