autots package¶
Subpackages¶
- autots.datasets package
- autots.evaluator package
- autots.models package
- Submodules
- autots.models.arch module
- autots.models.base module
- autots.models.basics module
- autots.models.cassandra module
- autots.models.dnn module
- autots.models.ensemble module
- autots.models.gluonts module
- autots.models.greykite module
- autots.models.matrix_var module
- autots.models.model_list module
- autots.models.prophet module
- autots.models.pytorch module
- autots.models.sklearn module
- autots.models.statsmodels module
- autots.models.tfp module
- Module contents
- autots.templates package
- autots.tools package
- Submodules
- autots.tools.anomaly_utils module
- autots.tools.calendar module
- autots.tools.cointegration module
- autots.tools.cpu_count module
- autots.tools.fast_kalman module
- autots.tools.hierarchial module
- autots.tools.holiday module
- autots.tools.impute module
- autots.tools.lunar module
- autots.tools.percentile module
- autots.tools.probabilistic module
- autots.tools.profile module
- autots.tools.regressor module
- autots.tools.seasonal module
- autots.tools.shaping module
- autots.tools.thresholding module
- autots.tools.transform module
- autots.tools.window_functions module
- Module contents
Module contents¶
Automated Time Series Model Selection for Python
https://github.com/winedarksea/AutoTS
-
autots.
load_daily
(long: bool = True)¶ 2020 Covid, Air Pollution, and Economic Data.
Sources: Wikimedia Foundation
- Parameters:
long (bool) – if True, return data in long format. Otherwise return wide
-
autots.
load_monthly
(long: bool = True)¶ Federal Reserve of St. Louis monthly economic indicators.
-
autots.
load_yearly
(long: bool = True)¶ Federal Reserve of St. Louis annual economic indicators.
-
autots.
load_hourly
(long: bool = True)¶ Traffic data from the MN DOT via the UCI data repository.
-
autots.
load_weekly
(long: bool = True)¶ Weekly petroleum industry data from the EIA.
-
autots.
load_weekdays
(long: bool = False, categorical: bool = True, periods: int = 180)¶ Test edge cases by creating a Series with values as day of week.
- Parameters:
long (bool) – if True, return a df with columns “value” and “datetime” if False, return a Series with dt index
categorical (bool) – if True, return str/object, else return int
periods (int) – number of periods, ie length of data to generate
-
autots.
load_live_daily
(long: bool = False, observation_start: str = None, observation_end: str = None, fred_key: str = None, fred_series=['DGS10', 'T5YIE', 'SP500', 'DCOILWTICO', 'DEXUSEU', 'WPU0911'], tickers: list = ['MSFT'], trends_list: list = ['forecasting', 'cycling', 'microsoft'], trends_geo: str = 'US', weather_data_types: list = ['AWND', 'WSF2', 'TAVG'], weather_stations: list = ['USW00094846', 'USW00014925'], weather_years: int = 10, london_air_stations: list = ['CT3', 'SK8'], london_air_species: str = 'PM25', london_air_days: int = 180, earthquake_days: int = 180, earthquake_min_magnitude: int = 5, gsa_key: str = None, gov_domain_list=['nasa.gov'], gov_domain_limit: int = 600, wikipedia_pages: list = ['Microsoft_Office', 'List_of_highest-grossing_films'], wiki_language: str = 'en', weather_event_types=['%28Z%29+Winter+Weather', '%28Z%29+Winter+Storm'], timeout: float = 300.05, sleep_seconds: int = 2)¶ Generates a dataframe of data up to the present day. Requires active internet connection. Try to be respectful of these free data sources by not calling too much too heavily. Pass None instead of specification lists to exclude a data source.
- Parameters:
long (bool) – whether to return in long format or wide
observation_start (str) – %Y-%m-%d earliest day to retrieve, passed to Fred.get_series and yfinance.history note that apis with more restricts have other default lengths below which ignore this
observation_end (str) – %Y-%m-%d most recent day to retrieve
fred_key (str) – https://fred.stlouisfed.org/docs/api/api_key.html
fred_series (list) – list of FRED series IDs. This requires fredapi package
tickers (list) – list of stock tickers, requires yfinance pypi package
trends_list (list) – list of search keywords, requires pytrends pypi package. None to skip.
weather_data_types (list) – from NCEI NOAA api data types, GHCN Daily Weather Elements PRCP, SNOW, TMAX, TMIN, TAVG, AWND, WSF1, WSF2, WSF5, WSFG
weather_stations (list) – from NCEI NOAA api station ids. Pass empty list to skip.
london_air_stations (list) – londonair.org.uk source station IDs. Pass empty list to skip.
london_species (str) – what measurement to pull from London Air. Not all stations have all metrics.
earthquake_min_magnitude (int) – smallest earthquake magnitude to pull from earthquake.usgs.gov. Set None to skip this.
gsa_key (str) – api key from https://open.gsa.gov/api/dap/
gov_domain_list (list) – dist of government run domains to get traffic data for. Can be very slow, so fewer is better. some examples: [‘usps.com’, ‘ncbi.nlm.nih.gov’, ‘cdc.gov’, ‘weather.gov’, ‘irs.gov’, “usajobs.gov”, “studentaid.gov”, ‘nasa.gov’, “uk.usembassy.gov”, “tsunami.gov”]
gov_domain_limit (int) – max number of records. Smaller will be faster. Max is currently 10000.
wikipedia_pages (list) – list of Wikipedia pages, html encoded if needed (underscore for space)
weather_event_types (list) – list of html encoded severe weather event types https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/Storm-Data-Export-Format.pdf
timeout (float) – used by some queries
sleep_seconds (int) – increasing this may reduce probability of server download failures
-
autots.
load_linear
(long=False, shape=None, start_date: str = '2021-01-01', introduce_nan: float = None, introduce_random: float = None, random_seed: int = 123)¶ Create a dataset of just zeroes for testing edge case.
- Parameters:
long (bool) – whether to make long or wide
shape (tuple) – shape of output dataframe
start_date (str) – first date of index
introduce_nan (float) – percent of rows to make null. 0.2 = 20%
introduce_random (float) – shape of gamma distribution
random_seed (int) – seed for random
-
autots.
load_artificial
(long=False, date_start=None, date_end=None)¶ Load artifically generated series from random distributions.
- Parameters:
long (bool) – if True long style data, if False, wide style data
date_start – str or datetime.datetime of start date
date_end – str or datetime.datetime of end date
-
autots.
load_sine
(long=False, shape=None, start_date: str = '2021-01-01', introduce_random: float = None, random_seed: int = 123)¶ Create a dataset of just zeroes for testing edge case.
-
class
autots.
AutoTS
(forecast_length: int = 14, frequency: str = 'infer', prediction_interval: float = 0.9, max_generations: int = 10, no_negatives: bool = False, constraint: float = None, ensemble: str = 'auto', initial_template: str = 'General+Random', random_seed: int = 2022, holiday_country: str = 'US', subset: int = None, aggfunc: str = 'first', na_tolerance: float = 1, metric_weighting: dict = {'containment_weighting': 0, 'contour_weighting': 1, 'imle_weighting': 0, 'made_weighting': 0.5, 'mae_weighting': 2, 'mage_weighting': 0, 'mle_weighting': 0, 'oda_weighting': 0.001, 'rmse_weighting': 2, 'runtime_weighting': 0.05, 'smape_weighting': 5, 'spl_weighting': 3}, drop_most_recent: int = 0, drop_data_older_than_periods: int = 100000, model_list: str = 'default', transformer_list: dict = 'auto', transformer_max_depth: int = 6, models_mode: str = 'random', num_validations: int = 'auto', models_to_validate: float = 0.15, max_per_model_class: int = None, validation_method: str = 'backwards', min_allowed_train_percent: float = 0.5, remove_leading_zeroes: bool = False, prefill_na: str = None, introduce_na: bool = None, preclean: dict = None, model_interrupt: bool = True, current_model_file: str = None, verbose: int = 1, n_jobs: int = -2)¶ Bases:
object
Automate time series modeling using a genetic algorithm.
- Parameters:
forecast_length (int) – number of periods over which to evaluate forecast. Can be overriden later in .predict().
frequency (str) – ‘infer’ or a specific pandas datetime offset. Can be used to force rollup of data (ie daily input, but frequency ‘M’ will rollup to monthly).
prediction_interval (float) – 0-1, uncertainty range for upper and lower forecasts. Adjust range, but rarely matches actual containment.
max_generations (int) – number of genetic algorithms generations to run. More runs = longer runtime, generally better accuracy. It’s called max because someday there will be an auto early stopping option, but for now this is just the exact number of generations to run.
no_negatives (bool) – if True, all negative predictions are rounded up to 0.
constraint (float) –
when not None, use this float value * data st dev above max or below min for constraining forecast values. now also instead accepts a dictionary containing the following key/values:
- constraint_method (str): one of
stdev_min - threshold is min and max of historic data +/- constraint * st dev of data stdev - threshold is the mean of historic data +/- constraint * st dev of data absolute - input is array of length series containing the threshold’s final value for each quantile - constraint is the quantile of historic data to use as threshold
- constraint_regularization (float): 0 to 1
where 0 means no constraint, 1 is hard threshold cutoff, and in between is penalty term
upper_constraint (float): or array, depending on method, None if unused lower_constraint (float): or array, depending on method, None if unused bounds (bool): if True, apply to upper/lower forecast, otherwise False applies only to forecast
ensemble (str) – None or list or comma-separated string containing: ‘auto’, ‘simple’, ‘distance’, ‘horizontal’, ‘horizontal-min’, ‘horizontal-max’, “mosaic”, “subsample”
initial_template (str) – ‘Random’ - randomly generates starting template, ‘General’ uses template included in package, ‘General+Random’ - both of previous. Also can be overriden with self.import_template()
random_seed (int) – random seed allows (slightly) more consistent results.
holiday_country (str) – passed through to Holidays package for some models.
subset (int) – maximum number of series to evaluate at once. Useful to speed evaluation when many series are input. takes a new subset of columns on each validation, unless mosaic ensembling, in which case columns are the same in each validation
aggfunc (str) – if data is to be rolled up to a higher frequency (daily -> monthly) or duplicate timestamps are included. Default ‘first’ removes duplicates, for rollup try ‘mean’ or np.sum. Beware numeric aggregations like ‘mean’ will not work with non-numeric inputs.
na_tolerance (float) – 0 to 1. Series are dropped if they have more than this percent NaN. 0.95 here would allow series containing up to 95% NaN values.
metric_weighting (dict) – weights to assign to metrics, effecting how the ranking score is generated.
drop_most_recent (int) – option to drop n most recent data points. Useful, say, for monthly sales data where the current (unfinished) month is included. occurs after any aggregration is applied, so will be whatever is specified by frequency, will drop n frequencies
drop_data_older_than_periods (int) – take only the n most recent timestamps
model_list (list) – str alias or list of names of model objects to use now can be a dictionary of {“model”: prob} but only affects starting random templates. Genetic algorithim takes from there.
transformer_list (list) – list of transformers to use, or dict of transformer:probability. Note this does not apply to initial templates. can accept string aliases: “all”, “fast”, “superfast”
transformer_max_depth (int) – maximum number of sequential transformers to generate for new Random Transformers. Fewer will be faster.
models_mode (str) – option to adjust parameter options for newly generated models. Currently includes: ‘default’, ‘deep’ (searches more params, likely slower), and ‘regressor’ (forces ‘User’ regressor mode in regressor capable models)
num_validations (int) – number of cross validations to perform. 0 for just train/test on best split. Possible confusion: num_validations is the number of validations to perform after the first eval segment, so totally eval/validations will be this + 1. Also “auto” and “max” aliases available. Max maxes out at 50.
models_to_validate (int) – top n models to pass through to cross validation. Or float in 0 to 1 as % of tried. 0.99 is forced to 100% validation. 1 evaluates just 1 model. If horizontal or mosaic ensemble, then additional min per_series models above the number here are added to validation.
max_per_model_class (int) – of the models_to_validate what is the maximum to pass from any one model class/family.
validation_method (str) – ‘even’, ‘backwards’, or ‘seasonal n’ where n is an integer of seasonal ‘backwards’ is better for recency and for shorter training sets ‘even’ splits the data into equally-sized slices best for more consistent data, a poetic but less effective strategy than others here ‘seasonal’ most similar indexes ‘seasonal n’ for example ‘seasonal 364’ would test all data on each previous year of the forecast_length that would immediately follow the training data. ‘similarity’ automatically finds the data sections most similar to the most recent data that will be used for prediction ‘custom’ - if used, .fit() needs validation_indexes passed - a list of pd.DatetimeIndex’s, tail of each is used as test
min_allowed_train_percent (float) – percent of forecast length to allow as min training, else raises error. 0.5 with a forecast length of 10 would mean 5 training points are mandated, for a total of 15 points. Useful in (unrecommended) cases where forecast_length > training length.
remove_leading_zeroes (bool) – replace leading zeroes with NaN. Useful in data where initial zeroes mean data collection hasn’t started yet.
prefill_na (str) – value to input to fill all NaNs with. Leaving as None and allowing model interpolation is recommended. None, 0, ‘mean’, or ‘median’. 0 may be useful in for examples sales cases where all NaN can be assumed equal to zero.
introduce_na (bool) – whether to force last values in one training validation to be NaN. Helps make more robust models. defaults to None, which introduces NaN in last rows of validations if any NaN in tail of training data. Will not introduce NaN to all series if subset is used. if True, will also randomly change 20% of all rows to NaN in the validations
preclean (dict) – if not None, a dictionary of Transformer params to be applied to input data {“fillna”: “median”, “transformations”: {}, “transformation_params”: {}} This will change data used in model inputs for fit and predict, and for accuracy evaluation in cross validation!
model_interrupt (bool) – if False, KeyboardInterrupts quit entire program. if True, KeyboardInterrupts attempt to only quit current model. if True, recommend use in conjunction with verbose > 0 and result_file in the event of accidental complete termination. if “end_generation”, as True and also ends entire generation of run. Note skipped models will not be tried again.
current_model_file (str) – file path to write to disk of current model params (for debugging if computer crashes). .json is appended
verbose (int) – setting to 0 or lower should reduce most output. Higher numbers give more output.
n_jobs (int) – Number of cores available to pass to parallel processing. A joblib context manager can be used instead (pass None in this case). Also ‘auto’.
-
best_model
¶ DataFrame containing template for the best ranked model
- Type:
pd.DataFrame
-
best_model_name
¶ model name
- Type:
str
-
best_model_params
¶ model params
- Type:
dict
-
best_model_transformation_params
¶ transformation parameters
- Type:
dict
-
best_model_ensemble
¶ Ensemble type int id
- Type:
int
-
regression_check
¶ If True, the best_model uses an input ‘User’ future_regressor
- Type:
bool
-
df_wide_numeric
¶ dataframe containing shaped final data
- Type:
pd.DataFrame
-
initial_results.
model_results
¶ contains a collection of result metrics
- Type:
object
-
score_per_series
¶ generated score of metrics given per input series, if horizontal ensembles
- Type:
pd.DataFrame
-
fit, predict
-
export_template, import_template, import_results
-
results, failure_rate
-
horizontal_to_df, mosaic_to_df
-
plot_horizontal, plot_horizontal_transformers, plot_generation_loss, plot_backforecast
-
back_forecast
(series=None, n_splits: int = 'auto', tail: int = 'auto', verbose: int = 0)¶ Create forecasts for the historical training data, ie. backcast or back forecast. OUT OF SAMPLE
This actually forecasts on historical data, these are not fit model values as are often returned by other packages. As such, this will be slower, but more representative of real world model performance. There may be jumps in data between chunks.
Args are same as for model_forecast except… n_splits(int): how many pieces to split data into. Pass 2 for fastest, or “auto” for best accuracy series (str): if to run on only one column, pass column name. Faster than full. tail (int): df.tail() of the dataset, back_forecast is only run on n most recent observations.
which points at eval_periods of lower-level back_forecast function
Returns a standard prediction object (access .forecast, .lower_forecast, .upper_forecast)
-
export_template
(filename=None, models: str = 'best', n: int = 20, max_per_model_class: int = None, include_results: bool = False)¶ Export top results as a reusable template.
- Parameters:
filename (str) – ‘csv’ or ‘json’ (in filename). None to return a dataframe and not write a file.
models (str) – ‘best’ or ‘all’
n (int) – if models = ‘best’, how many n-best to export
max_per_model_class (int) – if models = ‘best’, the max number of each model class to include in template
include_results (bool) – whether to include performance metrics
-
failure_rate
(result_set: str = 'initial')¶ Return fraction of models passing with exceptions.
- Parameters:
result_set (str, optional) – ‘validation’ or ‘initial’. Defaults to ‘initial’.
- Returns:
float.
-
fit
(df, date_col: str = None, value_col: str = None, id_col: str = None, future_regressor=None, weights: dict = {}, result_file: str = None, grouping_ids=None, validation_indexes: list = None)¶ Train algorithm given data supplied.
- Parameters:
df (pandas.DataFrame) – Datetime Indexed dataframe of series, or dataframe of three columns as below.
date_col (str) – name of datetime column
value_col (str) – name of column containing the data of series.
id_col (str) – name of column identifying different series.
future_regressor (numpy.Array) – single external regressor matching train.index
weights (dict) – {‘colname1’: 2, ‘colname2’: 5} - increase importance of a series in metric evaluation. Any left blank assumed to have weight of 1. pass the alias ‘mean’ as a str ie weights=’mean’ to automatically use the mean value of a series as its weight available aliases: mean, median, min, max
result_file (str) – results saved on each new generation. Does not include validation rounds. “.csv” save model results table. “.pickle” saves full object, including ensemble information.
grouping_ids (dict) – currently a one-level dict containing series_id:group_id mapping. used in 0.2.x but not 0.3.x+ versions. retained for potential future use
-
static
get_new_params
(method='random')¶
-
horizontal_per_generation
()¶
-
horizontal_to_df
()¶ helper function for plotting.
-
import_results
(filename)¶ Add results from another run on the same data.
Input can be filename with .csv or .pickle. or can be a DataFrame of model results or a full TemplateEvalObject
-
import_template
(filename: str, method: str = 'add_on', enforce_model_list: bool = True, include_ensemble: bool = False)¶ Import a previously exported template of model parameters. Must be done before the AutoTS object is .fit().
- Parameters:
filename (str) – file location (or a pd.DataFrame already loaded)
method (str) – ‘add_on’ or ‘only’ - “add_on” keeps initial_template generated in init. “only” uses only this template.
enforce_model_list (bool) – if True, remove model types not in model_list
include_ensemble (bool) – if enforce_model_list is True, this specifies whether to allow ensembles anyway (otherwise they are unpacked and parts kept)
-
list_failed_model_types
()¶ Return a list of model types (ie ETS, LastValueNaive) that failed. If all had at least one success, then return an empty list.
-
mosaic_to_df
()¶ Helper function to create a readable df of models in mosaic.
-
plot_back_forecast
(**kwargs)¶
-
plot_backforecast
(series=None, n_splits: int = 'auto', start_date='auto', title=None, alpha=0.25, facecolor='black', loc='upper left', **kwargs)¶ Plot the historical data and fit forecast on historic. Out of sample in chunks = forecast_length by default.
- Parameters:
series (str or list) – column names of time series
n_splits (int or str) – “auto”, number > 2, higher more accurate but slower
start_date (datetime.datetime) – or “auto”
title (str) –
passed to pd.DataFrame.plot() (**kwargs) –
-
plot_generation_loss
(title='Single Model Accuracy Gain Over Generations', **kwargs)¶ Plot improvement in accuracy over generations. Note: this is only “one size fits all” accuracy and doesn’t account for the benefits seen for ensembling.
- Parameters:
passed to pd.DataFrame.plot() (**kwargs) –
-
plot_horizontal
(max_series: int = 20, title='Model Types Chosen by Series', **kwargs)¶ Simple plot to visualize assigned series: models.
Note that for ‘mosaic’ ensembles, it only plots the type of the most common model_id for that series, or the first if all are mode.
- Parameters:
max_series (int) – max number of points to plot
passed to pandas.plot() (**kwargs) –
-
plot_horizontal_model_count
(color_list=None, top_n: int = 20, **kwargs)¶ Plots most common models. Does not factor in nested in non-horizontal Ensembles.
-
plot_horizontal_per_generation
(title='Horizontal Ensemble Model Accuracy Gain Over Generations', **kwargs)¶ Plot how well the horizontal ensembles would do after each new generation. Slow.
-
plot_horizontal_transformers
(method='transformers', color_list=None, **kwargs)¶ Simple plot to visualize transformers used. Note this doesn’t capture transformers nested in simple ensembles.
- Parameters:
method (str) – ‘fillna’ or ‘transformers’ - which to plot
= list of colors to sample for bar colors. Can be names or hex. (color_list) –
passed to pandas.plot() (**kwargs) –
-
plot_per_series_error
(title: str = 'Top Series Contributing Score Error', max_series: int = 10, max_name_chars: int = 25, color: str = '#ff9912', figsize=(12, 4), kind: str = 'bar', **kwargs)¶ Plot which series are contributing most to error (Score) of final model. Avg of validations for best_model
- Parameters:
title (str) – plot title
max_series (int) – max number of series to show on plot (sorted)
max_name_chars (str) – if horizontal ensemble, will chop series names to this
color (str) – hex or name of color of plot
figsize (tuple) – passed through to plot axis
kind (str) – bar or pie
passed to pandas.plot() (**kwargs) –
-
plot_per_series_smape
(title: str = None, max_series: int = 10, max_name_chars: int = 25, color: str = '#ff9912', figsize=(12, 4), kind: str = 'bar', **kwargs)¶ Plot which series are contributing most to SMAPE of final model. Avg of validations for best_model
- Parameters:
title (str) – plot title
max_series (int) – max number of series to show on plot (sorted)
max_name_chars (str) – if horizontal ensemble, will chop series names to this
color (str) – hex or name of color of plot
figsize (tuple) – passed through to plot axis
kind (str) – bar or pie
passed to pandas.plot() (**kwargs) –
-
predict
(forecast_length: int = 'self', prediction_interval: float = 'self', future_regressor=None, hierarchy=None, just_point_forecast: bool = False, fail_on_forecast_nan: bool = True, verbose: int = 'self')¶ Generate forecast data immediately following dates of index supplied to .fit().
- Parameters:
forecast_length (int) – Number of periods of data to forecast ahead
prediction_interval (float) –
interval of upper/lower forecasts. defaults to ‘self’ ie the interval specified in __init__() if prediction_interval is a list, then returns a dict of forecast objects.
{str(interval): prediction_object}
future_regressor (numpy.Array) – additional regressor
hierarchy – Not yet implemented
just_point_forecast (bool) – If True, return a pandas.DataFrame of just point forecasts
fail_on_forecast_nan (bool) – if False, return forecasts even if NaN present, if True, raises error if any nan in forecast
- Returns:
Either a PredictionObject of forecasts and metadata, or if just_point_forecast == True, a dataframe of point forecasts
-
results
(result_set: str = 'initial')¶ Convenience function to return tested models table.
- Parameters:
result_set (str) – ‘validation’ or ‘initial’
-
autots.
TransformTS
¶
-
class
autots.
GeneralTransformer
(fillna: str = None, transformations: dict = {}, transformation_params: dict = {}, grouping: str = None, reconciliation: str = None, grouping_ids=None, random_seed: int = 2020, n_jobs: int = 1)¶ Bases:
object
Remove fillNA and then mathematical transformations.
Expects a chronologically sorted pandas.DataFrame with a DatetimeIndex, only numeric data, and a ‘wide’ (one column per series) shape.
Warning
- inverse_transform will not fully return the original data under many conditions
the primary intention of inverse_transform is to inverse for forecast (immediately following the historical time period) data from models, not to return original data
NAs filled will be returned with the filled value
Discretization, statsmodels filters, Round, Slice, ClipOutliers cannot be inversed
- RollingMean, PctChange, CumSum, Seasonal Difference, and DifferencedTransformer will only return original or an immediately following forecast
by default ‘forecast’ is expected, ‘original’ can be set in trans_method
- Parameters:
fillNA (str) –
method to fill NA, passed through to FillNA()
’ffill’ - fill most recent non-na value forward until another non-na value is reached ‘zero’ - fill with zero. Useful for sales and other data where NA does usually mean $0. ‘mean’ - fill all missing values with the series’ overall average value ‘median’ - fill all missing values with the series’ overall median value ‘rolling_mean’ - fill with last n (window = 10) values ‘rolling_mean_24’ - fill with avg of last 24 ‘ffill_mean_biased’ - simple avg of ffill and mean ‘fake_date’ - shifts forward data over nan, thus values will have incorrect timestamps ‘IterativeImputer’ - sklearn iterative imputer most of the interpolate methods from pandas.interpolate
transformations (dict) –
transformations to apply {0: “MinMaxScaler”, 1: “Detrend”, …}
’None’ ‘MinMaxScaler’ - Sklearn MinMaxScaler ‘PowerTransformer’ - Sklearn PowerTransformer ‘QuantileTransformer’ - Sklearn ‘MaxAbsScaler’ - Sklearn ‘StandardScaler’ - Sklearn ‘RobustScaler’ - Sklearn ‘PCA, ‘FastICA’ - performs sklearn decomposition and returns n-cols worth of n_components ‘Detrend’ - fit then remove a linear regression from the data ‘RollingMeanTransformer’ - 10 period rolling average, can receive a custom window by transformation_param if used as second_transformation ‘FixedRollingMean’ - same as RollingMean, but with inverse_transform disabled, so smoothed forecasts are maintained. ‘RollingMean10’ - 10 period rolling average (smoothing) ‘RollingMean100thN’ - Rolling mean of periods of len(train)/100 (minimum 2) ‘DifferencedTransformer’ - makes each value the difference of that value and the previous value ‘PctChangeTransformer’ - converts to pct_change, not recommended if lots of zeroes in data ‘SinTrend’ - removes a sin trend (fitted to each column) from the data ‘CumSumTransformer’ - makes value sum of all previous ‘PositiveShift’ - makes all values >= 1 ‘Log’ - log transform (uses PositiveShift first as necessary) ‘IntermittentOccurrence’ - -1, 1 for non median values ‘SeasonalDifference’ - remove the last lag values from all values ‘SeasonalDifferenceMean’ - remove the average lag values from all ‘SeasonalDifference7’,’12’,’28’ - non-parameterized version of Seasonal ‘CenterLastValue’ - center data around tail of dataset ‘Round’ - round values on inverse or transform ‘Slice’ - use only recent records ‘ClipOutliers’ - remove outliers ‘Discretize’ - bin or round data into groups ‘DatepartRegression’ - move a trend trained on datetime index “ScipyFilter” - filter data (lose information but smoother!) from scipy “HPFilter” - statsmodels hp_filter “STLFilter” - seasonal decompose and keep just one part of decomposition “EWMAFilter” - use an exponential weighted moving average to smooth data “MeanDifference” - joint version of differencing “Cointegration” - VECM but just the vectors “BTCD” - Box Tiao decomposition ‘AlignLastValue’: align forecast start to end of training data ‘AnomalyRemoval’: more tailored anomaly removal options ‘HolidayTransformer’: detects holidays and wishes good cheer to all ‘LocalLinearTrend’: rolling local trend, using tails for future and past trend ‘KalmanSmoothing’: smooth using a state space model
transformation_params (dict) – params of transformers {0: {}, 1: {‘model’: ‘Poisson’}, …} pass through dictionary of empty dictionaries to utilize defaults
random_seed (int) – random state passed through where applicable
-
fill_na
(df, window: int = 10)¶ - Parameters:
df (pandas.DataFrame) – Datetime Indexed
window (int) – passed through to rolling mean fill technique
- Returns:
pandas.DataFrame
-
fit
(df)¶ Apply transformations and return transformer object.
- Parameters:
df (pandas.DataFrame) – Datetime Indexed
-
fit_transform
(df)¶ Directly fit and apply transformations to convert df.
-
static
get_new_params
(method='random')¶
-
inverse_transform
(df, trans_method: str = 'forecast', fillzero: bool = False, bounds: bool = False)¶ Undo the madness.
- Parameters:
df (pandas.DataFrame) – Datetime Indexed
trans_method (str) – ‘forecast’ or ‘original’ passed through
fillzero (bool) – if inverse returns NaN, fill with zero
bounds (bool) – currently ignores AlignLastValue transform if True (also used in process_components of Cassandra)
-
classmethod
retrieve_transformer
(transformation: str = None, param: dict = {}, df=None, random_seed: int = 2020, n_jobs: int = 1)¶ Retrieves a specific transformer object from a string.
- Parameters:
df (pandas.DataFrame) – Datetime Indexed - required to set params for some transformers
transformation (str) – name of desired method
param (dict) – dict of kwargs to pass (legacy: an actual param)
- Returns:
transformer object
-
transform
(df)¶ Apply transformations to convert df.
-
autots.
RandomTransform
(transformer_list: dict = {None: 0.0, 'MinMaxScaler': 0.05, 'PowerTransformer': 0.02, 'QuantileTransformer': 0.05, 'MaxAbsScaler': 0.05, 'StandardScaler': 0.04, 'RobustScaler': 0.05, 'PCA': 0.01, 'FastICA': 0.01, 'Detrend': 0.1, 'RollingMeanTransformer': 0.02, 'RollingMean100thN': 0.01, 'DifferencedTransformer': 0.07, 'SinTrend': 0.01, 'PctChangeTransformer': 0.01, 'CumSumTransformer': 0.02, 'PositiveShift': 0.02, 'Log': 0.01, 'IntermittentOccurrence': 0.01, 'SeasonalDifference': 0.1, 'cffilter': 0.01, 'bkfilter': 0.05, 'convolution_filter': 0.001, 'HPFilter': 0.01, 'DatepartRegression': 0.01, 'ClipOutliers': 0.05, 'Discretize': 0.01, 'CenterLastValue': 0.01, 'Round': 0.02, 'Slice': 0.02, 'ScipyFilter': 0.02, 'STLFilter': 0.01, 'EWMAFilter': 0.02, 'MeanDifference': 0.002, 'BTCD': 0.01, 'Cointegration': 0.01, 'AlignLastValue': 0.1, 'AnomalyRemoval': 0.03, 'HolidayTransformer': 0.01, 'LocalLinearTrend': 0.01, 'KalmanSmoothing': 0.01}, transformer_max_depth: int = 4, na_prob_dict: dict = {'ffill': 0.4, 'fake_date': 0.1, 'rolling_mean': 0.1, 'rolling_mean_24': 0.1, 'IterativeImputer': 0.05, 'mean': 0.06, 'zero': 0.05, 'ffill_mean_biased': 0.1, 'median': 0.03, None: 0.001, 'interpolate': 0.4, 'KNNImputer': 0.05, 'IterativeImputerExtraTrees': 0.0001}, fast_params: bool = None, superfast_params: bool = None, traditional_order: bool = False, transformer_min_depth: int = 1, allow_none: bool = True, no_nan_fill: bool = False)¶ Return a dict of randomly choosen transformation selections.
BTCD is used as a signal that slow parameters are allowed.
-
autots.
long_to_wide
(df, date_col: str = 'datetime', value_col: str = 'value', id_col: str = 'series_id', aggfunc: str = 'first')¶ Take long data and convert into wide, cleaner data.
- Parameters:
df (pd.DataFrame) –
date_col (str) –
value_col (str) –
the name of the column with the values of the time series (ie sales $)
id_col (str) –
name of the id column, unique for each time series
aggfunc (str) –
passed to pd.pivot_table, determines how to aggregate duplicates for series_id and datetime
other options include “mean” and other numpy functions, beware data must already be input as numeric type for these to work. if categorical data is provided, aggfunc=’first’ is recommended
-
autots.
model_forecast
(model_name, model_param_dict, model_transform_dict, df_train, forecast_length: int, frequency: str = 'infer', prediction_interval: float = 0.9, no_negatives: bool = False, constraint: float = None, future_regressor_train=None, future_regressor_forecast=None, holiday_country: str = 'US', startTimeStamps=None, grouping_ids=None, fail_on_forecast_nan: bool = True, random_seed: int = 2020, verbose: int = 0, n_jobs: int = 'auto', template_cols: list = ['Model', 'ModelParameters', 'TransformationParameters', 'Ensemble'], horizontal_subset: list = None, return_model: bool = False, current_model_file: str = None, model_count: int = 0, **kwargs)¶ Takes numeric data, returns numeric forecasts.
Only one model (albeit potentially an ensemble)! Horizontal ensembles can not be nested, other ensemble types can be.
Well, she turned me into a newt. A newt? I got better. -Python
- Parameters:
model_name (str) – a string to be direct to the appropriate model, used in ModelMonster
model_param_dict (dict) – dictionary of parameters to be passed into the model.
model_transform_dict (dict) – a dictionary of fillNA and transformation methods to be used pass an empty dictionary if no transformations are desired.
df_train (pandas.DataFrame) – numeric training dataset of DatetimeIndex and series as cols
forecast_length (int) – number of periods to forecast
frequency (str) – str representing frequency alias of time series
prediction_interval (float) – width of errors (note: rarely do the intervals accurately match the % asked for…)
no_negatives (bool) – whether to force all forecasts to be > 0
constraint (float) – when not None, use this value * data st dev above max or below min for constraining forecast values.
future_regressor_train (pd.Series) – with datetime index, of known in advance data, section matching train data
future_regressor_forecast (pd.Series) – with datetime index, of known in advance data, section matching test data
holiday_country (str) – passed through to holiday package, used by a few models as 0/1 regressor.
n_jobs (int) – number of CPUs to use when available.
template_cols (list) – column names of columns used as model template
horizontal_subset (list) – columns of df_train to use for forecast, meant for internal use for horizontal ensembling
fail_on_forecast_nan (bool) – if False, return forecasts even if NaN present, if True, raises error if any nan in forecast. True is recommended.
return_model (bool) – if True, forecast will have .model and .tranformer attributes set to model object. Only works for non-ensembles.
current_model_file (str) – file path to write to disk of current model params (for debugging if computer crashes). .json is appended
- Returns:
Prediction from AutoTS model object
- Return type:
PredictionObject (autots.PredictionObject)
-
autots.
create_lagged_regressor
(df, forecast_length: int, frequency: str = 'infer', scale: bool = True, summarize: str = None, backfill: str = 'bfill', n_jobs: str = 'auto', fill_na: str = 'ffill')¶ Create a regressor of features lagged by forecast length. Useful to some models that don’t otherwise use such information.
It is recommended that the .head(forecast_length) of both regressor_train and the df for training are dropped. df = df.iloc[forecast_length:]
- Parameters:
df (pd.DataFrame) – training data
forecast_length (int) – length of forecasts, to shift data by
frequency (str) – the ever necessary frequency for datetime things. Default ‘infer’
scale (bool) – if True, use the StandardScaler to standardize the features
summarize (str) – options to summarize the features, if large: ‘pca’, ‘median’, ‘mean’, ‘mean+std’, ‘feature_agglomeration’, ‘gaussian_random_projection’, “auto”
backfill (str) – method to deal with the NaNs created by shifting “bfill”- backfill with last values “ETS” -backfill with ETS backwards forecast “DatepartRegression” - backfill with DatepartRegression
fill_na (str) – method to prefill NAs in data, same methods as available elsewhere
- Returns:
regressor_train, regressor_forecast
-
autots.
create_regressor
(df, forecast_length, frequency: str = 'infer', holiday_countries: list = ['US'], datepart_method: str = 'simple_binarized', drop_most_recent: int = 0, scale: bool = True, summarize: str = 'auto', backfill: str = 'bfill', n_jobs: str = 'auto', fill_na: str = 'ffill', aggfunc: str = 'first', encode_holiday_type=False, holiday_detector_params={'anomaly_detector_params': {'forecast_params': None, 'method': 'mad', 'method_params': {'alpha': 0.05, 'distribution': 'gamma'}, 'transform_dict': {'fillna': None, 'transformation_params': {'0': {}}, 'transformations': {'0': 'DifferencedTransformer'}}}, 'output': 'univariate', 'splash_threshold': None, 'threshold': 0.8, 'use_dayofmonth_holidays': True, 'use_hebrew_holidays': False, 'use_islamic_holidays': False, 'use_lunar_holidays': False, 'use_lunar_weekday': False, 'use_wkdeom_holidays': False, 'use_wkdom_holidays': True}, holiday_regr_style='flag')¶ Create a regressor from information available in the existing dataset. Components: are lagged data, datepart information, and holiday.
All of this info and more is already created by the ~Regression models, but this may help some other models (GLM, WindowRegression)
It is recommended that the .head(forecast_length) of both regressor_train and the df for training are dropped. df = df.iloc[forecast_length:] If you don’t want the lagged features, set summarize=”median” which will only give one column of such, which can then be easily dropped
- Parameters:
df (pd.DataFrame) – WIDE style dataframe (use long_to_wide if the data isn’t already) categorical series will be discard for this, if present
forecast_length (int) – time ahead that will be forecast
frequency (str) – those annoying offset codes you have to always use for time series
holiday_countries (list) – list of countries to pull holidays for. Reqs holidays pkg also can be a dict of {‘country’: “subdiv”} to include subdivision (state)
datepart_method (str) – see date_part from seasonal
scale (bool) – if True, use the StandardScaler to standardize the features
summarize (str) – options to summarize the features, if large: ‘pca’, ‘median’, ‘mean’, ‘mean+std’, ‘feature_agglomeration’, ‘gaussian_random_projection’
backfill (str) – method to deal with the NaNs created by shifting “bfill”- backfill with last values “ETS” -backfill with ETS backwards forecast “DatepartRegression” - backfill with DatepartRegression
fill_na (str) – method to prefill NAs in data, same methods as available elsewhere
aggfunc (str) – str or func, used if frequency is resampled
encode_holiday_type (bool) – if True, returns column per holiday, ONLY for holidays package country holidays (not Detector)
holiday_detector_params (dict) – passed to HolidayDetector, or None
holiday_regr_style (str) – passed to detector’s dates_to_holidays ‘flag’, ‘series_flag’, ‘impact’
- Returns:
regressor_train, regressor_forecast
-
class
autots.
EventRiskForecast
(df_train, forecast_length, frequency: str = 'infer', prediction_interval=0.9, lower_limit=0.05, upper_limit=0.95, model_name='UnivariateMotif', model_param_dict={'distance_metric': 'euclidean', 'k': 10, 'pointed_method': 'median', 'return_result_windows': True, 'window': 14}, model_transform_dict={'fillna': 'pchip', 'transformation_params': {'0': {'method': 0.5}, '1': {}, '2': {'fixed': False, 'window': 7}, '3': {}}, 'transformations': {'0': 'Slice', '1': 'DifferencedTransformer', '2': 'RollingMeanTransformer', '3': 'MaxAbsScaler'}}, model_forecast_kwargs={'max_generations': 30, 'n_jobs': 'auto', 'random_seed': 321, 'verbose': 1}, future_regressor_train=None, future_regressor_forecast=None)¶ Bases:
object
Generate a risk score (0 to 1, but usually close to 0) for a future event exceeding user specified upper or lower bounds.
Upper and lower limits can be one of four types, and may each be different. 1. None (no risk score calculated for this direction) 2. Float in range [0, 1] historic quantile of series (which is historic min and max at edges) is chosen as limit. 3. A dictionary of {“model_name”: x, “model_param_dict”: y, “model_transform_dict”: z, “prediction_interval”: 0.9} to generate a forecast as the limits
Primarily intended for simple forecasts like SeasonalNaive, but can be used with any AutoTS model
a custom input numpy array of shape (forecast_length, num_series)
This can be used to find the “middle” limit too, flip so upper=lower and lower=upper, then abs(U - (1 - L)). In some cases it may help to drop the results from the first forecast timestep or two.
This functions by generating multiple outcome forecast possiblities in two ways. If a ‘Motif’ type model is passed, it uses all the k neighbors motifs as outcome paths (recommended) All other AutoTS models will generate the possible outcomes by utilizing multiple prediction_intervals (more intervals = slower but more resolution). The risk score is then the % of outcome forecasts which cross the limit. (less than or equal for lower, greater than or equal for upper)
Only accepts wide style dataframe input. Methods are class_methods and can be used standalone. They default to __init__ inputs, but can be overriden. Results are usually a numpy array of shape (forecast_length, num_series)
- Parameters:
df_train (pd.DataFrame) – `wide style data, pd.DatetimeIndex for index and one series per column
forecast_length (int) – number of forecast steps to make
frequency (str) – frequency of timesteps
prediction_interval (float) – float or list of floats for probabilistic forecasting if a list, the first item in the list is the one used for .fit default
model_forecast_kwargs (dict) – AutoTS kwargs to pass to generaet_result_windows, .fit_forecast, and forecast-style limits
model_param_dict, model_transform_dict (model_name,) – for model_forecast in generate_result_windows
future_regressor_forecast (future_regressor_train,) – regressor arrays if used
-
fit
()¶
-
predict
()¶
-
predict_historic
()¶
-
generate_result_windows
()¶
-
generate_risk_array
()¶
-
generate_historic_risk_array
()¶
-
set_limit
()¶
-
plot
()¶
-
result_windows, forecast_df, up_forecast_df, low_forecast_df
-
lower_limit_2d, upper_limit_2d, upper_risk_array, lower_risk_array
-
fit
(df_train=None, forecast_length=None, prediction_interval=None, models_mode='event_risk', model_list=['UnivariateMotif', 'MultivariateMotif', 'SectionalMotif', 'ARCH', 'MetricMotif', 'SeasonalityMotif'], ensemble=None, autots_kwargs=None, future_regressor_train=None) Shortcut for generating model params.
args specified are those suggested for an otherwise normal AutoTS run
- Parameters:
df_train (pd.DataFrame) – wide style only
model_method (str) – event_risk here is used by motif models
model_list (list) – suggesting the use of motif models
ensemble (list) – must be None or empty list to get motif result windows
autots_kwargs (dict) – all other args passed in as kwargs if None, defaults to class model_forecast_kwargs, for blank pass empty dict
-
static
generate_historic_risk_array
(df, limit, direction='upper') Given a df and a limit, returns a 0/1 array of whether limit was equaled or exceeded.
-
generate_result_windows
(df_train=None, forecast_length=None, frequency=None, prediction_interval=None, model_name=None, model_param_dict=None, model_transform_dict=None, model_forecast_kwargs=None, future_regressor_train=None, future_regressor_forecast=None) For event risk forecasting. Params default to class init but can be overridden here.
- Returns:
(num_samples/k, forecast_length, num_series/columns)
- Return type:
result_windows (numpy.array)
-
static
generate_risk_array
(result_windows, limit, direction='upper') Given a df and a limit, returns a 0/1 array of whether limit was equaled or exceeded.
-
plot
(column_idx=0, grays=['#838996', '#c0c0c0', '#dcdcdc', '#a9a9a9', '#808080', '#989898', '#808080', '#757575', '#696969', '#c9c0bb', '#c8c8c8', '#323232', '#e5e4e2', '#778899', '#4f666a', '#848482', '#414a4c', '#8a7f80', '#c4c3d0', '#bebebe', '#dbd7d2'], up_low_color=['#ff4500', '#ff5349'], bar_color='#6495ED', bar_ylim=[0.0, 0.5], figsize=(14, 8), result_windows=None, lower_limit_2d=None, upper_limit_2d=None, upper_risk_array=None, lower_risk_array=None) Plot a sample of the risk forecast outcomes.
- Parameters:
column_idx (int) – positional index of series to sample for plot
grays (list of str) – list of hex codes for colors for the potential forecasts
up_low_colors (list of str) – two hex code colors for lower and upper
bar_color (str) – hex color for bar graph
bar_ylim (list) – passed to ylim of plot, sets scale of axis of barplot
figsize (tuple) – passed to figsize of output figure
-
plot_eval
(df_test, column_idx=0, actuals_color=['#00BFFF'], up_low_color=['#ff4500', '#ff5349'], bar_color='#6495ED', bar_ylim=[0.0, 0.5], figsize=(14, 8), lower_limit_2d=None, upper_limit_2d=None, upper_risk_array=None, lower_risk_array=None)¶ Plot a sample of the risk forecast with known value vs risk score.
- Parameters:
df_test (pd.DataFrame) – dataframe of known values (dt index, series)
column_idx (int) – positional index of series to sample for plot
actuals_color (list of str) – list of one hex code for line of known actuals
up_low_colors (list of str) – two hex code colors for lower and upper
bar_color (str) – hex color for bar graph
bar_ylim (list) – passed to ylim of plot, sets scale of axis of barplot
figsize (tuple) – passed to figsize of output figure
-
predict
() Returns forecast upper, lower risk probability arrays for input limits.
-
predict_historic
(upper_limit=None, lower_limit=None, eval_periods=None) Returns upper, lower risk probability arrays for input limits for the historic data. If manual numpy array limits are used, the limits will need to be appropriate shape (for df_train and eval_periods if used)
- Parameters:
upper_limit – if different than the version passed to init
lower_limit – if different than the version passed to init
eval_periods (int) – only assess the n most recent periods of history
-
static
set_limit
(limit, target_shape, df_train, direction='upper', period='forecast', forecast_length=None, eval_periods=None) Handles all limit input styles and returns numpy array.
- Parameters:
limit – see class overview for input options
target_shape (tuple) – of (forecast_length, num_series)
df_train (pd.DataFrame) – training data
direction (str) – whether it is the “upper” or “lower” limit
period (str) – “forecast” or “historic” only used for limits defined by forecast algorithm params
forecast_length (int) – needed only for historic of forecast algorithm defined limit
eval_periods (int) – only for historic forecast limit, only runs on the tail n (this) of data
-
class
autots.
AnomalyDetector
(output='multivariate', method='zscore', transform_dict={'transformation_params': {0: {'datepart_method': 'simple_3', 'regression_model': {'model': 'ElasticNet', 'model_params': {}}}}, 'transformations': {0: 'DatepartRegression'}}, forecast_params=None, method_params={}, eval_period=None, n_jobs=1)¶ Bases:
object
-
detect
(df)¶ All will return -1 for anomalies.
- Parameters:
df (pd.DataFrame) – pandas wide-style data
- Returns:
pd.DataFrame (classifications, -1 = outlier, 1 = not outlier), pd.DataFrame s(scores)
-
fit
(df)¶
-
fit_anomaly_classifier
()¶ Fit a model to predict if a score is an anomaly.
-
static
get_new_params
(method='random')¶
-
plot
(series_name=None, title=None, plot_kwargs={})¶
-
score_to_anomaly
(scores)¶ A DecisionTree model, used as models are nonstandard (and nonparametric).
-
-
class
autots.
HolidayDetector
(anomaly_detector_params={}, threshold=0.8, min_occurrences=2, splash_threshold=0.65, use_dayofmonth_holidays=True, use_wkdom_holidays=True, use_wkdeom_holidays=True, use_lunar_holidays=True, use_lunar_weekday=False, use_islamic_holidays=True, use_hebrew_holidays=True, output: str = 'multivariate', n_jobs: int = 1)¶ Bases:
object
-
dates_to_holidays
(dates, style='flag', holiday_impacts=False)¶ Populate date information for a given pd.DatetimeIndex.
- Parameters:
dates (pd.DatetimeIndex) – list of dates
day_holidays (pd.DataFrame) – list of month/day holidays. Pass None if not available
style (str) – option for how to return information “long” - return date, name, series for all holidays in a long style dataframe “impact” - returns dates, series with values of sum of impacts (if given) or joined string of holiday names ‘flag’ - return dates, holidays flag, (is not 0-1 but rather sum of input series impacted for that holiday and day) ‘prophet’ - return format required for prophet. Will need to be filtered on series for multivariate case ‘series_flag’ - dates, series 0/1 for if holiday occurred in any calendar
holiday_impacts (dict) – a dict passed to .replace contaning values for holiday_names, or str ‘value’ or ‘anomaly_score’
-
detect
(df)¶ Run holiday detection. Input wide-style pandas time series.
-
fit
(df)¶
-
static
get_new_params
(method='random')¶
-
plot
(series_name=None, include_anomalies=True, title=None, plot_kwargs={})¶
-
plot_anomaly
(kwargs={})¶
-