Metadata-Version: 2.4
Name: quantresearch_thd
Version: 0.1.26
Summary: Ensemble framework for detecting outliers in grouped time-series data
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy<2
Requires-Dist: joblib
Requires-Dist: prophet
Requires-Dist: scikit-learn
Requires-Dist: google-cloud-bigquery
Requires-Dist: google-cloud-storage
Requires-Dist: statsmodels
Requires-Dist: plotly
Requires-Dist: pandas-gbq
Requires-Dist: gcsfs
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Quantresearch_thd

Quantresearch_thd is an ensemble framework for detecting outliers in grouped time-series data. It automates the entire workflow from data cleaning and calendar interpolation to running 8 different detection algorithms and generating visual diagnostic reports.

## Key Capabilities

- Ensemble Scoring: Combines 8 models (Statistical + ML) to provide a robust Anomaly_Score and a final is_Anomaly consensus.
- Hierarchical Processing: Natively handles grouped data (e.g., detecting anomalies per Region, Product, or Channel).
- Automated Preprocessing: Handles missing dates via linear interpolation and filters out "low-quality" unique_ids      automatically.
- Parallel Execution: Leverages joblib for multi-core processing of large datasets.
- Visual Analytics: Generates pie charts, stacked bar plots, and detailed group-level time-series breakdowns.

## Included Models

The pipeline utilizes an ensemble of the following methodologies:

- Statistical: Percentile (5th/95th), Standard Deviation (SD), Median Absolute Deviation (MAD), and Interquartile Range (IQR).

- Time-Series Specific: EWMA (Exponentially Weighted Moving Average) and FB Prophet (Walk-forward validation).

- Machine Learning: Isolation Forest (General & Time-series optimized) and DBSCAN.

## Detailed Functionality

- Robust Input Validation: Clear error messaging for missing parameters or incorrect data types.

- Quality Control: Automatically generates a Success Report 

- Visual Suite: Automated rendering of Pie Charts (Summary), Stacked Bars (Distribution), and Top-5 Anomaly Heatmaps.

## 🚀 Quick Start

```python
!pip install quantresearch_thd
import pandas as pd
from quantresearch_thd import timeseries_anomaly_detection

 # Load your data
df = pd.read_csv("your_data.csv")

 # Run the pipeline
anomaly_df, success_report, exclusion_report = timeseries_anomaly_detection( master_data=df,
                                                                             unique_ids=['category', 'region'],
                                                                             variable='sales',
                                                                             date_column='timestamp',
                                                                             freq='W-MON',
                                                                             eval_period=1  # Evaluate the most recent recor
                                                                             )

```
## 📊 Visualizing Results & Deep Dives
Inspecting a Specific Group, if a specific group shows a high anomaly rate, use the evaluation_info tool to render detailed diagnostic plots.

```python
from quantresearch_thd import evaluation_info

# Filter the specific group you want to inspect. Define the group values (must match the order in unique_ids)
group_values = ['appliances', 'TX'] 

# Filter the results for this group
mask = anomaly_df[unique_ids].eq(group_values).all(axis=1)
group_df = anomaly_df[mask]

# Generate detailed diagnostic plots
evaluation_info(group_df,
                unique_ids,
                variable,
                date_column,
                eval_period=1
                )
```

The Evaluation Dashboard provides:

- Model Breakdown: Individual charts for FB Prophet, EWMA, and Isolation Forest with confidence intervals.

- Ensemble View: A summary highlighting where multiple models overlap.

- Statistical Thresholds: Visual markers for IQR, MAD, percentile and SD limits.

## Input_data: 

### Mandatory

* **master_data** (pd.DataFrame) : Name of your dataframe containing inputs to be evaluated for anomalies
                                   include variables, dates, and unique_ids.
* **unique_ids** (list[str])  : List of columns used to segment data ['SKU', 'channel', "store_id"].
* **variable** (str)             : The numerical target column name to analyze for presence of anomalies.
* **date_column** (str)          : The datetime column representing the time dimension ["date","week","month"].

### Default

* **freq** (str)                : Frequency of the date column. Default: 'W-MON'. Accepts 'D', or 'MS'.
* **eval_period** (int)         : Number of trailing records or periods to evaluate for anomalies. Default: 1.
* **max_records** (int)         : Max history to consider starting from the most recent date. Default: all history
* **imputation_method**         : Technique to fill missing time units. Default: 'linear'. Acceptable values are : 'mean',                                     'mode', 'zero', 'linear'
* **mad_threshold** (int)       : MAD parameter, controls Median Absolute Deviation sensitivity. Default: 2.
* **mad_scale_factor** (int)    : MAD parameter, The scaling constant used to normalize the MAD. Default: 0.6745.
* **alpha** (float)             : EWMA parameter, controls the smoothing factor for EWMA trend. Default: 0.3.
* **sigma** (float)             : EWMA parameter, determines the standard deviation multiplier for upper and lower                                             bounds. Default: 1.5.
* **prophet_CI** (float)        : Prophet parameter, determines the confidence interval. Range 0 to 1,                                                         Default: 0.9.
* **contamination** (float)     : Isolation Forest parameter, expected % of outliers (0 to 0.5). Default: 0.03. 
* **random_state** (int)        : Seed for model reproducibility. Default: 42.

## 📤 RETURNS

`tuple` [pd.DataFrame, pd.DataFrame, pd.DataFrame]:

* **final_results**         :   The main output, a dataframe that identifies anomalies with `Anomaly_Votes` and                                               `is_Anomaly`.
* **evaluation_report**     :   Summary of interpolation %, record counts, and anomaly rates.

---

## Output columns of final_results : All the output values are at "unique_ids" level. 

MIN_value
The minimum historical "variable" values. For train data the value is fixed. For test data varies. It is the min_value up to t-1.
________________________________________
MAX_value
The maximum historical "variable" values. For train data the value is fixed. For test data varies. It is the max_value up to t-1. 
________________________________________
Percentile_low / Percentile_high
The 5th and 95th percentile  "variable" values
Used to detect unusually low or unusually high "variable" values. Fixed for train data. Varies for test data. Takes the stats by considering historical data upto t-1.
________________________________________
Percentile_anomaly
Flags based on percentile limits:
• Low → value < Percentile_low
• High → value > Percentile_high
• None → within the range
________________________________________

Mean / SD (Standard Deviation)
The average "variable"and its standard deviation based on historical data.Fixed for train data. Varies for test data. Takes the stats by considering historical data upto t-1.
________________________________________
SD2_low / SD2_high
Two-standard-deviation control limits:
• SD2_low = mean − 2×SD (floored at 0)
• SD2_high = mean + 2×SD 
__________________________________
SD_anomaly
Flags based on SD2 limits:
• Low → value < SD2_low
• High → value > SD2_high
• None → within the range
________________________________________
Median / MAD (Median Absolute Deviation)
Median of "variable" and the median of absolute deviations from the median.Fixed for train data. Varies for test data. Takes the stats by considering historical data upto t-1.
Used for robust anomaly detection when data contains outliers.
________________________________________
MAD_low / MAD_high
MAD-based limits:
• MAD_low = median − 2 × MAD / 0.6745 (floored at 0)
• MAD_high = median + 2 × MAD / 0.6745 

________________________________________
MAD_anomaly
Flags based on MAD limits:
• Low → value < MAD_low
• High → value > MAD_high
• None → within the range
________________________________________
Q1 / Q3 / IQR (Interquartile Range)
• Q1: 25th percentile
• Q3: 75th percentile
• IQR = Q3 − Q1
Used to detect unusually low or high "variable" values.
________________________________________
IQR_low / IQR_high
IQR-based limits:
• IQR_low = Q1 − 1.5 × IQR (floored at 0)
• IQR_high = Q3 + 1.5 × IQR 
______________________________________
IQR_anomaly
Flags based on IQR limits:
• Low → value < IQR_low
• High → value > IQR_high
• None → within the range
________________________________________
is_Percentile_anomaly / is_SD_anomaly / is_MAD_anomaly / is_IQR_anomaly
Boolean indicators stating whether each method classified the value as an anomaly (low or high).
________________________________________
Alpha
Smoothing factor used in EWMA. Higher values give more weight to recent observations.
________________________________________
EWMA_forecast
Expected value estimated using the EWMA model.
________________________________________
EWMA_STD
Rolling standard deviation of residuals around the EWMA forecast.
________________________________________
EWMA_high
Upper anomaly threshold (EWMA_forecast + sigma × EWMA_STD).
_____________________________________ 
EWMA_low
lower anomaly threshold (EWMA_forecast - sigma × EWMA_STD).
_____________________________________ 
Is_EWMA_anomaly
Boolean flag indicating whether the observed value falls outside the EWMA bounds.
________________________________________
FB_forecast
Expected value estimated using the EWMA model.
________________________________________
FB_low
Lower confidence interval of the Prophet forecast
________________________________________
FB_high
Upper confidence interval of the Prophet forecast.
_____________________________________ 
FB_residual
Difference between observed value and Prophet forecast.
_____________________________________ 
FB_anomaly
Raw anomaly indicator based on Prophet confidence bounds.
_____________________________________ 
Is_FB_anomaly
Boolean flag indicating a Prophet-detected anomaly.
______________________________________   
isolation_forest_score
Score from the Isolation Forest model indicating anomaly severity. Typical range: –0.5 to +0.5
• Higher scores = more normal
• Lower scores = more anomalous
________________________________________
is_IsoForest_anomaly
Boolean flag based on Isolation Forest model output:
• True → model predicts anomaly (prediction = –1)
• False → model predicts normal (prediction = 1)
______________________________________   
dbscan_score
Cluster label or distance score produced by DBSCAN (-1 indicates noise/anomaly).
________________________________________
is_DBSCAN_anomaly
Boolean flag indicating DBSCAN-detected anomaly.
________________________________________
Anomaly_Votes
Count of anomaly-detection methods that agree a point is anomalous.
Ranges from 0 to 8.
________________________________________
is_Anomaly
Final ensemble decision:
• True → value flagged anomalous by 4 or more methods
• False → fewer than 4 methods indicate anomaly


