Metadata-Version: 2.4
Name: xelytics-core
Version: 0.3.0
Summary: Pure analytics engine with lazy execution, graph-based transformations, and extensible analysis
Author: Xelytics Team
License: MIT
Project-URL: Homepage, https://xelytics.live
Project-URL: Quick Start Notebook, https://colab.research.google.com/drive/1d1gN5Ip9-p7ojbogxVRRc3QIK6Gq6m9o?usp=sharing
Project-URL: End to End Notebook, https://colab.research.google.com/drive/1zQuBfquU9Zk-UuiX5wE6VmhrX2DqZ1MX?usp=sharing
Project-URL: v0.2 Deep Dive Notebook, https://colab.research.google.com/drive/1pD6Shn-2PCOfX4TE6mSObaOrVt5KjHJI?usp=sharing
Project-URL: v0.3.0 Deep Dive Notebook, https://colab.research.google.com/drive/1nAuOsxDu6SZ0oTHNKzpza0clTe7V4hUa?usp=sharing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.1.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: statsmodels>=0.14.0
Requires-Dist: pingouin>=0.5.3
Requires-Dist: plotly>=5.17.0
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: redis>=5.0.0
Provides-Extra: llm
Requires-Dist: openai>=1.6.0; extra == "llm"
Requires-Dist: groq>=0.4.0; extra == "llm"
Requires-Dist: httpx>=0.25.0; extra == "llm"
Provides-Extra: advanced
Requires-Dist: ruptures>=1.1.8; extra == "advanced"
Requires-Dist: pmdarima>=2.0.4; extra == "advanced"
Provides-Extra: connectors
Requires-Dist: psycopg2-binary>=2.9.0; extra == "connectors"
Requires-Dist: pymysql>=1.1.0; extra == "connectors"
Requires-Dist: snowflake-connector-python>=3.0.0; extra == "connectors"
Requires-Dist: sqlalchemy>=2.0.0; extra == "connectors"
Requires-Dist: openpyxl>=3.1.0; extra == "connectors"
Requires-Dist: pyarrow>=14.0.0; extra == "connectors"
Requires-Dist: boto3>=1.34.0; extra == "connectors"
Requires-Dist: azure-storage-blob>=12.19.0; extra == "connectors"
Requires-Dist: google-cloud-storage>=2.10.0; extra == "connectors"
Requires-Dist: google-cloud-bigquery>=3.11.0; extra == "connectors"
Requires-Dist: pandas-gbq>=0.19.0; extra == "connectors"
Provides-Extra: export
Requires-Dist: jinja2>=3.1.0; extra == "export"
Requires-Dist: weasyprint>=60.0; extra == "export"
Requires-Dist: python-pptx>=0.6.23; extra == "export"
Requires-Dist: nbformat>=5.7.0; extra == "export"
Requires-Dist: kaleido>=0.2.1; extra == "export"
Provides-Extra: large-data
Requires-Dist: dask[dataframe]>=2024.1.0; extra == "large-data"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: black>=23.11.0; extra == "dev"
Requires-Dist: mypy>=1.7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"

# Xelytics-Core

Python package for automated analytics with a lazy, graph-aware execution engine.

[![Version](https://img.shields.io/badge/version-0.3.0-blue)](CHANGELOG.md)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](pyproject.toml)
[![Status](https://img.shields.io/badge/status-beta-blue)](CHANGELOG.md)

> **Status**: v0.3.0 documentation update in progress | v0.2.x APIs remain supported while the lazy graph execution model becomes the recommended path.

---

## What It Does

Xelytics-Core is a **zero-configuration analytics engine** that analyzes your data and produces professional insights, statistical tests, interactive visualizations, and predictions — all with a single function call.

**One-line analysis:**
```python
from xelytics import analyze
import pandas as pd

df = pd.read_csv("data.csv")
result = analyze(df)  # That's it!

for insight in result.insights:
    print(f"📊 {insight.title}: {insight.description}")
```

**Output includes:**
- ✅ 50+ statistical tests (parametric & non-parametric)
- ✅ Time series decomposition & forecasting (ARIMA, Exponential Smoothing)
- ✅ Anomaly detection & change point detection
- ✅ Clustering analysis (K-Means, DBSCAN, Hierarchical)
- ✅ Interactive Plotly visualizations
- ✅ Human-readable insights (with optional LLM narration)
- ✅ Professional HTML, PDF, PowerPoint, and Jupyter reports

---

## What's New in v0.3.0

> Added in v0.3.0

v0.3.0 evolves Xelytics-Core from an eager, mostly linear DataFrame analysis pipeline into a lazy, graph-aware analytics engine. The existing `analyze(df)` workflow is still supported for v0.2.x compatibility, while new projects should prefer the chainable `Xelytics` API when they need lazy data binding, execution planning, SQL pushdown, lineage, or plugin extension points.

| Area | v0.2.x Behavior | v0.3.0 Behavior | Compatibility |
|---|---|---|---|
| Entry point | `analyze(df)` runs the pipeline directly | `Xelytics().dataset(df).analyze().run()` builds then executes a plan | `analyze(df)` remains supported |
| Data model | DataFrame-first, connector results usually materialized | Unified `Dataset` abstraction with materialized and lazy datasets | Existing DataFrame inputs still work |
| Execution | Eager pipeline with optional parallel tasks | Lazy `ExecutionPlan` DAG with scan, transform, analysis nodes | Eager behavior is preserved through legacy API |
| SQL sources | Query first, then analyze returned DataFrame | Filter/project nodes can be pushed into SQL where supported | Connector APIs remain available |
| Transformations | Custom pipeline steps execute before analysis | Transformations can be represented as graph nodes | `Pipeline` remains supported |
| Caching | Result/intermediate cache for analysis stages | Node-level cache support for transformation graph nodes | Existing cache backends remain supported |
| Metadata | Run metadata plus optional sampling/parallel fields | Adds trace, profiling, lineage, cache, and analyzer outputs | Existing result fields remain stable |
| Extensibility | Pipelines, exporters, LLM providers | Registries for analyzers, transformations, and output formats | Existing extension patterns remain valid |

### Legacy API (v0.2.x Compatible)

> Legacy API (still supported)

```python
from xelytics import analyze

result = analyze(df)
```

### Recommended v0.3.0 API

> Added in v0.3.0

```python
from xelytics import Xelytics

result = (
    Xelytics()
      .dataset(df)
      .filter("revenue > 1000")
      .analyze()
      .run()
)
```

`load_dataframe(df)` is also available as an explicit DataFrame-loading name in the current implementation. Documentation uses `dataset(df)` for the recommended v0.3.0 abstraction.

---

## Core Principles

| Principle | Meaning |
|---|---|
| `advanced` | advanced time series dependencies such as `ruptures` and `pmdarima` |
| `connectors` | database, cloud storage, Excel, and Parquet connector dependencies |
| `export` | PDF, PowerPoint, notebook, and static chart export dependencies |
| `llm` | OpenAI and Groq provider dependencies |
| `large_data` | Dask dataframe support |
| `dev` | test, lint, type-check, and formatting tools |

## Quick Start

### v0.2-Compatible API

Use this for simple one-shot DataFrame analysis.

```python
import pandas as pd
from xelytics import AnalysisConfig, analyze

df = pd.read_csv("sales.csv")

config = AnalysisConfig(
    enable_llm_insights=False,
    generate_visualizations=False,
)

result = analyze(df, config=config)

print(result.summary.row_count)
print(result.metadata.tests_executed)

for insight in result.insights[:5]:
    print(f"{insight.severity.value}: {insight.title}")

result.export_to("analysis.json")
```

### Recommended v0.3.0 API

Use the chainable API when you want to bind data first, record operations, and
execute only when `.run()` is called.

```python
import pandas as pd
from xelytics import AnalysisConfig, Xelytics

df = pd.read_csv("sales.csv")

result = (
    Xelytics(config=AnalysisConfig(enable_llm_insights=False))
      .dataset(df)
      .filter("revenue > 1000")
      .analyze()
      .run()
)

print(result.summary.row_count)
print(result.trace.print_trace() if result.trace else "No trace")
```

`load_dataframe(df)` and `from_dataset(dataset)` are also available aliases for
explicit binding.

The full runnable notebook for this release is
[examples/xelytics_core_v0_3_0_complete.ipynb](examples/xelytics_core_v0_3_0_complete.ipynb).
It uses generated data and local files only, so it can be executed without API
keys or database credentials.

## What Changed in v0.3.0

| Area | v0.2.x | v0.3.0 |
|---|---|---|
| Entry point | `analyze(df)` | `analyze(df)` still works; `Xelytics().dataset(df).analyze().run()` is recommended for lazy workflows |
| Data model | DataFrame-first | `Dataset`, `MaterializedDataset`, `LazyDataset`, and `TransformedDataset` |
| Execution | eager pipeline | `ExecutionPlan`, `PlanNode`, `PlanBuilder`, and DAG execution |
| Connectors | mostly materialized DataFrames | database connectors can back lazy datasets |
| SQL behavior | query then analyze | filter/project/limit plan nodes can use SQL pushdown when supported |
| Transformations | eager `Pipeline` preprocessing | `TransformGraph`, graph nodes, node cache, and lineage APIs |
| Analysis outputs | stats, visualizations, insights, time series, clustering | adds `correlation`, `trend_anomaly`, and `segmentation` analyzer outputs |
| Observability | logs and metadata | `TraceCollector` and `ExecutionProfiler` attached to results |
| Extensibility | pipelines/exporters/providers | registries for analyzers, transformations, and output formats |
| Compatibility | v0.2.x public API | no public v0.2.x API removed |

See [MIGRATION_GUIDE_v0.2_to_v0.3.md](MIGRATION_GUIDE_v0.2_to_v0.3.md) for the
full migration guide.

## Implemented v0.3.0 Story Map

The v0.3.0 implementation is organized around the story set in
[aidlc-docs/inception/v0.3.0](aidlc-docs/inception/v0.3.0).

| Epic | Implemented surface | Main modules |
|---|---|---|
| Epic 1: Data Connectivity Engine | source abstraction, schema inference, lazy data binding, connector timeouts/retries/sampling hints | `xelytics.dataset`, `xelytics.schemas.schema`, `xelytics.connectors`, `xelytics.schemas.config` |
| Epic 2: Execution Engine | execution plans, lazy execution, SQL pushdown helpers, chunked planning support | `xelytics.execution`, `xelytics.engine` |
| Epic 3: Transformation Graph Engine | graph nodes, graph execution, node cache, schema hooks, lineage records | `xelytics.graph` |
| Epic 4: Analysis and Insight Engine | profiling, correlation, trend/anomaly, segmentation, ranked and deduplicated insights | `xelytics.analyzers`, `xelytics.insights` |
| Epic 5: Output Layer and Python API | structured JSON, optional visualizations, chainable `Xelytics` API, result export | `xelytics.api`, `xelytics.schemas.outputs`, `xelytics.export` |
| Epic 6: Observability and Debugging | execution logs, trace collection, node profiling, trace/profile serialization | `xelytics.observability`, `xelytics.engine` |
| Epic 7: Extensibility System | custom analyzer, transformation, and output-format registries | `xelytics.extension` |

Regression coverage for these surfaces lives in `tests/test_epic1_connectivity.py`
through `tests/test_epic7.py`, plus compatibility tests for earlier APIs.

## Feature Overview

| Capability | Status |
|---|---|
| Automatic statistical test planning and execution | supported |
| Dataset summaries and column profiling | supported |
| Rule-based insights and ranked insights | supported |
| Plotly-compatible visualization specs | supported |
| Time series detection, decomposition, forecasting, anomalies, and change points | supported through `xelytics.timeseries` and v0.3 analyzer outputs |
| K-Means, DBSCAN, hierarchical clustering, and cluster profiling | supported |
| PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, S3, Azure Blob, GCS, and file connectors | supported through optional extras |
| File and Redis caching | supported |
| Large dataset summary and sample analysis | supported through `analyze_large_dataset()` |
| HTML, PDF, PowerPoint, Jupyter notebook, and JSON export | supported through `xelytics.export` |
| CLI for CSV and Excel analysis | supported through the `xelytics` command |
| Optional LLM provider integrations | OpenAI and Groq dependencies available through `llm` extra |

## Public API

The stable top-level imports are:

```python
# Define which columns to analyze
config = AnalysisConfig(
    include_columns=["age", "income", "purchase_frequency"],
    exclude_columns=["customer_id", "timestamp"],
    categorical_max_categories=50,  # Skip columns with >50 unique values
)

result = analyze(df, config=config)
```

**Statistics Covered:**
- ✅ Descriptive: mean, median, variance, skewness, kurtosis
- ✅ t-tests, ANOVA, Welch's test, Mann-Whitney U, Kruskal-Wallis
- ✅ Correlation: Pearson, Spearman, Kendall Tau
- ✅ Chi-square tests for categorical associations
- ✅ Effect sizes: Cohen's d, Cramér's V, Eta-squared
- ✅ Assumption checks: Normality (Shapiro-Wilk), Homogeneity of variance (Levene)

---

### 2️⃣ Time Series Analysis (NEW in v0.2.0)

Complete time series toolkit: detection, decomposition, forecasting, anomalies.

#### Time Series Detection

```python
from xelytics import analyze, AnalysisConfig

# Option 1: Auto-detect time series columns
config = AnalysisConfig(enable_time_series=True)
result = analyze(df, config=config)

# Option 2: Specify datetime column
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
)
result = analyze(df, config=config)

# Check which columns were detected as time series
for ts in result.time_series_analysis:
    print(f"{ts.column_name}:")
    print(f"  Type: {ts.series_type.value}")
    print(f"  Frequency: {ts.frequency}")
    print(f"  Has trend: {ts.has_trend}")
    print(f"  Has seasonality: {ts.has_seasonality}")
    if ts.has_seasonality:
        print(f"  Seasonal period: {ts.seasonal_period}")
```

#### Time Series Decomposition

```python
# Automatically decompose into trend, seasonal, residual
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    decomposition_method="additive",  # or "multiplicative", "stl"
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.decomposition:
        print(f"{ts.column_name} decomposition:")
        print(f"  Trend strength: {ts.decomposition.trend_strength:.3f}")
        print(f"  Seasonal strength: {ts.decomposition.seasonal_strength:.3f}")
```

#### Forecasting

```python
# ARIMA and Exponential Smoothing forecasting
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=30,  # Forecast next 30 periods
    forecast_methods=["arima", "exponential_smoothing"],
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.forecasts:
        print(f"\n{ts.column_name} - Next 30 periods forecast:")
        for forecast in ts.forecasts[:5]:  # Show first 5
            print(f"  Period {forecast.period}: {forecast.value:.2f} "
                  f"(95% CI: {forecast.lower_bound:.2f}-{forecast.upper_bound:.2f})")
```

#### Anomaly Detection

```python
# Multiple detection methods: Z-score, IQR, MAD, Isolation Forest
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,  # 95th percentile threshold
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.anomalies:
        print(f"\n{ts.column_name} - Anomalies detected:")
        for anomaly in ts.anomalies[:3]:
            print(f"  Index {anomaly.index}: {anomaly.value:.2f} "
                  f"(severity: {anomaly.severity}, confidence: {anomaly.confidence:.2f})")
```

#### Change Point Detection

```python
# Detect structural breaks (CUSUM algorithm)
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    detect_change_points=True,
    change_point_sensitivity=0.05,
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.change_points:
        print(f"\n{ts.column_name} - Change points:")
        for cp in ts.change_points:
            print(f"  At index {cp.index}: magnitude={cp.magnitude:.2f}, "
                  f"confidence={cp.confidence:.2f}")
```

---

### 3️⃣ Clustering & Segmentation (NEW in v0.2.0)

Unsupervised learning for customer segmentation, market clustering, etc.

#### Basic Clustering

```python
from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=8,
    exclude_columns=["customer_id", "name"],
)
result = analyze(df, config=config)

# View clusters
print(f"Algorithm used: {result.clusters[0].algorithm}")
for cluster in result.clusters:
    print(f"\nCluster {cluster.cluster_id}:")
    print(f"  Size: {cluster.size} members ({cluster.size/result.summary.row_count*100:.1f}%)")
    print(f"  Silhouette score: {cluster.silhouette_score:.3f}")
    print(f"  Profile: {cluster.profile}")
```

#### K-Means (with Automatic K Selection)

```python
# K-Means tries multiple K values and picks the best
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=10,
    k_selection_method="elbow",  # elbow, silhouette, gap_statistic
)
result = analyze(df, config=config)

# View metrics for each K
for cluster in result.clusters:
    print(f"K={cluster.algorithm_params['n_clusters']}: "
          f"silhouette={cluster.silhouette_score:.3f}")
```

#### DBSCAN (Density-Based)

```python
# DBSCAN finds natural clusters and noise points
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="dbscan",
    dbscan_eps=0.5,  # Auto-estimated if not provided
    dbscan_min_samples=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    noise_label = "Noise" if cluster.cluster_id == -1 else f"Cluster {cluster.cluster_id}"
    print(f"{noise_label}: {cluster.size} points")
```

#### Hierarchical Clustering

```python
# Produces dendrograms and tree-based clusters
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="hierarchical",
    hierarchical_linkage="ward",  # ward, complete, average, single
    max_clusters=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    print(f"Cluster {cluster.cluster_id}: {cluster.size} members")
```

---

### 4️⃣ Data Connectors (NEW in v0.2.0)

Analyze data directly from databases and cloud storage—no manual data export needed.

#### PostgreSQL

```python
from pathlib import Path
from xelytics.connectors import connect_to_source

output_dir = Path(".cache/xelytics_readme")
output_dir.mkdir(parents=True, exist_ok=True)

csv_path = output_dir / "sales.csv"
df.to_csv(csv_path, index=False)

file_dataset = connect_to_source("file", path=str(csv_path))
print(file_dataset.to_pandas().head())
```

Database pattern:

```python
from xelytics import AnalysisConfig, Xelytics

result = (
    Xelytics(config=AnalysisConfig(enable_llm_insights=False))
      .connect(
          "postgresql",
          host="localhost",
          database="analytics",
          user="reader",
          password="secret",
          query="SELECT * FROM sales",
      )
      .filter("revenue > 1000")
      .analyze()
      .run()
)
```

### Cache APIs

| API | Purpose |
|---|---|
| `Cache(backend="file", **kwargs)` | Direct cache instance |
| `Cache.get(key)` | Read cached value |
| `Cache.set(key, value, ttl=None)` | Store cached value |
| `Cache.delete(key)` | Delete key |
| `Cache.clear(pattern=None)` | Clear backend |
| `Cache.cached(ttl=None)` | Decorator for function caching |
| `get_cache(backend, **kwargs)` | Create/get global cache |
| `clear_cache(pattern=None)` | Clear global cache |
| `NodeCache.get(node_id, input_dfs, func)` | Read transform-node output |
| `NodeCache.set(node_id, input_dfs, func, result)` | Store transform-node output |

```python
connector = connect_to_source(
    source_type="bigquery",
    project_id="my-project",
    credentials_path="/path/to/service-account.json",
)

df = connector.query("""
    SELECT * FROM `my-project.dataset.events`
    WHERE event_date >= '2025-01-01'
    LIMIT 100000
""")
result = analyze(df)
```

#### Snowflake

```python
connector = connect_to_source(
    source_type="snowflake",
    account="xy12345",
    warehouse="COMPUTE",
    database="ANALYTICS",
    schema="PUBLIC",
    user=os.getenv("SNOWFLAKE_USER"),
    password=os.getenv("SNOWFLAKE_PASSWORD"),
)

df = connector.query("SELECT * FROM CUSTOMER_DATA")
result = analyze(df)
```

#### S3 / Cloud Storage

```python
# Amazon S3
connector = connect_to_source(
    source_type="s3",
    bucket="my-analytics-bucket",
    key="data/sales.parquet",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("AWS_SECRET_KEY"),
)
df = connector.query()  # Returns DataFrame
result = analyze(df)

# Azure Blob Storage
connector = connect_to_source(
    source_type="azure_blob",
    container_name="data",
    blob_name="sales.csv",
    connection_string=os.getenv("AZURE_CONN_STRING"),
)
df = connector.query()
result = analyze(df)

# Google Cloud Storage
connector = connect_to_source(
    source_type="gcs",
    bucket="my-bucket",
    key="data/sales.csv",
    credentials_path="/path/to/gcp-key.json",
)
df = connector.query()
result = analyze(df)
```

---

### 5️⃣ Report Generation (NEW in v0.2.0)

Generate professional, interactive reports in multiple formats.

#### HTML Report

```python
from xelytics.pipeline import Pipeline, correlation_analysis, normalize, pca, remove_outliers

pipeline = Pipeline(name="demo")
pipeline.add_step(
    remove_outliers,
    name="remove_outliers",
    inputs=["df"],
    outputs=["df"],
    columns=["revenue"],
    method="iqr",
    threshold=3.0,
)
pipeline.add_step(
    normalize,
    name="normalize",
    inputs=["df"],
    outputs=["normalized"],
    columns=["revenue", "cost"],
    method="minmax",
)

context = pipeline.execute({"df": df})
print(context["normalized"].head())
print(pca(df[["revenue", "cost"]], n_components=2).head())
print(correlation_analysis(df[["revenue", "cost", "profit"]]))
```

### Transformation Graph, Lineage, Trace, and Profiling

```python
from xelytics.dataset import MaterializedDataset
from xelytics.graph.graph import TransformGraph
from xelytics.graph.lineage import LineageTracker
from xelytics.graph.node import DataSourceNode, TransformNode
from xelytics.observability.profiler import ExecutionProfiler
from xelytics.observability.trace import TraceCollector, TraceEntry

graph = TransformGraph()
graph.add_node(DataSourceNode(id="source", dataset=MaterializedDataset(df)))
graph.add_node(
    TransformNode(
        id="filter",
        name="filter",
        func=lambda frame: frame.query("revenue > 1000"),
        inputs=["source"],
    )
)
graph.add_edge("source", "filter")
graph.validate()
graph_df = graph.run()

lineage = LineageTracker()
lineage.record_execution("filter", {"source": "hash-a"}, "hash-b", 12.5)
print(lineage.get_record("filter"))
lineage.clear()

trace = TraceCollector()
trace.add(TraceEntry(step_name="demo", row_count=len(graph_df)))
print(trace.print_trace())

profiler = ExecutionProfiler()
profiler.start("node")
profiler.stop("node", operation="demo", rows_fetched=len(graph_df))
print(profiler.print_profile())
```

#### JSON Export

```python
import json

# For programmatic access or storage
with open("analysis.json", "w") as f:
    json.dump(result.to_dict(), f, indent=2)

# Later, reconstruct from JSON
from xelytics.schemas.outputs import AnalysisResult
with open("analysis.json") as f:
    data = json.load(f)
    result = AnalysisResult(**data)
```

---

### 6️⃣ Custom Pipelines (NEW in v0.2.0)

Pre-process data with custom steps before analysis.

```python
from xelytics.pipeline import Pipeline, normalize, pca, remove_outliers, correlation_analysis
from xelytics import analyze

# Build a custom pipeline
pipeline = Pipeline([
    remove_outliers(method="iqr", threshold=1.5),
    normalize(method="minmax"),
    pca(n_components=10),
    correlation_analysis(threshold=0.7),
])

# Apply before analysis
df_processed = pipeline.fit_transform(df)
result = analyze(df_processed)

# Or use in AnalysisConfig
config = AnalysisConfig(
    run_custom_pipeline=True,
    custom_pipeline=pipeline,
)
result = analyze(df, config=config)
```

---

### 7️⃣ Caching (NEW in v0.2.0)

Speed up repeated analyses on the same data.

#### File-Based Cache

```python
from xelytics import analyze, AnalysisConfig
from xelytics.cache import FileCache

cache = FileCache(cache_dir="./cache")

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

# First run: takes full time
result1 = analyze(df, config=config)

# Subsequent runs on same data: instant
result2 = analyze(df, config=config)  # Retrieved from cache!
```

#### Redis Cache (Distributed)

```python
from xelytics.cache import RedisCache

cache = RedisCache(host="localhost", port=6379, db=0, ttl=3600)

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

result = analyze(df, config=config)
```

#### Clear Cache

```python
from xelytics.cache import clear_cache

# Clear all caches
clear_cache(pattern="*")

# Clear specific patterns
clear_cache(pattern="stats:*")  # Only clear stats caches
```

---

### 8️⃣ CLI (Command-Line Interface)

Analyze without writing Python code.

```bash
# Basic analysis - outputs JSON
xelytics analyze data.csv

# Save to file
xelytics analyze data.csv --output results.json

# Set parameters
xelytics analyze data.csv \
  --format=json \
  --alpha 0.01 \
  --no-llm \
  --max-visualizations 20 \
  --datetime-column "date"

# Time series analysis
xelytics analyze data.csv \
  --enable-time-series \
  --datetime-column "date" \
  --forecast-periods 30

# Clustering
xelytics analyze data.csv \
  --enable-clustering \
  --clustering-algorithm kmeans \
  --max-clusters 5

# Show version
xelytics --version

# Help
xelytics --help
```

---

### 9️⃣ LLM Integration (Optional)

Enhance insights with AI narration.

```python
from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",  # openai, groq, or local
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# Insights now include AI-generated descriptions
for insight in result.insights:
    print(f"{insight.title}")
    print(f"  📝 {insight.narrative}")  # AI-generated explanation
```

#### Multiple LLM Providers

```python
# OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

# Groq (fast, open source)
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="groq",
    llm_model="mixtral-8x7b",
    llm_api_key=os.getenv("GROQ_API_KEY"),
)

# Azure OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="azure",
    llm_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    llm_api_key=os.getenv("AZURE_OPENAI_KEY"),
)
```

### Extension Registries and Custom Output

```python
from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    # Auto-sample if > 1M rows
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Or force sampling
    sampling_strategy="stratified",
    sample_size=100_000,
    
    # Parallel execution
    parallel_execution=True,
    max_workers=4,
)

result = analyze(df, config=config)
```

#### Chunked Processing for Very Large Files

```python
from xelytics.engine import analyze_large_dataset

# Process 10M row file without loading into memory
result = analyze_large_dataset(
    source="huge_sales_data.csv",
    chunksize=50_000,
    sample_size=100_000,  # Take a sample for full analysis
    config=AnalysisConfig(),
)
```

---

## ⚙️ Configuration Reference

```python
from xelytics import AnalysisConfig

config = AnalysisConfig(
    # General
    significance_level=0.05,
    mode="automated",  # automated or semi-automated
    
    # Columns
    include_columns=None,  # [list] Include only these columns
    exclude_columns=None,  # [list] Exclude these columns
    datetime_column=None,  # [str] Column name for time series
    
    # Time Series
    enable_time_series=False,
    decomposition_method="additive",  # additive, multiplicative, stl
    forecast_periods=0,
    forecast_methods=["arima", "exponential_smoothing"],
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,
    detect_change_points=False,
    
    # Clustering
    enable_clustering=False,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=10,
    k_selection_method="elbow",
    
    # Performance
    parallel_execution=True,
    max_workers=4,
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Caching
    enable_caching=False,
    cache_backend=None,
    
    # Reporting
    max_visualizations=15,
    run_custom_pipeline=False,
    custom_pipeline=None,
    
    # LLM
    enable_llm_insights=False,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=None,
    
    # Other
    random_seed=42,
    verbose=True,
)
```

## Usage Examples

### Configure Analysis

```python
from xelytics import AnalysisConfig, analyze

config = AnalysisConfig(
    significance_level=0.01,
    enable_llm_insights=False,
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=14,
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=5,
    parallel_execution=True,
    enable_caching=True,
    
    # Reporting
    max_visualizations=20,
    enable_llm_insights=True,
    llm_provider="openai",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# 4. EXPLORE RESULTS
print(f"\n✓ Analysis complete in {result.metadata.execution_time_ms}ms")
print(f"  • Tests: {result.metadata.tests_executed}")
print(f"  • Visualizations: {len(result.visualizations)}")
print(f"  • Insights: {len(result.insights)}")
print(f"  • Time Series Series: {len(result.time_series_analysis)}")
print(f"  • Clusters: {len(result.clusters)}")

print("\n📊 Key Insights:")
for i, insight in enumerate(result.insights[:5], 1):
    print(f"  {i}. {insight.title}")
    if hasattr(insight, 'narrative'):
        print(f"     {insight.narrative[:100]}...")

# 5. GENERATE REPORTS
print("\n📄 Generating reports...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# HTML Report
html_generator = HTMLReportGenerator(
    theme="light",
    logo_text="Sales Analytics",
    company_name="ACME Corp"
)
html = html_generator.generate(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
html_path = f"reports/sales_analysis_{timestamp}.html"
os.makedirs("reports", exist_ok=True)
with open(html_path, "w") as f:
    f.write(html)
print(f"  ✓ HTML: {html_path}")

# PDF Report
pdf_bytes = generate_pdf_report(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
pdf_path = f"reports/sales_analysis_{timestamp}.pdf"
with open(pdf_path, "wb") as f:
    f.write(pdf_bytes)
print(f"  ✓ PDF:  {pdf_path}")

# JSON Export
json_path = f"reports/sales_analysis_{timestamp}.json"
import json
with open(json_path, "w") as f:
    json.dump(result.to_dict(), f, indent=2)
print(f"  ✓ JSON: {json_path}")

print("\n✅ Analysis complete!")
print(f"Reports saved to: {os.path.abspath('reports')}")
```

**Output:**
```
📁 Loading data...
✓ Loaded 150,432 rows

⚙️  Configuring analysis...

🔍 Running analysis...

✓ Analysis complete in 3421ms
  • Tests: 47
  • Visualizations: 18
  • Insights: 12
  • Time Series Series: 2
  • Clusters: 5

📊 Key Insights:
  1. Significant correlation detected: total_amount vs. customer_age
  2. Strong seasonality in Q4 sales
  3. Customer segmentation: 5 distinct groups identified
  4. Outliers detected in unit_price column
  5. Increasing trend in repeat customer rate

📄 Generating reports...
  ✓ HTML: reports/sales_analysis_20250307_143021.html
  ✓ PDF:  reports/sales_analysis_20250307_143021.pdf
  ✓ JSON: reports/sales_analysis_20250307_143021.json

✅ Analysis complete!
Reports saved to: /home/user/reports
```

---

## 📈 Performance & Scaling

| Dataset Size | Processing Time | Max Parallel Tasks |
|---|---|---|
| **10K rows** | 1–2 seconds | 3 |
| **100K rows** | 5–10 seconds | 4 |
| **1M rows** | 30–60 seconds | 4 |
| **10M rows** | 3–5 minutes | 4 (chunked) |
| **100M rows** | 10–30 minutes | 4 (chunked + sampled) |

**Optimization Strategies:**
- ✅ Automatic sampling for datasets > 1M rows
- ✅ Parallel execution (4 workers by default)
- ✅ Result caching (file or Redis)
- ✅ Progress callbacks for long-running analyses
- ✅ Memory-aware warnings (logs warning if > 1GB)

---

## 📊 Feature Comparison

| Feature | v0.1.0 | v0.2.0 | 
|---|:---:|:---:|
| **Statistical Analysis** | ✅ | ✅ |
| Automated test selection | ✅ | ✅ |
| Effect size calculation | ✅ | ✅ |
| Assumption checking | ✅ | ✅ |
| **Time Series (NEW)** | — | ✅ |
| Detection & decomposition | — | ✅ |
| ARIMA & ES forecasting | — | ✅ |
| Anomaly detection | — | ✅ |
| Change point detection | — | ✅ |
| **Clustering (NEW)** | — | ✅ |
| K-Means | — | ✅ |
| DBSCAN | — | ✅ |
| Hierarchical | — | ✅ |
| Cluster profiling | — | ✅ |
| **Performance (NEW)** | — | ✅ |
| Parallel execution | — | ✅ |
| Result caching | — | ✅ |
| Sampling strategies | — | ✅ |
| Chunked processing | — | ✅ |
| **Connectors (NEW)** | — | ✅ |
| PostgreSQL | — | ✅ |
| MySQL/MariaDB | — | ✅ |
| SQLite | — | ✅ |
| BigQuery | — | ✅ |
| Snowflake | — | ✅ |
| S3/Azure/GCS | — | ✅ |
| **Export (NEW)** | — | ✅ |
| HTML reports | — | ✅ |
| PDF export | — | ✅ |
| PowerPoint slides | — | ✅ |
| Jupyter notebooks | — | ✅ |
| JSON export | — | ✅ |
| **Other Features** | | |
| Data profiling | ✅ | ✅ |
| Rule-based insights | ✅ | ✅ |
| LLM narration | ✅ | ✅ |
| Custom pipelines | — | ✅ |
| Progress callbacks | — | ✅ |
| CLI interface | — | ✅ |
| Backward compatible | — | ✅ |

---

## 🔧 Installation & Setup

### System Requirements

- **Python:** 3.9, 3.10, 3.11, 3.12
- **OS:** Linux, macOS, Windows
- **RAM:** 2GB minimum; 8GB+ recommended for large datasets

### Basic Installation

```bash
# Minimal (core features only)
pip install -e .

# Development
pip install -e ".[dev]"

# Production (all features)
pip install -e ".[advanced,connectors,export,llm]"

# Everything (including dev tools)
pip install -e ".[advanced,connectors,export,llm,dev]"
```

### Verify Installation

```bash
python -c "from xelytics import analyze; print('✓ Xelytics installed')"

# Check version
python -c "import xelytics; print(xelytics.__version__)"

# Test CLI
xelytics --version
```

---

## 📚 Documentation

Full documentation is available in the `docs/` folder:

| Topic | Location |
|---|---|
| **🚀 Installation** | [docs/installation.md](docs/installation.md) |
| **📖 Quick Start** | [docs/quickstart.md](docs/quickstart.md) |
| **📊 Statistical Analysis** | [docs/guides/01_basic_analysis.md](docs/guides/01_basic_analysis.md) |
| **⏱️ Time Series** | [docs/guides/02_time_series.md](docs/guides/02_time_series.md) |
| **🎯 Clustering** | [docs/guides/03_clustering.md](docs/guides/03_clustering.md) |
| **⚡ Performance** | [docs/guides/04_performance.md](docs/guides/04_performance.md) |
| **🔗 Connectors** | [docs/guides/05_connectors.md](docs/guides/05_connectors.md) |
| **📄 Export & Reports** | [docs/guides/06_export_reports.md](docs/guides/06_export_reports.md) |
| **🛠️ Custom Pipelines** | [docs/guides/07_custom_pipelines.md](docs/guides/07_custom_pipelines.md) |
| **💻 CLI Guide** | [docs/guides/08_cli.md](docs/guides/08_cli.md) |
| **📡 Observability** | [docs/guides/09_observability.md](docs/guides/09_observability.md) |
| **🧩 Extensibility** | [docs/guides/10_extensibility.md](docs/guides/10_extensibility.md) |
| **🔍 API Reference** | [docs/api/](docs/api/) |
| **📋 Examples** | [examples/](examples/) |
| **📜 Migration Guide** | [docs/migration/v01_to_v02.md](docs/migration/v01_to_v02.md) |
| **📜 v0.2 → v0.3 Migration** | [MIGRATION_GUIDE_v0.2_to_v0.3.md](MIGRATION_GUIDE_v0.2_to_v0.3.md) |
| **🏗️ Architecture** | [ARCHITECTURE.md](ARCHITECTURE.md) |
| **📑 API Contract** | [API_CONTRACT.md](API_CONTRACT.md) |
| **📝 Comprehensive Docs** | [COMPREHENSIVE_DOCUMENTATION.md](COMPREHENSIVE_DOCUMENTATION.md) |

---

## 🛠️ Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/xelytics/xelytics-core.git
cd xelytics-core

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dev dependencies
pip install -e ".[dev,advanced,connectors,export]"
```

### Running Tests

```bash
# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_clustering.py -v

# Tests matching pattern
pytest tests/ -k "test_kmeans" -v

# With coverage report
pytest tests/ --cov=xelytics --cov-report=html

# Only unit tests (exclude slow integration tests)
pytest tests/ -m "not integration" -v

# Only fast tests
pytest tests/ -m "not slow" -v
```

### Code Formatting & Linting

```bash
# Format code with Black
black xelytics/ tests/ examples/

# Check formatting
black --check xelytics/ tests/

# Lint with Ruff
ruff check xelytics/ tests/ --fix

# Type checking with mypy
mypy xelytics/
```

### Use the CLI

```bash
# Build package
pip install build
python -m build

# Publish to PyPI (requires credentials)
pip install twine
python -m twine upload dist/*
```

---

## 🧪 Testing & Quality Assurance

**Test Coverage:** 85%+ (307 tests)

**Test Categories:**

| Category | Count | Status |
|---|---|---|
| Unit Tests | 200+ | ✅ Passing |
| Integration Tests | 50+ | ✅ Passing |
| Performance Tests | 20+ | ✅ Passing |
| Backward Compatibility Tests | 8 | ✅ Passing (v0.1.0 code works in v0.2.0) |
| Example Scripts | 5 | ✅ Working |

**Key Test Suites:**
- ✅ `test_core.py` - Data ingestion, profiling, feature detection
- ✅ `test_clustering.py` - K-Means, DBSCAN, Hierarchical
- ✅ `test_timeseries_advanced.py` - Decomposition, forecasting, anomalies
- ✅ `test_stats.py` - Statistical tests, effect sizes, assumptions
- ✅ `test_connectors_integration.py` - Database connectivity
- ✅ `test_export.py` - HTML, PDF, PowerPoint, notebook export
- ✅ `test_caching.py` - File and Redis caching
- ✅ `test_v02_backward_compatibility.py` - v0.1.0 compatibility

**Run Full Test Suite:**

```bash
# Quick run (excludes slow tests)
pytest tests/ -m "not slow" --tb=short

# Full run (includes slow + integration)
pytest tests/ -v --tb=short

# With coverage
pytest tests/ --cov=xelytics --cov-report=term-missing
```

---

## Architecture Evolution (v0.2.x → v0.3.0)

> Added in v0.3.0

The v0.2.x architecture remains valid for simple DataFrame workflows: ingest data, detect schema/features, profile columns, run analysis modules, generate visualizations and insights, then export the result. v0.3.0 adds a planning layer in front of that pipeline rather than replacing it outright.

```text
v0.2.x eager flow:
DataFrame -> ingestion -> profiling -> stats/time series/clustering -> insights -> exports

v0.3.0 lazy flow:
Dataset -> ExecutionPlan -> TransformGraph nodes -> executor -> analysis -> trace/profile/result
```

| Layer | v0.2.x | v0.3.0 |
|---|---|---|
| Public API | `analyze(df)` | `analyze(df)` plus `Xelytics().dataset(...).analyze().run()` |
| Data source | DataFrame or connector-loaded DataFrame | `Dataset`, `MaterializedDataset`, `LazyDataset`, connector-backed sources |
| Pipeline shape | Mostly linear, eager execution | DAG of plan nodes and transform nodes |
| Optimization | Parallel tasks, sampling, result cache | Execution planning, SQL pushdown, chunk-aware execution hooks, node cache |
| Metadata | `RunMetadata` | `RunMetadata` plus trace/profiling/lineage-capable metadata |
| Extensibility | Pipeline steps, exporters, LLM providers | Analyzer, transformation, and output registries |

Compatibility guarantee: the v0.3.0 executor still materializes into the established `AnalysisResult` schema after planning. Existing code that reads `summary`, `statistics`, `visualizations`, `insights`, `metadata`, `time_series_analysis`, or `clusters` can continue to do so.

## 🏗️ Architecture

### System Design

```
┌─────────────────────────────────┐
│    Public API Layer             │
│  analyze() / AnalysisConfig     │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Data Ingestion Layer         │
│  Connectors, DataFrames, Files  │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Processing Core              │
│  Type Detection, Sampling        │
│  Feature Detection, Profiling    │
└──────────────┬──────────────────┘
               │
       ┌───────┴─────────┬──────────────┐
       │                 │              │
   ┌───▼────┐  ┌────────▼──┐  ┌───────▼──┐
   │ Stats  │  │ TimeSeries│  │Clustering│
   │Engine  │  │ Engine    │  │ Engine   │
   └───┬────┘  └────────┬──┘  └───────┬──┘
       │                │              │
       └────────┬───────┴──────────────┘
                │
      ┌─────────▼──────────┐
      │  Visualization &   │
      │  Insight Generator │
      └─────────┬──────────┘
                │
      ┌─────────▼──────────┐
      │  Export Layer      │
      │  HTML/PDF/PPTX/etc │
      └────────────────────┘
```

### Module Breakdown

```
xelytics-core/
├── xelytics/
│   ├── __init__.py               # Public API
│   ├── engine.py                 # Main analyze() function
│   ├── api.py                    # Chainable Xelytics API (v0.3.0)
│   ├── dataset.py                # Dataset abstraction: materialized/lazy/transformed (v0.3.0)
│   ├── exceptions.py             # Exception hierarchy
│   │
│   ├── core/                     # Data pipeline
│   │   ├── ingestion.py          # Type detection, validation
│   │   ├── profiler.py           # Column statistics
│   │   ├── features.py           # Feature detection
│   │   └── chunked.py            # Large dataset processing
│   │
│   ├── stats/                    # Statistical analysis
│   │   ├── engine.py             # Test selection & execution
│   │   ├── planner.py            # Analysis planning
│   │   └── ...
│   │
│   ├── timeseries/               # Time series (v0.2.0)
│   │   ├── detector.py           # Series detection
│   │   ├── decomposition.py      # Trend/seasonal separation
│   │   ├── forecasting.py        # ARIMA/ExpSmoothing
│   │   ├── anomaly.py            # Anomaly detection
│   │   └── change_points.py      # Change point detection
│   │
│   ├── clustering/               # Clustering (v0.2.0)
│   │   ├── kmeans.py             # K-Means
│   │   ├── dbscan.py             # DBSCAN
│   │   ├── hierarchical.py       # Hierarchical clustering
│   │   └── profiler.py           # Cluster profiling
│   │
│   ├── connectors/               # Data sources (v0.2.0)
│   │   ├── postgres.py           # PostgreSQL
│   │   ├── mysql.py              # MySQL/MariaDB
│   │   ├── database.py           # Base SQL class
│   │   ├── s3.py                 # AWS S3
│   │   ├── cloud.py              # Azure/GCS
│   │   └── ...
│   │
│   ├── export/                   # Report generation (v0.2.0)
│   │   ├── html.py               # HTML reports
│   │   ├── pdf.py                # PDF export
│   │   ├── pptx.py               # PowerPoint slides
│   │   ├── notebook.py           # Jupyter notebooks
│   │   └── ...
│   │
│   ├── cache/                    # Caching (v0.2.0)
│   │   ├── base.py               # Cache interface
│   │   ├── file.py               # File-based cache
│   │   └── redis.py              # Redis cache
│   │
│   ├── pipeline/                 # Custom pipelines (v0.2.0)
│   │   ├── __init__.py           # Pipeline class
│   │   └── steps.py              # Pre-built steps
│   │
│   ├── execution/                # Lazy execution planning (v0.3.0)
│   │   ├── plan.py               # ExecutionPlan and PlanNode
│   │   ├── builder.py            # PlanBuilder
│   │   ├── executor.py           # DAG executor with tracing/profiling
│   │   └── pushdown.py           # SQL pushdown helpers
│   │
│   ├── graph/                    # Transformation DAG (v0.3.0)
│   │   ├── graph.py              # TransformGraph
│   │   ├── node.py               # DataSourceNode, TransformNode, SinkNode
│   │   ├── cache.py              # NodeCache
│   │   └── lineage.py            # LineageTracker
│   │
│   ├── analyzers/                # Modular analyzers (v0.3.0)
│   │   ├── profiling.py          # ProfilingAnalyzer
│   │   ├── correlation.py        # CorrelationAnalyzer
│   │   ├── trend_anomaly.py      # TrendAnomalyAnalyzer
│   │   └── segmentation.py       # SegmentationAnalyzer
│   │
│   ├── observability/            # Tracing and profiling (v0.3.0)
│   │   ├── trace.py              # TraceCollector
│   │   └── profiler.py           # ExecutionProfiler
│   │
│   ├── extension/                # Plugin registries (v0.3.0)
│   │   ├── interfaces.py         # Analyzer, CustomTransform, OutputFormat
│   │   └── registry.py           # register_* decorators
│   │
│   ├── llm/                      # LLM integration
│   │   ├── openai.py             # OpenAI provider
│   │   ├── groq.py               # Groq provider
│   │   └── base.py               # Provider interface
│   │
│   ├── viz/                      # Visualizations
│   │   ├── generator.py          # Plotly spec generation
│   │   └── themes.py             # Color schemes
│   │
│   ├── insights/                 # Insight generation
│   │   ├── rules.py              # Rule-based insights
│   │   └── templates.py          # Insight templates
│   │
│   ├── schemas/                  # Type definitions
│   │   ├── config.py             # AnalysisConfig
│   │   └── outputs.py            # AnalysisResult & schemas
│   │
│   └── cli/                      # Command-line interface
│       └── main.py               # CLI entry point
│
├── tests/                        # 300+ tests
│   ├── test_core.py
│   ├── test_clustering.py
│   ├── test_timeseries_*.py
│   ├── test_connectors_integration.py
│   ├── test_export.py
│   └── ...
│
├── examples/                     # Example scripts
│   ├── quickstart.py
│   ├── forecasting_demo.py
│   └── ...
│
├── docs/                         # Full documentation
│   ├── guides/                   # Step-by-step guides
│   ├── api/                      # API reference
│   └── examples/                 # Example notebooks
│
└── pyproject.toml                # Dependencies & config
```

---

## 📋 API Classes & Functions

### Core Classes

```python
from xelytics import analyze

result = analyze(df)
```

Adopt this when you need lazy data binding, plan inspection, graph transforms,
observability, or extension registries:

```python
from xelytics import Xelytics

result = Xelytics().dataset(df).analyze().run()
```

Migration notes:

- `analyze(df)`, `AnalysisConfig`, `AnalysisResult`, connectors, cache backends,
  exporters, pipelines, time series modules, and clustering modules remain
  supported.
- v0.3.0 adds optional result fields: `correlation`, `trend_anomaly`,
  `segmentation`, `trace`, and `profiling`.
- v0.2.x eager `Pipeline` remains supported for preprocessing.
- Prefer `Dataset.transform()` or `TransformGraph` for transformations that need
  lineage, node caching, or plan visibility.
- Prefer `Xelytics` or `build_plan()` for new lazy and graph-aware workflows.

## Documentation

| Document | Purpose |
|---|---|
| [examples/xelytics_core_v0_3_0_complete.ipynb](examples/xelytics_core_v0_3_0_complete.ipynb) | complete executable v0.3.0 notebook |
| [docs/quickstart.md](docs/quickstart.md) | copy-paste examples |
| [docs/index.md](docs/index.md) | documentation index and feature matrix |
| [docs/api/analyze.md](docs/api/analyze.md) | `analyze()`, `Xelytics`, and large-dataset API |
| [docs/api/config.md](docs/api/config.md) | all `AnalysisConfig` fields |
| [docs/api/result_schema.md](docs/api/result_schema.md) | result dataclasses and serialization |
| [docs/api/execution.md](docs/api/execution.md) | `Dataset`, `ExecutionPlan`, `TransformGraph`, observability |
| [docs/api/extensions.md](docs/api/extensions.md) | extension registry APIs |
| [docs/guides/05_connectors.md](docs/guides/05_connectors.md) | database and cloud source usage |
| [docs/guides/06_export_reports.md](docs/guides/06_export_reports.md) | report export formats |
| [MIGRATION_GUIDE_v0.2_to_v0.3.md](MIGRATION_GUIDE_v0.2_to_v0.3.md) | v0.2.x to v0.3.0 migration |
| [ARCHITECTURE.md](ARCHITECTURE.md) | package architecture |
| [CHANGELOG.md](CHANGELOG.md) | release history |

## Development

```bash
pip install -e ".[dev]"
pytest tests/
```

Focused v0.3.0 verification:

```bash
pytest tests/test_epic1_connectivity.py tests/test_epic2.py tests/test_epic3.py
pytest tests/test_epic4.py tests/test_epic5.py tests/test_epic6.py tests/test_epic7.py
```

The package supports Python 3.9 through 3.12.

## Project Status

Xelytics-Core is beta software. v0.3.0 is compatibility-first: older v0.2.x
DataFrame workflows remain valid while the package moves toward the lazy,
graph-aware engine model. See [CHANGELOG.md](CHANGELOG.md) and
[API_CONTRACT.md](API_CONTRACT.md) for versioning and compatibility policy.

## License

MIT, as declared in [pyproject.toml](pyproject.toml).
