Metadata-Version: 2.4
Name: ml4t-data
Version: 0.1.0b12
Summary: High-performance market data management library with unified multi-provider interface
Project-URL: Homepage, https://github.com/stefan-jansen/ml4t-data
Project-URL: Documentation, https://ml4t-data.readthedocs.io
Project-URL: Repository, https://github.com/stefan-jansen/ml4t-data
Project-URL: Issues, https://github.com/stefan-jansen/ml4t-data/issues
Project-URL: Changelog, https://github.com/stefan-jansen/ml4t-data/blob/main/CHANGELOG.md
Author-email: ML4T Team <info@ml4trading.io>
Maintainer-email: ML4T Contributors <dev@ml4trading.io>
License: MIT
License-File: LICENSE
Keywords: backtesting,binance,cryptocompare,databento,finance,market-data,oanda,parquet,polars,quantitative-finance,trading,yahoo-finance
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial :: Investment
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: click>=8.0.0
Requires-Dist: filelock>=3.19.1
Requires-Dist: html5lib>=1.1
Requires-Dist: httpx>=0.25.0
Requires-Dist: lxml>=6.0.2
Requires-Dist: numpy>=1.24.0
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pandas-market-calendars>=4.3.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: platformdirs>=4.0.0
Requires-Dist: polars>=0.20.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: pybreaker>=1.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: structlog>=23.0.0
Requires-Dist: tenacity>=8.0.0
Provides-Extra: all
Requires-Dist: cot-reports>=0.1.0; extra == 'all'
Requires-Dist: databento>=0.38.0; extra == 'all'
Requires-Dist: hypothesis>=6.80.0; extra == 'all'
Requires-Dist: ipdb>=0.13.0; extra == 'all'
Requires-Dist: ipython>=8.14.0; extra == 'all'
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == 'all'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'all'
Requires-Dist: mkdocs>=1.6.0; extra == 'all'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'all'
Requires-Dist: mypy>=1.5.0; extra == 'all'
Requires-Dist: oandapyv20>=0.7.0; extra == 'all'
Requires-Dist: pre-commit>=3.3.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'all'
Requires-Dist: pytest-xdist>=3.3.0; extra == 'all'
Requires-Dist: pytest>=7.4.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Requires-Dist: xlsxwriter>=3.1.0; extra == 'all'
Requires-Dist: yfinance>=0.2.0; extra == 'all'
Provides-Extra: all-providers
Requires-Dist: cot-reports>=0.1.0; extra == 'all-providers'
Requires-Dist: databento>=0.38.0; extra == 'all-providers'
Requires-Dist: oandapyv20>=0.7.0; extra == 'all-providers'
Requires-Dist: yfinance>=0.2.0; extra == 'all-providers'
Provides-Extra: cot
Requires-Dist: cot-reports>=0.1.0; extra == 'cot'
Provides-Extra: databento
Requires-Dist: databento>=0.38.0; extra == 'databento'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.80.0; extra == 'dev'
Requires-Dist: ipdb>=0.13.0; extra == 'dev'
Requires-Dist: ipython>=8.14.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: oandapyv20>=0.7.0; extra == 'dev'
Requires-Dist: pre-commit>=3.3.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.3.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: xlsxwriter>=3.1.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-git-revision-date-localized-plugin>=1.2.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.6.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Provides-Extra: oanda
Requires-Dist: oandapyv20>=0.7.0; extra == 'oanda'
Provides-Extra: yahoo
Requires-Dist: yfinance>=0.2.0; extra == 'yahoo'
Description-Content-Type: text/markdown

# ml4t-data

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![PyPI](https://img.shields.io/pypi/v/ml4t-data)](https://pypi.org/project/ml4t-data/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Unified market data acquisition and storage for quantitative research workflows.

## Part of the ML4T Library Ecosystem

This library is one of six interconnected libraries supporting the machine learning for trading workflow described in [Machine Learning for Trading](https://ml4trading.io):

![ML4T Library Ecosystem](docs/images/ml4t_ecosystem_workflow_color.png)

Together they cover data infrastructure, feature engineering, modeling, signal evaluation, strategy backtesting, and live deployment.

## What This Library Does

Quantitative research requires consistent, reproducible access to market data from multiple sources. ml4t-data provides:

- `DataManager` as the unified interface: fetch, store, update, and query across all providers
- 20+ provider adapters covering equities, crypto, futures, forex, macro, prediction markets, and factors
- Automated storage in Hive-partitioned Parquet format with metadata tracking
- Incremental updates, gap detection, and backfill via CLI
- Built-in data validation (OHLC invariants, deduplication, anomaly detection)
- Futures module for CME/ICE bulk downloads with continuous contract construction
- COT module for CFTC Commitment of Traders weekly reports
- Resilience: rate limiting, retry with exponential backoff, gap detection

The goal is to support an ongoing research workflow rather than one-off downloads. Data is stored locally, tracked for freshness, and queryable with tools like DuckDB or Polars.

![ml4t-data Architecture](docs/images/ml4t_data_architecture_print.jpeg)

## Installation

```bash
pip install ml4t-data
```

## Quick Start

### DataManager (Unified Interface)

```python
from ml4t.data import DataManager

dm = DataManager()

# Fetch and store
dm.fetch("AAPL", "2020-01-01", "2024-12-31", provider="yahoo")

# Load from local storage
data = dm.load("AAPL", "2020-01-01", "2024-12-31")

# Batch load multiple symbols
prices = dm.batch_load(["AAPL", "MSFT", "GOOGL"], "2020-01-01", "2024-12-31")

# Incremental update
dm.update("AAPL")

# List what's stored
symbols = dm.list_symbols()
metadata = dm.get_metadata("AAPL")
```

### Direct Provider Access

All providers implement the same interface:

```python
from ml4t.data.providers import YahooFinanceProvider, CoinGeckoProvider, FREDProvider

# Equities
provider = YahooFinanceProvider()
data = provider.fetch_ohlcv("AAPL", "2020-01-01", "2024-12-31")

# Crypto
crypto = CoinGeckoProvider().fetch_ohlcv("bitcoin", "2024-01-01", "2024-12-31")

# Economic data
fred = FREDProvider().fetch_series("GDP", "2020-01-01", "2024-12-31")
```

## Data Providers

### No API Key Required

| Provider | Coverage |
|----------|----------|
| Yahoo Finance | US/global equities, ETFs, crypto, forex |
| CoinGecko | 10,000+ cryptocurrencies |
| FRED | 850,000 economic series |
| Fama-French | Academic factor data |
| AQR | Research factors (QMJ, BAB, HML Devil, VME, more) |
| Wiki Prices | Frozen US equities history (1962-2018) |
| Kalshi | Prediction market contracts |
| Polymarket | Prediction market history/order book snapshots |
| Binance Public | Bulk crypto data downloads |
| NASDAQ ITCH Sample | Tick-level sample data |

### Authenticated or Metered APIs

| Provider | Coverage |
|----------|----------|
| EODHD | 60+ global exchanges |
| Tiingo | US equities with quality focus |
| Twelve Data | Multi-asset coverage |
| Databento | CME, CBOE, ICE futures/options |
| Polygon | US equities, options, forex, crypto |
| Finnhub | 70+ global exchanges |
| Binance | Crypto exchange data |
| OKX | Crypto perpetuals and funding rates |
| CryptoCompare | Crypto market data |
| OANDA | Forex broker data |

## Specialized Modules

### Futures

Bulk download and continuous contract construction for CME/ICE products:

```python
from ml4t.data.futures import FuturesDownloader, ContinuousContractBuilder

# Bulk download via Databento (parent symbology)
downloader = FuturesDownloader(config)
downloader.download()  # Downloads ES, NQ, CL, GC, etc.

# Build continuous contracts with configurable roll logic
builder = ContinuousContractBuilder()
continuous = builder.build(contracts_df, roll_method="volume")
```

Book-focused interface with profiling:

```python
from ml4t.data.futures import FuturesDataManager

fm = FuturesDataManager.from_config("config.yaml")
fm.download_all()
data = fm.load_ohlcv("ES")
profile = fm.generate_profile("ES")
```

### COT (Commitment of Traders)

CFTC weekly positioning data for futures markets:

```python
from ml4t.data.cot import COTFetcher, create_cot_features, combine_cot_ohlcv_pit

fetcher = COTFetcher(config)
cot_data = fetcher.fetch_product("ES", start_year=2015, end_year=2024)

# Point-in-time combination with OHLCV (no look-ahead)
combined = combine_cot_ohlcv_pit(cot_data, ohlcv_data)

# Generate features from COT data
features = create_cot_features(cot_data)
```

### Book Data Managers

Simplified interfaces for the ML4T book workflow:

```python
from ml4t.data.etfs import ETFDataManager
from ml4t.data.crypto import CryptoDataManager

# 50 diversified ETFs via Yahoo Finance
etf_dm = ETFDataManager.from_config("config.yaml")
etf_dm.download_all()
aapl = etf_dm.load_ohlcv("AAPL")

# Crypto premium index via Binance Public
crypto_dm = CryptoDataManager.from_config("config.yaml")
crypto_dm.download_premium_index()
```

## CLI for Automated Updates

```bash
# Fetch specific symbols
ml4t-data fetch -s AAPL -s MSFT -s GOOGL --provider yahoo --start 2020-01-01

# Incremental update
ml4t-data update --symbol AAPL

# Validate data quality
ml4t-data validate --symbol AAPL --anomalies

# Check storage status
ml4t-data status --detailed

# List available data
ml4t-data list-data

# Export to CSV/JSON/Excel
ml4t-data export --symbol AAPL --format-type csv --output aapl.csv

# Get symbol info
ml4t-data info --symbol AAPL
```

Configuration-driven batch updates:

```yaml
storage:
  path: ~/data/market

datasets:
  sp500_daily:
    provider: yahoo
    symbols_file: symbols/sp500.txt
    frequency: daily
    start_date: 2015-01-01

  crypto:
    provider: coingecko
    symbols: [bitcoin, ethereum, solana]
    frequency: daily
    start_date: 2020-01-01
```

## Storage Format

Data is stored in Hive-partitioned Parquet:

```
~/data/market/
├── yahoo/daily/symbol=AAPL/data.parquet
├── yahoo/daily/symbol=MSFT/data.parquet
└── coingecko/daily/symbol=bitcoin/data.parquet
```

Query with DuckDB or Polars:

```python
import duckdb

result = duckdb.execute("""
    SELECT * FROM read_parquet('~/data/market/yahoo/daily/**/*.parquet')
    WHERE symbol IN ('AAPL', 'MSFT')
    AND date >= '2024-01-01'
""").pl()
```

## Data Validation

```python
from ml4t.data.validation import OHLCVValidator, ValidationReport

validator = OHLCVValidator()
report = validator.validate(data)
# Checks: high >= low, high >= open/close, low <= open/close
# Detects: duplicates, gaps, anomalies
```

Anomaly detection:

```python
from ml4t.data.anomaly import AnomalyManager, ReturnOutlierDetector, VolumeSpikeDetector

manager = AnomalyManager([
    ReturnOutlierDetector(),
    VolumeSpikeDetector(),
])
report = manager.detect(data)
```

## Documentation

- [Getting Started](docs/user-guide/getting-started.md) — quick start guide
- [Configuration](docs/user-guide/configuration.md) — YAML config reference
- [Storage](docs/user-guide/storage.md) — Hive partitioning and backends
- [Incremental Updates](docs/user-guide/incremental-updates.md) — update strategies and gap detection
- [Data Quality](docs/user-guide/data-quality.md) — validation and anomaly detection
- [CLI Reference](docs/user-guide/cli-reference.md) — command-line interface
- [Provider Selection Guide](docs/provider-selection-guide.md) — choosing providers
- [Creating a Provider](docs/creating_a_provider.md) — extending with new sources

## Technical Characteristics

- **Polars-based**: Native Polars DataFrames throughout
- **Consistent schema**: All providers return the same column structure
- **Async support**: Async providers and batch operations for parallel downloads
- **Metadata tracking**: Last update timestamps, row counts, date ranges
- **Resilience**: Rate limiting, retry with exponential backoff, gap detection
- **Multiple backends**: File system, S3, and in-memory storage
- **Type-safe**: Full type annotations throughout

## Related Libraries

- **ml4t-engineer**: Feature engineering and technical indicators
- **ml4t-diagnostic**: Signal evaluation and statistical validation
- **ml4t-backtest**: Event-driven backtesting
- **ml4t-live**: Live trading with broker integration

## Development

```bash
git clone https://github.com/ml4t/ml4t-data.git
cd ml4t-data
uv sync
uv run pytest tests/ -q
uv run ty check
```

## License

MIT License - see [LICENSE](LICENSE) for details.
