Metadata-Version: 2.4
Name: factordbms
Version: 0.1.0
Summary: A comprehensive factor library management system for quantitative trading research
Project-URL: Homepage, https://github.com/ElenYoung/FactorDBMS
Project-URL: Repository, https://github.com/ElenYoung/FactorDBMS
Project-URL: Issues, https://github.com/ElenYoung/FactorDBMS/issues
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.20.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: quantchdb>=0.2.0
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: PyYAML>=5.0
Requires-Dist: streamlit>=1.28.0
Requires-Dist: plotly>=5.0.0
Requires-Dist: networkx>=2.8.0
Requires-Dist: build>=1.4.0
Requires-Dist: twine>=6.2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"

# FactorDBMS

A comprehensive factor library management system for quantitative trading research.

## Features

- **Unified Operator Mapping**: Integrates operators from different factor mining frameworks (gpfactor, masfactorMiner; mas1 legacy) into a common set of operations
- **Expression Processing**: Parse and calculate factor expressions across frameworks; normalization is available (auto-detect by default, can be disabled when you already use unified names)
- **Factor Evaluation**: 30+ evaluation metrics including IC, monotonicity, returns, and distribution statistics
- **Orthogonality Analysis**: Comprehensive factor redundancy analysis with correlation, clustering, and selection algorithms
- **Database Storage**: ClickHouse-based storage for factor values and metadata
- **Automated Factor Management**: Factor registration, calculation, and lifecycle management

## Installation

```bash
# Clone the repository
git clone https://github.com/ElenYoung/FactorDBMS.git
cd FactorDBMS

# Install dependencies
pip install -r requirements.txt

# Install the package
pip install -e .
```

## Configuration

Create a `.env` file in the project root with your ClickHouse database credentials:

```env
DB_HOST=localhost
DB_PORT=9000
DB_USER=default
DB_PASSWORD=your_password
```

## Quick Start

### 1. Expression Processing

```python
from factordb import ExpressionNormalizer, ExpressionCalculator

# Normalize expressions from different frameworks (optional)
normalizer = ExpressionNormalizer()

# gpfactor (C++ style)
expr = normalizer.normalize_expression('sub(close)(low)', 'gpfactor')
# Result: 'sub(close, low)'

# mas1 (CamelCase style)
expr = normalizer.normalize_expression('Mul(Rank(close), EMA(volume, 5))', 'mas1')
# Result: 'mul(cs_rank(close), ts_ema(volume, 5))'

# Calculate factor values (normalization is enabled by default, pass normalize=False to skip)
calculator = ExpressionCalculator()
factor_values = calculator.calculate('ts_mean(close, 20)', market_data)          # auto-detect framework
factor_values_no_norm = calculator.calculate('ts_mean(close, 20)', market_data, normalize=False)
```

### 2. Parse Mining Results

```python
from factordb.parsers import parse_factor_file

# Auto-detect framework and parse (supports gpfactor, masfactorMiner; mas1 legacy mapping)
factors = parse_factor_file('path/to/mining_results.json')

for factor in factors:
    print(f"Expression: {factor.normalized_expression}")
    print(f"IC Mean: {factor.get_metric('ic_mean')}")
```

### 3. Standalone Factor Analysis (for out-of-DB or US-equity data)

If you just want to compute/analyze factor expressions on arbitrary data (CSV/Parquet/ClickHouse) without using the full FactorDB pipeline:

```python
from factor_analysis import FactorAnalysisConfig, FactorAnalysis, FactorAnalyzer

# Load config (YAML) defining data source/columns
cfg = FactorAnalysisConfig.from_yaml('path/to/factor_analysis.yaml')

# Calculate factor values
fa = FactorAnalysis(cfg)
values = fa.calculate('ts_mean(close, 20)')  # returns Series with MultiIndex (code, date)

# Quick single-factor analysis
report = FactorAnalyzer().analyze(values)
print(report)
```

Example `factor_analysis.yaml`:

```yaml
data_source: clickhouse       # csv | parquet | clickhouse
data_path: null               # required when data_source is csv/parquet
date_column: date
code_column: code
clickhouse:
  table: us_market.daily
  asset_type: STOCK
  factor_type: DAILY
  where: "date >= '2024-01-01'"
  limit: 200000
```

### 4. Create Custom Factors

```python
from factordb import ExpressionFactor, CustomFactor

# Method 1: Using expression
factor = ExpressionFactor(
    expression='ts_zscore(close, 20)',
    name='price_zscore',
    explanation='20-day price z-score'
)

# Method 2: Custom calculation
class MomentumFactor(CustomFactor):
    def __init__(self, window=20):
        super().__init__()
        self.window = window
        self._name = f"momentum_{window}d"

    def _compute(self, group_data):
        return group_data['close'].pct_change(self.window)
```

### 5. Factor Evaluation

```python
from factordb.evaluators import FactorEvaluator

evaluator = FactorEvaluator()
metrics = evaluator.evaluate(factor_values, returns_data)

print(f"IC Mean: {metrics['ic_mean']:.4f}")
print(f"IC IR: {metrics['ic_ir']:.4f}")
print(f"Direction: {metrics['direction']}")
```

### 6. Factor Management (with Database)

```python
from factordb import FactorManager, AssetType, FactorType

manager = FactorManager()

# Register a factor
factor_id = manager.register_from_expression(
    expression='ts_zscore(close, 20)',
    name='price_zscore',
    asset_type=AssetType.STOCK,
    factor_type=FactorType.DAILY
)

# Register factors from mining results
factor_ids = manager.register_mined_factors(
    file_path='mining_results.json',
    max_factors=100,
    min_ic=0.03
)

# Calculate and save factor values
manager.calculate_and_save(factor, market_data)

# Evaluate and save results
metrics = manager.evaluate_factor(factor_id, returns_data, save_results=True)

# Search for good factors
good_factors = manager.search_factors(min_ic=0.03, min_icir=0.5)
```

### 7. Orthogonality Analysis

Analyze factor redundancy and select non-redundant factors using a three-phase framework:

```python
from factordb.orthogonality import OrthogonalityAnalyzer

# Initialize analyzer
analyzer = OrthogonalityAnalyzer(
    correlation_threshold=0.7,   # High correlation pair threshold
    vif_threshold=5.0,           # VIF collinearity threshold
    marginal_ic_threshold=0.015  # Minimum marginal IC for selection
)

# Prepare factor matrix (wide format: rows=observations, cols=factors)
factor_matrix = analyzer.prepare_factor_matrix(factor_data)

# Run full analysis
report = analyzer.analyze(
    factor_matrix=factor_matrix,
    returns=returns_data,
    ic_values=ic_dict  # {factor_id: ic_value}
)

# Phase 1: Global Correlation Results
print(f"Total factors: {report.effective_n.total_factors}")
print(f"Effective N (90%): {report.effective_n.effective_n_90}")
print(f"Mean correlation: {report.correlation_stats.mean_correlation:.3f}")
print(f"High correlation pairs: {len(report.correlation_stats.high_correlation_pairs)}")

# Phase 2: Clustering Results
print(f"Number of clusters: {report.clustering_result.n_clusters}")
print(f"Central factors: {report.mst_result.central_factors}")
print(f"Peripheral factors (unique alpha): {report.mst_result.peripheral_factors}")

# Phase 3: Selection Results
print(f"Selected factors: {len(report.final_selected_factors)}")
print(f"Removed factors: {len(report.removed_factors)}")

# Get final non-redundant factor set
selected_factors = report.final_selected_factors
```

#### Quick Analysis (No Returns Required)

For initial exploration without return data:

```python
# Run Phase 1 & 2 only
results = analyzer.quick_analysis(factor_matrix, ic_values)

print(f"Redundancy ratio: {results['summary']['redundancy_ratio_90']:.1%}")
print(f"Suggested clusters: {analyzer.suggest_optimal_cluster_count(results['phase1']['effective_n'])}")
```

## Interactive CLI (main.py)

`main.py` provides an interactive menu for managing the full factor pipeline. It reads parameters from a YAML config file and prompts for additional inputs at runtime.

### Quick Start

```bash
# Run with default config (pipeline_config.yaml)
python main.py

# Run with custom config
python main.py -c path/to/config.yaml
```

### Interactive Menu

```
==================================================
  FactorDB Pipeline (STOCK / DAILY)
==================================================
  1. Upload factors from file
  2. Update factor values (incremental)
  3. Evaluate factors
  4. Show database status
  5. Switch asset type (STOCK <-> ETF)
  0. Exit
==================================================
Select option:
```

#### 1. Upload Factors

Parses factors from a mining results file, registers new factors, calculates values, and evaluates them. Supports incremental execution -- if a previous run was interrupted, it detects which factors are already registered/calculated/evaluated and only runs the remaining steps.

Prompts:
- **File path**: Path to mining results JSON file (default from config)
- **Skip evaluation**: Whether to skip the evaluation step after upload

#### 2. Update Factor Values (Incremental)

Incrementally updates all registered factor values to the latest date. Uses the `A3_factor_upgrade` table to determine what data is already present, then only calculates and saves the new portion.

Prompts:
- **Re-evaluate**: Whether to re-evaluate factors that were updated

#### 3. Evaluate Factors

Calculates evaluation metrics (IC, ICIR, monotonicity, returns, etc.) for factors.

Prompts:
- **Re-evaluate ALL**: By default only evaluates factors missing metrics. Choose yes to re-evaluate all factors.

#### 4. Show Database Status

Displays a summary of the factor database: total registered factors, how many have values, how many have evaluation metrics, and the latest/earliest update dates.

Prompts:
- **Detailed list**: Whether to show the full factor list

#### 5. Switch Asset Type

Switch between STOCK and ETF factor databases. The current selection is shown in the menu header. Each asset type has its own separate database:
- STOCK: `stk_factors` database, factor IDs like `F_stk_000001`
- ETF: `etf_factors` database, factor IDs like `F_etf_000001`

Note: HIGH_FREQ (intraday) factors are not yet supported in the interactive CLI due to different evaluation logic.

### Configuration File (pipeline_config.yaml)

```yaml
# Factor classification
asset_type: "STOCK"          # STOCK | ETF
factor_type: "DAILY"         # DAILY | HIGH_FREQ (HIGH_FREQ not yet supported)

# Parallelism
n_jobs: 16

# Market data source
market_data:
  database: "stocks"
  price_table: "daily_adj_tushare"
  basic_table: "daily_basic_tushare"
  start_date: "2000-01-01"
  end_date: null             # null = today

# Upload command defaults
upload:
  file_path: "mined_factors_demo/stocks/new_128.json"
  max_factors: 140
  min_score: 30.0

# Update command defaults
update:
  evaluate_after_update: false

# Evaluate command defaults
evaluate:
  return_column: "pct_chg"
  cap_column: "circ_mv"
  n_jobs: 4                  # Evaluation threads (lower to avoid memory issues)
```

**Note for ETF factors:** If processing ETF factors with different market data tables, either:
1. Modify the `market_data` section in `pipeline_config.yaml` before switching to ETF
2. Or use a separate config file: `python main.py -c etf_config.yaml`

## Web Dashboard (dashboard.py)

A Streamlit-based web interface for viewing factor information.

### Quick Start

```bash
# Activate virtual environment and run (Windows)
.venv\Scripts\python -m streamlit run dashboard.py

# Or using uv (if project has pyproject.toml)
uv run streamlit run dashboard.py

# Or activate venv first, then run
.venv\Scripts\activate
streamlit run dashboard.py
```

The dashboard will open in your browser at `http://localhost:8501`.

### Features

- **Summary Statistics**: Total factors, factors with values, factors with evaluation
- **Factor List**: Sortable table with key metrics (IC, RankIC, ICIR, etc.)
- **Filtering**: Filter by minimum IC, ICIR, or monotonicity
- **Factor Detail**: Detailed view of selected factor with expression and all metrics
- **Asset Type Switch**: Toggle between STOCK and ETF databases

### Displayed Metrics

| Metric | Description |
|--------|-------------|
| IC Mean | Pearson correlation with future returns |
| Rank IC | Spearman correlation (more robust) |
| IC IR | Information ratio (IC mean / IC std) |
| Rank IC IR | Rank IC information ratio |
| Mono (10g) | Monotonicity of 10-group returns |
| Top-Bottom Return | Long-short portfolio return |
| Top-Bottom Sharpe | Long-short Sharpe ratio |

All metrics are shown for Full period, 5-year, and 1-year windows.

## Project Structure

```
FactorDB/
├── main.py                          # Interactive CLI entry point
├── dashboard.py                     # Streamlit web dashboard
├── pipeline_config.yaml             # Pipeline configuration
├── src/factordb/
│   ├── core/
│   │   ├── config.py          # Configuration management
│   │   ├── expression.py      # Expression processing
│   │   ├── factor.py          # Factor base classes
│   │   └── factor_manager.py  # Factor lifecycle management
│   ├── evaluators/
│   │   ├── factor_evaluator.py      # Main evaluator
│   │   ├── ic_calculator.py         # IC metrics
│   │   ├── return_calculator.py     # Return metrics
│   │   └── monotonicity_calculator.py
│   ├── orthogonality/               # Factor orthogonality analysis
│   │   ├── orthogonality_analyzer.py  # Main orchestrator
│   │   ├── correlation_analyzer.py    # Phase 1: Correlation analysis
│   │   ├── clustering_analyzer.py     # Phase 2: Clustering & MST
│   │   └── selection_analyzer.py      # Phase 3: VIF & Marginal IC
│   ├── operators/
│   │   └── unified_operators.py     # Operator registry (45+ operators)
│   ├── parsers/
│   │   ├── gpfactor_parser.py       # gpfactor results parser
│   │   ├── masfactor_miner_parser.py# masfactorMiner results parser (mas2 legacy)
│   │   └── parser_factory.py        # Auto-detection & parsing
│   └── storage/
│       ├── clickhouse_storage.py    # Database operations
│       └── schema.py                # Table schemas
├── src/factor_analysis/             # Standalone expression compute & analysis
│   ├── config.py
│   ├── calculator.py
│   └── analyzer.py
├── examples/
│   └── stock_factor_pipeline.py     # Example pipeline script
├── mined_factors_demo/              # Sample mining results
├── requirements.txt
└── README.md
```

## Supported Operators

### Unary Operators
`abs`, `sign`, `log`, `neg`, `inv`, `sqrt`, `square`, `sigmoid`, `tanh`

### Binary Operators
`add`, `sub`, `mul`, `div`, `max`, `min`, `power`

### Time-Series Operators
`ts_mean`, `ts_std`, `ts_var`, `ts_max`, `ts_min`, `ts_sum`, `ts_median`, `ts_delta`, `ts_delay`, `ts_return`, `ts_slope`, `ts_corr`, `ts_cov`, `ts_ema`, `ts_wma`, `ts_skew`, `ts_kurt`, `ts_rank`, `ts_zscore`, `ts_argmax`, `ts_argmin`, `ts_prod`, `ts_quantile`

### Cross-Sectional Operators
`cs_rank`, `cs_zscore`, `cs_demean`, `cs_scale`

### Conditional Operators
`if_else`, `greater`, `less`

## Evaluation Metrics

| Metric | Description |
|--------|-------------|
| IC Mean | Pearson correlation with future returns |
| Rank IC Mean | Spearman correlation (more robust) |
| IC IR | Information ratio (IC mean / IC std) |
| IC t-stat | Statistical significance |
| Direction | Trading direction (1=long high, -1=short high) |
| isMono 5/10/15 | Monotonicity of group returns |
| Top-Bottom Return | Long-short portfolio return |
| Top-Bottom SR | Long-short Sharpe ratio |
| Factor Std/Skew/Kurt | Distribution statistics |

All metrics are calculated for full period, recent 5 years, and recent 1 year.

## Orthogonality Analysis

The orthogonality module provides a three-phase framework to analyze factor redundancy and select non-redundant factors:

### Phase 1: Global Correlation Check

| Method | Description |
|--------|-------------|
| Correlation Matrix | Spearman rank correlation between all factor pairs |
| Effective N | PCA-based eigenvalue analysis to measure true dimensionality |

**Key Metrics:**
- `effective_n_90`: Number of factors explaining 90% of variance
- `redundancy_ratio`: 1 - (effective_n / total_factors)
- `high_correlation_pairs`: Factor pairs with \|corr\| > threshold

### Phase 2: Clustering & Structure

| Method | Description |
|--------|-------------|
| Hierarchical Clustering | Ward/Average linkage to group similar factors |
| Minimum Spanning Tree | Graph-based structure to find central/peripheral factors |

**Key Outputs:**
- `clusters`: Factor groupings with intra-cluster correlation
- `central_factors`: Proxy factors representing each style
- `peripheral_factors`: Unique alpha factors (most orthogonal)
- `representative_factors`: Best factor per cluster (by IC)

### Phase 3: Selection & Pruning

| Method | Description |
|--------|-------------|
| VIF (Variance Inflation Factor) | Detect multicollinearity within clusters |
| Marginal IC Analysis | Stepwise selection based on incremental IC contribution |

**Selection Logic:**
1. Start with highest IC factor
2. For each remaining factor, orthogonalize against selected set
3. If residual IC > threshold (default 0.015), add to selection
4. Repeat until no significant marginal contribution

## Database Schema

### A1_factor_basic
Factor registration and metadata.

| Column | Type | Description |
|--------|------|-------------|
| factor_id | String | Unique ID (e.g., F_stk_000001) |
| name | Nullable(String) | Factor name |
| expression | Nullable(String) | Factor expression |
| explanation | Nullable(String) | Factor explanation |
| register_datetime | DateTime | Registration time |

### A2_factor_evaluate
Factor evaluation metrics.

| Column | Type | Description |
|--------|------|-------------|
| factor_id | String | Factor ID |
| upgrade_datetime | DateTime | Evaluation time |
| ic_mean, rank_ic_mean, ... | Float32 | Evaluation metrics |

### A3_factor_upgrade
Factor value update tracking.

| Column | Type | Description |
|--------|------|-------------|
| factor_id | String | Factor ID |
| latest_date | Date | Latest factor value date |
| upgrade_datetime | DateTime | Last update time |

## License

MIT License
