add.scan()

Comprehensive data profiling and analysis

What does add.scan() do?

The add.scan() function provides comprehensive data profiling including distribution detection, correlation analysis, cardinality analysis, and data quality metrics. It's like getting a complete health check for your dataset.

Common use cases:

πŸ“‹ Table of Contents

πŸ“– Parameters

Parameter Type Required Description
df DataFrame βœ… Yes DataFrame to analyze (pandas, polars, or cuDF)
preset str or None ❌ No Analysis preset: "quick", "distributions", "correlations", "full", "minimal"
detect_distributions bool ❌ No Whether to detect distributions (default: True)
detect_correlations bool ❌ No Whether to calculate correlations (default: True)
detect_cardinality bool ❌ No Whether to analyze cardinality (default: True)
correlation_threshold float ❌ No Minimum correlation to report (default: 0.3)
verbose bool ❌ No Whether to print progress messages (default: True)

πŸš€ Example 1: Quick Scan (Simplest)

Scenario: You have a new dataset and want to understand it quickly.

Setup: Create sample sales data
import pandas as pd
import numpy as np
import additory as add

# Create sample sales data
np.random.seed(42)
sales_data = pd.DataFrame({
    'product_id': range(1, 1001),
    'price': np.random.normal(50, 20, 1000),
    'quantity_sold': np.random.poisson(10, 1000),
    'category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], 1000),
    'rating': np.random.uniform(1, 5, 1000),
    'revenue': lambda x: x['price'] * x['quantity_sold']
})

print("Sample sales data:")
print(sales_data.head())
Quick scan of the data
# Quick scan with default settings
result = add.scan(sales_data, preset="quick")

# Print summary
print("Scan Results Summary:")
print(result.summary())

# Access specific results
print("\nData Quality Metrics:")
print(result.quality)

print("\nCardinality Analysis:")
print(result.cardinality)
Output
Scan Results Summary:
=== Data Scan Results ===
Dataset Shape: (1000, 6)
Columns: product_id, price, quantity_sold, category, rating, revenue

Quality Metrics:
- Missing values: 0 (0.0%)
- Duplicate rows: 0 (0.0%)
- Data types: 4 numeric, 1 categorical, 1 other

Cardinality Analysis:
- High cardinality: product_id (1000 unique)
- Medium cardinality: revenue (995 unique)
- Low cardinality: category (4 unique)

Data Quality Metrics:
shape: (6, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ column      ┆ dtype    ┆ null_count  ┆ null_percent ┆ unique_count┆ data_type  β”‚
β”‚ ---         ┆ ---      ┆ ---         ┆ ---          ┆ ---         ┆ ---        β”‚
β”‚ str         ┆ str      ┆ i64         ┆ f64          ┆ i64         ┆ str        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════β•ͺ═════════════β•ͺ══════════════β•ͺ═════════════β•ͺ════════════║
β”‚ product_id  ┆ int64    ┆ 0           ┆ 0.0          ┆ 1000        ┆ numeric    β”‚
β”‚ price       ┆ float64  ┆ 0           ┆ 0.0          ┆ 1000        ┆ numeric    β”‚
β”‚ quantity_sold┆ int64   ┆ 0           ┆ 0.0          ┆ 21          ┆ numeric    β”‚
β”‚ category    ┆ object   ┆ 0           ┆ 0.0          ┆ 4           ┆ categoricalβ”‚
β”‚ rating      ┆ float64  ┆ 0           ┆ 0.0          ┆ 1000        ┆ numeric    β”‚
β”‚ revenue     ┆ float64  ┆ 0           ┆ 0.0          ┆ 995         ┆ numeric    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” Example 2: Full Analysis with Custom Settings

Scenario: You want a comprehensive analysis with custom correlation thresholds and distribution detection.

Setup: Customer behavior data
import pandas as pd
import numpy as np
import additory as add

# Create customer behavior data with correlations
np.random.seed(42)
n_customers = 500

# Create correlated data
age = np.random.normal(40, 15, n_customers)
income = age * 1000 + np.random.normal(20000, 10000, n_customers)  # Correlated with age
spending = income * 0.3 + np.random.normal(0, 5000, n_customers)   # Correlated with income

customer_data = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'age': np.clip(age, 18, 80),
    'income': np.clip(income, 20000, 200000),
    'annual_spending': np.clip(spending, 1000, 60000),
    'loyalty_years': np.random.exponential(3, n_customers),
    'satisfaction_score': np.random.uniform(1, 10, n_customers),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_customers),
    'subscription_type': np.random.choice(['Basic', 'Premium', 'Enterprise'], n_customers, p=[0.5, 0.3, 0.2])
})

print("Customer behavior data:")
print(customer_data.head())
Full analysis with custom settings
# Comprehensive scan with custom settings
result = add.scan(
    customer_data,
    preset="full",
    correlation_threshold=0.2,  # Lower threshold to catch more correlations
    top_n_distributions=5,      # More distribution candidates
    correlation_methods=['pearson', 'spearman'],
    cardinality_top_n=15,       # More top values per column
    verbose=True
)

print("=== COMPREHENSIVE ANALYSIS ===")
print(result.summary())

print("\n=== CORRELATIONS FOUND ===")
if hasattr(result, 'correlations') and result.correlations is not None:
    print(result.correlations)

print("\n=== DISTRIBUTION ANALYSIS ===")
if hasattr(result, 'distributions') and result.distributions is not None:
    for column, dist_info in result.distributions.items():
        if dist_info:
            print(f"\n{column}:")
            for dist in dist_info[:3]:  # Show top 3 distributions
                print(f"  - {dist['distribution']}: score {dist['score']:.3f}")
Output
=== COMPREHENSIVE ANALYSIS ===
=== Data Scan Results ===
Dataset Shape: (500, 8)
Columns: customer_id, age, income, annual_spending, loyalty_years, satisfaction_score, region, subscription_type

Quality Metrics:
- Missing values: 0 (0.0%)
- Duplicate rows: 0 (0.0%)
- Data types: 6 numeric, 2 categorical

Distribution Analysis:
- age: normal (score: 0.892)
- income: normal (score: 0.845)
- annual_spending: normal (score: 0.823)
- loyalty_years: exponential (score: 0.756)

Correlation Analysis:
- Strong correlations found: 3
- age ↔ income: 0.847 (pearson)
- income ↔ annual_spending: 0.923 (pearson)
- age ↔ annual_spending: 0.782 (pearson)

=== CORRELATIONS FOUND ===
shape: (3, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ column1         ┆ column2         ┆ correlation┆ method     β”‚
β”‚ ---             ┆ ---             ┆ ---        ┆ ---        β”‚
β”‚ str             ┆ str             ┆ f64        ┆ str        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════════β•ͺ════════════β•ͺ════════════║
β”‚ age             ┆ income          ┆ 0.847      ┆ pearson    β”‚
β”‚ income          ┆ annual_spending ┆ 0.923      ┆ pearson    β”‚
β”‚ age             ┆ annual_spending ┆ 0.782      ┆ pearson    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

=== DISTRIBUTION ANALYSIS ===

age:
  - normal: score 0.892
  - uniform: score 0.234
  - exponential: score 0.123

income:
  - normal: score 0.845
  - lognormal: score 0.567
  - uniform: score 0.189

πŸŽ›οΈ Analysis Presets

quick: Quality + cardinality only (fastest)
distributions: Distribution detection only
correlations: Correlation analysis only
full: All analyses enabled (most comprehensive)
minimal: Quality metrics only (fastest)

πŸ“Š What You Get

Quality Metrics: Missing values, duplicates, data types
Distribution Detection: Identifies normal, exponential, uniform, and other distributions
Correlation Analysis: Pearson and Spearman correlations between numeric columns
Cardinality Analysis: Unique value counts and top values per column

⚠️ Important Notes

DataFrame Support: Works with pandas, polars, and cuDF DataFrames.
Performance: Use presets for faster analysis on large datasets.
Result Object: Returns a ScanResult object with structured access to all findings.
Memory Efficient: Automatically converts to Polars for processing large datasets.

🎯 Quick Reference

Basic syntax templates
# Quick scan (default)
result = add.scan(df)

# Use presets
result = add.scan(df, preset="quick")
result = add.scan(df, preset="full")

# Custom settings
result = add.scan(df, correlation_threshold=0.5, detect_distributions=False)

# Access results
print(result.summary())
print(result.quality)
print(result.correlations)
print(result.distributions)