Comprehensive data profiling and analysis
What does add.scan() do?
The add.scan() function provides comprehensive data profiling including distribution detection, correlation analysis, cardinality analysis, and data quality metrics. It's like getting a complete health check for your dataset.
Common use cases:
| Parameter | Type | Required | Description |
|---|---|---|---|
| df | DataFrame | β Yes | DataFrame to analyze (pandas, polars, or cuDF) |
| preset | str or None | β No | Analysis preset: "quick", "distributions", "correlations", "full", "minimal" |
| detect_distributions | bool | β No | Whether to detect distributions (default: True) |
| detect_correlations | bool | β No | Whether to calculate correlations (default: True) |
| detect_cardinality | bool | β No | Whether to analyze cardinality (default: True) |
| correlation_threshold | float | β No | Minimum correlation to report (default: 0.3) |
| verbose | bool | β No | Whether to print progress messages (default: True) |
Scenario: You have a new dataset and want to understand it quickly.
import pandas as pd
import numpy as np
import additory as add
# Create sample sales data
np.random.seed(42)
sales_data = pd.DataFrame({
'product_id': range(1, 1001),
'price': np.random.normal(50, 20, 1000),
'quantity_sold': np.random.poisson(10, 1000),
'category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home'], 1000),
'rating': np.random.uniform(1, 5, 1000),
'revenue': lambda x: x['price'] * x['quantity_sold']
})
print("Sample sales data:")
print(sales_data.head())
# Quick scan with default settings
result = add.scan(sales_data, preset="quick")
# Print summary
print("Scan Results Summary:")
print(result.summary())
# Access specific results
print("\nData Quality Metrics:")
print(result.quality)
print("\nCardinality Analysis:")
print(result.cardinality)
Scan Results Summary:
=== Data Scan Results ===
Dataset Shape: (1000, 6)
Columns: product_id, price, quantity_sold, category, rating, revenue
Quality Metrics:
- Missing values: 0 (0.0%)
- Duplicate rows: 0 (0.0%)
- Data types: 4 numeric, 1 categorical, 1 other
Cardinality Analysis:
- High cardinality: product_id (1000 unique)
- Medium cardinality: revenue (995 unique)
- Low cardinality: category (4 unique)
Data Quality Metrics:
shape: (6, 6)
βββββββββββββββ¬βββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββ¬βββββββββββββ
β column β dtype β null_count β null_percent β unique_countβ data_type β
β --- β --- β --- β --- β --- β --- β
β str β str β i64 β f64 β i64 β str β
βββββββββββββββͺβββββββββββͺββββββββββββββͺβββββββββββββββͺββββββββββββββͺβββββββββββββ‘
β product_id β int64 β 0 β 0.0 β 1000 β numeric β
β price β float64 β 0 β 0.0 β 1000 β numeric β
β quantity_soldβ int64 β 0 β 0.0 β 21 β numeric β
β category β object β 0 β 0.0 β 4 β categoricalβ
β rating β float64 β 0 β 0.0 β 1000 β numeric β
β revenue β float64 β 0 β 0.0 β 995 β numeric β
βββββββββββββββ΄βββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββ΄βββββββββββββ
Scenario: You want a comprehensive analysis with custom correlation thresholds and distribution detection.
import pandas as pd
import numpy as np
import additory as add
# Create customer behavior data with correlations
np.random.seed(42)
n_customers = 500
# Create correlated data
age = np.random.normal(40, 15, n_customers)
income = age * 1000 + np.random.normal(20000, 10000, n_customers) # Correlated with age
spending = income * 0.3 + np.random.normal(0, 5000, n_customers) # Correlated with income
customer_data = pd.DataFrame({
'customer_id': range(1, n_customers + 1),
'age': np.clip(age, 18, 80),
'income': np.clip(income, 20000, 200000),
'annual_spending': np.clip(spending, 1000, 60000),
'loyalty_years': np.random.exponential(3, n_customers),
'satisfaction_score': np.random.uniform(1, 10, n_customers),
'region': np.random.choice(['North', 'South', 'East', 'West'], n_customers),
'subscription_type': np.random.choice(['Basic', 'Premium', 'Enterprise'], n_customers, p=[0.5, 0.3, 0.2])
})
print("Customer behavior data:")
print(customer_data.head())
# Comprehensive scan with custom settings
result = add.scan(
customer_data,
preset="full",
correlation_threshold=0.2, # Lower threshold to catch more correlations
top_n_distributions=5, # More distribution candidates
correlation_methods=['pearson', 'spearman'],
cardinality_top_n=15, # More top values per column
verbose=True
)
print("=== COMPREHENSIVE ANALYSIS ===")
print(result.summary())
print("\n=== CORRELATIONS FOUND ===")
if hasattr(result, 'correlations') and result.correlations is not None:
print(result.correlations)
print("\n=== DISTRIBUTION ANALYSIS ===")
if hasattr(result, 'distributions') and result.distributions is not None:
for column, dist_info in result.distributions.items():
if dist_info:
print(f"\n{column}:")
for dist in dist_info[:3]: # Show top 3 distributions
print(f" - {dist['distribution']}: score {dist['score']:.3f}")
=== COMPREHENSIVE ANALYSIS ===
=== Data Scan Results ===
Dataset Shape: (500, 8)
Columns: customer_id, age, income, annual_spending, loyalty_years, satisfaction_score, region, subscription_type
Quality Metrics:
- Missing values: 0 (0.0%)
- Duplicate rows: 0 (0.0%)
- Data types: 6 numeric, 2 categorical
Distribution Analysis:
- age: normal (score: 0.892)
- income: normal (score: 0.845)
- annual_spending: normal (score: 0.823)
- loyalty_years: exponential (score: 0.756)
Correlation Analysis:
- Strong correlations found: 3
- age β income: 0.847 (pearson)
- income β annual_spending: 0.923 (pearson)
- age β annual_spending: 0.782 (pearson)
=== CORRELATIONS FOUND ===
shape: (3, 4)
βββββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββ¬βββββββββββββ
β column1 β column2 β correlationβ method β
β --- β --- β --- β --- β
β str β str β f64 β str β
βββββββββββββββββββͺββββββββββββββββββͺβββββββββββββͺβββββββββββββ‘
β age β income β 0.847 β pearson β
β income β annual_spending β 0.923 β pearson β
β age β annual_spending β 0.782 β pearson β
βββββββββββββββββββ΄ββββββββββββββββββ΄βββββββββββββ΄βββββββββββββ
=== DISTRIBUTION ANALYSIS ===
age:
- normal: score 0.892
- uniform: score 0.234
- exponential: score 0.123
income:
- normal: score 0.845
- lognormal: score 0.567
- uniform: score 0.189
# Quick scan (default)
result = add.scan(df)
# Use presets
result = add.scan(df, preset="quick")
result = add.scan(df, preset="full")
# Custom settings
result = add.scan(df, correlation_threshold=0.5, detect_distributions=False)
# Access results
print(result.summary())
print(result.quality)
print(result.correlations)
print(result.distributions)