Metadata-Version: 2.4
Name: sparkwise
Version: 0.1.0
Summary: The automated technical fellow for your Fabric Spark workloads - intelligent configuration analysis and optimization recommendations
Author-email: Santhosh Ravindran <santhoshravindran7@users.noreply.github.com>
License: MIT
Project-URL: Homepage, https://github.com/santhoshravindran7/sparkwise
Project-URL: Repository, https://github.com/santhoshravindran7/sparkwise
Project-URL: Issues, https://github.com/santhoshravindran7/sparkwise/issues
Keywords: spark,fabric,microsoft-fabric,optimization,pyspark,delta,configuration
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyspark>=3.3.0
Requires-Dist: rich>=10.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# 🔥 sparkwise

> **The automated technical fellow for your Fabric Spark workloads**

[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/sparkwise.svg)](https://badge.fury.io/py/sparkwise)

sparkwise is an intelligent configuration advisor for Apache Spark on Microsoft Fabric. It automatically analyzes your Spark workloads, detects performance issues, and provides actionable optimization recommendations - all without you having to scan through thousands of configuration options.

## 🎯 Why sparkwise?

As a Spark developer on Microsoft Fabric, you face:
- **Millions of configuration combinations** - impossible to know which ones matter for your workload
- **Runtime mysteries** - jobs fail or run slowly with cryptic error messages
- **Hidden optimizations** - missing out on Native Execution Engine, V-Order, or proper pooling strategies
- **Immutable config traps** - accidentally setting configs that force 3-5min cold-starts
- **Documentation overload** - OSS Spark, Delta Lake, and Fabric-specific configs scattered everywhere

**sparkwise solves this** by acting as your personal Spark expert that:
- ✅ Analyzes your current session and detects misconfigurations
- ✅ **Warns when you accidentally force Custom Pool usage** (save 3-5min per run!)
- ✅ Explains errors in plain English with remediation steps
- ✅ Recommends configuration tweaks based on your workload characteristics
- ✅ Provides an interactive Q&A interface for 100+ Spark/Delta/Fabric configurations

## 🚀 Quick Start

### Installation

```bash
pip install sparkwise
```

### Usage in Fabric Notebook

```python
from sparkwise import diagnose

# Run comprehensive analysis after your Spark job
diagnose.analyze_last_run()
```

**Output:**
```
🚀 Running sparkwise Analysis...

🔎 --- Native Execution Engine ---
✅ Native Engine ACTIVE: Your query is fully vectorized (Velox detected)

🏊 --- Pooling Strategy ---
🔴 CRITICAL: Session-Immutable Configs Detected
======================================================================
The following configs FORCE Custom Pool usage (3-5min cold-start):

   • spark.executor.memory = 8g
   • spark.dynamicAllocation.maxExecutors = 20

💡 Impact:
   ❌ Cannot use Starter Pool (instant startup)
   ❌ Forced to Custom Pool (3-5 minute cold-start)
   ❌ Additional capacity consumption

✅ Solution:
   1. Remove these spark.conf.set() calls from your notebook
   2. Use Starter Pool defaults (auto-configured by Fabric)
   3. Only set these if you truly need Custom Pool
======================================================================

💾 --- Storage & Delta Optimizations ---
⚠️ Performance Miss: V-Order is DISABLED
   💡 Set 'spark.sql.parquet.vorder.enabled=true' for 3x faster Power BI reads

⚙️ --- Runtime Tuning ---
✅ Adaptive Query Execution (AQE) is Active
✅ Optimal partition sizing for your workload

📊 --- Data Skew Detection ---
⚠️ Data Skew Detected: One task took 145s while median was 32s
   💡 Consider salting your join keys or repartitioning

Done. 🎄 Happy Optimizing!
```

### Interactive Configuration Assistant

```python
from sparkwise import ask

# Ask about any configuration
ask.config("spark.sql.shuffle.partitions")

# Search across 100+ documented configs
ask.search("partition")
```

**Knowledge Base: 100+ Configurations**
- 55+ Core Spark configurations (shuffle, memory, AQE, serialization, etc.)
- 17 Delta Lake configurations (V-Order, deletion vectors, OPTIMIZE, VACUUM, etc.)
- 12 Fabric-specific configurations (Native Engine, Starter Pools, OneLake, etc.)
- **Critical:** Session-immutable configs that force Custom Pool usage

**Output:**
```
📚 spark.sql.shuffle.partitions

Default: 200
Scope: Session-level, can be changed at runtime

What it does:
Controls the number of partitions created during shuffle operations 
(joins, aggregations, etc.). The default 200 is optimized for small 
clusters but may be suboptimal for large-scale workloads.

Recommendations for your workload:
- Small data (<10GB): 50-100 partitions
- Medium data (10-100GB): 200-500 partitions  
- Large data (>100GB): 1000-2000 partitions
- Formula: num_executors * executor_cores * 2-3

Fabric-specific notes:
On Starter Pools with Native Execution, start with 100-200 and let
AQE (Adaptive Query Execution) handle dynamic coalescing.

Related configs:
- spark.sql.adaptive.coalescePartitions.enabled
- spark.sql.files.maxPartitionBytes
```

### Error Diagnosis

```python
from sparkwise import diagnose

# When you hit an error
diagnose.explain_error("org.apache.spark.shuffle.FetchFailedException")
```

## 🎯 Key Features

### 1. **Native Execution Engine Verification**
Checks if you're actually using Fabric's Velox-based Native Execution Engine or accidentally falling back to slower row-based processing due to UDFs.

### 2. **Intelligent Pooling Advisor**
Detects if you're wasting 3-5 minutes spinning up Custom Pools for jobs that could run on Starter Pools.

### 3. **Data Skew Detection**
Identifies when one task is taking 2x+ longer than others, indicating skewed data distribution.

### 4. **Delta & Storage Optimizations**
- V-Order enablement for Power BI/Direct Lake performance
- Deletion Vectors for efficient MERGE operations
- Optimize Write for small file prevention

### 5. **Runtime Tuning Recommendations**
- AQE configuration validation
- Partition sizing analysis
- Scheduler mode recommendations
- Driver vs Executor balance checks

### 6. **Interactive Documentation**
Ask questions about any Spark, Delta, or Fabric configuration and get clear, context-aware explanations.

## 📋 Core Analysis Modules

| Module | What It Checks | Key Metrics |
|--------|---------------|-------------|
| **Native Compliance** | Velox engine usage | Physical plan analysis, fallback detection |
| **Pooling Efficiency** | Starter vs Custom Pool | Node count, startup overhead |
| **Skew Detection** | Task duration variance | Max vs Median task time |
| **Delta Hygiene** | V-Order, Deletion Vectors | Storage format, merge performance |
| **Runtime Tuning** | AQE, partitioning, scheduler | Partition sizes, parallelism |
| **Resource Profile** | Driver/Executor balance | Memory allocation, OOM risks |

## 🛠️ Advanced Usage

### Analyze with DataFrame Context

```python
# Provide a DataFrame for deep plan analysis
df = spark.read.parquet("/lakehouse/data/large_table")
result = df.groupBy("category").agg(sum("sales"))

diagnose.analyze(result)
```

### Get Configuration Report

```python
from sparkwise import config_report

# Get detailed report of current vs recommended configurations
report = config_report.generate()
print(report.to_markdown())
```

### Export Recommendations

```python
# Save recommendations to file
diagnose.analyze_last_run(export_path="/lakehouse/reports/optimization_report.json")
```

## 🏗️ Architecture

```
sparkwise/
├── core/
│   ├── advisor.py          # Main diagnostic engine
│   ├── native_check.py     # Velox/Native execution verification
│   ├── pool_check.py       # Pooling strategy analysis
│   ├── skew_check.py       # Data skew detection
│   ├── delta_check.py      # Delta/Storage optimizations
│   └── runtime_check.py    # Runtime configuration tuning
├── knowledge_base/
│   ├── spark_configs.yaml  # OSS Spark configurations
│   ├── delta_configs.yaml  # Delta Lake configurations
│   └── fabric_configs.yaml # Fabric-specific configurations
├── error_diagnosis/
│   └── error_parser.py     # Error explanation engine
├── cli/
│   └── main.py            # Command-line interface
└── utils/
    └── session_utils.py   # SparkSession utilities
```

## 🎓 Examples

Check out the [examples](examples/) directory for:
- Basic analysis workflow
- Error diagnosis patterns
- Configuration Q&A usage
- Integration with existing notebooks

## 🤝 Contributing

Contributions are welcome! Please read our [Contributing Guide](CONTRIBUTING.md) for details.

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

Built with ❤️ for the Microsoft Fabric Spark community. Special thanks to the Fabric Data Engineering team for their work on the Native Execution Engine.

## 📬 Contact

- **Author**: Santhosh Ravindran
- **GitHub**: [@santhoshravindran7](https://github.com/santhoshravindran7)
- **Issues**: [GitHub Issues](https://github.com/santhoshravindran7/sparkwise/issues)

---

**Tagline**: "Before you head off for the holidays, make sure your Fabric jobs aren't burning budget while you sleep. 🎄"
