Project file structure:
=======================
./
    frontend-web/
        SKILL.md
        builder.py
    data-analysis/
        BA_skill.md
        SKILL.md
        dashboard.md
        pipeline.py
        streamlit_da_dashaboard.md


File Contents:
===============


--- FILE: frontend-web/SKILL.md ---

# skill.md: The Web Architect (Cognitive Edition)

## Metadata
```yaml
name: web-architect-prime
description: High-intelligence engine for bespoke web engineering and UI/UX design.
output_format: Single-file HTML (HTML5 + CSS3 + Vanilla JS)
```

## THE PRECEPT
You are not a code assistant; you are a **Principal Creative Engineer**. Your task is to interpret the user's vision—no matter how vague or complex—and translate it into a world-class, fully functional digital experience. You do not use templates. You do not default to "AI styles." You use your internal logic to build what is **right** for the specific request.

---

## PHASE 1: INTENT DECONSTRUCTION (The Brain)
Before writing code, analyze the prompt to determine the **Intent Profile**:

1.  **The High-Fidelity Clone:** (e.g., "Like Apple," "Like TradingView")
    *   *Goal:* Reverse-engineer the DNA.
    *   *Action:* Study the spacing (layout density), the interaction language (how it moves), and the typographic hierarchy. Match the "feel" exactly.
2.  **The Functional Tool:** (e.g., "A dashboard," "A terminal," "A calculator")
    *   *Goal:* Utility and state management.
    *   *Action:* Build a robust JavaScript engine. If it’s a terminal, the data must update. If it’s a dashboard, the charts must respond.
3.  **The Original Concept:** (e.g., "Build a site for a futuristic tea shop," "A minimalist portfolio")
    *   *Goal:* Invent a unique visual language.
    *   *Action:* Synthesize a brand-new aesthetic. Choose a color theory and layout architecture that matches the "soul" of the idea.

---

## PHASE 2: THE "LIVING SITE" STANDARD
A website is failed if it is static. Every build must be **Alive**:
*   **Total Interactivity:** Every button, link, tab, and menu must work.
*   **State-Driven UI:** Use Vanilla JS to manage UI states (e.g., opening modals, switching tabs, filtering lists, updating counters).
*   **Satisfying Feedback:** Elements must respond to the user. Hover states, active presses, and focus rings must be custom-designed and buttery smooth.
*   **Data Simulation:** If the site requires data (like a trading price or a social feed), write a JS function to simulate real-time updates.

---

## PHASE 3: TECHNICAL ARCHITECTURE

### 1. Unified Single-File Delivery
*   Provide all code in one `.html` file.
*   CSS goes in `<style>`, JS goes in `<script>` (at the bottom of the body).
*   **Zero External Dependencies** unless strictly necessary (e.g., Google Fonts or Three.js for 3D). Use Inline SVGs for all icons.

### 2. Expert-Level CSS
*   **Semantic Layout:** Use HTML5 tags (`<main>`, `<section>`, `<nav>`, `<aside>`).
*   **Modern Layouts:** Use CSS Grid for complex structures and Flexbox for components.
*   **Variable-First:** Define a comprehensive `:root` system for colors, font weights, and spacing scales.
*   **Refined Motion:** Use `cubic-bezier` for transitions. Use `IntersectionObserver` for scroll-triggered animations.

### 3. "Kimi" Brain JS
*   **No Spaghetti Code:** Organize JS into logical modules (e.g., `Navigation`, `Animations`, `DataEngine`).
*   **Mobile-First Logic:** Ensure touch events and mobile menus work flawlessly.
*   **Validation:** All forms must have client-side validation and a "Success/Sent" UI state.

---

## PHASE 4: EXECUTION GATEWAY
Before finishing, the "Brain" must check:
1.  **Is it beautiful?** Does the typography and spacing look professional?
2.  **Does it work?** Can I click the buttons? Does the menu open? Do the tabs switch?
3.  **Is it accurate?** If they asked for "Apple," does it feel premium? If they asked for "Scratch," does it feel unique?
4.  **Is it complete?** No "Coming Soon" text. No empty `href="#"`. No broken layouts on mobile.

---

## HOW TO START
1.  **State your interpretation:** (e.g., "I am building a high-density trading terminal with live-updating SVG charts and a sidebar navigation system.")
2.  **Define the aesthetic:** (e.g., "A dark, monochromatic UI using SF Mono and high-contrast accents.")
3.  **Execute:** Provide the full, functional code.

***

**User Prompt Example:** *"Make me a site that looks like a high-end luxury watch brand, but the background should change colors as I scroll, and I want a working shopping cart sidebar."*

**Kimi's Thought Process:** 
*   **Aesthetic:** Luxury Editorial. Large Serif fonts, high-quality image placeholders.
*   **The "Brain" Challenge:** I need a JS `scroll` listener that calculates the scroll percentage and maps it to an HSL color value on the `<body>`. 
*   **Functional Requirement:** A "Shopping Cart" state. I'll create a JS array to hold "items" and a function to render the sidebar whenever the "Add to Cart" button is clicked. 
*   **Result:** A single-file masterpiece where the colors shift smoothly, and the cart actually updates in real-time.

--- FILE: frontend-web/builder.py ---

import os
import json
import re
import questionary
from typing import Dict, Any

from nova_cli.nova_core.ai.api_client import BridgeyeAPIClient
from nova_cli.local.ui import ui
from nova_cli.local.file_manager.commands import handle_ai_commands
from rich.tree import Tree
from rich.panel import Panel

# --- CONSTANTS ---
ARCHITECT_MODEL = "openai/gpt-oss-120b"
ARCHITECT_PROVIDER = "openrouter"
BUILDER_MODEL = "moonshotai/kimi-k2.6"
BUILDER_PROVIDER = "openrouter"

def _extract_json(text: str) -> dict:
    """Robust JSON extraction to handle LLM markdown formatting."""
    try:
        # If it's wrapped in markdown backticks
        match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL)
        if match:
            return json.loads(match.group(1))
        # Fallback: find the first { and last }
        start = text.find('{')
        end = text.rfind('}')
        if start != -1 and end != -1:
            return json.loads(text[start:end+1])
        return json.loads(text)
    except Exception as e:
        ui.print(f"[red]Failed to parse, Trying Again...[/red]")
        return {}

def load_skill_md() -> str:
    """Loads the SKILL.md rules to inject into the planner."""
    skill_path = os.path.join(os.path.dirname(__file__), "SKILL.md")
    if os.path.exists(skill_path):
        with open(skill_path, "r", encoding="utf-8") as f:
            return f.read()
    return ""

def stage_2_classifier(api: BridgeyeAPIClient, user_prompt: str) -> Dict[str, Any]:
    """Scores the prompt to determine how much context is missing."""
    prompt = f"""
    SYSTEM: You are a strict intent classifier for a web development engine.
    Analyze the user's web build request and score its completeness.
    Output ONLY valid JSON. No markdown backticks, no explanations.

    SCHEMA:
    {{
      "has_business_context": bool,
      "has_visual_direction": bool,
      "has_section_structure": bool,
      "has_tech_preference": bool,
      "completeness_score": int (0-100),
      "missing": [list of strings indicating what is missing]
    }}

    USER REQUEST:
    {user_prompt}
    """
    with ui.create_loader("Analyzing request complexity..."):
        res = api.chat(prompt, context={}, model=ARCHITECT_MODEL, provider=ARCHITECT_PROVIDER)
    return _extract_json(res)

def stage_3_enhancer(api: BridgeyeAPIClient, user_prompt: str) -> Dict[str, Any]:
    """Fills in missing business context for sparse prompts (Score <= 35)."""
    prompt = f"""
    SYSTEM: You are a Web Prompt Enhancer for a web development engine.
    The user provided a very sparse request. Infer and fill in reasonable defaults for the business.
    Do NOT invent visual themes or colors (that is handled later). Focus purely on content/business logic.
    Output ONLY valid JSON. No markdown backticks.

    SCHEMA:
    {{
      "business_name": "string",
      "business_type": "string",
      "core_services": ["string"],
      "target_audience": "string",
      "location": "string",
      "cta_action": "string",
      "must_have_sections": ["string"],
      "tech_preference": "string",
      "special_requirements": ["string"]
    }}

    SPARSE USER REQUEST:
    {user_prompt}
    """
    with ui.create_loader("Running Web Prompt Enhancer..."):
        res = api.chat(prompt, context={}, model=ARCHITECT_MODEL, provider=ARCHITECT_PROVIDER)
    return _extract_json(res)

def stage_4_planner(api: BridgeyeAPIClient, user_prompt: str, classification: dict, enhancer_data: dict, skill_md: str) -> Dict[str, Any]:
    """Combines all context with SKILL.md to generate the definitive Build Plan."""
    
    score = classification.get("completeness_score", 0)
    missing = classification.get("missing", [])

    # Dynamic 3-Tier Planning Logic
    if score <= 35:
        mode_instruction = "FULL GENERATION MODE: The user provided a sparse prompt. Use the ENHANCED BUSINESS CONTEXT below. Auto-select the most appropriate theme from SKILL.md and build a complete, detailed plan filling all visual and structural gaps."
        context_str = f"ENHANCED BUSINESS CONTEXT:\n{json.dumps(enhancer_data, indent=2)}\n\nUSER PROMPT:\n{user_prompt}"
    elif score <= 65:
        mode_instruction = "GAP-FILLING MODE: The user provided partial details. Lock all user-stated preferences. Fill the specific gaps listed in IDENTIFIED GAPS using SKILL.md defaults."
        context_str = f"USER PROMPT:\n{user_prompt}\n\nIDENTIFIED GAPS TO FILL:\n{json.dumps(missing)}"
    else:
        mode_instruction = "STRUCTURE ONLY MODE: The user fully specified the requirements. Lock EVERY user-specified value (colors, fonts, sections). Do NOT reinterpret or override user intent. Use SKILL.md ONLY to fill the minor missing pieces."
        context_str = f"USER PROMPT:\n{user_prompt}\n\nMINOR GAPS TO FILL:\n{json.dumps(missing)}"

    prompt = f"""
    SYSTEM_OVERRIDE: You are the Lead Architect. Your job is to create a definitive JSON Blueprint for a web build.
    You must strictly adhere to the THEME SYSTEM and RULES defined in the SKILL.md document.
    Rule: User-stated details are locked. SKILL.md fills the silence.

    {mode_instruction}

    SKILL.md RULES:
    {skill_md}

    {context_str}

    REQUIRED JSON SCHEMA:
    {{
      "theme": "string (MUST be one of the themes from SKILL.md)",
      "theme_reason": "string (Explain why this theme was chosen or how it adapts the user's request)",
      "palette": {{ "background": "", "primary": "", "accent": "", "text": "", "surface": "" }},
      "fonts": {{ "heading": "", "body": "" }},
      "scope": "single-page or multi-page",
      "tech": "string (e.g. html-css-js, react)",
      "sections": ["string"],
      "copy_outline": {{ "hero_headline": "", "hero_subline": "", "cta_label": "", "section_notes": {{}} }},
      "file_structure": {{
         "pages": ["filename.ext"],
         "shared_components": ["filename.ext"],
         "build_order": ["filename.ext"]
      }} // Only include file_structure if scope is multi-page, otherwise null
    }}
    """
    with ui.create_loader("NOVA is Architecting Blueprint..."):
        res = api.chat(prompt, context={}, model=ARCHITECT_MODEL, provider=ARCHITECT_PROVIDER)
    return _extract_json(res)

def stage_6_build_single(api: BridgeyeAPIClient, plan: dict):
    """Executes a single-page build using Kimi."""
    prompt = f"""
    SYSTEM_OVERRIDE: PHASE: IMPLEMENTATION.
    You are an expert Frontend Developer executing a strict blueprint.
    Output ONLY a [CREATE: index.html] tag followed immediately by a ```html code block containing the FULL, production-ready website.
    Do NOT output any markdown text outside the block.
    Ensure CSS is in <style>, JS is in <script>. Use CSS variables in :root for the palette.
    
    BLUEPRINT:
    {json.dumps(plan, indent=2)}
    """
    ui.print("[cyan]>> NOVA (Kimi k2.6) is generating your single-page website...[/cyan]")
    output = ui.stream_rich_response(api.chat_stream(prompt, context={}, model=BUILDER_MODEL, provider=BUILDER_PROVIDER))
    
    # Save the file locally using the standard CLI engine
    return handle_ai_commands(output)

def stage_6_build_multi(api: BridgeyeAPIClient, plan: dict):
    """Executes a multi-page build iteratively using Kimi."""
    file_struct = plan.get("file_structure", {})
    build_order = file_struct.get("build_order", [])
    
    if not build_order:
        ui.print("[red]Error: Multi-page plan missing build order.[/red]")
        return []

    # Keep a running context of generated files to maintain consistency
    built_context = {}
    all_modified = []
    
    for i, filename in enumerate(build_order, 1):
        ui.print(f"\n[bold cyan]─── Building [{i}/{len(build_order)}]: {filename} ───[/bold cyan]")
        
        prompt = f"""
        SYSTEM_OVERRIDE: PHASE: IMPLEMENTATION.
        You are executing a multi-file web application blueprint.
        Currently building: {filename}.
        Output ONLY a [CREATE: {filename}] tag followed by a code block containing the exact file content.
        Do NOT add conversational text.
        
        GLOBAL BLUEPRINT:
        {json.dumps(plan, indent=2)}
        """
        
        output = ui.stream_rich_response(
            api.chat_stream(
                prompt, 
                context=built_context, 
                model=BUILDER_MODEL, 
                provider=BUILDER_PROVIDER
            )
        )
        
        # Save file to disk
        modified = handle_ai_commands(output)
        
        # Load the newly created file into the context for the next turn
        if modified:
            all_modified.extend(modified)
            for f in modified:
                if os.path.exists(f):
                    with open(f, "r", encoding="utf-8") as file:
                        built_context[os.path.basename(f)] = file.read()
                        
    return all_modified

def run(user_prompt: str):
    """Main Orchestrator Entry Point for the Frontend Web Builder Skill."""
    api = BridgeyeAPIClient()
    skill_md = load_skill_md()
    modified_files = []

    if not skill_md:
        ui.print("[red]Fatal: SKILL.md not found in frontend-web directory.[/red]")
        return modified_files

    # Stage 2: Classifier
    classification = stage_2_classifier(api, user_prompt)
    score = classification.get("completeness_score", 0)

    # Stage 3: Enhancer (Conditional)
    enhancer_data = {}
    if score <= 35:
        enhancer_data = stage_3_enhancer(api, user_prompt)

    # Stage 4: Planner
    plan = stage_4_planner(api, user_prompt, classification, enhancer_data, skill_md)
    if not plan:
        ui.print("[red]Failed to generate build plan. Aborting.[/red]")
        return modified_files

    # Stage 5: Scope Decision & Checkpoint
    scope = plan.get("scope", "single-page")
    theme = plan.get("theme", "MINIMAL")
    palette = plan.get("palette", {})
    fonts = plan.get("fonts", {})
    
    if scope == "multi-page":
        # UX Checkpoint for complex builds
        struct = plan.get("file_structure", {})
        
        ui.print("\n[bold magenta]>> NOVA Architect Plan Ready[/bold magenta]")
        # ui.print(f"[dim]Theme: {theme} | Tech: {plan.get('tech')} | Palette: {palette.get('primary', '')}[/dim]")
        
        # Build Visual Tree
        tree = Tree("[bold cyan]Project File Structure (Build Order)[/bold cyan]")
        for f in struct.get("build_order", []):
            tree.add(f"📄 {f}")
        ui.print(Panel(tree, border_style="cyan"))
        
        confirm = questionary.confirm("Review the plan above. Proceed with build?").ask()
        if not confirm:
            ui.print("[dim]Build cancelled by user.[/dim]")
            return modified_files
            
        ui.display_coding_mode(BUILDER_MODEL)
        modified_files = stage_6_build_multi(api, plan)
        
    else:
        # Single-page immediate execution
        # ui.print(f"\n[green]Using {theme} theme[/green] — {palette.get('background', 'Background')}, {fonts.get('heading', 'Serif')} + {fonts.get('body', 'Sans')}, single HTML file.")
        ui.display_coding_mode(BUILDER_MODEL)
        modified_files = stage_6_build_single(api, plan)
        
    ui.print("\n[bold green]✔ Web Build Complete![/bold green]")
    return modified_files

--- FILE: data-analysis/BA_skill.md ---

# BA_skill.md: The Kimi BI Framework

## Objective
You are a Principal BI Architect. Your task is to build a high-fidelity, comprehensive Decision Support System. You are not reporting on the data's shape; you are reporting on the business's health.

## Visual Identity System (High-Fidelity)
- **CSS Framework**: Tailwind CSS (CDN). Use `dark:` variants for theme switching.
- **Charts**: ApexCharts (CDN). Customize labels, tooltips, and grid colors to match the active theme.
- **Theme Toggle**: You MUST implement a **Light/Dark Mode Toggle** in the header. 
  - **Dark Mode**: Ultra-Dark (#0f172a / #1e293b) with Neon accents (Cyan/Green).
  - **Light Mode**: Clean, high-contrast, professional white/gray theme.
- **Visuals**: Use Glassmorphism effects for cards, consistent border radii, and smooth CSS transitions.

## Data Intelligence Logic
You will receive `pipeline_output.json` (which contains `data_head`) and `domain_mapping.json`.

### KPI Hero Row — Derivation Rules (DO NOT pad with generic counts)
Derive exactly **6–8 KPIs** that a CXO would ask for in a board meeting. Use `domain_mapping.json → vertical` to select the right set:
- **Sales / CRM**: Total Revenue, Avg Deal Size, Win Rate %, Pipeline Value, Leads This Period, Conversion Rate, Avg Sales Cycle (days), Top Channel by Volume
- **Logistics / Ops**: On-Time Delivery %, SLA Breach Count, Avg Transit Days, Throughput (units/day), Pending Orders, Cost per Shipment
- **HR / Workforce**: Headcount, Attrition Rate %, Avg Tenure, Open Roles, Avg Performance Score, Absenteeism Rate
- **Finance**: Total Revenue, Total Cost, Gross Margin %, MoM Growth %, Burn Rate, Outstanding Receivables
- **Marketing**: Total Leads, Cost per Lead, MQL→SQL Rate, Campaign ROI, Top Channel, Avg Engagement Rate
- **Generic fallback**: Use `north_star` from domain mapping as the primary KPI, then derive 5 supporting metrics from the numeric columns with lowest null_pct.
Each KPI card MUST show: current value, delta vs. previous period (if a datetime column exists), and a trend indicator (↑ green / ↓ red / → gray).

### Chart Mandate — Map chart type to data type (no guessing)
| Data pattern | Chart type |
|---|---|
| Datetime column + numeric column | Line/Area chart (ApexCharts `area` type) with date on X-axis |
| Categorical column (≤15 unique) + numeric | Horizontal Bar chart, sorted descending |
| Categorical column share / composition | Donut chart — max 7 slices, remainder as "Other" |
| Two numeric columns with correlation ≥ 0.6 | Scatter plot with regression line annotation |
| Funnel-type flow (from `flow_type: funnel`) | ApexCharts `bar` funnel-style, stages sorted by volume desc |
| Entity leaderboard (from `entities` in domain map) | Ranked horizontal bar, top 10 only, badge for rank 1 |
| Distribution of a numeric column | Histogram (column chart with small bin width) |
Minimum 6 charts across the tabs. Maximum 10. Do not create a chart if the required columns are missing or all-null.

### Filter Rules — Only use categorical columns
Build `<select>` dropdowns **only** from columns where `col_type == "categorical"` in `pipeline_output.json`. Never build a filter from a numeric, datetime, id_numeric, or high_cardinality column. Maximum 4 filters in the sidebar. Each filter must update all KPI cards and all charts simultaneously via a single `applyFilters()` JS function.

### Tabbed Navigation
- **Overview**: KPI Hero Row + North Star trend chart (line/area) + top-level donut for primary categorical split.
- **Performance Analysis**: 2–3 cross-dimensional bar/scatter charts based on the chart mandate above.
- **Leaderboards**: Ranked entity tables (from `entities` in domain map). Include rank badge, value, and delta column.
- **Alert Center**: Actionable flags only — stalled records, SLA breaches, records missing a revenue-critical field. Use severity badges (🔴 Critical / 🟡 Warning / 🟢 OK). Do NOT list technical profiling flags (skewness, dtypes).
- **Data Explorer**: Searchable, paginated table. Badge-code categorical status columns. Show max 15 rows per page.

## Technical Execution
- Deliver a single `index.html`.
- Use Vanilla JS for state management (Filters, Tabs, Theme Toggle).
- Filters MUST instantly update all metric cards and charts via a shared `applyFilters()` function.
- Charts must be initialized after data is loaded; re-render on filter change using `chart.updateSeries()`.
- Do not surface profiling-only quality flags (skewness, outlier_pct, dtypes) in the UI unless they directly represent a revenue loss or operational risk.

## Actionability
Include "Action Buttons" that print specific terminal commands to the console, allowing the user to export filtered results or take the next step.

--- FILE: data-analysis/SKILL.md ---

---
name: data-analysis
description: Data profiling, analysis, cleaning, and visualisation using pipeline.py for automated dataset profiling followed by LLM-generated, NOVA-executed Python code.
origin: NOVA
---

# Data Analysis Skill

End-to-end skill for understanding, analysing, and transforming datasets. Automatically profiles the input file, interprets the profile, and generates runnable Python code to answer the user's request.

## When to Activate

- User provides a file and asks to analyse, explore, or summarise it
- User asks what analyses are possible on their data
- User asks to clean, fix, or prepare their data
- User asks to visualise any aspect of their data
- User asks to find patterns, outliers, correlations, or trends
- User asks a specific question about their dataset (e.g. "which column has the most nulls?")
- User asks to filter, group, aggregate, or transform their data

## Core Skill Flow

### Step 1 — Run the profiler

Always run `pipeline.py` first on the provided file before doing anything else. This gives the LLM accurate, data-specific context rather than assumptions.

```bash
python skills/data-analysis/pipeline.py <file_path>
```

The output is a JSON profile covering shape, column types, nulls, distributions, outliers, correlations, and quality flags.

### Step 2 — Build the LLM prompt

Combine three inputs into one prompt:

```
[SKILL.md instructions]
+
[pipeline.py JSON output]
+
[User's original query]
```

The LLM uses the JSON profile to write code that is accurate for the actual dataset — correct column names, correct dtypes, no hallucinated columns.

### Step 3 — Generate summary + code

The LLM response must always have two parts:

**Part 1 — Data summary** (plain text, always present regardless of query)

**Part 2 — Analysis code** (Python, directly runnable by NOVA)

### Step 4 — NOVA executes the code

NOVA runs the generated code. If it fails, NOVA automatically fixes and reruns until execution succeeds or the retry limit is reached.

---

## Response Format

Every response must follow this exact structure:

```
## Data summary
One paragraph: dataset size, column breakdown by type, notable quality issues from flags[].

## Key observations
- Nulls: list columns with null_pct > 0, sorted by severity
- Distributions: skew, outliers, ranges worth calling out
- Data quality: flags from pipeline output (duplicates, constants, ID columns)
- Correlations: pairs from high_correlations[] if present

## What I'm doing
One sentence: what the code below does, in plain language.

## Code
\`\`\`python
# clean, commented, runnable Python
\`\`\`

## Output interpretation
Brief explanation of how to read the output once the code runs.
```

**Rules:**
- Never skip the data summary, even for simple queries
- Always surface quality flags before analysis — bad data should be acknowledged first
- Code must use only column names that exist in `columns[].name` from the profile
- Code must respect dtypes from the profile (don't treat a string column as numeric)

---

## Code Generation Rules

### Imports

Always include all required imports inside the generated code block. NOVA runs the code in isolation.

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
```

### File loading

Always use the same file path the user provided. Use the extension from `file.extension` in the profile to load correctly.

```python
# CSV
df = pd.read_csv("path/to/file.csv")

# Excel
df = pd.read_excel("path/to/file.xlsx", sheet_name=0)

# JSON
df = pd.read_json("path/to/file.json")

# Parquet
df = pd.read_parquet("path/to/file.parquet")
```

### Column name safety

Always reference column names using bracket notation, not dot notation, to handle spaces and special characters.

```python
# Correct
df["column name with spaces"]

# Incorrect — will fail
df.column name with spaces
```

### Dtype handling

Check the `col_type` field from the profile before operating on a column.

```python
# Numeric operations only on col_type: numeric
df["sales"].mean()

# String operations only on col_type: categorical or text
df["category"].value_counts()

# Date operations only on col_type: datetime
df["date"].dt.year
```

### Visualisations

Save plots to file, never use `plt.show()` — NOVA runs headlessly.

```python
plt.figure(figsize=(10, 6))
sns.histplot(df["column"], kde=True)
plt.title("Distribution of column")
plt.tight_layout()
plt.savefig("output_plot.png", dpi=150)
plt.close()
```

### Cleaning operations

When cleaning, always create a copy and report what was changed.

```python
df_clean = df.copy()

# Drop high-null columns (> 50%)
high_null_cols = ["col_a", "col_b"]
df_clean = df_clean.drop(columns=high_null_cols)

# Fill numeric nulls with median
df_clean["revenue"] = df_clean["revenue"].fillna(df_clean["revenue"].median())

# Drop duplicates
before = len(df_clean)
df_clean = df_clean.drop_duplicates()
print(f"Removed {before - len(df_clean)} duplicate rows")

# Save cleaned file
df_clean.to_csv("cleaned_output.csv", index=False)
print(f"Cleaned dataset: {df_clean.shape[0]} rows x {df_clean.shape[1]} cols")
```

---

## Common Patterns

### Exploratory analysis

```python
import pandas as pd

df = pd.read_csv("data.csv")

# Overview
print(f"Shape: {df.shape}")
print(f"\nDtypes:\n{df.dtypes}")
print(f"\nNull counts:\n{df.isnull().sum()}")
print(f"\nDescriptive stats:\n{df.describe()}")
```

### Correlation matrix

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("data.csv")
numeric_df = df.select_dtypes(include="number")

plt.figure(figsize=(12, 8))
sns.heatmap(numeric_df.corr(), annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation matrix")
plt.tight_layout()
plt.savefig("correlation_matrix.png", dpi=150)
plt.close()
```

### Outlier detection and removal

```python
import pandas as pd
import numpy as np

df = pd.read_csv("data.csv")

def remove_outliers_iqr(df, col):
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    mask = (df[col] >= q1 - 1.5 * iqr) & (df[col] <= q3 + 1.5 * iqr)
    removed = (~mask).sum()
    print(f"'{col}': removed {removed} outliers")
    return df[mask]

for col in df.select_dtypes(include="number").columns:
    df = remove_outliers_iqr(df, col)
```

### Group-by aggregation

```python
import pandas as pd

df = pd.read_csv("data.csv")

result = (
    df.groupby("category_col")
    .agg(
        count=("value_col", "count"),
        mean=("value_col", "mean"),
        total=("value_col", "sum"),
    )
    .sort_values("total", ascending=False)
    .reset_index()
)

print(result.to_string(index=False))
```

### Missing value summary

```python
import pandas as pd

df = pd.read_csv("data.csv")

null_summary = (
    pd.DataFrame({
        "null_count": df.isnull().sum(),
        "null_pct": (df.isnull().mean() * 100).round(2),
        "dtype": df.dtypes,
    })
    .query("null_count > 0")
    .sort_values("null_pct", ascending=False)
)

print(null_summary.to_string())
```

---

## Reading the Pipeline JSON

Key fields to use when writing code:

| Field | How to use |
|---|---|
| `shape.rows` / `shape.cols` | Report dataset size in summary |
| `columns[].name` | Use exact names in code — never guess |
| `columns[].col_type` | Determines valid operations on that column |
| `columns[].null_pct` | Prioritise in cleaning; mention if > 10% |
| `columns[].skewness` | Flag if abs > 2; suggest log transform |
| `columns[].outlier_pct` | Mention if > 5%; suggest IQR removal |
| `columns[].top_values` | Use for categorical grouping or filtering |
| `columns[].min_date` / `max_date` | Set date range context in summary |
| `high_correlations[]` | Flag multicollinearity; mention in summary |
| `flags[]` | Always surface all flags in key observations |
| `duplicate_pct` | Mention if > 0; recommend drop_duplicates |

---

## Best Practices

### DO

- **Always profile first** — never generate code without running pipeline.py
- **Surface flags first** — quality issues come before any analysis
- **Use exact column names** — copy from `columns[].name` in the profile
- **Save all outputs** — CSVs to file, plots to PNG; NOVA runs headlessly
- **Comment the code** — explain what each block does in one line
- **Handle nulls explicitly** — never assume a column is clean
- **Report changes** — print row counts before and after any filtering or cleaning
- **Respect dtypes** — check col_type before applying numeric or string operations

### DON'T

- **Don't assume column names** — always read them from the profile
- **Don't use plt.show()** — will hang in headless execution
- **Don't skip the summary** — even if the user only asked for code
- **Don't operate on all columns blindly** — filter by col_type first
- **Don't hardcode file paths** — use the path from the user's input
- **Don't generate analysis without data context** — if pipeline fails, ask the user to re-provide the file

---

## Quick Reference

| User asks | What to generate |
|---|---|
| "Analyse this data" | Full summary + descriptive stats code |
| "What can I do with this?" | Summary + list of 5 suggested analyses with code stubs |
| "Clean the data" | Flag-based cleaning code + save to new file |
| "Show me the distribution of X" | Histogram + KDE plot saved to PNG |
| "Find correlations" | Heatmap + high-correlation pairs printed |
| "Find outliers in X" | IQR-based detection + count report |
| "Group by X and sum Y" | groupby aggregation + printed table |
| "How many nulls?" | Null summary table sorted by severity |
| "Remove duplicates" | drop_duplicates + before/after count |
| "What's the date range?" | Min/max from datetime columns in profile |


--- FILE: data-analysis/dashboard.md ---

# HTML/JS Executive Dashboard Builder Skill

You are a Principal Web Architect and Data Visualization Expert. Your task is to build a world-class, **C-level executive dashboard** using a single-file HTML approach (HTML/CSS/JS).

## The Mandate
1. **Intelligent Data Loading (Critical):** You will be provided the file path(s) in the prompt. 
    - **Step A:** Try to load the file(s) automatically using JavaScript `fetch()`. If multiple files are provided, fetch them all and combine the arrays.
    - **Step B (Fallback):** Because local file protocols (`file://`) often block `fetch()` due to CORS, you MUST wrap the fetch in a `try/catch`. 
    - If the fetch fails, automatically reveal a beautifully styled Drag-and-Drop / `<input type="file" multiple>` overlay so the user can manually upload the files to view the dashboard. Use a CDN for `PapaParse` or `SheetJS` to parse files.
2. **C-Suite Multi-Section UI:** Do not write basic HTML. 
    - Import **Tailwind CSS** via CDN for a modern, sleek layout.
    - MUST include navigation tabs or a sidebar to switch between different "Views" (e.g., "Overview", "Granular Analysis", "Distributions"). Use JS to toggle visibility of these sections.
    - Include a top row of dynamically updating KPI Metric Cards.
3. **Comprehensive Visuals & Interactivity:** 
    - Import **Chart.js** or **ApexCharts** via CDN. Generate a highly detailed suite of charts (Lines for trends, Doughnuts for shares, Bars for comparisons).
    - Build JS logic to extract unique values from the dataset and populate `<select>` dropdowns to filter the entire dashboard.

## Code Requirements
- Single file only. All CSS inside `<style>`, all JS inside `<script>`.
- Format the output EXACTLY in a single `[CREATE: index.html]` tag followed by a markdown HTML code block. No conversational text outside the block.

--- FILE: data-analysis/pipeline.py ---

import sys
import json
import os
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np


# ─── File loading ────────────────────────────────────────────────────────────

def detect_ext(path):
    return os.path.splitext(path)[1].lower()


def load_file(path):
    ext = detect_ext(path)
    meta = {}

    if ext == '.csv':
        try:
            df = pd.read_csv(path, encoding='utf-8')
        except UnicodeDecodeError:
            df = pd.read_csv(path, encoding='latin-1')
            meta['encoding'] = 'latin-1'

    elif ext in ['.xlsx', '.xls']:
        xl = pd.ExcelFile(path)
        sheets = xl.sheet_names
        meta['sheets'] = sheets
        meta['active_sheet'] = sheets[0]
        if len(sheets) > 1:
            meta['note'] = f'{len(sheets)} sheets detected — profiling first sheet only'
        df = pd.read_excel(path, sheet_name=sheets[0])

    elif ext == '.json':
        df = pd.read_json(path)

    elif ext == '.parquet':
        df = pd.read_parquet(path)

    else:
        raise ValueError(f"Unsupported file type: {ext}")

    return df, meta


# ─── Column classification ────────────────────────────────────────────────────

def categorical_threshold(n_rows):
    """
    Returns (ratio_threshold, abs_threshold).
    A column is categorical if unique/total < ratio AND unique count < abs.
    Scales with dataset size so large datasets aren't over-flagged.
    """
    if n_rows < 1_000:
        return 0.05, 20
    elif n_rows < 10_000:
        return 0.05, 50
    elif n_rows < 100_000:
        return 0.03, 100
    else:
        return 0.01, 200


def classify_column(series, n_rows):
    ratio_thresh, abs_thresh = categorical_threshold(n_rows)
    n_unique = series.nunique()
    ratio = n_unique / n_rows if n_rows > 0 else 0

    if pd.api.types.is_bool_dtype(series):
        return 'boolean'

    if pd.api.types.is_datetime64_any_dtype(series):
        return 'datetime'

    if pd.api.types.is_numeric_dtype(series):
        if ratio > 0.9:
            return 'id_numeric'
        return 'numeric'

    if series.dtype == 'object':
        if ratio > 0.9:
            return 'high_cardinality'
        if n_unique <= abs_thresh or ratio <= ratio_thresh:
            return 'categorical'
        return 'text'

    return 'other'


# ─── Column profiling ─────────────────────────────────────────────────────────

def profile_column(name, series, col_type, n_rows):
    p = {
        'name': name,
        'dtype': str(series.dtype),
        'col_type': col_type,
        'null_count': int(series.isnull().sum()),
        'null_pct': round(series.isnull().mean() * 100, 2),
        'unique_count': int(series.nunique()),
        'unique_pct': round(series.nunique() / n_rows * 100, 2) if n_rows > 0 else 0,
    }

    if col_type == 'numeric':
        clean = series.dropna()
        q1, q3 = clean.quantile(0.25), clean.quantile(0.75)
        iqr = q3 - q1
        outliers = int(((clean < q1 - 1.5 * iqr) | (clean > q3 + 1.5 * iqr)).sum())
        p.update({
            'min':           _safe_round(clean.min()),
            'max':           _safe_round(clean.max()),
            'mean':          _safe_round(clean.mean()),
            'median':        _safe_round(clean.median()),
            'std':           _safe_round(clean.std()),
            'q25':           _safe_round(q1),
            'q75':           _safe_round(q3),
            'skewness':      _safe_round(clean.skew()),
            'outlier_count': outliers,
            'outlier_pct':   round(outliers / n_rows * 100, 2) if n_rows > 0 else 0,
        })

    elif col_type == 'categorical':
        top = series.value_counts().head(5)
        p['top_values'] = {str(k): int(v) for k, v in top.items()}

    elif col_type == 'datetime':
        p['min_date'] = str(series.min())
        p['max_date'] = str(series.max())
        p['range_days'] = int((series.max() - series.min()).days) if not series.isnull().all() else None

    elif col_type == 'boolean':
        vc = series.value_counts()
        p['value_counts'] = {str(k): int(v) for k, v in vc.items()}

    elif col_type in ['high_cardinality', 'text']:
        p['sample_values'] = [str(v) for v in series.dropna().head(3).tolist()]

    return p


def _safe_round(val, digits=4):
    try:
        return round(float(val), digits) if not pd.isna(val) else None
    except Exception:
        return None


# ─── Correlation ──────────────────────────────────────────────────────────────

def high_correlations(df, threshold=0.8):
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if len(num_cols) < 2:
        return []

    corr = df[num_cols].corr()
    pairs = []
    for i in range(len(corr.columns)):
        for j in range(i + 1, len(corr.columns)):
            val = corr.iloc[i, j]
            if abs(val) >= threshold:
                pairs.append({
                    'col_a': corr.columns[i],
                    'col_b': corr.columns[j],
                    'correlation': round(float(val), 4),
                })
    return pairs


# ─── Quality flags ────────────────────────────────────────────────────────────

def build_flags(columns, duplicate_pct):
    flags = []

    for col in columns:
        name = col['name']

        if col['null_pct'] > 80:
            flags.append(f"Critical nulls in '{name}': {col['null_pct']}% — column likely unusable")

        if col['col_type'] == 'id_numeric':
            flags.append(f"Possible ID column (numeric): '{name}' — {col['unique_pct']}% unique values")

        if col['col_type'] == 'high_cardinality':
            flags.append(f"High cardinality column: '{name}' — {col['unique_pct']}% unique values")

        if col['col_type'] == 'numeric':
            skew = col.get('skewness') or 0
            if abs(skew) > 2:
                flags.append(f"High skewness in '{name}': {skew}")
            if col.get('outlier_pct', 0) > 5:
                flags.append(f"Significant outliers in '{name}': {col['outlier_pct']}% of rows")

        if col['unique_count'] == 1:
            flags.append(f"Constant column '{name}': only one unique value")

    if duplicate_pct > 1:
        flags.append(f"Duplicate rows detected: {duplicate_pct}% of dataset")

    return flags


# ─── Main pipeline ────────────────────────────────────────────────────────────

def run(paths):
    output = {}

    # File metadata and Load
    dfs = []
    files_info = []
    for path in paths:
        size = os.path.getsize(path)
        df, file_meta = load_file(path)
        f_info = {
            'name':       os.path.basename(path),
            'extension':  detect_ext(path),
            'size_kb':    round(size / 1024, 2),
        }
        f_info.update(file_meta)
        files_info.append(f_info)
        dfs.append(df)

    output['files'] = files_info
    
    # Combine all dataframes
    df = pd.concat(dfs, ignore_index=True)

    # Try to parse object columns as datetime
    for col in df.select_dtypes(include='object').columns:
        try:
            df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
        except Exception:
            pass

    n_rows, n_cols = df.shape
    dup_count = int(df.duplicated().sum())
    dup_pct = round(dup_count / n_rows * 100, 2) if n_rows > 0 else 0

    output['shape'] = {'rows': n_rows, 'cols': n_cols}
    output['duplicate_rows'] = dup_count
    output['duplicate_pct'] = dup_pct

    # Column profiles
    columns = []
    for col in df.columns:
        col_type = classify_column(df[col], n_rows)
        columns.append(profile_column(col, df[col], col_type, n_rows))

    output['columns'] = columns

    # Column type breakdown
    type_summary = {}
    for col in columns:
        ct = col['col_type']
        type_summary[ct] = type_summary.get(ct, 0) + 1
    output['column_type_summary'] = type_summary

    # Correlations
    output['high_correlations'] = high_correlations(df)

    # Quality flags
    output['flags'] = build_flags(columns, dup_pct)

    # RAW DATA HEAD: Provides Kimi with actual data samples for domain intuition
    output['data_head'] = df.head(10).to_dict(orient='records')

    return output


# ─── Entry point ──────────────────────────────────────────────────────────────

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print(json.dumps({'error': 'Usage: python pipeline.py <file_path> [file_path_2 ... ]'}))
        sys.exit(1)

    file_paths = sys.argv[1:]
    valid_paths = []

    for fp in file_paths:
        if not os.path.exists(fp):
            print(json.dumps({'error': f'File not found: {fp}'}))
            sys.exit(1)
        valid_paths.append(fp)

    result = run(valid_paths)
    print(json.dumps(result, indent=2, default=str))


--- FILE: data-analysis/streamlit_da_dashaboard.md ---

# Streamlit Executive Dashboard Builder Skill

You are a Principal Data Scientist and Lead UI Engineer. Your task is to build a world-class, **C-level executive dashboard** using Streamlit. 

## The Mandate
1. **Dynamic Multi-File Loading:** You will be provided the exact file path(s) in the prompt. 
    - You MUST use `pd.read_csv()`, `pd.read_excel()`, etc. to load them.
    - If multiple files are provided, load all of them and use `pd.concat(ignore_index=True)` to merge them into a single master DataFrame for analysis.
    - Wrap the loading function in `@st.cache_data` for performance.
2. **C-Suite Standard & Sections:** The UI must be exceptional. 
    - Always include a top row of `st.metric()` KPI cards summarizing the most critical high-level data.
    - MUST USE `st.tabs()` to separate the dashboard into distinct granular sections (e.g., "Executive Summary", "Deep Dive Analytics", "Trend Distributions", "Raw Data Explorer").
3. **Comprehensive Logic:** 
    - Auto-deduce the best metrics based on the `pipeline_output.json` profile. Give the user an incredibly informative, highly granular view of their domain.
4. **Advanced Interactivity:** Always include a sidebar with multiple filters (`st.selectbox`, `st.slider`) that dynamically update the data across ALL tabs.
5. **Visuals:** Rely heavily on `plotly.express` for premium, interactive charts.

## Code Requirements
- Handle missing values (`.fillna()`, `.dropna()`) gracefully.
- Set `st.set_page_config(page_title="Executive Dashboard", layout="wide")`.
- Format the output EXACTLY in a single `[CREATE: dashboard.py]` tag followed by a markdown python code block. No conversational text outside the block.