Metadata-Version: 2.4
Name: ai-critic
Version: 2.0.0
Summary: Fast AI evaluator for scikit-learn models
Author-email: Luiz Seabra <filipedemarco@yahoo.com>
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy
Requires-Dist: scikit-learn

# ai-critic 🧠

## The Quality Gate for Machine Learning Models

**ai-critic** is a specialized **decision-making system** designed to evaluate whether a machine learning model is **safe, reliable, and trustworthy enough** to be deployed in real-world environments.

Unlike traditional ML evaluation tools that focus almost exclusively on *performance metrics*, **ai-critic** operates as a **Quality Gate** — a final checkpoint that actively probes models to uncover **hidden risks** that frequently cause silent failures in production.

> **ai-critic does not ask *“How accurate is this model?”***
> It asks ***“Can this model be trusted in the real world?”***

---

## 🎯 What Problem Does ai-critic Solve?

In production, most ML failures are **not accuracy problems**.

They are caused by:

* Data leakage hidden inside features
* Overfitting disguised as strong validation scores
* Models that collapse under small noise
* Models that rely on a single fragile signal
* Configuration choices that look fine — but are structurally unsafe

These failures usually appear **after deployment**, when it is already expensive or dangerous to fix them.

**ai-critic exists to catch these failures *before* deployment.**

---

## 🚀 Getting Started (The Basics)

This section is intentionally designed for **beginners**, **students**, and **engineers under time pressure**.

If you only want a **fast, conservative verdict**, this is all you need.

---

### Installation

Install directly from PyPI:

```bash
pip install ai-critic
```

Python ≥ 3.8 is recommended.

---

### The Quick Verdict

With just a few lines of code, you can obtain:

* An **executive-level verdict**
* A **risk classification**
* A **deployment recommendation**

```python
from ai_critic import AICritic
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# 1. Prepare data and model
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    random_state=42
)

model = RandomForestClassifier(
    max_depth=5,
    random_state=42
)

# 2. Initialize the Critic
critic = AICritic(model, X, y)

# 3. Run the audit
report = critic.evaluate(view="executive")

print(f"Verdict: {report['verdict']}")
print(f"Risk Level: {report['risk_level']}")
print(f"Deploy Recommended: {report['deploy_recommended']}")
print(f"Main Reason: {report['main_reason']}")
```

**Example Output:**

```text
Verdict: ⚠️ Risky
Risk Level: medium
Deploy Recommended: False
Main Reason: Structural, robustness, or dependency-related risks detected.
```

This verdict is intentionally **conservative by design**.

> If **ai-critic approves deployment**, it means **no meaningful risks were detected** by multiple independent heuristics.

---

## 🧭 How to Read the Verdict

| Field                | Meaning                 |
| -------------------- | ----------------------- |
| `verdict`            | Human-readable summary  |
| `risk_level`         | low / medium / high     |
| `deploy_recommended` | Final gate decision     |
| `main_reason`        | Primary blocking factor |

The goal is clarity, not ambiguity.

---

## 💡 Understanding the Critique (Intermediate Level)

This section is for **data scientists**, **ML engineers**, and **students** who want to understand *why* the model was flagged — and how to improve it.

---

### The Four Pillars of the Audit

**ai-critic** evaluates models across **four independent risk dimensions**.

| Pillar             | What It Detects                  | Why It Matters       |
| ------------------ | -------------------------------- | -------------------- |
| 📊 Data Integrity  | Leakage, correlations, shortcuts | Inflated performance |
| 🧠 Model Structure | Over-complexity, unsafe configs  | Poor generalization  |
| 📈 Performance     | Suspicious CV behavior           | False confidence     |
| 🧪 Robustness      | Noise sensitivity                | Production collapse  |

Each pillar produces **signals**, not binary judgments.

Those signals are later aggregated by the **deployment gate**.

---

## 📊 Data Integrity Analysis

This pillar focuses on **the relationship between features and the target**.

It answers questions like:

* Are some features *too predictive*?
* Are there suspicious correlations?
* Does performance collapse when a single feature is disturbed?

These are classic symptoms of **data leakage** and **shortcut learning**.

---

## 🧠 Model Structure Analysis

A model can be accurate and still be unsafe.

Structural analysis looks for:

* Excessive depth
* Over-parameterization
* Configuration choices that amplify variance
* Inconsistent bias–variance tradeoffs

This is especially important for:

* Decision trees
* Boosting models
* Neural networks with limited data

---

## 📈 Performance Sanity Checks

Rather than optimizing metrics, **ai-critic questions them**.

It checks:

* Cross-validation stability
* Variance across folds
* Learning curve consistency
* Performance under perturbations

A strong score that behaves strangely is treated as **a warning, not a success**.

---

## 🧪 Robustness Testing (Noise Injection)

Production data is **never clean**.

This test injects controlled noise into inputs and measures degradation.

```python
robustness = report["details"]["robustness"]

print(f"Original CV Score: {robustness['cv_score_original']}")
print(f"Noisy CV Score: {robustness['cv_score_noisy']}")
print(f"Performance Drop: {robustness['performance_drop']}")
print(f"Verdict: {robustness['verdict']}")
```

Possible outcomes:

* `stable` → acceptable degradation
* `fragile` → high sensitivity
* `misleading` → performance likely inflated

---

## 🔍 Explainability & Feature Sensitivity

Accuracy alone hides *why* a model works.

The explainability module performs **feature sensitivity analysis** to detect:

* Feature-level leakage
* Over-reliance on a single signal
* Structural shortcuts

---

### How Explainability Works

For each feature:

1. The feature is randomly permuted.
2. The model is re-evaluated.
3. Performance drop is measured.

Large drops indicate **critical dependency**.

This approach is:

* Model-agnostic
* Lightweight
* Framework-independent
* Interpretable by humans

---

### Explainability Verdicts

| Verdict                | Meaning                  |
| ---------------------- | ------------------------ |
| `stable`               | Balanced feature usage   |
| `feature_dependency`   | Few features dominate    |
| `feature_leakage_risk` | Single feature dominates |

These verdicts **directly affect**:

* Deployment decision
* Confidence score
* Recommendations

---

## 🧠 Recommendations Engine (New)

**ai-critic does not stop at “deploy or not”.**

It generates **actionable recommendations**, such as:

* “Reduce `max_depth`”
* “Increase regularization”
* “Likely feature leakage detected”
* “Model shows structural overfitting”
* “High noise sensitivity — retrain with augmentation”

These recommendations are **rule-based + data-driven**, not LLM hallucinations.

---

## ⚙️ Deployment Gate

The final decision is produced by `deploy_decision()`.

```python
decision = critic.deploy_decision()

print(decision["deploy"])
print(decision["risk_level"])
print(decision["confidence"])
print(decision["blocking_issues"])
```

Conceptually:

* **Hard blockers** → deployment denied
* **Soft blockers** → deployment discouraged
* **Confidence score (0–1)** → heuristic trust

---

## 🔄 Feedback Loop & Learning Critic

**ai-critic improves over time**.

Each evaluation can be stored as feedback:

* Model config
* Signals
* Final outcome
* Human override (optional)

This enables:

* Meta-learning
* Better future recommendations
* Context-aware criticism

---

## 🧪 Session Tracking & Comparison

You can compare models over time:

```python
critic_v1 = AICritic(model, X, y, session="v1")
critic_v1.evaluate()

critic_v2 = AICritic(model, X, y, session="v2")
critic_v2.evaluate()

critic_v2.compare_with("v1")
```

Use cases:

* Regression detection
* Risk drift
* Governance audits

---

## ⚙️ Multi-Framework Support

The same API works for:

* scikit-learn
* PyTorch
* TensorFlow

Adapters handle training, evaluation, and probing internally.

---

## 🧩 Design Philosophy

**ai-critic is intentionally skeptical.**

It assumes:

* Metrics can lie
* Data is imperfect
* Models fail silently
* Confidence must be earned

This makes it ideal as a **final gate**, not a tuning toy.

---

## 🛡️ What ai-critic Is NOT

* ❌ A hyperparameter optimizer
* ❌ A leaderboard benchmark tool
* ❌ A replacement for domain expertise
* ❌ A magic “approve all” system

---

## 🧠 Final Note

> **ai-critic is not here to make models look good.**
> It exists to **prevent bad models from looking good enough to deploy**.

A failed audit does **not** mean your model is bad.
It means your model is **not yet safe to trust**.

That distinction is everything.
