Metadata-Version: 2.4
Name: LangMet
Version: 0.3.0
Summary: Observability and performance metrics for LLM and RAG systems
Author: Dr Mabrouka Abuhmida
License: MIT
Keywords: llm,rag,ragas,evaluation,analytics,metrics,observability
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: sqlalchemy
Requires-Dist: sqlalchemy>=2.0.23; extra == "sqlalchemy"
Provides-Extra: fastapi
Requires-Dist: fastapi>=0.104.1; extra == "fastapi"
Requires-Dist: pydantic>=2.5.0; extra == "fastapi"
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.2.0; extra == "embeddings"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.7.0; extra == "dev"
Requires-Dist: build>=1.2.2; extra == "dev"
Requires-Dist: twine>=5.1.1; extra == "dev"
Dynamic: license-file

# LangMet

![LangMet Logo](https://raw.githubusercontent.com/mabrouka-abuhmida/LangMet/main/LangeMet-Logo.png)

**Observability and drift intelligence for LLM and RAG systems.**

LangMet provides a reusable analytics layer for monitoring operational performance, retrieval quality, and evidence coverage in AI systems.

It separates analytical computation from data access, allowing teams to compute metrics from any telemetry source — SQL databases, log streams, data warehouses, or custom repositories.

**Designed for production AI environments.**

LangMet separates **analytical intelligence** from data access so you can compute metrics from any source: SQL databases, log streams, files, or custom repositories.

## Why LangMet?

Most LLM metrics pipelines are tightly coupled to infrastructure.

### Benefits of LangMet:
* isolates analytics from storage
* provides percentile-based latency monitoring
* supports windowed drift detection (short-term vs long-term baselines)
* enables evidence coverage analysis for RAG systems
* works with any data source via repository interfaces

**This makes it suitable for:**

* production monitoring
* research evaluation
* safety-critical AI systems
* regulated environments

### Features

- Pure analytics functions for:
  - Operational LLM metrics
  - RAG performance metrics
  - Citation coverage metrics
  - **RAGAS evaluation metrics** (faithfulness, answer relevancy, context precision, context recall, context relevancy, answer correctness, answer similarity)
  - **Cost & token-budget metrics** (per-model pricing, input/output accounting)
- **Pluggable RAGAS scorers** — dependency-free token overlap by default, or embedding-backed via the `[embeddings]` extra
- **Threshold-based alerting** (`evaluate_alerts`) across operational, cost, citation, RAGA, and drift metrics
- Built-in latency percentiles (`p50`, `p90`, `p95`, `p99`) for SLO monitoring
- Drift detection for numeric and categorical signals (PSI + TVD based)
- Windowed drift baselines (compare last 1h vs trailing 7d automatically), including **RAGA quality drift** over time
- Repository interface (`MetricsRepository`) for pluggable data access
- In-memory and SQLAlchemy adapters
- Framework-agnostic service layer
- Ships with type hints (`py.typed`)

## [[Install]]

```bash
pip install langmet
```

or with git cli&pip
```cmd/bash
pip install git+https://github.com/mabrouka-abuhmida/Langmet.gi
```

With SQLAlchemy adapter support:

```bash
pip install "langmet[sqlalchemy]"
```

With embedding-backed RAGAS scoring:

```bash
pip install "langmet[embeddings]"
```

## 2-Minute Demo

Most engineers want proof it works before reading internals. A runnable backend + frontend demo is included:

- `examples/two-minute-demo/README.md`

Quick run:

```bash
python -m pip install -e ".[fastapi]"
python -m pip install uvicorn
uvicorn app:app --app-dir examples/two-minute-demo --reload
```

Open `http://127.0.0.1:8000/`.


### Example UI Demo

![Example UI Demo - Overview](https://raw.githubusercontent.com/mabrouka-abuhmida/LangMet/main/examples/two-minute-demo/image/README/1770853195159.png)

![Example UI Demo - Drift](https://raw.githubusercontent.com/mabrouka-abuhmida/LangMet/main/examples/two-minute-demo/image/README/1770853181629.png)

## [[Quickstart]] (Pure Functions)

```python
from datetime import datetime
from langmet.models import CompletionEvent
from langmet.analytics import compute_operational_metrics

events = [
    CompletionEvent(
        provider="openai",
        model="gpt-4o-mini",
        latency_ms=320,
        tokens_total=850,
        error_message=None,
        created_at=datetime.utcnow(),
    )
]

metrics = compute_operational_metrics(events)
print(metrics["overview"]["avg_latency_ms"])
```

RAGAS quality scoring (per-query, no external dependencies):

```python
from langmet.analytics import (
    score_faithfulness,
    score_answer_relevancy,
    score_context_precision,
    score_context_recall,
    score_context_relevancy,
    score_answer_correctness,
    score_answer_similarity,
)

question = "What is the capital of France?"
answer = "Paris is the capital of France."
contexts = ["Paris is the capital and largest city of France."]
ground_truth = "Paris is the capital of France."

faithfulness     = score_faithfulness(answer, contexts)
ans_relevancy    = score_answer_relevancy(question, answer)
ctx_precision    = score_context_precision(contexts, ground_truth)
ctx_recall       = score_context_recall(contexts, ground_truth)
ctx_relevancy    = score_context_relevancy(question, contexts)
ans_correctness  = score_answer_correctness(answer, ground_truth)
ans_similarity   = score_answer_similarity(answer, ground_truth)
```

Aggregate RAGAS scores over many queries:

```python
from datetime import datetime
from langmet.models import RagaEvaluationEvent
from langmet.analytics import compute_raga_metrics

events = [
    RagaEvaluationEvent(
        query_id="q1",
        faithfulness=0.92,
        answer_relevancy=0.88,
        context_precision=0.80,
        context_recall=0.85,
        context_relevancy=0.79,
        answer_correctness=0.83,
        answer_similarity=0.86,
        created_at=datetime.utcnow(),
    ),
    # ... more events
]

raga = compute_raga_metrics(events)
print(raga["overview"]["overall_score"])
print(raga["scores"]["faithfulness"])
```

Score queries directly (token overlap by default, or swap in embeddings):

```python
from langmet.scoring import score_query, TokenOverlapScorer
# from langmet.scoring import EmbeddingScorer  # pip install "langmet[embeddings]"

event = score_query(
    question="What is the capital of France?",
    answer="Paris is the capital of France.",
    contexts=["Paris is the capital and largest city of France."],
    ground_truth="Paris is the capital of France.",
    query_id="q1",
    scorer=TokenOverlapScorer(),  # or EmbeddingScorer()
)
# event is a RagaEvaluationEvent ready to feed into compute_raga_metrics
```

Cost & token-budget metrics:

```python
from langmet.cost import compute_cost_metrics
from langmet.models import CompletionEvent
from datetime import datetime

events = [
    CompletionEvent(
        provider="openai",
        model="gpt-4o",
        latency_ms=300,
        tokens_total=1500,
        error_message=None,
        created_at=datetime.utcnow(),
        prompt_tokens=1000,
        completion_tokens=500,
    )
]

cost = compute_cost_metrics(events)            # uses DEFAULT_PRICE_TABLE
print(cost["overview"]["total_cost_usd"])
# Or pass your own rates (USD per 1,000 tokens):
cost = compute_cost_metrics(events, price_table={"gpt-4o": {"input": 0.0025, "output": 0.01}})
```

Threshold-based alerting:

```python
from langmet.alerts import evaluate_alerts, AlertThresholds

result = evaluate_alerts(
    operational=operational_metrics,
    cost=cost_metrics,
    raga=raga_metrics,
    thresholds=AlertThresholds(
        max_error_rate=0.05,
        max_p95_latency_ms=500,
        min_faithfulness=0.7,
        max_total_cost_usd=100.0,
    ),
)
if result["triggered"]:
    for alert in result["alerts"]:
        print(alert["severity"], alert["message"])
```

RAGA quality drift (is faithfulness degrading over time?):

```python
from langmet.analytics import detect_raga_drift

drift = detect_raga_drift(raga_events, metric="faithfulness")
print(drift["drift_detected"])
```

Drift detection:

```python
from datetime import datetime, timedelta
from langmet.analytics import (
    detect_numeric_drift,
    detect_categorical_drift,
    detect_numeric_drift_windowed,
)

latency_drift = detect_numeric_drift(
    baseline_values=[120, 130, 115, 125],
    current_values=[210, 220, 205, 215],
)

provider_drift = detect_categorical_drift(
    baseline_labels=["openai", "openai", "anthropic"],
    current_labels=["anthropic", "anthropic", "openai"],
)

# Automatic window split: last 1h vs trailing 7d.
ref = datetime.utcnow()
observations = [
    (ref - timedelta(hours=2), 120.0),
    (ref - timedelta(minutes=40), 220.0),
]
windowed_drift = detect_numeric_drift_windowed(
    observations=observations,
    reference_time=ref,
)
```

## Quickstart (SQL Repository + Service)

```python
from datetime import datetime, timedelta
from langmet.service import AnalyticsService
from langmet.adapters.sqlalchemy_repo import SQLAlchemyMetricsRepository

repo = SQLAlchemyMetricsRepository(db_session)
service = AnalyticsService(repo)

start = datetime.utcnow() - timedelta(days=7)
end = datetime.utcnow()

all_operational = service.get_operational_metrics(start, end)
all_rag = service.get_rag_metrics(start, end)
citation = service.get_citation_coverage(start, end)
raga = service.get_raga_metrics(start, end)
cost = service.get_cost_metrics(start, end)
```

For tests, demos, or batch jobs without a database, use the in-memory adapter:

```python
from langmet.adapters import InMemoryMetricsRepository

repo = InMemoryMetricsRepository(completion_events=events)
service = AnalyticsService(repo)
```

## Production Integration Guide

Use this path when wiring LangMet to a real service.

- ### 1) Capture telemetry events in your app

For each request or pipeline run, emit these fields:
- Completion events: `provider`, `model`, `latency_ms`, `tokens_total`, `error_message`, `created_at`
- RAG events: `top_k`, `top_n`, `retrieval_scores`, `rerank_scores`, `retrieval_latency_ms`, `rerank_latency_ms`, `created_at`
- Citation events: `message_id`, `evidence_count`, `created_at`
- RAGAS evaluation events: `query_id`, `faithfulness`, `answer_relevancy`, `context_precision`, `context_recall`, `context_relevancy`, `answer_correctness`, `answer_similarity`, `created_at` (all score fields are optional floats in `[0, 1]`)

- ### 2) Example SQL schema (PostgreSQL)

```sql
CREATE TABLE completion_logs (
  id BIGSERIAL PRIMARY KEY,
  provider TEXT NOT NULL,
  model TEXT,
  latency_ms DOUBLE PRECISION,
  tokens_total INTEGER,
  error_message TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE rag_logs (
  id BIGSERIAL PRIMARY KEY,
  top_k INTEGER,
  top_n INTEGER,
  retrieval_scores JSONB,
  rerank_scores JSONB,
  retrieval_latency_ms DOUBLE PRECISION,
  rerank_latency_ms DOUBLE PRECISION,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE citation_events (
  id BIGSERIAL PRIMARY KEY,
  message_id TEXT NOT NULL,
  evidence_count INTEGER NOT NULL DEFAULT 0,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE raga_evaluations (
  id BIGSERIAL PRIMARY KEY,
  query_id TEXT NOT NULL,
  faithfulness DOUBLE PRECISION,
  answer_relevancy DOUBLE PRECISION,
  context_precision DOUBLE PRECISION,
  context_recall DOUBLE PRECISION,
  context_relevancy DOUBLE PRECISION,
  answer_correctness DOUBLE PRECISION,
  answer_similarity DOUBLE PRECISION,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_completion_logs_created_at ON completion_logs (created_at);
CREATE INDEX idx_rag_logs_created_at ON rag_logs (created_at);
CREATE INDEX idx_citation_events_created_at ON citation_events (created_at);
CREATE INDEX idx_raga_evaluations_created_at ON raga_evaluations (created_at);
```

### 3) Wire repository and service

```python
from datetime import datetime, timedelta
from sqlalchemy.orm import Session
from langmet.adapters.sqlalchemy_repo import SQLAlchemyMetricsRepository
from langmet.service import AnalyticsService

def get_metrics_payload(db: Session) -> dict:
    repo = SQLAlchemyMetricsRepository(db)
    svc = AnalyticsService(repo)
    start = datetime.utcnow() - timedelta(days=7)
    end = datetime.utcnow()
    return {
        "operational": svc.get_operational_metrics(start, end),
        "rag": svc.get_rag_metrics(start, end),
        "citation_coverage": svc.get_citation_coverage(start, end),
        "raga": svc.get_raga_metrics(start, end),
    }
```

### 4) Expose in API

```python
from fastapi import FastAPI

app = FastAPI()

@app.get("/api/metrics")
def metrics():
    # replace with your Session management
    payload = get_metrics_payload(db_session)
    return payload
```

### 5) Add drift monitoring

```python
from datetime import timedelta
from langmet.analytics import detect_numeric_drift_windowed

drift = detect_numeric_drift_windowed(
    observations=latency_observations,  # list[(timestamp, latency_ms)]
    current_window=timedelta(hours=1),
    baseline_window=timedelta(days=7),
    min_samples_per_window=20,
)
```

### 6) Frontend contract

Your UI only needs:

- `GET /api/metrics` for overview cards and tables
- `GET /api/drift` (or drift in same payload) for alerts

Keep response keys stable:

- `operational.overview`
- `rag.overview`
- `citation_coverage`
- `raga.overview`, `raga.scores`, `raga.evaluation_counts`

### 7) Production checklist

- Store timestamps in UTC (`TIMESTAMPTZ`)
- Index `created_at` on telemetry tables
- Add cache TTL for dashboard polling endpoints
- Define alert thresholds for:
  - latency percentiles (`p95`, `p99`)
  - error rate
  - drift (`psi`, `tvd`)
- Add data retention policy (for example 30–90 days hot storage)

## Core Concepts

- `langmet.models`: event contracts used by analytics
- `langmet.analytics`: pure computation functions
- `langmet.ports`: repository protocol your project can implement
- `langmet.service`: orchestration facade
- `langmet.adapters`: optional infrastructure adapters

## Development

```bash
pip install -e ".[dev,sqlalchemy]"
ruff check .
pytest
python -m build
twine check dist/*
```

## License

MIT
