Metadata-Version: 2.4
Name: mca-sdk
Version: 0.8.28
Summary: Model Collector Agent SDK for OpenTelemetry monitoring
Author-email: Baptist Health South Florida <ai-ml@baptisthealth.net>
License: Apache-2.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: opentelemetry-api>=1.35.0
Requires-Dist: opentelemetry-sdk>=1.35.0
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.35.0
Requires-Dist: protobuf<6.0,>=5.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: pybreaker>=1.4.1
Requires-Dist: cryptography>=46.0.5
Requires-Dist: requests>=2.31.0
Provides-Extra: genai
Requires-Dist: litellm>=1.17.0; extra == "genai"
Provides-Extra: vendor
Requires-Dist: flask>=3.0.0; extra == "vendor"
Provides-Extra: prometheus
Requires-Dist: opentelemetry-exporter-prometheus>=0.48b0; extra == "prometheus"
Provides-Extra: gcp-auth
Requires-Dist: google-auth<3.0.0,>=2.27.0; extra == "gcp-auth"
Provides-Extra: gcp-secrets
Requires-Dist: google-cloud-secret-manager<3.0.0,>=2.26.0; extra == "gcp-secrets"
Provides-Extra: gcp-trace
Requires-Dist: google-cloud-trace<2.0.0,>=1.11.0; extra == "gcp-trace"
Requires-Dist: google-auth<3.0.0,>=2.27.0; extra == "gcp-trace"
Provides-Extra: gcp-logging
Requires-Dist: google-cloud-logging<4.0.0,>=3.13.0; extra == "gcp-logging"
Provides-Extra: gcp
Requires-Dist: google-auth<3.0.0,>=2.27.0; extra == "gcp"
Requires-Dist: google-cloud-secret-manager<3.0.0,>=2.26.0; extra == "gcp"
Requires-Dist: google-cloud-logging<4.0.0,>=3.13.0; extra == "gcp"
Requires-Dist: google-cloud-trace<2.0.0,>=1.11.0; extra == "gcp"
Provides-Extra: autolog
Requires-Dist: scikit-learn==1.7.2; extra == "autolog"
Requires-Dist: xgboost>=2.0.0; extra == "autolog"
Requires-Dist: lightgbm>=4.0.0; extra == "autolog"
Requires-Dist: numpy>=1.24.0; extra == "autolog"
Provides-Extra: instrument
Requires-Dist: opentelemetry-instrumentation>=0.48b0; extra == "instrument"
Requires-Dist: opentelemetry-distro>=0.48b0; extra == "instrument"
Provides-Extra: test
Requires-Dist: pytest>=7.4.0; extra == "test"
Requires-Dist: pytest-cov>=4.1.0; extra == "test"
Requires-Dist: pytest-mock>=3.12.0; extra == "test"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "test"
Requires-Dist: opentelemetry-exporter-prometheus>=0.48b0; extra == "test"
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.12.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
Requires-Dist: opentelemetry-exporter-prometheus>=0.48b0; extra == "dev"
Provides-Extra: all
Requires-Dist: litellm>=1.17.0; extra == "all"
Requires-Dist: flask>=3.0.0; extra == "all"
Requires-Dist: google-auth>=2.27.0; extra == "all"
Requires-Dist: google-cloud-secret-manager>=2.26.0; extra == "all"
Requires-Dist: google-cloud-logging<4.0.0,>=3.13.0; extra == "all"
Requires-Dist: google-cloud-trace<2.0.0,>=1.11.0; extra == "all"
Requires-Dist: opentelemetry-exporter-prometheus>=0.48b0; extra == "all"
Requires-Dist: scikit-learn==1.7.2; extra == "all"
Requires-Dist: xgboost>=2.0.0; extra == "all"
Requires-Dist: lightgbm>=4.0.0; extra == "all"
Requires-Dist: numpy>=1.24.0; extra == "all"
Requires-Dist: opentelemetry-instrumentation>=0.48b0; extra == "all"
Requires-Dist: opentelemetry-distro>=0.48b0; extra == "all"
Dynamic: license-file
Dynamic: requires-python

# MCA SDK - Model Collector Agent

[![Pipeline Status](https://gitlab.com/bhsf/ai_ml/model-monitoring/sdk/badges/main/pipeline.svg)](https://gitlab.com/bhsf/ai_ml/model-monitoring/sdk/-/pipelines)
[![Security Scan](https://img.shields.io/badge/security-pip--audit-blue)](https://gitlab.com/bhsf/ai_ml/model-monitoring/sdk/-/pipelines)
[![PyPI version](https://img.shields.io/badge/pypi-v0.8.28-blue)](https://pypi.org/project/mca-sdk/)

Production-ready OpenTelemetry SDK for healthcare ML model monitoring. Provides comprehensive instrumentation for Predictive ML, Generative AI, and Agentic AI models with HIPAA-compliant telemetry collection, centralized configuration management, and enterprise security features.

## Architecture

```
┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│ Internal &   │  │  Agentic AI  │  │ Registry     │◄─│  Model Reg.  │
│ GenAI Models │  │  Assistant   │  │ GenAI Model  │  │  (Mock API)  │
│ (Py + SDK)   │  │  (Multi-step)│  │ (Dynamic Cfg)│  └──────────────┘
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       │ OTLP/HTTP       │ OTLP/HTTP       │ OTLP/HTTP      ┌──────────────┐
       │ :4318           │ :4318           │ :4318          │  Vendor API  │
       │                 │                 │                │  (FastAPI)   │
       │                 │                 │                └──────┬───────┘
       │                 │                 │                       │ JSON
       │                 │                 │                       ▼
       │                 │                 │              ┌────────────────┐
       │                 │                 │              │ Vendor Bridge  │
       │                 │                 │              │ (JSON→OTLP)    │
       │                 │                 │              └────────┬───────┘
       │                 │                 │                       │ OTLP
       ▼                 ▼                 ▼                       ▼ :4318
    ┌──────────────────────────────────────────────────────────────────────┐
    │                 OpenTelemetry Collector (Port 4318)                  │
    │                                                                      │
    │   ┌──────────┐      ┌────────────────┐      ┌──────────────┐         │
    │   │  Batch   │  →   │  Attributes    │  →   │    Debug     │         │
    │   │Processor │      │  Processor     │      │  Exporter    │         │
    │   │(10s/100) │      │(region, env)   │      │  (stdout)    │         │
    │   └──────────┘      └────────────────┘      └──────────────┘         │
    └──────────────────────────────────────────────────────────────────────┘
                                       │
                                       ▼
                              Docker Logs / Console
                            (Simulates GCP Backend)
```

## What's New in v0.8.28

### `association_id` for Prediction and Error Linking

`record_prediction()` and `record_error()` now accept an optional `association_id` keyword argument that links telemetry events to downstream business records or time-series entities:

```python
client.record_prediction(
    latency=0.15,
    prediction_outcome="positive",
    association_id="encounter-789"
)

client.record_error(error, latency=0.20, association_id="encounter-789")
```

The value is emitted as the `ml.association.id` span attribute and included in structured logs. Existing call sites require no changes.

---

## Using GCP Development Infrastructure

The MCA SDK supports testing against real GCP services (Cloud Logging, Cloud Trace, Google-Managed Prometheus) before deploying to production. The GCP dev environment provides a shared OpenTelemetry Collector at `10.164.76.55` (internal VPC) that exports to GCP project `bhsf-mca-dev`.

### Quick Start

**Docker Compose:**
```bash
cd mca-prototype
docker-compose -f docker-compose.yml -f docker-compose.gcp-dev.yml up internal-model
```

**Local Python:**
```bash
export MCA_COLLECTOR_ENDPOINT=http://10.164.76.55:4318
export MCA_COLLECTOR_PROTOCOL=http
export MCA_ALLOW_INSECURE_COLLECTOR=true
export MCA_SERVICE_NAME=your-model
export MCA_MODEL_ID=your-model-id
export MCA_TEAM_NAME=your-team
python your_model.py
```

### Environment Variables

| Variable | Value for GCP Dev | Description |
|----------|------------------|-------------|
| `MCA_COLLECTOR_ENDPOINT` | `http://10.164.76.55:4318` | GCP dev collector HTTP endpoint |
| `MCA_COLLECTOR_PROTOCOL` | `http` | Protocol (http or grpc) |
| `MCA_ALLOW_INSECURE_COLLECTOR` | `true` | Allow HTTP for internal endpoint |

### Verification

Check your telemetry in GCP Console:
- **Logs**: [https://console.cloud.google.com/logs/query?project=bhsf-mca-dev](https://console.cloud.google.com/logs/query?project=bhsf-mca-dev)
- **Traces**: [https://console.cloud.google.com/traces/list?project=bhsf-mca-dev](https://console.cloud.google.com/traces/list?project=bhsf-mca-dev)

Or use gcloud CLI:
```bash
gcloud logging read "logName=projects/bhsf-mca-dev/logs/emms-model-telemetry" --limit=10
gcloud trace list --project=bhsf-mca-dev --limit=10
```

### Prerequisites

- **VPC Access**: The collector endpoint is internal (10.164.76.55). Use VPN if testing from outside the VPC.
- **Connectivity Test**: `curl http://10.164.76.55:4318/` should succeed

For complete setup instructions, troubleshooting, and Kubernetes deployment, contact your organization for access to internal documentation.

## Installation

### Requirements
- **Python**: 3.10 or higher
- **OpenTelemetry**: 1.35.0 or higher (auto-installed)

### Basic Installation
Install the MCA SDK from PyPI using pip:

```bash
pip install mca-sdk
```

### Installation with Optional Features
The SDK includes optional dependencies for specific use cases:

```bash
# For zero-code auto-instrumentation (FastAPI, requests, etc.)
pip install "mca-sdk[instrument]"
opentelemetry-bootstrap -a install  # Auto-installs specific instrumentors

# For ML framework auto-instrumentation (Scikit-learn, XGBoost, LightGBM)
pip install "mca-sdk[instrument,autolog]"

# For GenAI/LLM monitoring (includes LiteLLM)
pip install mca-sdk[genai]

# For vendor integration (includes Flask)
pip install mca-sdk[vendor]

# For GCP Cloud Trace and Cloud Logging exporters
pip install mca-sdk[gcp]

# For development (includes pytest, black, mypy, etc.)
pip install mca-sdk[dev]

# All optional dependencies
pip install mca-sdk[all]
```

### Version Pinning (Recommended for Production)
Pin to a specific version for reproducible deployments:

```bash
# Install exact version
pip install mca-sdk==0.6.7

# Install with version constraints
pip install "mca-sdk>=0.6.7,<1.0.0"
```

### Verify Installation
After installation, verify the SDK is correctly installed:

```bash
# Check installed version
pip show mca-sdk

# Test import
python -c "from mca_sdk import MCAClient; print('MCA SDK installed successfully')"
```

### From Source (Development Setup)
If you are developing or testing the SDK locally:

```bash
# Clone the repository
git clone <repository-url>
cd sdk/mca-prototype

# Run the automated setup script to install dependencies, pre-commit hooks, and build containers
make dev-setup

# OR manually:
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
```

## Instrumentation Approaches

Once the SDK is installed, you can choose from three approaches to instrument your models:

### Approach 1: AI Assistant Integration (Highly Recommended)
If you are using an agentic coding assistant (like Claude Code, Gemini CLI, Cursor, or ChatGPT), the fastest and most robust way to instrument your application is to have the AI do it for you. The package ships with a `FOR_AI_ASSISTANTS.md` guide.
Simply paste this prompt to your AI assistant:
> "Please review my codebase and instrument my models using the MCA SDK. You can find the instructions on how to do this by reading the guide located at `mca_sdk/FOR_AI_ASSISTANTS.md` within the installed package."

### Approach 2: Zero-Code Auto-Instrumentation (Recommended)
This approach utilizes the `mca-instrument` CLI wrapper to automatically instrument your application's external calls and supported framework inferences without touching your source code.
**Prerequisites**: Install the SDK with instrumentation dependencies: `pip install "mca-sdk[instrument]"` and run `opentelemetry-bootstrap -a install`.

**Usage via CLI Wrapper (`mca-instrument`):**
```bash
export MCA_MODEL_ID="your-model-id"
export MCA_TEAM_NAME="your-team-name"
export MCA_COLLECTOR_ENDPOINT="http://localhost:4318" 

mca-instrument -- python sample_model.py
```

**Usage in Docker / Continuous Deployment (e.g., Vertex AI):**
For models deployed as long-running, live services, modify the `CMD` or `ENTRYPOINT` in your `Dockerfile` to wrap the server startup command with `mca-instrument`:
```dockerfile
# Example: Auto-instrumenting a FastAPI app deployed with Uvicorn
CMD ["mca-instrument", "--", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
```
Provide configuration via environment variables in your deployment manifest. The background batch processor automatically handles continuous export without blocking prediction latency, and hooks into `SIGTERM` for graceful shutdown flushing.

**Important Notes:**
- **Capturing Logs**: OpenTelemetry's zero-code auto-instrumentation hooks into the standard Python `logging` library. **It does NOT capture `print()` statements.** To ensure outputs appear in GCP Cloud Logging, replace `print()` calls with standard `logging.info()`, `logging.error()`, etc.
- **Short-lived Scripts**: If wrapping a fast, short-lived Python script, add `import time; time.sleep(5)` to the end to allow final network exports to complete.
- **GCP Exporters**: To route telemetry to GCP buckets, you must set `MCA_GCP_LOGGING_ENABLED=true` and `MCA_GCP_TRACE_ENABLED=true` environment variables.

### Approach 3: Manual Instrumentation
If you are not using standard HTTP/DB frameworks, **if you are working in a Jupyter Notebook**, or if you need precise programmatic control, use the SDK's Python API directly in your code.

```python
from mca_sdk import MCAClient

client = MCAClient(
    service_name="my-model",
    model_id="mdl-001",
    team_name="data-science"
)
```

## Troubleshooting

### Import Error After Installation
```bash
# Verify installation
pip show mca-sdk
# Check Python version (requires 3.10+)
python --version
# Test import
python -c "from mca_sdk import MCAClient"
```

### Dependency Conflicts
Use a virtual environment to isolate dependencies:
```bash
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install mca-sdk
```

### OpenTelemetry Version Conflicts
```bash
# Check installed versions
pip list | grep opentelemetry
# Upgrade if needed
pip install --upgrade mca-sdk
```

## Quick Start

### Prerequisites
- Docker and Docker Compose installed
- Python 3.10+ (for running tests and standalone examples)
- No GCP account needed (uses debug exporter)

### Step 1: Start the Stack
```bash
cd mca-prototype
docker-compose up
```

Expected output indicators:
- `mca-otel-collector` container starts and shows collector startup
- `mca-vendor-api` shows FastAPI startup on port 8080
- `mca-vendor-bridge` begins polling and exporting metrics every 30s

### Step 2: Run the Demo Model

#### Option A: Using PyPI Package (Recommended)
 In another terminal, install mca-sdk from PyPI and run the standalone demo:
```bash
# Install package
pip install mca-sdk

# Run standalone demo (can be executed from any directory)
python mca-prototype/sdk-examples/internal-model/instrumented_model.py
```

#### Option B: Development Mode
Install dependencies and run example from repository:
```bash
cd mca-prototype
pip install -r sdk-examples/internal-model/requirements.txt
python sdk-examples/internal-model/instrumented_model.py
```

Expected behavior (both options):
- Runs predictions with 1-second intervals
- Prints prediction latency for each iteration
- Sends metrics, logs, and traces to collector
- Flushes all telemetry at completion

### Step 3: Observe the Collector Logs

Look for output in the collector terminal showing received telemetry:

**Metrics from Internal Model**:
```
ResourceMetrics #0
Resource attributes:
     -> service.name: Str(demo-readmission-model)
     -> model.id: Str(mdl-001)
     -> gcp.region: Str(us-central1)        ← Added by collector
     -> environment: Str(prototype)         ← Added by collector
Metric #0
     -> Name: model_predictions_total
     -> Value: 10
```

**Metrics from Vendor API** (appears every 30 seconds):
```
Resource attributes:
     -> service.name: Str(vendor-sepsis-v2)
     -> model.type: Str(vendor)
     -> gcp.region: Str(us-central1)        ← Added by collector
     -> environment: Str(prototype)         ← Added by collector
Metric #0
     -> Name: model.accuracy
     -> Value: 0.89
```

**Traces from Internal Model**:
```
Span #0
     -> Name: model.predict
     -> Attributes:
          -> model.id: Str(mdl-001)
          -> prediction_id: Str(pred-1234)
```

### Step 4: Run E2E Tests
```bash
# Collector must be running from Step 1
cd mca-prototype
pip install -r requirements.txt
pytest tests/integration/test_e2e_flow.py -v -s
```

Expected output:
- Health check passes
- Counter metric test sends value 42, verifies in logs (waits 12s for batch timeout)
- Histogram test sends 5 values, verifies in logs
- Attribute enrichment test confirms `gcp.region` and `environment` added

### Step 5: Run Unit Tests
```bash
pytest tests/integration/test_sdk_integration.py -v
```

Expected: ~20 tests pass covering provider initialization, metric operations, graceful failure handling, and resource attribute propagation

## SDK Features

The MCA SDK provides comprehensive instrumentation capabilities that are designed to work together:

- **mca-init** - Project scaffolding builder (creates local mock collector & config)
- **mca-instrument** - Zero-code instrumentation for network/HTTP/DB frameworks
- **autolog()** - Zero-code instrumentation for ML frameworks (scikit-learn, xgboost)
- **@predict() Decorator** - Manual instrumentation for custom logic
- **LiteLLM Integration** - Native callback for GenAI/LLM monitoring
- **Agentic AI Support** - Goal tracking and tool execution monitoring
- **Registry Integration** - Centralized configuration management
- **Security** - Queue encryption, certificate management, GCP authentication
- **Resilience** - Circuit breakers, retry logic, graceful degradation
- **Buffering** - Dead Letter Queue (DLQ) for failed telemetry

*Think of `mca-init` as the toolbox builder, `mca-instrument` as the wide net for network/infrastructure, `autolog` as the specialized net for ML models, and `@predict` as the sniper rifle for custom logic.*

### The Ideal Workflow
1. Run `mca-init my-project` to generate your local `mca.yaml` and mock OpenTelemetry collector.
2. Inside your script, use `autolog()` to automatically track Scikit-learn or XGBoost inferences.
3. Run your script wrapped with `mca-instrument -- python script.py` to seamlessly capture all surrounding HTTP and database calls into the collector.

---

### CLI Wrappers - Zero-Code Instrumentation

The MCA SDK provides two CLI commands that enable instrumentation without modifying your model code. Simply change how you run your script.

#### mca-instrument - Full OpenTelemetry Auto-Instrumentation

Wraps your script with OpenTelemetry auto-instrumentation for comprehensive telemetry including HTTP requests, database calls, and framework-level operations.

**Installation:**
```bash
pip install "mca-sdk[instrument]"
opentelemetry-bootstrap -a install  # Install framework instrumentors
```

**Usage:**
```bash
# Basic usage - use -- to separate wrapper args from script args
mca-instrument --model-id mdl-001 --team clinical-ai -- python my_model.py

# With all options
mca-instrument \
  --model-id mdl-001 \
  --team clinical-ai \
  --service-name my-service \
  --collector-endpoint http://localhost:4318 \
  --protocol grpc \
  -- python my_model.py --debug

# Using environment variables (higher priority than CLI args)
export MCA_MODEL_ID=mdl-001
export MCA_TEAM_NAME=clinical-ai
mca-instrument -- python my_model.py
```

**Features:**
- Automatic instrumentation of HTTP libraries (requests, urllib3, httpx)
- Database query tracking (psycopg2, pymongo, redis)
- Framework instrumentation (Flask, FastAPI, Django)
- No code changes required in your model script
- POSIX-compliant signal handling and exit codes
- Timeout protection prevents hanging on collector failures

**Configuration Options:**
- `--model-id` - Model identifier (required, or set MCA_MODEL_ID)
- `--team` - Team name (required, or set MCA_TEAM_NAME)
- `--service-name` - Service name (defaults to model-id)
- `--collector-endpoint` - OTel Collector URL (default: http://localhost:4318)
- `--protocol` - OTLP protocol: http/protobuf or grpc (default: http/protobuf)
- `--registry-url` - Model Registry API URL (optional)
- `--debug` - Enable debug logging

**Usage in Docker / Continuous Deployment (e.g., Vertex AI)**
For models deployed as long-running, live services (like Vertex AI Endpoints or Kubernetes deployments), zero-code auto-instrumentation is highly robust and operates continuously without blocking prediction latency.
1. **Continuous Export**: You do not need to rely on a "shutdown" event to send data. The SDK uses a background batch processor. While your model is live and handling requests, telemetry data is instantly pushed to an in-memory queue and sent periodically in the background.
2. **Dockerfile Integration**: To use zero-code auto-instrumentation in a container, modify the `CMD` or `ENTRYPOINT` in your `Dockerfile` to wrap the server startup command with `mca-instrument`:
   ```dockerfile
   # Example: Auto-instrumenting a FastAPI app deployed with Uvicorn
   CMD ["mca-instrument", "--", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
   ```
3. **Configuration via Environment Variables**: Pass the required configuration to the container at deployment time via environment variables. In platforms like Vertex AI or Kubernetes, set these in your deployment manifest.
4. **Graceful Shutdown**: If the platform scales your container down to zero or restarts it, it sends a termination signal (e.g., `SIGTERM`). `mca-instrument` automatically hooks into this shutdown sequence to perform a final flush of any lingering data in the queue before the container terminates, preventing data loss.

**Important Notes:**
- **Capturing Logs**: OpenTelemetry's zero-code auto-instrumentation specifically hooks into the standard Python `logging` library. **It does NOT capture `print()` statements.** To ensure your outputs appear in GCP Cloud Logging, replace `print()` calls in your scripts with standard `logging.info()`, `logging.warning()`, or `logging.error()`.
- **Short-lived Scripts**: If you are wrapping a fast, short-lived Python script, add `import time; time.sleep(5)` to the very end of short scripts to allow the final network exports to complete.
- Requires `opentelemetry-instrument` on PATH (installed via [instrument] extra)
- Use `--` separator to avoid argument conflicts with your script
- Works with Python scripts, bash scripts, uvicorn, gunicorn, etc.
- Respects virtual environments (no sys.executable assumptions)

#### mca-run - MCA Client Auto-Initialization

Simpler variant that initializes MCAClient without OpenTelemetry auto-instrumentation. Useful when you only need MCA SDK telemetry without framework-level instrumentation.

**Installation:**
```bash
pip install mca-sdk  # No extra dependencies needed
```

**Usage:**
```bash
# Same arguments as mca-instrument
mca-run --model-id mdl-001 --team clinical-ai -- python my_model.py

# Works with any executable
mca-run --model-id mdl-001 --team test -- bash train.sh
```

**Features:**
- Injects MCA_* environment variables
- Auto-initializes MCAClient via PYTHONPATH injection
- 3-second shutdown timeout prevents CI pipeline hangs
- Works with any executable (not just Python)
- No OpenTelemetry dependencies required

**How It Works:**
1. Creates temporary sitecustomize.py that imports mca_sdk.cli._bootstrap
2. Injects directory into PYTHONPATH
3. Runs your command exactly as provided
4. MCAClient auto-initializes from environment variables
5. Registers atexit handler with timeout for clean shutdown

**Use Cases:**
- Legacy scripts you can't modify
- Quick prototyping without code changes
- Scripts that already use MCA SDK but need environment setup
- Non-Python executables that call Python scripts

#### Working with Multi-File Projects and Directories

The CLI wrapper operates at the **process level**, not the file/directory level. It does NOT automatically scan directories or identify Python files to instrument.

**What Gets Instrumented:**
- The Python process you start
- All modules imported by that process
- HTTP libraries (requests, urllib3, httpx)
- Database drivers (psycopg2, pymongo, redis)
- Web frameworks (Flask, FastAPI, Django)

**What Does NOT Get Instrumented:**
- Files not imported by your entry point
- Standalone scripts not executed by your command
- Files in the directory that aren't run

**Scenario 1: Project with Imports (WORKS AUTOMATICALLY)**

Project structure:
```
myproject/
├── main.py         # Entry point
├── models.py       # Imported by main.py
├── utils.py        # Imported by main.py
└── config.yaml     # Loaded by main.py
```

Command:
```bash
mca-instrument -- python main.py
```

Result: main.py, models.py, utils.py all instrumented via imports. YAML file reading is instrumented if using instrumented libraries.

**Scenario 2: Multiple Standalone Scripts (DOESN'T WORK AUTOMATICALLY)**

Project structure:
```
scripts/
├── train.py        # Standalone script
├── evaluate.py     # Standalone script
└── deploy.py       # Standalone script
```

Problem:
```bash
mca-instrument -- python train.py
# Only train.py is instrumented
```

Solutions:
```bash
# Option 1: Run each separately
mca-instrument -- python train.py
mca-instrument -- python evaluate.py
mca-instrument -- python deploy.py

# Option 2: Create wrapper script (RECOMMENDED)
# run_pipeline.sh:
#   python train.py
#   python evaluate.py
#   python deploy.py

mca-instrument -- bash run_pipeline.sh
# All three scripts instrumented (inherit environment)
```

**Scenario 3: Mixed File Types (NON-PYTHON FILES IGNORED)**

Project structure:
```
project/
├── main.py         # Entry point
├── helper.py       # Imported by main.py
├── config.yaml     # Loaded at runtime
├── data.csv        # Read by pandas
└── README.md       # Documentation
```

Command:
```bash
mca-instrument -- python main.py
```

Result:
- main.py and helper.py instrumented
- config.yaml, data.csv reading instrumented if using instrumented libraries
- README.md ignored (not code)

**Scenario 4: Web Server (ENTIRE APPLICATION INSTRUMENTED)**

Project structure:
```
api/
├── app.py          # FastAPI application
├── routes/
│   ├── users.py    # Imported by app.py
│   └── models.py   # Imported by app.py
└── database.py     # Imported by routes
```

Command:
```bash
mca-instrument -- uvicorn api.app:app
```

Result: Entire application instrumented, including all HTTP requests, database queries, and route handlers.

**Scenario 5: Python Module Execution**

Project structure:
```
mypackage/
├── __main__.py     # Entry point for -m
├── core.py         # Imported by __main__
└── utils.py        # Imported by core
```

Command:
```bash
mca-instrument -- python -m mypackage
```

Result: All modules in the package instrumented via import chain.

**Key Takeaway:**
The CLI wrapper instruments whatever **command** you provide. For multi-file projects, ensure all files are either:
1. Imported by your entry point
2. Executed sequentially in a wrapper script
3. Run as separate CLI wrapper invocations

#### CLI Best Practices

**Argument Separation:**
Always use `--` to separate wrapper arguments from script arguments:
```bash
# Good - explicit separation
mca-instrument --model-id mdl-001 --team test -- python script.py --debug

# Bad - may cause conflicts
mca-instrument --model-id mdl-001 --team test python script.py --debug
```

**Environment Variable Precedence:**
Environment variables take precedence over CLI arguments:
```bash
export MCA_MODEL_ID=from-env
mca-instrument --model-id from-cli --team test -- python script.py
# Uses: from-env (environment variable wins)
```

**Service Name Defaulting:**
If not specified, service name defaults to model ID:
```bash
mca-instrument --model-id mdl-001 --team test -- python script.py
# Results in: MCA_SERVICE_NAME=mdl-001
```

**Error Handling:**
Both commands validate required arguments and exit with code 2 if validation fails:
```bash
mca-instrument --team test -- python script.py
# Error: Missing required configuration: model ID
# Exit code: 2
```

**Signal Handling:**
Proper POSIX signal handling with timeout escalation:
- SIGTERM/SIGINT forwarded to child process
- 5-second graceful shutdown timeout
- Escalates to SIGKILL if child doesn't exit
- Returns exit code 128 + signal number for signal termination

#### Troubleshooting CLI Commands

**opentelemetry-instrument not found:**
```bash
pip install "mca-sdk[instrument]"
opentelemetry-bootstrap -a install
which opentelemetry-instrument  # Verify on PATH
```

**Command not executing:**
- Use `--` separator
- Check that command is executable
- Verify command is on PATH or use full path

**Collector unreachable:**
- Verify collector endpoint: `curl http://localhost:4318/`
- Check network connectivity
- Review collector logs for errors
- Commands have 3-5 second timeout protection

**Environment variable conflicts:**
- List current environment: `env | grep MCA_`
- Remember: environment > CLI arguments
- Unset conflicting vars: `unset MCA_MODEL_ID`

### autolog() - Zero-Code Instrumentation

The `autolog()` function provides automatic instrumentation for popular ML frameworks without requiring code changes. Simply call `autolog()` after initializing MCAClient, and all predictions from supported frameworks will automatically emit OpenTelemetry metrics and traces.

**Supported Frameworks:**
- **scikit-learn**: `predict()`, `predict_proba()`, `fit()`
- **XGBoost**: `Booster.predict()`, `XGBClassifier.predict()`, `XGBRegressor.predict()`
- **LightGBM**: `Booster.predict()`, `LGBMClassifier.predict()`, `LGBMRegressor.predict()`

#### Basic Usage

```python
from mca_sdk import MCAClient, autolog
from sklearn.ensemble import RandomForestClassifier

# Initialize client first
client = MCAClient(
    service_name="ml-service",
    model_id="model-v1",
    team_name="ml-team"
)

# Enable autolog (one line!)
autolog()

# Now sklearn predictions are automatically instrumented
model = RandomForestClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)  # Automatically tracked!

client.shutdown()
```

#### Configuration Options

```python
# Enable for specific frameworks only
autolog(frameworks=["sklearn", "xgboost"])

# Exclude specific frameworks
autolog(exclude=["lightgbm"])

# Capture input/output data with size limits (use with caution - PHI risk)
# Payloads larger than max_payload_size will be truncated to prevent memory bloat
autolog(capture_input=True, capture_output=True, max_payload_size=5000)

# Note: Invalid framework names will raise ValueError
# Supported: sklearn, xgboost, lightgbm
```

#### Telemetry Generated

**Metrics:**
- `model.prediction.count`: Counter for prediction calls
- `model.prediction.latency`: Histogram of prediction latencies
- Attributes: `framework`, `model_class`, `method`

**Traces:**
- Span name: `{framework}.{model_class}.{method}`
- Example: `sklearn.RandomForestClassifier.predict`

#### Important Notes

1. **MCAClient must be initialized before autolog()**: The autolog() function requires an active MCAClient instance. If predictions are made without an initialized client, a warning is logged once (to avoid spam) and predictions execute without telemetry.
2. **Framework detection**: Only patches frameworks that are imported in `sys.modules`
3. **PHI considerations**: Be cautious with `capture_input=True` - input data may contain PHI. **Application handles PHI/PII masking/scrubbing** before sending data. Downstream, DLP is used to identify columns with PII/PHI to flag if those columns hadn't been selected earlier in the registration as columns with PHI/PII. Use `max_payload_size` to limit captured data size.
4. **Payload size limits**: Large predictions (common in batch ML) automatically truncated to `max_payload_size` (default: 10KB) to prevent memory bloat and network payload explosion
5. **Thread-safe**: Autolog is thread-safe and prevents double-patching
6. **Error handling**: Exceptions during prediction are properly recorded in spans with ERROR status and re-raised to preserve application behavior
7. **Validation**: Invalid framework names in `frameworks` or `exclude` parameters raise `ValueError` immediately

See [autolog demo](sdk-examples/autolog-demo/) for complete examples.

### @predict() Decorator

The MCA SDK provides a `@predict()` decorator for automatic instrumentation of prediction functions. The decorator captures inputs, outputs, latency, and errors without manual metric recording.

#### Basic Usage

```python
from mca_sdk import MCAClient

client = MCAClient(
    service_name="my-ml-service",
    model_id="model-v1",
    team_name="ml-team"
)

@client.predict()
def make_prediction(features: dict) -> dict:
    # Your prediction logic
    score = sum(features.values()) * 0.1
    return {"prediction": "positive" if score > 0.5 else "negative"}

# Decorator automatically tracks:
# - Prediction count (counter)
# - Latency (histogram)
# - Prediction ID (for actuals join)
# - Errors and exceptions
result = make_prediction({"feature1": 2.5, "feature2": 3.8})
```

#### Advanced Configuration

```python
@client.predict(
    span_name="custom_prediction_name",  # Custom trace span name
    capture_input=True,                   # Capture function inputs
    capture_output=True,                  # Capture function outputs
    model_version="2.0",                  # Additional metric attributes
    threshold=0.7
)
def advanced_prediction(data: dict) -> dict:
    return {"result": "ok"}
```

#### Async Function Support

```python
@client.predict()
async def async_prediction(features: dict) -> dict:
    await asyncio.sleep(0.01)  # Async operations
    return {"prediction": "result"}

result = await async_prediction({"feature1": 1.0})
```

**IMPORTANT - Async Environment Best Practices:**

When using the MCA SDK in async environments (FastAPI, Starlette, aiohttp), follow these patterns to avoid blocking the event loop:

**DO:**
- Initialize `MCAClient` during application startup (sync context)
- Enable background refresh: `refresh_interval_secs > 0`
- Use `@predict()` decorator for async functions
- Let the background thread handle config updates

**DON'T:**
- Call `registry_client.fetch_model_config()` from async functions
- Manually refresh config in request handlers
- Access registry directly from FastAPI endpoints

**Example - FastAPI Integration:**
```python
from fastapi import FastAPI
from mca_sdk import MCAClient

app = FastAPI()

# Initialize during startup (sync context)
client = MCAClient(
    service_name="my-api",
    registry_url="https://registry.example.com",
    refresh_interval_secs=600  # Background refresh enabled
)

@app.get("/predict")
@client.predict()  # Decorator handles telemetry
async def predict_endpoint(data: dict):
    # Registry config is cached and refreshed in background
    # No direct registry access - won't block event loop
    result = await some_async_model(data)
    return result
```

The `RegistryClient` uses synchronous locking (`threading.RLock`) and will block the event loop if called from async contexts. Use background refresh to avoid this issue.

#### Multi-threaded Usage

The decorator is thread-safe and generates unique prediction IDs for concurrent calls:

```python
from concurrent.futures import ThreadPoolExecutor

@client.predict()
def threaded_prediction(value: int) -> dict:
    return {"output": value * 2}

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(threaded_prediction, i) for i in range(100)]
    results = [f.result() for f in futures]
```

#### Performance

The decorator adds minimal overhead (<5ms per prediction) for span creation and metric recording.

## PyPI Package Verification

If you've installed `mca-sdk` from PyPI, you can verify it works correctly without any local repository dependencies.

### Standalone Demo

The standalone demo demonstrates using the PyPI-installed package:

```bash
# Install from PyPI
pip install mca-sdk

# Start collector
cd mca-prototype && docker-compose up otel-collector

# Run demo (works from any directory - no sys.path manipulation)
python mca-prototype/sdk-examples/internal-model/instrumented_model.py
```

**Key Points**:
- No `sys.path.insert()` needed
- Clean imports: `from mca_sdk import MCAClient`
- Works from any directory (not just inside cloned repo)
- All dependencies auto-installed

### PyPI Package Tests

Run verification tests to ensure the package is properly installed:

```bash
# Install package first
pip install mca-sdk

# Run PyPI verification tests
pytest tests/integration/test_pypi_package.py -v
```

These tests verify:
- Package is installed from PyPI (not local)
- Imports work without path manipulation
- Client can be instantiated
- Metrics can be created and recorded
- Dependencies are properly installed
- Optional dependencies (genai, vendor) work if installed

### Example Requirements Files

All examples have been updated to use the PyPI package:

**internal-model/requirements.txt**:
```txt
mca-sdk>=0.6.7
opentelemetry-semantic-conventions==0.48b0
```

**internal-genai/requirements.txt**:
```txt
mca-sdk[genai]>=0.6.7
tiktoken>=0.5.2
```

**internal-agentic/requirements.txt**:
```txt
mca-sdk>=0.6.7
langchain>=0.1.0
```

**vendor-bridge/requirements.txt**:
```txt
mca-sdk[vendor]>=0.6.7
fastapi>=0.115.6
uvicorn>=0.34.0
```

## Development Setup

For developers contributing to the MCA Prototype, we provide a reproducible development environment with Docker Compose, hot-reload, pre-commit hooks, and IDE configurations.

### Prerequisites
- Docker and Docker Compose
- Python 3.11+
- Git
- Make (optional, but recommended)

### One-Time Setup

```bash
# Clone the repository (if not already done)
git clone <repository-url>
cd sdk  # or the directory you cloned into

# Run the automated setup script
make dev-setup
# OR manually:
./scripts/setup-dev.sh
```

This script will:
1. Create a Python virtual environment
2. Install all Python dependencies (SDK, dev tools, pre-commit)
3. Install pre-commit hooks for automatic code quality checks
4. Build Docker images for all services
5. Create a `.env` file from `.env.example`

**Setup time:** <10 minutes (depending on internet speed)

### Starting Development Services

```bash
# Start all services (collector, examples, vendor API)
make dev-start

# View logs from all services
make logs

# Stop all services
make dev-stop
```

**Available Services:**
- OTel Collector: `http://localhost:4318` (OTLP), `http://localhost:13133` (health)
- Vendor API: `http://localhost:8080`
- Internal Model: Container `mca-internal-model`
- Internal Agentic: Container `mca-internal-agentic`
- GenAI Assistant: Container `mca-genai-assistant`
- Vendor Bridge: Container `mca-vendor-bridge`

### Hot-Reload for Fast Iteration

All services have volume mounts configured for live code reload:
```yaml
volumes:
  - ./mca_sdk:/app/mca_sdk  # Changes to SDK reflected immediately
```

**To test hot-reload:**
1. Start services: `make dev-start`
2. Modify code in `mca_sdk/`
3. Check container logs: `docker logs mca-internal-model -f`
4. Changes are reflected without restarting containers

### Running Tests

```bash
# Run all tests with coverage (requires 85% coverage)
make test

# Run tests manually
pytest tests/ -v --cov=mca-prototype/mca_sdk --cov-fail-under=85
```

### Code Quality & Linting

```bash
# Run all linting checks (Black, isort, pylint, mypy, bandit)
make lint

# Auto-format code (Black + isort)
make format

# Run pre-commit hooks manually
make pre-commit
```

**Pre-commit hooks run automatically** on every `git commit` and check:
- Python formatting (Black)
- Import sorting (isort)
- Python linting (pylint)
- Type checking (mypy)
- Security scanning (Bandit)
- YAML/JSON formatting (Prettier)
- Dockerfile linting (hadolint)

### IDE Setup (VS Code)

VS Code configurations are included in `.vscode/`:
- **settings.json**: Python linting, formatting, testing
- **launch.json**: Debug configurations for tests and examples

**Recommended Extensions:**
- Python (ms-python.python)
- Black Formatter (ms-python.black-formatter)
- Pylance (ms-python.vscode-pylance)
- Docker (ms-azuretools.vscode-docker)
- Prettier (esbenp.prettier-vscode)

**Debug Examples:**
1. Open VS Code
2. Go to Run & Debug (Ctrl+Shift+D)
3. Select debug configuration (e.g., "Python: Debug SDK Internal Model Example")
4. Press F5 to start debugging

### Common Development Tasks

```bash
# Run a specific example locally
source venv/bin/activate
export PYTHONPATH=$(pwd)/mca-prototype
python mca-prototype/sdk-examples/internal-model/instrumented_model.py

# View logs from a specific service
docker logs mca-internal-model -f

# Rebuild a specific service after Dockerfile changes
cd mca-prototype && docker-compose build internal-model

# Clean up build artifacts
make clean

# Full clean including virtual environment
make clean-all
```

### Development Workflow

1. **Create a feature branch**: `git checkout -b feature/your-feature`
2. **Make code changes** in `mca_sdk/` or examples
3. **Run tests locally**: `make test`
4. **Run linting**: `make lint` (or let pre-commit handle it)
5. **Commit changes**: `git commit -m "description"` (pre-commit hooks run automatically)
6. **Push and create PR**: `git push origin feature/your-feature`

### Troubleshooting Development Setup

**Issue: Pre-commit hooks failing**
```bash
# Run hooks manually to see errors
pre-commit run --all-files

# Auto-fix formatting issues
make format

# Update pre-commit hooks
pre-commit autoupdate
```

**Issue: Docker containers not starting**
```bash
# Check Docker daemon is running
docker ps

# Rebuild containers
cd mca-prototype && docker-compose build

# Check logs
docker-compose logs
```

**Issue: Tests failing with import errors**
```bash
# Ensure virtual environment is activated
source venv/bin/activate

# Reinstall dependencies
pip install -r mca-prototype/mca_sdk/requirements.txt
```

**Issue: Port conflicts**
```bash
# Check what's using ports 4318 or 8080
lsof -i :4318
lsof -i :8080

# Stop conflicting services or change ports in docker-compose.yml
```

### Environment Variables

Development environment variables are configured in `.env` (created from `.env.example`):

```bash
# Key development settings
DEBUG_MODE=true
LOG_LEVEL=DEBUG
COLLECTOR_ENDPOINT=http://localhost:4318
REGISTRY_URL=http://localhost:8000  # Mock registry for local dev
```

See `.env.example` for all available configuration options.

## Project Structure

```
.
├── mca-prototype/
│   ├── docker-compose.yml              # Orchestrates local testing services
│   ├── config/
│   │   └── otel-collector-config.yaml  # Collector pipelines: OTLP → Batch → Attributes → Debug
│   ├── mca_sdk/                        # Python SDK source code
│   ├── mca-sdk-nodejs/                 # Node.js SDK source code (WIP)
│   ├── k8s/ & helm/                    # Kubernetes deployment configurations
│   ├── sdk-examples/
│   │   ├── internal-model/             # Demo: Metrics, Logs, Traces instrumentation
│   │   ├── internal-genai/             # GenAI assistant with LiteLLM + MCA SDK
│   │   ├── internal-agentic/           # Medical research agent with multi-step reasoning
│   │   └── vendor-bridge/              # Converts vendor JSON to OTLP metrics
│   └── tests/                          # Comprehensive Pytest suites
│
├── terraform/                          # Terraform modules for GCP infrastructure
│   ├── modules/                        # cloud-logging, cloud-trace, cloud-monitoring, iam, etc.
│   └── main.tf                         # Main infrastructure orchestration
│
├── docs/                               # Additional documentation & integration guides
├── scripts/                            # Utility scripts for development & deployment
└── Makefile                            # Development tasks (lint, test, build, terraform)
```

### Key Components

| Component | Purpose | Port |
|-----------|---------|------|
| **OpenTelemetry Collector** | Receives OTLP data, enriches with metadata, outputs to debug exporter | 4318, 13133 |
| **Internal Model** | Demonstrates full SDK instrumentation (metrics/logs/traces) for predictive ML | - |
| **Internal GenAI** | Demonstrates LLM monitoring with LiteLLM + MCA SDK integration | - |
| **Internal Agentic** | Demonstrates agentic AI with goal tracking, tool execution, and multi-step reasoning | - |
| **Vendor API** | Simulates third-party model API with proprietary JSON format | 8080 |
| **Vendor Bridge** | Converts vendor JSON to OTLP metrics every 30 seconds | - |
| **E2E Tests** | Validates collector receives and processes data | - |
| **Unit Tests** | Tests SDK integration patterns without network | - |

### Data Pipelines

1. **Metrics Pipeline**: `OTLP Receiver` → `Attributes Processor` (adds region/env) → `Batch Processor` (10s/100 metrics) → `Debug Exporter` (stdout)
2. **Logs Pipeline**: Same processors, OTLP logs input
3. **Traces Pipeline**: Same processors, OTLP traces input

### Enrichment Strategy
- All telemetry signals enriched with `gcp.region: us-central1` and `environment: prototype`
- Demonstrates how to add organizational metadata at collector level
- Resource attributes from application (service name, model ID) preserved

## Demo Scenarios

### Scenario 1: Internal Model Monitoring
**Use Case**: Hospital's readmission prediction model with full instrumentation

**Steps**:
1. Start collector: `docker-compose up`
2. Run model: `python sdk-examples/internal-model/instrumented_model.py`
3. Show collector logs with metrics, logs, and traces
4. Point out enriched attributes (`gcp.region`, `environment`)

**Key Points**:
- Full observability: metrics (counter/histogram), logs (structured), traces (nested spans)
- Resource attributes identify model, version, team
- Collector adds deployment context automatically

### Scenario 2: Vendor API Integration
**Use Case**: Third-party sepsis model doesn't support OTLP natively

**Steps**:
1. Collector already running from Scenario 1
2. Show vendor API JSON: `curl http://localhost:8080/metrics`
3. Observe bridge logs converting and exporting
4. Show collector receiving vendor metrics with `model.type: vendor` attribute

**Key Points**:
- Bridge pattern for non-OTLP APIs
- Delta calculation for counters (converts 24h rolling count to cumulative)
- Dynamic resource attributes from API response
- Polling every 30 seconds

### Scenario 3: E2E Validation
**Use Case**: Verify collector pipeline works correctly

**Steps**:
1. Run E2E tests: `pytest tests/integration/test_e2e_flow.py -v -s`
2. Show test sending metrics with known values (42)
3. Show test parsing Docker logs to verify receipt
4. Demonstrate attribute enrichment validation

**Key Points**:
- Tests send real OTLP data to running collector
- Verifies batch processing (12s wait for 10s timeout)
- Log-based verification for manual inspection
- Validates enrichment pipeline

### Scenario 4: GenAI/LLM Monitoring
**Use Case**: Clinical documentation assistant with LLM observability

**Steps**:
1. Services already running from `docker-compose up`
2. Check GenAI logs: `docker logs mca-genai-assistant -f`
3. Observe collector receiving LLM traces with token counts
4. Show custom metrics in collector logs: `docker logs mca-otel-collector | grep genai`

**Key Points**:
- LiteLLM's automatic trace instrumentation for LLM calls
- Token usage tracking (prompt and completion tokens)
- Cost estimation based on token counts
- Latency monitoring for LLM requests
- Mock mode for demo purposes (no API calls)
- Continuous 30-second loop demonstrates ongoing LLM usage patterns

**Expected Telemetry**:
- Metrics: `genai.tokens.prompt`, `genai.tokens.completion`, `genai.request.cost_usd`, `genai.request.latency_seconds`
- Traces: Automatic spans from LiteLLM with model, token counts, and latency
- Resource attributes: `service.name=genai-clinical-assistant`, `model.type=generative`, `llm.provider=openai-mock`

### Scenario 5: Agentic AI with Multi-Step Reasoning
**Use Case**: Medical research assistant agent that uses multiple tools to answer clinical questions

**Steps**:
1. Collector already running from previous scenarios
2. Run agent: `python sdk-examples/internal-agentic/agent_instrumented.py`
3. Watch agent execute multi-step workflow (planning → research → analysis → synthesis)
4. Show agent metrics: `docker logs mca-otel-collector | grep agent`

**Key Points**:
- **Goal Tracking**: Monitors when goals start/complete with success/failure status
- **Tool Execution**: Tracks PubMed searches, drug database queries with latency metrics
- **Multi-Step Reasoning**: Nested spans show planning, research, analysis, synthesis steps
- **Human Intervention**: Tracks when human review is requested
- **Mock Mode**: All tools use predefined responses (no external APIs)

**Expected Telemetry**:
- Metrics:
  - `agent.goals_started_total`, `agent.goals_completed_total` (counters)
  - `agent.tool_calls_total` (counter with tool_name label)
  - `agent.tool_latency_seconds` (histogram per tool)
  - `agent.human_interventions_total` (counter)
  - `agent.reasoning_steps_total` (counter)
- Traces:
  - `agent.goal` (parent span for entire goal)
    - `agent.planning` (search strategy)
    - `agent.tool_execution` (PubMed, drug database)
    - `agent.reasoning` (analysis)
    - `agent.synthesis` (answer creation)
    - `agent.human_intervention` (review request)
- Resource attributes: `service.name=medical-research-agent`, `model.type=agentic`, `team.name=ai-research-team`

## Model Registry Integration

The MCA SDK now supports centralized configuration management through a Model Registry API. This enables:
- Dynamic model metadata and thresholds
- Automatic periodic refresh (default 10 minutes)
- Graceful fallback when registry is unavailable
- Security: HTTPS required for non-localhost, bearer token authentication

### Usage

**With Environment Variables:**
```bash
export MCA_REGISTRY_URL="https://registry.example.com"
export MCA_REGISTRY_TOKEN="your-secret-token"
export MCA_MODEL_ID="mdl-001"
export MCA_MODEL_VERSION="2.0.0"

python your_model.py
```

**With Code:**
```python
from mca_sdk import MCAClient, MCAConfig

config = MCAConfig(
    service_name="readmission-model",
    model_id="mdl-001",
    model_version="2.0.0",
    registry_url="https://registry.example.com",
    registry_token="your-secret-token",
    refresh_interval_secs=600,  # 10 minutes
)

client = MCAClient(config=config)

# Access registry-provided thresholds
if client.thresholds.get("latency_warn_ms", 0) < latency_ms:
    client.logger.warning("Latency threshold exceeded")

client.shutdown()
```

### Registry API Contract

**Model Config Endpoint:**
```http
GET /models/{model_id}?version=2.0.0
Authorization: Bearer <token>

Response:
{
  "service_name": "readmission-model",
  "model_id": "mdl-001",
  "model_version": "2.0.0",
  "team_name": "clinical-ai",
  "model_type": "internal",
  "thresholds": {
    "latency_warn_ms": 500,
    "error_rate_warn": 0.05
  },
  "extra_resource": {
    "deployment.env": "production"
  }
}
```

**Deployment Config Endpoint (optional):**
```http
GET /deployments/{deployment_id}
Authorization: Bearer <token>

Response:
{
  "deployment_id": "dep-001",
  "environment": "production",
  "region": "us-east-1",
  "resource_overrides": {
    "deployment.zone": "az-1"
  }
}
```

### Features

- **Config Precedence**: kwargs > registry > env > YAML > defaults
- **Background Refresh**: Updates thresholds every 10 minutes (configurable)
- **Identity Immutability**: service_name, model_id changes require restart
- **Resilience**: Telemetry continues if registry is down (uses last-known config)
- **Security**: HTTPS required, token never logged
- **GCP Authentication**: Automatic ID token authentication for Cloud Run APIs (see [GCP Auth Guide](docs/registry-gcp-auth-guide.md))
- **Telemetry**: Self-monitoring metrics for registry operations

### Configuration Options

| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `MCA_REGISTRY_URL` | Registry service URL (HTTPS required) | None |
| `MCA_REGISTRY_TOKEN` | Bearer token for authentication | None |
| `MCA_REFRESH_SECS` | Refresh interval in seconds | 600 |
| `MCA_PREFER_REGISTRY` | Registry overrides local config | True |
| `MCA_DEPLOYMENT_ID` | Optional deployment identifier | None |

## Next Steps / Known Limitations

### Implemented (Phase 1)
- ✅ OTLP HTTP receiver for metrics, logs, traces
- ✅ Batch processing (10s timeout or 100 metrics)
- ✅ Attribute enrichment (region, environment)
- ✅ GCP exporter for prototype validation
- ✅ Vendor API bridge pattern
- ✅ Full SDK instrumentation example
- ✅ GenAI/LLM monitoring with LiteLLM integration
- ✅ Agentic AI instrumentation with goal tracking and tool execution monitoring
- ✅ **Model Registry Integration**: Centralized config management with automatic refresh
- ✅ **GCP Integration**: Cloud Trace and Cloud Logging exporters
- ✅ **GCP Authentication**: Service account and ID token authentication
- ✅ **Security**: Queue encryption, certificate management, HTTPS enforcement
- ✅ **Resilience**: Circuit breakers, retry logic with exponential backoff
- ✅ **Persistent Storage**: Dead Letter Queue (DLQ) for failed telemetry
- ✅ **Kubernetes Deployment**: Helm charts and Kustomize configurations
- ✅ Comprehensive testing (unit + e2e)
- ✅ Docker Compose orchestration
- ✅ Health check endpoint

### Phase 2: Production Readiness (In progress)
- [x] **GCP Cloud Monitoring**
- [ ] **Security Hardening**:
  - mTLS for collector OTLP receiver
  - API key authentication for vendor bridge
- [ ] **High Availability**: Multi-instance collector with load balancing
- [x] **Alerting**: Configure processor for alert generation on metric thresholds (via Terraform)
- [x] **Schema Validation**: Enforce metric naming conventions at collector level
- [x] **Node.js SDK**: Initial implementation of the MCA SDK for Node.js (in `mca-sdk-nodejs/`)
- [ ] **Cost Optimization**: Sampling strategies for high-volume traces

### Phase 3: Scale & Features
- [ ] **Additional Vendors**: More bridge implementations
- [ ] **Real Models**: Production model integrations
- [x] **Dashboards**: GCP console visualizations (via Terraform)
- [ ] **SLO Monitoring**: Track model performance SLIs
- [ ] **Anomaly Detection**: Statistical outlier identification
- [ ] **Data Retention**: Policies for metric aggregation/archival

### Known Limitations
- **Collector Authentication**: OTLP receiver does not require authentication (use network policies in production)
- **Batch Timeout**: Up to 10s delay in data visibility (configurable)
- **Single Instance Collector**: No built-in redundancy or failover (use Kubernetes replication)
- **Metric Descriptor Management**: GCP Cloud Monitoring descriptors created automatically but not pre-configured
- **Manual E2E Verification**: Tests rely on Docker log parsing (consider using OTLP test receiver)

### Security Considerations (For Production)
- **HTTPS Enforcement**: Both `registry_url` and `collector_endpoint` **require HTTPS** for non-localhost endpoints (enforced since v0.4.1)
  - ✅ Allowed: `https://registry.example.com`, `http://localhost:5000`, `http://127.0.0.1:4318`
  - ❌ Blocked: `http://registry.example.com` (raises `ConfigurationError`)
  - **Localhost Exception**: HTTP is allowed for localhost, 127.0.0.0/8 range, and ::1 for development convenience
  - **Security Note**: Prevents credential and telemetry exposure over unencrypted connections
- **Audit Logs**: Implement comprehensive access logging for collector
- **Encryption**: Require TLS for all OTLP communication
- **Access Control**: Implement RBAC for collector configuration
- **Data Residency**: Ensure GCP region meets compliance requirements

## Troubleshooting

### Collector not receiving metrics
**Symptom**: No output in collector logs after running model

**Solutions**:
- Check collector is healthy: `curl http://localhost:13133/`
- Verify port 4318 is accessible: `docker ps`
- Check model completed and flushed: Look for "Flushing metrics" in model output
- Increase batch timeout: Metrics may be waiting for 10s batch window

### Vendor bridge failing to start
**Symptom**: `mca-vendor-bridge` container exits with error

**Solutions**:
- Check vendor-api is healthy: `docker ps` (should show healthy status)
- Verify API is accessible: `curl http://localhost:8080/health`
- Check environment variables in docker-compose.yml
- Review bridge logs: `docker logs mca-vendor-bridge`

### E2E tests skipped
**Symptom**: Tests show "SKIPPED - Collector is not running"

**Solutions**:
- Start collector first: `docker-compose up`
- Wait for health endpoint: May take 10-15 seconds on first start
- Check health manually: `curl http://localhost:13133/`
- Rebuild if config changed: `docker-compose up --build`

### Import errors in tests
**Symptom**: `ImportError: cannot import name 'InMemorySpanExporter'`

**Solutions**:
- Install dependencies: `pip install -r requirements.txt`
- Check Python version: Requires 3.10+
- Virtual environment recommended: `python -m venv venv && source venv/bin/activate`

## Additional Resources

### MCA SDK Documentation
- **AI Assistant Guide**: See `mca_sdk/FOR_AI_ASSISTANTS.md` in your installed package
- **GCP Authentication Guide**: Contact your organization for access to internal documentation
- **GCP Dev Environment Guide**: Contact your organization for access to internal documentation
- **Integration Guides**: See examples in the `sdk-examples/` directory of the source repository

### OpenTelemetry Resources
- **OpenTelemetry Docs**: https://opentelemetry.io/docs/
- **Collector Configuration**: https://opentelemetry.io/docs/collector/configuration/
- **Python SDK**: https://opentelemetry-python.readthedocs.io/
- **OTLP Specification**: https://opentelemetry.io/docs/specs/otlp/


