Metadata-Version: 2.1
Name: smolmodels
Version: 0.1.2
Summary: A framework for building ML models from natural language
Home-page: https://github.com/plexe-ai/smolmodels
License: Apache-2.0
Keywords: custom ai,llm,machine learning model,data generation
Author: marcellodebernardi
Author-email: marcello.debernardi@outlook.com
Requires-Python: >=3.12,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: anthropic (==0.42.0)
Requires-Dist: bandit (>=1.8.2,<2.0.0)
Requires-Dist: dataclasses-json (>=0.6.7,<0.7.0)
Requires-Dist: google-generativeai (>=0.8.2,<0.9.0)
Requires-Dist: imbalanced-learn (>=0.12.4,<0.13.0)
Requires-Dist: instructor[anthropic] (>=1.7.2,<2.0.0)
Requires-Dist: joblib (>=1.4.2,<2.0.0)
Requires-Dist: mlxtend (>=0.23.4,<0.24.0)
Requires-Dist: openai (>=1.60.1,<2.0.0)
Requires-Dist: pandas (>=2.2.0,<3.0.0)
Requires-Dist: pydantic (>=2.9.2,<3.0.0)
Requires-Dist: scikit-learn (>=1.5.2,<2.0.0)
Requires-Dist: seaborn (>=0.12.2,<0.13.0)
Requires-Dist: xgboost (>=2.1.3,<3.0.0)
Project-URL: Repository, https://github.com/plexe-ai/smolmodels
Description-Content-Type: text/markdown

<div align="center">

# smolmodels 🤖✨

[![PyPI version](https://img.shields.io/pypi/v/smolmodels.svg)](https://pypi.org/project/smolmodels/)
[![Discord](https://img.shields.io/discord/1300920499886358529?logo=discord&logoColor=white)](https://discord.gg/3czW7BMj)

Build specialized ML models using natural language.

</div>

## What is smolmodels?

smolmodels is a Python library that lets you create machine learning models by describing what you want them to do in
plain English. Instead of wrestling with model architectures and hyperparameters, you simply describe your intent,
define your inputs and outputs, and let smolmodels handle the rest.

```python
import smolmodels as sm

# Create a house price predictor with just a description
model = sm.Model(
    intent="Predict house prices based on property features",
    input_schema={
        "square_feet": float,
        "bedrooms": int,
        "location": str,
        "year_built": int
    },
    output_schema={
        "predicted_price": float
    }
)

# Build the model, using the backend of your choice - optionally generate synthetic training data
model.build("house-prices.csv", generate_samples=1000, provider="openai:gpt-4o-mini")

# Make predictions
price = model.predict({
    "square_feet": 2500,
    "bedrooms": 4,
    "location": "San Francisco",
    "year_built": 1985
})

# Save the model for later use
sm.save_model(model, "house-price-predictor")
```

## How Does It Work?

smolmodels uses a multi-step process for model creation:

1. **Intent Analysis**: Problem description is analyzed to understand the type of model needed, key requirements, and
   success criteria.

2. **Data Generation**:  Smolmodels can generate synthetic data to enable model build when there is no training data
   available.

3. **Model Building**: The library:
    - Selects appropriate model architectures
    - Handles feature engineering
    - Manages training and validation
    - Ensures outputs meets the specified constraints

4. **Validation & Refinement**: The model is tested against constraints and refined using directives (like "optimize for
   speed" or "prioritize explainability").

## Key Features

### Natural Language Intent 📝

Models are defined through natural language descriptions and schema specifications, abstracting away architecture
decisions.

### Data Generation 🎲

Built-in synthetic data generation for training and validation.

### Directives for fine-grained Control 🎯

Guide the model building process with high-level directives:

```python
from smolmodels import Directive

model.build(directives=[
    Directive("Optimize for inference speed"),
    Directive("Prioritize interpretability")
])
```

### Optional Constraints ✅

Optional declarative constraints for model validation:

```python
from smolmodels import Constraint

# Ensure predictions are always positive
positive_constraint = Constraint(
    lambda inputs, outputs: outputs["predicted_price"] > 0,
    description="Predictions must be positive"
)

model = Model(
    intent="Predict house prices...",
    constraints=[positive_constraint],
    ...
)
```

### Multi-Provider Support 🌐

You can use multiple LLM providers as a backend for model generation by specifying the provider name, and optionally
the model too, when calling `build()`:

```python
model.build("house-prices.csv", provider="openai:gpt-4o-mini")
```

Currently supported providers are `openai`, `anthropic`, `google` and `deepseek`. You need to configure the
appropriate API keys for each provider as environment variables (see installation instructions).

## Installation & Setup

```bash
pip install smolmodels
```

## API Keys

Set required API keys as environment variables. Which API keys are required depends on which provider you are using.

```bash
export OPENAI_API_KEY=<your-API-key>
export ANTHROPIC_API_KEY=<your-API-key>
export GOOGLE_API_KEY=<your-API-key>
export DEEPSEEK_API_KEY=<your-API-key>
```

## Quick Start

1. **Define model**:

```python
import smolmodels as sm

model = sm.Model(
    intent="Classify customer feedback as positive, negative, or neutral",
    input_schema={"text": str},
    output_schema={"sentiment": str}
)
```

2. **Build and save**:

```python
# Build with existing data
model.build(dataset="feedback.csv")

# Or generate synthetic data
model.build(generate_samples=1000)

# Save model for later use
sm.save_model(model, "sentiment_model")
```

3. **Load and use**:

```python
# Load existing model
loaded_model = sm.load_model("sentiment_model")

# Make predictions
result = loaded_model.predict({"text": "Great service, highly recommend!"})
print(result["sentiment"])  # "positive"
```

## Benchmarks

Performance evaluated on 20 OpenML benchmark datasets and 12 Kaggle competitions. Higher performance observed on 12/20
OpenML datasets, with remaining datasets showing performance within 0.005 of baseline. Experiments conducted on standard
infrastructure (8 vCPUs, 30GB RAM) with 1-hour runtime limit per dataset.

Complete code and results are available at [plexe-ai/plexe-results](https://github.com/plexe-ai/plexe-results).

## Documentation

For full documentation, visit [docs.plexe.ai](https://docs.plexe.ai).

## Contributing

We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

Apache-2.0 License - see [LICENSE](LICENSE) for details.

