Metadata-Version: 2.4
Name: water-conflict-classifier
Version: 0.1.0
Summary: SetFit-based multi-label classifier for water-related conflict events
Author: Baobab Tech
License-Expression: CC-BY-NC-4.0
Project-URL: Homepage, https://github.com/baobabtech/waterconflict
Project-URL: Documentation, https://github.com/baobabtech/waterconflict/blob/main/classifier/README.md
Project-URL: Repository, https://github.com/baobabtech/waterconflict
Keywords: water,conflict,classifier,setfit,machine-learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: setfit>=1.0.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: torch>=2.0.0
Requires-Dist: huggingface-hub>=0.19.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"

# Water Conflict Classifier

SetFit-based multi-label text classifier for identifying water-related conflict events in news headlines.

**Project:** Experimental research supporting the [Pacific Institute's Water Conflict Chronology](https://www.worldwater.org/water-conflict/)  
**Developer:** Baobab Tech  
**License:** [CC BY-NC 4.0](http://creativecommons.org/licenses/by-nc/4.0/) (Non-Commercial)

## Frugal AI: Training with Limited Data

This classifier demonstrates an intentional approach to building AI systems with **limited data** using [SetFit](https://huggingface.co/docs/setfit/en/index) - a framework for few-shot learning with sentence transformers. Rather than defaulting to massive language models (GPT, Claude, or 100B+ parameter models) for simple classification tasks, we fine-tune a small, efficient model (~33M parameters) on a focused dataset.

**Why this matters:** The industry has normalized using trillion-parameter models to classify headlines, answer simple questions, or categorize text - tasks that don't require world knowledge, reasoning, or generative capabilities. This is computationally wasteful and environmentally costly. A properly fine-tuned small model can achieve comparable or better accuracy while using a fraction of the compute resources.

**Our approach:**
- Train on ~600 examples (few-shot learning with SetFit)
- Deploy a 33M parameter model vs. 100B-1T parameter alternatives
- Achieve specialized task performance without the overhead of general-purpose LLMs
- Reduce inference costs and latency by orders of magnitude

This is not about avoiding large models altogether - they're invaluable for complex reasoning tasks. But for targeted classification problems with labeled data, fine-tuning remains the professional, responsible choice.

## Project Structure

Simple, flat structure with shared modules:

```
classifier/
├── __init__.py                         # Package marker
├── data_prep.py                        # Data loading & preprocessing (shared)
├── training_logic.py                   # Core training logic (shared)
├── evaluation.py                       # Model evaluation & metrics (shared)
├── model_card.py                       # Model card generation (shared)
├── train_setfit_headline_classifier.py # Local training (uses shared modules)
├── train_on_hf.py                      # HF Jobs training (self-contained with UV)
├── upload_datasets.py                  # Upload data to HF Hub
├── transform_prep_negatives.py         # Generate negative examples from ACLED
├── classify_headline.py                # Local inference example
├── classify_headline_hub.py            # HF Hub inference example
└── README.md                           # This file
```

### Package Structure

The classifier is a proper Python package that can be installed via pip/uv.

**Local Training:** `train_setfit_headline_classifier.py` imports from the installed package.

**HF Jobs Training:** `train_on_hf.py` uses UV with the package as a dependency - clean, no duplication!

---

## Training Options

### Option 1: Local Training

Train on your own hardware with local data files:

```bash
cd classifier
python train_setfit_headline_classifier.py
```

**Pros:** Full control, works offline, no HF account needed  
**Cons:** Requires local GPU (or slow on CPU), manual model management

### Option 2: HF Jobs (Cloud Training)

Train on managed GPUs with automatic model upload to HF Hub:

```bash
hf jobs uv run \
  --flavor a10g-large \
  --timeout 2h \
  --env HF_ORGANIZATION=your-org \
  --namespace your-org \
  --secrets HF_TOKEN \
  classifier/train_on_hf.py
```

**Pros:** Fast GPU training (~2-5 min), auto model upload, reproducible  
**Cons:** Requires HF account, data must be on HF Hub

**Note:** Package must be published to PyPI or use Git URL. See `PUBLISHING.md` for complete publishing instructions with UV.

**Learn more:** 
- [Hugging Face Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
- [Publishing Guide](PUBLISHING.md) - How to publish with UV

---

## Setup

**Important:** This is a package within a mono repo. All commands assume you're in the `/classifier` directory.

### For Local Training

1. **Navigate to package directory:**

```bash
cd classifier  # Must be in this directory!
```

2. **Install the package in development mode:**

```bash
uv pip install -e .
# or with regular pip:
pip install -e .
```

This installs the modules (`data_prep`, `training_logic`, etc.) so they can be imported.

2. **Prepare training data:**

Training data should be in `../data/` (one level up from classifier folder):
- `../data/positives.csv` - Water conflict headlines with labels
- `../data/negatives.csv` - Non-water conflict headlines

Generate negatives from ACLED (if needed):
```bash
# This script is in the parent scripts folder
cd ../scripts
python transform_prep_negatives.py
```

3. **Train:**

```bash
cd classifier  # Make sure you're in classifier/
python train_setfit_headline_classifier.py
```

Model saved to `./water-conflict-classifier/`

---

### For HF Jobs (Cloud Training)

#### Prerequisites

```bash
# Install HF CLI
pip install huggingface-hub[cli]

# Authenticate
hf auth login
```

Get your token from: [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

#### Step 1: Configure HuggingFace Repos

Copy the sample config to create your own:

```bash
cd /path/to/waterconflict
cp config.sample.py config.py
```

Edit `config.py` and set your organization or username:

```python
HF_ORGANIZATION = "my-org-name"  # or "my-username"
```

The `config.py` file is gitignored so your credentials stay local.

#### Step 2: Upload Training Data

```bash
# Upload script is in parent scripts folder
cd ../scripts
python upload_datasets.py
```

This creates a dataset repository at `YOUR_ORG/water-conflict-training-data` (or `YOUR_USERNAME/...` if using personal account).

#### Step 3: Publish Package (First Time Only)

Before HF Jobs can use it, publish the package:

```bash
cd classifier

# Build and publish to PyPI
uv build
uv publish

# Or use Git URL (see PUBLISHING.md for details)
```

#### Step 4: Run Training Job

```bash
# From mono repo root
hf jobs uv run \
  --flavor a10g-large \
  --timeout 2h \
  --env HF_ORGANIZATION=baobabtech \
  --secrets HF_TOKEN \
  --namespace baobabtech \
  classifier/train_on_hf.py
```

Replace `baobabtech` with your organization name from `config.py`.

**Important:** Package must be published to PyPI or available via Git URL. See `PUBLISHING.md` for details.

**Configuration Options:**

- `--secrets HF_TOKEN`: Authentication (required for private repos/pushing models)
- `--env HF_ORGANIZATION`: Your HF org/username (required - not in git due to .gitignore)
- `--namespace`: Runs job under org account for billing/tracking (optional)
- `--timeout`: Max runtime before auto-termination

**Hardware options:** See [available flavors](https://huggingface.co/docs/trl/main/en/jobs_training#hardware) - recommend `a10g-large` for this task.

**Dependencies:** UV automatically handles all dependencies from inline script declarations.

---

## Monitoring

```bash
# List jobs
hf jobs ps -a --namespace baobabtech

# Stream logs
hf jobs logs <job_id> --namespace baobabtech

# Cancel job
hf jobs cancel <job_id> --namespace baobabtech
```

---

## Training Pipeline

The script follows the same pipeline as the local version but with HF Hub integration:

1. **Authenticate** with HF Hub (via `HF_TOKEN`)
2. **Load data** from dataset repo (downloads CSVs)
3. **Preprocess** into multi-label format (balances negatives to match positives count)
4. **Split data** (85% train pool / 15% held-out test set)
5. **Sample training data** (600 examples from train pool for efficient few-shot learning)
6. **Train** SetFit model (1 epoch, undersampling strategy)
7. **Evaluate** on held-out test set (F1, accuracy, per-label metrics)
8. **Push to Hub** (model + comprehensive model card with evaluation tables)

Expected runtime: ~2-5 minutes on A10G GPU

---

## After Training

Your model will be at: `https://huggingface.co/YOUR_ORG/water-conflict-classifier` (or `YOUR_USERNAME/...` if using personal account)

Use it with the inference script:

```bash
python classify_headline.py
```

Or directly in Python:

```python
from setfit import SetFitModel

model = SetFitModel.from_pretrained("YOUR_ORG/water-conflict-classifier")
predictions = model.predict(["Taliban attack dam workers in Afghanistan"])
# Output: [[1, 1, 1]]  # [Trigger, Casualty, Weapon]
```

---

## Troubleshooting

**"Not authenticated"** → Run `hf auth login`

**"Dataset not found"** → Verify `DATASET_REPO` matches uploaded dataset name

**Out of memory** → Reduce `BATCH_SIZE` in script or use smaller GPU flavor

**Job timeout** → Increase `--timeout` value

---

## Local Testing of HF Jobs Script

Test the HF Jobs script locally before submitting:

```bash
cd classifier
uv pip install -e .  # Install package locally first
uv run train_on_hf.py
```

Note: Still requires dataset on HF Hub and proper authentication.

---

## Configuration Options

### Private Repositories

Set `private=True` in the upload and push methods (check `upload_datasets.py` and `train_on_hf.py`)

### Different Base Model

Edit the BASE_MODEL constant in either training script:

```python
BASE_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # Smaller/faster
# or
BASE_MODEL = "BAAI/bge-base-en-v1.5"  # Larger/better quality
```

### Additional Secrets

```bash
hf jobs uv run \
  --secrets HF_TOKEN \
  --secrets WANDB_API_KEY \
  --env HF_ORGANIZATION=baobabtech \
  --env WANDB_PROJECT=water-conflict \
  classifier/train_on_hf.py
```

---

## Data Sources

The training data combines:

- **Positive Examples**: Water conflict headlines from [Pacific Institute Water Conflict Chronology](https://www.worldwater.org/water-conflict/)
- **Negative Examples**: Non-water conflict events from [ACLED](https://acleddata.com/)

Both positive and negative examples are labeled for three categories: Trigger, Casualty, and Weapon.

## Resources

- [HF Jobs Guide](https://huggingface.co/docs/huggingface_hub/guides/jobs)
- [UV Script Format](https://docs.astral.sh/uv/guides/scripts/) (used in `train_on_hf.py`)
- [SetFit Documentation](https://huggingface.co/docs/setfit)
- [Pacific Institute Water Conflict Chronology](https://www.worldwater.org/water-conflict/)
- [ACLED Data](https://acleddata.com/)

---

## License

Copyright © 2025 Baobab Tech

This project is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](http://creativecommons.org/licenses/by-nc/4.0/).

You are free to use, share, and adapt this work for non-commercial purposes with appropriate attribution to Baobab Tech. For commercial licensing inquiries, please contact Baobab Tech.
