Metadata-Version: 2.4
Name: income_predict_d100_d400
Version: 0.1.1
Summary: My installable package for income prediction
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: ucimlrepo>=0.0.1
Requires-Dist: numpy>=1.26
Requires-Dist: pandas>=2.2
Requires-Dist: pyarrow>=19.0
Requires-Dist: matplotlib>=3.10
Requires-Dist: seaborn>=0.13
Requires-Dist: scikit-learn>=1.6
Requires-Dist: lightgbm>=4.0
Requires-Dist: scipy>=1.0
Requires-Dist: joblib>=1.0
Provides-Extra: dev
Requires-Dist: jupyter>=1.0; extra == "dev"
Requires-Dist: polars>=1.30; extra == "dev"
Requires-Dist: pre-commit>=3.4; extra == "dev"

# d100_d400_income_predict

## Overview

This repository provides a reproducible Docker environment pre-configured with everything needed to run the GLM and LGBM models for predicting high income basaed on the 1994 US census dataset.

The main analysis can be found at: `src/notebooks/final_report.ipynb`

There is are other sub-analysis files, they are:
- `src/tests/benchmark_pandas_polars.py` script that highlights the performance differences between Polars and Pandas on loading and cleaning the dataset.
- `src/notebooks/eda_cleaning.ipynb` exploratory data analysis. Lots more charts and info on how and why certain decisions were made in building the models.

## Installation

There are two ways to install and run:
    - Directly from PyPI as a package (easiest)
    - Docker container (most robust, recommended for development)

### Install and Run - Method 1, PyPI

#### 1. Install the package

`pip install income_predict_d100_d400`

#### 2. Run the Pipeline

`python -m income_predict_d100_d400.training_pipeline`

or create a your own file and import income_predict_d100_d400:
```
from income_predict_d100_d400.data import run_data_fetch_pipeline
from income_predict_d100_d400.cleaning import run_cleaning_pipeline
from income_predict_d100_d400.evaluation import run_evaluation
from income_predict_d100_d400.model_training import (
    TARGET,
    load_training_outputs,
    run_split,
    run_training,
)

print("Starting Pipeline...")

file_path = run_data_fetch_pipeline()
df_raw = pd.read_parquet(file_path)
run_cleaning_pipeline(df_raw)
run_split()
run_training()

results = load_training_outputs()

run_evaluation(
    results["test"],
    TARGET,
    results["glm_model"],
    results["lgbm_model"],
    results["train_X"],
)

print("Pipeline finished.")
```

### Install and Run - Method 2, Docker

#### 1. Download and install Docker Desktop (if you don't have it already)

- link: [Docker Desktop](https://www.docker.com/products/docker-desktop)

### 2. Clone the Repository

    ```
    git clone https://github.com/caitpj/d100_d400_income_predict.git
    cd d100_d400_income_predict
    ```

### 2. Build the Docker Image

    `docker build -t conda-uciml .`

### 3. Run the Model Pipeline
This runs the model in the Docker container, including downloading the data, cleaning, training, tuning, and saving key visualisations. It should take a minuite or so to run.

    ```
    docker run --rm --shm-size=2g \
    -e PYTHONWARNINGS=ignore \
    -e PYTHONUNBUFFERED=1 \
    -e OMP_NUM_THREADS=1 \
    conda-uciml python src/income_predict/training_pipeline.py
    ```

### 4. Run the `final_report.ipynb` Notebook

    ```
    docker run --rm -it \
    -p 8888:8888 \
    -v "$(pwd):/app" \
    conda-uciml \
    jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
    ```
From the output of the above code, find and paste the URL into a browser. It should start with: `http://127.0.0.1:8888/?token=...`


## Development

There are a few more steps needed if you want to develop this repo on your local machine.

To ensure code quality, I use `pre-commit` hooks that run locally on your machine before every commit. This requires a local Conda environment on your host machine (not in Docker).

### 1. Download and install Miniconda (if you don't have it already)

- link: [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) installed on your machine (for local development and git hooks).

### 2. This installs pre-commit, black, mypy, etc. based on environment.yml
    `conda env update --file environment.yml --prune`

### 3. initialize conda (will need to reset terminal)
    `conda init zsh`

### 4. Activate the environment
    `conda activate d100_d300_env`

### 5. Install the git hooks
    `pre-commit install`

Now, every time you run `git commit`, your local machine will fist check it meets the rules stated in .`pre-commit-config.yaml` automatically.


## AI Use
Some code was AI generated, notably:
- Visualisations
- Pandas vs Polars benchmark test

In other areas, AI was used to help with debugging, notably:
- Docker related issues
- Performence issues for hypertunning
