Metadata-Version: 2.4
Name: my_python_lib-tarik
Version: 0.1.0
Summary: A modular, object-oriented framework for machine learning and data preprocessing
Author-email: Mustafa Tarık Kocabıyık <mtarikkb@gmail.com>
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy>=2.4.4
Requires-Dist: pandas>=3.0.2

# Machine Learning & Data Preprocessing Library

## Introduction

This is a simple machine learning algorithm library consists of Linear Regression , KNN Classifier and some other data processing algorithms from scratch based on numpy and pandas libraries.

---

##  Mapping Core Learning Outcomes

The 6 required patterns were applied appropriately in the project. Every single one is explained below.

### 1. Object-Oriented Programming (OOP)
- **Where**: In `core.py` and `data.py`.
- **How**: 
  - **Inheritance & Abstraction**: Employs abstract base classes (`BaseAlgorithm`, `RegressionStrategy`, `DistanceMetric`, `DataLoader`, `DataCreator`, `ImputeStrategy`, `EncodingStrategy`) to enforce blueprints.
  - **Polymorphism**: Concrete implementations dynamically substitute base behavior. For example, `LinearRegression` executes `.train()` polymorphic actions via different assigned regression strategies without altering its own structure.
  - **Encapsulation**: State variables are protected internally. In `data.py`, the raw dataframe is hidden behind a protected attribute `self._data` and managed safely using the `@property` getter.

### 2. Functional Programming
- **Where**: In `core.py` and `utils.py`.
- **How**:
  - **Pure Functions & Lambda**: `evaluate_model` avoids modifying external states and relies entirely on input arguments, calculating mean squared errors via a clean pure lambda routine.
  - **Higher-Order Functions & Map/Reduce**:
    - `reduce` combined with `lambda` is used inside `evaluate_model` to sum squared errors.
    - `map` is used inside `series_to_ndarray` to cast panda series rows to float representations.
    - `apply_pipeline` utilizes `reduce` to sequentially compose list-based transformation callables across data boundaries (`reduce(lambda d, func: func(d), transformations, data)`).

### 3. Concurrency (Multi-threading)
- **Where**: Implemented in `core.py` inside the `KNNClassifier` class.
- **How**: 
  - Predicting classes for massive feature maps sequentially is computationally bound. The `predict` method generates individual `threading.Thread` operations for every distinct evaluation sample.
  - The `_predict_single` worker calculates specific row-by-row matrix operations concurrently, storing structural outputs inside a shared pre-allocated numpy results matrix (`results[index]`).
  - Thread control structures utilize `t.start()` loops followed by systematic `t.join()` barriers to synchronize and block primary execution until parallel estimations conclude safely.

### 4. Recursion / Dynamic Programming
- **Where**: In `core.py` inside the `EuclideanDistance` class.
- **How**:
  - Distance metrics typically resolve dimensions via nested iterative syntax or high-level library functions. This implementation achieves element-wise vector difference accumulations via a custom recursive function `recursive_sum_sq(a, b, idx)`.
  - It recursively accumulates squared parameter differences index-by-index until it reaches the base case (`idx == len(a)`), gracefully returning the final structural matrix sqrt reduction.

### 5. SOLID Principles
- **Where**: In `core.py` and `data.py`.
- **How**:
  - **Single Responsibility Principle (SRP)**: Classes do exactly one thing. `CSVLoader` only ingests data streams; `MeanImputer` strictly provides missing value fillings; `DataProcessor` focuses on data manipulation.
  - **Open/Closed Principle (OCP)**: The system is open for extension but closed for modification. Introducing a new distance metric (e.g., Cosine Distance) requires subclassing `DistanceMetric` without touching `KNNClassifier`.
  - **Liskov Substitution Principle (LSP)**: Derived classes are completely interchangeable with their abstractions. Any encoder (`LabelEncoder`, `OneHotEncoder`, `TargetEncoder`) fulfills the signature constraints expected by `DataProcessor`.
  - **Interface Segregation Principle (ISP)**: Interfaces remain lean and decoupled. `RegressionStrategy` enforces a single clear contractual point (`train`), avoiding bulky, unrelated structural configurations.
  - **Dependency Inversion Principle (DIP)**: High-level objects depend on abstractions rather than low-level concrete modules. `LinearRegression` binds entirely against the `RegressionStrategy` interface, decoupling model training mechanisms from specific analytical algorithms.

### 6. Architectural & Design Patterns
- **Where**: Full design of `data.py` and `core.py`.
- **How**:
  - **Pipeline Architecture**: Managed by `DataPipeline` which neatly bridges file checking, concrete factory creation, loading, and structured feature preparation routines into a uniform linear API stream (`run_default_preprocessing`).
  - **Strategy Pattern**: Implemented multiple times to provide interchangeable components:
    - Optimization algorithms in `LinearRegression` via `LeastSquaresStrategy` and `GradientDescentStrategy`.
    - Distance formulations in `KNNClassifier` via `EuclideanDistance` and `ManhattanDistance`.
    - Data imputation in `DataProcessor` via `MeanImputer`, `MedianImputer`, and `ModeImputer`.
    - Variable transformations via `LabelEncoder`, `OneHotEncoder`, and `TargetEncoder`.
  - **Factory Method Pattern**: Used to create appropriate data loaders without binding to concrete files. `DataCreator` acts as the creator interface, declaring `create_document()`. Concrete implementations `CSVCreator` and `JSONCreator` override this method to instantiate and return `CSVLoader` or `JSONLoader` respectively, abstracting the instantiation process away from the main pipeline.
