Metadata-Version: 2.3
Name: mynk_etl
Version: 0.1.15
Summary: Add your description here
Requires-Dist: authlib>=1.7.0
Requires-Dist: cachetools>=7.0.6
Requires-Dist: confluent-kafka==2.12.2
Requires-Dist: fastavro>=1.12.1
Requires-Dist: httpx>=0.28.1
Requires-Dist: ipykernel>=7.2.0
Requires-Dist: pandas-market-calendars>=5.3.0
Requires-Dist: pathlib>=1.0.1
Requires-Dist: psycopg2-binary>=2.9.12
Requires-Dist: pynessie>=0.67.0
Requires-Dist: pyspark==3.5.6
Requires-Dist: xxhash>=3.7.0
Requires-Python: >=3.11, <=3.12.10
Description-Content-Type: text/markdown

# MYNK ETL

A comprehensive Python-based Extract-Transform-Load (ETL) pipeline framework designed for financial data processing and management. This project leverages PySpark, Apache Iceberg, Apache Kafka, and Nessie for building scalable, production-ready data pipelines.

## Overview

MYNK ETL provides a modular architecture for:
- **Extracting** data from multiple sources (Kafka, databases, APIs)
- **Transforming** financial data with built-in support for market data operations
- **Loading** processed data to data lakes and RDBMS systems
- **Managing** distributed computing workflows using Apache Spark

## Features

- **Modular Pipeline Architecture**: Abstract base classes for extensible Extract, Transform, and Load operations
- **Multi-Source Support**: 
  - Kafka streaming integration with Avro schema support
  - RDBMS connectors (PostgreSQL, etc.)
  - File-based I/O
- **Data Lake Integration**: Apache Iceberg support for ACID transactions and time-travel queries
- **Financial Data Processing**: Specialized utilities for yfinance ticker data transformation
- **Scalable Computing**: PySpark for distributed batch and streaming operations
- **Catalog Management**: Nessie catalog integration for version control on data
- **Comprehensive Logging**: JSON-based logging with decorators for operation tracking
- **Configuration Management**: YAML-based configuration system for infrastructure and tables

## Project Structure

```
mynk_etl/
├── extract/              # Data extraction implementations
│   ├── extract.py       # Abstract Extract base class
│   ├── kafkaExtract.py  # Kafka streaming extraction
│   └── __init__.py
├── load/                # Data loading implementations
│   ├── load.py          # Abstract Load base class
│   ├── icebergWriter.py # Apache Iceberg writer
│   ├── fileWriter.py    # File-based writer
│   ├── rdbmsWriter.py   # RDBMS writer
│   └── __init__.py
├── transform/           # Data transformation implementations
│   ├── transform.py     # Abstract Transform base class
│   ├── yfinance/        # Financial data transformations
│   │   ├── tickTransform.py
│   │   └── __init__.py
│   └── __init__.py
├── sparkUtils/          # Spark initialization and utilities
│   ├── sparkInit.py     # Spark session creation
│   ├── sparkComfunc.py  # Common Spark functions
│   └── __init__.py
├── utils/               # Utility modules
│   ├── common/
│   │   ├── constants.py     # Application-wide constants
│   │   ├── confs.py         # Configuration loading
│   │   ├── genUtils.py      # General utilities
│   │   ├── kUtils.py        # Kafka utilities
│   │   ├── marketUtils.py   # Market/financial utilities
│   │   └── __init__.py
│   └── logger/
│       ├── jsonLogging.py        # JSON log formatting
│       ├── logDecor.py           # Logging decorators
│       ├── psgLogging.py         # PostgreSQL logging
│       ├── QueueListenerHandler.py
│       └── __init__.py
├── mainCalls/
│   ├── yfinanceUtils.py     # yfinance ETL orchestration
│   └── __init__.py
├── main.py              # Main ETL orchestration
└── __init__.py
```

## Prerequisites

- **Python**: 3.11 - 3.12.10
- **PySpark**: 3.5.6
- **Java Runtime** (for Spark)
- **Configuration Files**: `tables.yml` and infrastructure config (loaded via environment)

## Installation

### Using UV Package Manager

```bash
uv sync
```

### Using pip

```bash
pip install -e .
```

### Dependencies

Key dependencies include:
- `pyspark==3.5.6` - Distributed computing framework
- `confluent-kafka==2.12.2` - Kafka client
- `pynessie>=0.67.0` - Nessie catalog client
- `fastavro>=1.12.1` - Avro serialization
- `psycopg2-binary>=2.9.11` - PostgreSQL connector
- `pandas-market-calendars>=5.3.0` - Market calendar support

## Configuration

### Environment Variables

Set the infrastructure environment:
```bash
export INFRA_ENV=DEV  # or PROD
```

### Configuration Files

1. **tables.yml** - Define table properties and configurations per environment
2. **Infrastructure Config** - Spark parameters, Kafka brokers, Nessie settings, MinIO credentials

Configuration is loaded via `mynk_etl.utils.common.confs` module.

## Usage

### Command Line Interface

```bash
mynk_etl <method> <config_key> [--path /path/to/config]
```

**Parameters:**
- `method`: Operation to execute (e.g., `yfinanceDataLoad`)
- `config_key`: Configuration key in format `table_group.table_name`
- `path`: Optional path to configuration directory (defaults to current working directory)

### Programmatic Usage

```python
from mynk_etl import main_function

# Execute yfinance data load
main_function('yfinanceDataLoad', 'financial.stock_ticks')
```

## Module Details

### Extract Module
Abstract interface for data extraction from various sources. Supports both streaming and batch modes.

### Transform Module
Abstract interface for data transformations. Specialized implementations for financial data (ticker transformations, temporal partitioning).

### Load Module
Abstract interface for writing data to destination systems:
- **Iceberg**: ACID transactions, time-travel queries
- **RDBMS**: PostgreSQL and other relational databases
- **File**: Parquet, CSV exports

### Spark Utilities
- Session initialization with Iceberg and Kafka support
- Common DataFrame operations and optimizations
- Distributed computing utilities

### Utils
- **Constants**: Application-wide configuration and enums
- **Configuration**: YAML-based config loading
- **Logging**: JSON-formatted logs with operation decorators
- **Market Utils**: Financial-specific utilities
- **Kafka Utils**: Kafka connection and messaging utilities

## Testing

Run tests using pytest:

```bash
pytest tests/
```

Run with multiple workers:
```bash
pytest -n auto tests/
```

### Test Structure

- `tests/unit/` - Unit tests
- `tests/fixtures/` - Reusable test fixtures
- `tests/helper/` - Test helper utilities
- `tests/data/` - Test data files

### Dev Dependencies

- `pytest>=9.0.2` - Test framework
- `pytest-mock>=3.15.1` - Mocking utilities
- `pytest-xdist>=3.8.0` - Parallel test execution

## Logging

The framework provides structured JSON logging with:
- Log decoration for tracking function execution
- Operation-level logging
- Multiple handlers (console, file, PostgreSQL)
- Correlation tracking via RUN_ID

## Architecture Patterns

### Abstract Base Classes

The framework uses abstract base classes for extensibility:

```python
# Extract
class Extract(ABC):
    abstractmethod extractSparkData()
    abstractmethod extractSparkStreamData()

# Transform
class Transform(ABC):
    # Custom transformation implementations

# Load
class Load(ABC):
    abstractmethod streamWriter()
    abstractmethod nonStreamWriter()
```

This design allows easy implementation of new data sources, transformations, and destinations.

## Performance Features

- **PySpark Optimization**: Leverages Spark's distributed computing
- **Iceberg Features**: ACID transactions, schema evolution, time-travel queries
- **Kafka Integration**: Reliable streaming data ingestion
- **Horizontal Scalability**: Designed for multi-node Spark clusters

## Version

Current version: **0.1.14**

## Contributing

When adding new modules:
1. Follow the modular structure (Extract → Transform → Load)
2. Implement abstract base classes
3. Add comprehensive docstrings
4. Include unit tests in `tests/unit/`
5. Add fixtures to `tests/fixtures/` if needed

## License

[Add your license here]

## Support

For issues and questions, please refer to the project documentation or contact the development team.
