Metadata-Version: 2.4
Name: nedo-vision-training
Version: 1.3.7
Summary: A comprehensive training service library for AI models in the Nedo Vision platform
Author-email: Willy Achmat Fauzi <willy.achmat@gmail.com>
Maintainer-email: Willy Achmat Fauzi <willy.achmat@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://gitlab.com/sindika/research/nedo-vision/nedo-vision-training-service
Project-URL: Documentation, https://gitlab.com/sindika/research/nedo-vision/nedo-vision-training-service/-/blob/main/README.md
Project-URL: Repository, https://gitlab.com/sindika/research/nedo-vision/nedo-vision-training-service
Project-URL: Bug Reports, https://gitlab.com/sindika/research/nedo-vision/nedo-vision-training-service/-/issues
Keywords: computer-vision,machine-learning,ai,training,deep-learning,object-detection,neural-networks,pytorch
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: grpcio<2.0.0,>=1.59.0
Requires-Dist: grpcio-tools<2.0.0,>=1.59.0
Requires-Dist: pika<2.0.0,>=1.3.0
Requires-Dist: rfdetr<2.0.0,>=1.2.0
Requires-Dist: pynvml<12.0.0,>=11.4.0
Requires-Dist: psutil<6.0.0,>=5.8.0
Requires-Dist: torch<3.0.0,>=2.0.0
Requires-Dist: torchvision<1.0.0,>=0.15.0
Requires-Dist: numpy<2.0.0,>=1.21.0
Requires-Dist: pillow<11.0.0,>=9.0.0
Requires-Dist: opencv-python<5.0.0,>=4.8.0
Requires-Dist: requests<3.0.0,>=2.31.0
Requires-Dist: tqdm<5.0.0,>=4.65.0
Provides-Extra: gpu
Requires-Dist: torch==2.3.1; extra == "gpu"
Requires-Dist: torchvision==0.18.1; extra == "gpu"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: isort>=5.10.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: pre-commit>=2.17.0; extra == "dev"

# Nedo Vision Training Service

A distributed AI model training service for the Nedo Vision platform. This service manages training workflows, monitoring, and lifecycle management for computer vision models using RF-DETR architecture.

## Features

- **Configurable Training Service**: Automated training with customizable intervals and parameters
- **gRPC Communication**: Reliable communication with the vision manager and other services
- **Distributed Training**: Support for multi-GPU and distributed training scenarios
- **Real-time Monitoring**: System resource monitoring and training progress tracking
- **Cloud Integration**: AWS S3 integration for model storage and dataset management
- **Message Queue Support**: RabbitMQ integration for task queue management

## Installation

Install the package from PyPI:

```bash
pip install nedo-vision-training
```

For GPU support with CUDA 12.1:

```bash
pip install nedo-vision-training[gpu] --extra-index-url https://download.pytorch.org/whl/cu121
```

For development with all tools:

```bash
pip install nedo-vision-training[dev]
```

## Quick Start

### Using the CLI

After installation, you can use the training service CLI:

```bash
# Show CLI help
nedo-training --help

# Check system dependencies and requirements
nedo-training doctor

# Start training service with authentication token
nedo-training run --token YOUR_TOKEN

# Start with custom server configuration
nedo-training run --token YOUR_TOKEN --server-host custom.server.com --server-port 60000

# Start with custom REST API port
nedo-training run --token YOUR_TOKEN --rest-api-port 8081

# Start with custom intervals
nedo-training run --token YOUR_TOKEN --system-usage-interval 30 --latency-check-interval 15

# Start with all custom configurations
nedo-training run --token YOUR_TOKEN \
  --server-host custom.server.com \
  --server-port 60000 \
  --rest-api-port 8081 \
  --system-usage-interval 30 \
  --latency-check-interval 15
```

### Configuration Options

The service supports various configuration options:

#### Available Commands

- `doctor`: Check system dependencies and requirements (CUDA, NVIDIA drivers, etc.)
- `run`: Start the training service

#### Run Command Options

- `--token`: Authentication token for secure communication (required)
- `--server-host`: gRPC server host (default: localhost)
- `--server-port`: gRPC server port (default: 50051)
- `--rest-api-port`: Manager REST API port (default: 8081)
- `--system-usage-interval`: System usage reporting interval in seconds (default: 30)
- `--latency-check-interval`: Latency monitoring interval in seconds (default: 10)

## Architecture

### Core Components

- **TrainingService**: Main service orchestrator for training workflows
- **RFDETRTrainer**: RF-DETR algorithm implementation with PyTorch backend
- **TrainerLogger**: Real-time training progress logging via gRPC
- **ResourceMonitor**: System resource monitoring (GPU, CPU, memory)

### Dependencies

The service relies on several key technologies:

- **PyTorch**: Deep learning framework with CUDA support
- **RF-DETR**: Roboflow's Real-time Detection Transformer
- **gRPC**: High-performance RPC framework
- **RabbitMQ**: Message queue for distributed task management
- **AWS SDK**: Cloud storage integration
- **NVIDIA ML**: GPU monitoring and management

## Development Setup

## Troubleshooting

### Common Issues

1. **gRPC Connection Timeouts**: Ensure the server host and port are correctly configured
2. **CUDA Out of Memory**: Reduce batch size or use gradient accumulation
3. **Missing Dependencies**: Reinstall with `pip install --upgrade nedo-vision-training`

### Support

For issues and questions:

- Check the logs for detailed error information
- Ensure your token is valid and not expired
- Verify network connectivity to the training manager

## License

This project is part of the Nedo Vision platform. Please refer to the main project license for usage terms.
