Metadata-Version: 2.4
Name: yta-fastapi-docker-llamacpp
Version: 0.0.1
Summary: Youtube Autonomous FastAPI Docker Llama.cpp Module
License-File: LICENSE
Author: danialcala94
Author-email: danielalcalavalera@gmail.com
Requires-Python: >=3.10,<3.14
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: fastapi (>=0.0.1,<9999.0.0)
Requires-Dist: uvicorn (>=0.0.1,<9999.0.0)
Requires-Dist: yta_fastapi_docker_pydantic_models (>=0.0.4,<1.0.0)
Requires-Dist: yta_httpx (>=0.0.27,<1.0.0)
Description-Content-Type: text/markdown

# Youtube Autonomous FastAPI Docker Llama.cpp Module

The module that is providing the functionality related to the Llama.cpp models hub (having the models and using them) through a FastAPI that is included and isolated in a Docker container.

This module is meant to be exposed as a container inside the internal network, to be connected with its own FastAPI that is exposing the functionality outside.

### Endpoints

#### GET
No endpoints by now.

#### POST
No endpoints by now.

## Instructions
I've followed these steps to make `llama.cpp` available in my laptop as a container running with cuda, and I've adapted this workflow to this project so its done automatically:

1. Nos aseguramos de tener la imagen de Nvidia en docker:
`$docker run --rm --gpus all nvidia/cuda:12.9.1-runtime-ubuntu24.04 nvidia-smi`

2. Creamos una carpeta `models` para tener los modelos ahí guardados (en mi caso en un SSD externo para ahorrar espacio) en `D:/llama/models`.

3. Descargamos el modelo GGUF que necesitemos (para ello, ver que tipo y qué características en función de nuestro PC), en cmd desde la carpeta `models` del paso 2:
`$huggingface-cli download unsloth/gemma-4-E2B-it-GGUF gemma-4-E2B-it-UD-Q4_K_XL.gguf --local-dir ./`

1. Descargamos el contenedor 'llama.cpp' adaptado a CUDA, estando en el cmd de la carpeta `models`:
`$docker run --rm --gpus all -p 8080:8080 -v "${PWD}:/models" ghcr.io/ggml-org/llama.cpp:server-cuda -m  --host 0.0.0.0 -ngl 999`
