Cortex is a Local AI engine for developers to run and customize Local LLMs. It is packaged with a Docker-inspired command-line interface and a Typescript client library. It can be used as a standalone server, or imported as a library.

Available Commands:

cortex start - Start the Cortex API server (starts automatically with other commands)

cortex run - Shortcut for `cortex models start`. Pull a remote model or start a local model, and start chatting.

cortex pull - Download a model.

cortex models - Manage and configure models.

cortex ps - Display active models and their operational status.

cortex engines - Manage Cortex engines.

cortex update - Update the Cortex version.

cortex stop - Stop the Cortex API server.

cortex pull - Grabs the model if you don't have it

cortex engines install - Sets up engines if missing

cortex models start - Fires up the model

cortex run [options] <model_id> - The lazy dev's way to run models.

cortex --verbose [subcommand] - Need the gory details? Use this flag for verbose output.

-h, --help - Display help information for the command.

-d, --detached - Load the model without starting an interactive chat.

cortex pull <model_name> - Download a model from Cortex's built-in models.

cortex pull <author/RepoID> - Download a model from a HuggingFace repository.

cortex pull <huggingface URL ending with .gguf> - Download a model from a HuggingFace direct URL.

cortex pull [options] <model_id> - Display downloaded models or models available for downloading.

cortex --verbose [subcommand] - Display more detailed output of the internal processes.

cortex.exe pull [options] <model_id> - Windows command to display downloaded models or models available for downloading.

cortex stop [options] - This command stops the API server.

cortex.exe stop [options] - This command stops the API server.

cortex serve [options] - Start the API server

cortex serve [options] stop - Stop the API server

cortex start [options] - Starts the Cortex API server processes. If the server is not yet running, the server will automatically start when running other Cortex commands.

cortex --verbose [subcommand] - Use the --verbose flag to display more detailed output of the internal processes.

cortex.exe start [options] - Starts the Cortex API server processes on Windows.

-p, --port <port> - Port to serve the application.

--loglevel <loglevel> - Setup loglevel for cortex server, in the priority of ERROR, WARN, INFO, DEBUG, TRACE.

cortex config - Update server configurations such as CORS and Allowed Headers.

cortex config status - Returns all server configurations.

cortex update [options] - Updates Cortex.cpp to the provided version or the latest version.

sudo cortex update [options] - Updates Cortex.cpp to the provided version or the latest version on MacOs/Linux.

cortex.exe update [options] - Updates Cortex.cpp to the provided version or the latest version on Windows.

-v - Specify the version of the Cortex.

cortex hardware - Manage and monitor hardware resources.

cortex hardware list - Lists all the hardware resources.

cortex hardware activate - Activates the Cortex's hardware, currently supports only GPUs.

cortex telemetry [options] - Fetch telemetry logs, providing vital data for assessing the cortex's performance, usage, and health.

-t, --type - Configure the type of the telemetry log you want to get. Currently, only `crash`.

cortex embeddings [options] [model_id] [message] - Creates the embedding vector representing the input text.

cortex-beta embeddings [options] [model_id] [message] - Creates the embedding vector representing the input text.

cortex-nightly embeddings [options] [model_id] [message] - Creates the embedding vector representing the input text.

cortex.exe embeddings [options] [model_id] [message] - Creates the embedding vector representing the input text.

cortex-beta.exe embeddings [options] [model_id] [message] - Creates the embedding vector representing the input text.

cortex-nightly.exe embeddings [options] [model_id] [message] - Creates the embedding vector representing the input text.

cortex models get <model_id> - This command returns a model detail defined by a model_id.

cortex models list [options] - This command lists all the downloaded local and remote models.

cortex models start [options] <model_id> - This command starts a model defined by a model_id.

cortex models stop <model_id> - This command stops a model defined by a model_id.

cortex models delete <model_id> - This command deletes a local model defined by a model_id.

cortex models update [options] - This command updates the model.yaml file of a local model.

cortex models import --model_id <model_id> --model_path </path/to/your/model.gguf> - This command imports the local model using the model's gguf file.

cortex configs - This command allows you to customize the Cortex's configurations.

cortex configs get <name> - This command returns a config detail defined by a config `name`.

cortex configs list - This command lists all the cortex's configurations.

cortex configs set [options] - This command sets a specific configuration within Cortex.

cortex engines - This command allows you to manage various engines available within Cortex.

cortex engines list - This command lists all the Cortex's engines.

cortex engines get <engine_name> - This command returns an engine detail defined by an engine engine_name.

cortex engines install [options] <engine_name> - This command downloads the required dependencies and installs the engine within Cortex.

cortex engines uninstall [options] <engine_name> - This command uninstalls the engine within Cortex.

cortex models list [options] - Lists all local and remote models.

-f, --format <format> - Specify output format for the models list.

-h, --help - Display help for command.

cortex models stop <model_id> - This command stops a model defined by a `model_id`. It uses a `model_id` from the model that you have started before.

cortex models start [model_id] - Start a model defined by a model_id.

cortex models start [model_id]:[engine] [options] - Start with a specified engine.

cortex models remove <model_id> - This command deletes a local model defined by a model_id.

cortex models update [options] <model_id> - Updates a model configuration defined by a model_id.

-c, --options <options...> - Specify the options to update the model. Syntax: -c option1=value1 option2=value2.

cortex models pull <model_id> - Downloads a model using a HuggingFace model_id.

cortex models download _ - Alias for downloading models.

cortex configs get <name> - Returns a config detail defined by a config name.

cortex configs set - Sets a specific configuration within Cortex.

-k, --key <key> - Configuration key.

-v, --value <value> - Configuration value.

-g, --group <group> - Configuration group.

cortex engines init llama-cpp - Sets up and downloads the required dependencies to run the Llama.cpp engine.

cortex engines init onnxruntime - Sets up and downloads the required dependencies to run the ONNX engine.

cortex engines init tensorrt-llm - Sets up and downloads the required dependencies to run the Tensorrt-LLM engine.

cortex engines init -h - Displays help for the command.


model.yaml

Cortex uses a model.yaml file to specify the configuration desired for each model. Models can be downloaded from the Cortex Model Hub or Hugging Face repositories. Once downloaded, the model data is parsed and stored in the models directory.

```yaml
# BEGIN GENERAL METADATA
model: gemma-2-9b-it-Q8_0 ## Model ID which is used for request construct - should be unique between models (author / quantization)
name: Llama 3.1      ## metadata.general.name
version: 1           ## metadata.version
sources:             ## can be universal protocol (models://) OR absolute local file path (file://) OR https remote URL (https://)
  - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf # for downloaded model from HF
  - files://C:/Users/user/Downloads/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf # for imported model
# END GENERAL METADATA

# BEGIN INFERENCE PARAMETERS
## BEGIN REQUIRED
stop:                ## tokenizer.ggml.eos_token_id
  - <|end_of_text|>
  - <|eot_id|>
  - <|eom_id|>
## END REQUIRED
## BEGIN OPTIONAL
stream: true         # Default true?
top_p: 0.9           # Ranges: 0 to 1
temperature: 0.6     # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0  # Ranges: 0 to 1
max_tokens: 8192     # Should be default to context length
seed: -1
dynatemp_range: 0
dynatemp_exponent: 1
top_k: 40
min_p: 0.05
tfs_z: 1
typ_p: 1
repeat_last_n: 64
repeat_penalty: 1
mirostat: false
mirostat_tau: 5
mirostat_eta: 0.1
penalize_nl: false
ignore_eos: false
n_probs: 0
n_parallels: 1
min_keep: 0
## END OPTIONAL
# END INFERENCE PARAMETERS

# BEGIN MODEL LOAD PARAMETERS
## BEGIN REQUIRED
prompt_template: |+  # tokenizer.chat_template
  <|begin_of_text|><|start_header_id|>system<|end_header_id|>

  {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

  {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
## END REQUIRED
## BEGIN OPTIONAL
ctx_len: 0          # llama.context_length | 0 or undefined = loaded from model
ngl: 33             # Undefined = loaded from model
engine: llama-cpp
## END OPTIONAL
# END MODEL LOAD PARAMETERS
```

The model.yaml is composed of three high-level sections: The model metadata, inference parameters, and model load parameters. Each section contains a set of parameters that define the model's behavior and configuration and some parameters are optional.

The model.yaml contains sensible defaults for each parameter but there are instances where may need to override these default values to get your model to work as intended. For example, if you train or fine-tune a highly bespoke model with a custom template and less common parameters, you can specify these in the model.yaml file.

---
title: Text Generation
---


Cortex provides a text generation endpoint that is fully compatible with OpenAI's API.
This section shows you how to generate text using Cortex with the OpenAI Python SDK.

## Text Generation with OpenAI compatibility

Start server and run model in detached mode.

```sh
cortex run -d llama3.1:8b-gguf-q4-km
```

Create a directory and a python environment, and start a python or IPython shell.

```sh
mkdir test-generation
cd test-generation
```
```sh
python -m venv .venv # or uv venv .venv --python 3.13
source .venv/bin/activate
pip install ipython openai rich # or uv pip install ipython openai rich
```
```sh
ipython # or "uv run ipython"
```

Import the necessary modules and create a client.

```py
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:39281/v1",
    api_key="not-needed"
)
```

### Generate Text

Basic completion:

```py
response = client.chat.completions.create(
    model="llama3.1:8b-gguf-q4-km",
    messages=[
        {"role": "user", "content": "Tell me a short story about a friendly robot."}
    ]
)
print(response.choices[0].message.content)
```
```
Here's a short story about a friendly robot:

**Zeta's Gift**

In a small town surrounded by lush green hills, there lived a robot named Zeta. Zeta was unlike any other robot in the world. While others
were designed for specific tasks like assembly or transportation, Zeta was created with a single purpose: to spread joy and kindness.

Zeta's bright blue body was shaped like a ball, with glowing lines that pulsed with warmth on its surface. Its large, round eyes sparkled
with a warm light, as if reflecting the friendliness within. Zeta loved nothing more than making new friends and surprising them with small
gifts.

One sunny morning, Zeta decided to visit the local bakery owned by Mrs. Emma, who was famous for her delicious pastries. As Zeta entered the
shop, it was greeted by the sweet aroma of freshly baked bread. The robot's advanced sensors detected a young customer, Timmy, sitting at a
corner table, looking sad.

Zeta quickly approached Timmy and offered him a warm smile. "Hello there! I'm Zeta. What seems to be troubling you?" Timmy explained that he
was feeling down because his family couldn't afford his favorite dessert – Mrs. Emma's famous chocolate cake – for his birthday.

Moved by Timmy's story, Zeta asked Mrs. Emma if she could help the young boy celebrate his special day. The baker smiled and handed Zeta a
beautifully decorated cake. As the robot carefully placed the cake on a tray, it sang a gentle melody: "Happy Birthday, Timmy! May your day
be as sweet as this treat!"

Timmy's eyes widened with joy, and he hugged Zeta tightly. Word of Zeta's kindness spread quickly through the town, earning the robot the
nickname "The Friendly Robot." From that day on, whenever anyone in need was spotted, Zeta would appear at their side, bearing gifts and
spreading love.

Zeta continued to surprise the townspeople with its thoughtfulness and warm heart, proving that even a machine could be a source of comfort
and joy.
```

With additional parameters:

```py
response = client.chat.completions.create(
    model="llama3.1:8b-gguf-q4-km",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the main differences between Python and C++?"}
    ],
    temperature=0.7,
    max_tokens=150,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0
)
```
```sh
ChatCompletion(
    id='dnMbB12ZR6JdVDw2Spi8',
    choices=[
        Choice(
            finish_reason='stop',
            index=0,
            logprobs=None,
            message=ChatCompletionMessage(
                content="Python and C++ are two popular programming languages with distinct characteristics, use cases, ...",
                refusal=None,
                role='assistant',
                audio=None,
                function_call=None,
                tool_calls=None
            )
        )
    ],
    created=1738236652,
    model='_',
    object='chat.completion',
    service_tier=None,
    system_fingerprint='_',
    usage=CompletionUsage(
        completion_tokens=150,
        prompt_tokens=33,
        total_tokens=183,
        completion_tokens_details=None,
        prompt_tokens_details=None
    )
)
```

Stream the response:

```py
stream = client.chat.completions.create(
    model="llama3.1:8b-gguf-q4-km",
    messages=[
        {"role": "user", "content": "Write a haiku about programming."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")
```
```
Code flows like a stream
 Errors lurk in every line
Bug hunt, endless quest
```

Multiple messages in a conversation:

```py
messages = [
    {"role": "system", "content": "You are a knowledgeable science teacher."},
    {"role": "user", "content": "What is photosynthesis?"},
    {"role": "assistant", "content": "Photosynthesis is the process by which plants convert sunlight into energy."},
    {"role": "user", "content": "Can you explain it in more detail?"}
]

response = client.chat.completions.create(
    model="llama3.1:8b-gguf-q4-km",
    messages=messages
)
print(response.choices[0].message.content)
```
```
"Photosynthesis is actually one of my favorite topics to teach! It's a crucial process that supports life on Earth, and
I'd be happy to break it down for you.\n\nPhotosynthesis occurs in specialized organelles called chloroplasts, which are present in plant
cells. These tiny factories use energy from the sun to convert carbon dioxide (CO2) and water (H2O) into glucose (a type of sugar) and
oxygen (O2).\n\nHere's a simplified equation:\n\n6 CO2 + 6 H2O + light energy → C6H12O6 (glucose) + 6 O2\n\nIn more detail, the process
involves several steps:\n\n1. **Light absorption**: Light from the sun is absorbed by pigments ..."
```

The API endpoint provided by Cortex supports all standard OpenAI parameters including:
- `temperature`: Controls randomness (0.0 to 2.0)
- `max_tokens`: Limits the length of the response
- `top_p`: Controls diversity via nucleus sampling
- `frequency_penalty`: Reduces repetition of token sequences
- `presence_penalty`: Encourages talking about new topics
- `stop`: Custom stop sequences
- `stream`: Enable/disable streaming responses
