Metadata-Version: 2.4
Name: self-ai-operating-computer
Version: 2.0.15
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: annotated-types==0.6.0
Requires-Dist: anyio==3.7.1
Requires-Dist: certifi==2025.6.15
Requires-Dist: charset-normalizer==3.3.2
Requires-Dist: colorama==0.4.6
Requires-Dist: contourpy==1.2.0
Requires-Dist: cycler==0.12.1
Requires-Dist: distro==1.8.0
Requires-Dist: EasyProcess==1.1
Requires-Dist: entrypoint2==1.1
Requires-Dist: exceptiongroup==1.1.3
Requires-Dist: fonttools==4.44.0
Requires-Dist: h11==0.14.0
Requires-Dist: httpcore==1.0.2
Requires-Dist: httpx>=0.25.2
Requires-Dist: idna==3.4
Requires-Dist: importlib-resources==6.1.1
Requires-Dist: kiwisolver==1.4.5
Requires-Dist: matplotlib==3.8.1
Requires-Dist: MouseInfo==0.1.3
Requires-Dist: mss==9.0.1
Requires-Dist: numpy==1.26.1
Requires-Dist: openai==1.2.3
Requires-Dist: packaging==23.2
Requires-Dist: Pillow==10.1.0
Requires-Dist: prompt-toolkit==3.0.39
Requires-Dist: PyAutoGUI==0.9.54
Requires-Dist: pydantic==2.4.2
Requires-Dist: pydantic_core==2.10.1
Requires-Dist: PyGetWindow==0.0.9
Requires-Dist: PyMsgBox==1.0.9
Requires-Dist: pyparsing==3.1.1
Requires-Dist: pyperclip==1.8.2
Requires-Dist: PyRect==0.2.0
Requires-Dist: pyscreenshot==3.1
Requires-Dist: PyScreeze==0.1.29
Requires-Dist: python3-xlib==0.15
Requires-Dist: python-dateutil==2.8.2
Requires-Dist: python-dotenv==1.0.0
Requires-Dist: pytweening==1.0.7
Requires-Dist: requests==2.31.0
Requires-Dist: rubicon-objc==0.4.7
Requires-Dist: six==1.16.0
Requires-Dist: sniffio==1.3.0
Requires-Dist: tqdm==4.66.1
Requires-Dist: typing_extensions==4.8.0
Requires-Dist: urllib3==2.0.7
Requires-Dist: wcwidth==0.2.9
Requires-Dist: zipp==3.17.0
Requires-Dist: google-generativeai==0.3.0
Requires-Dist: aiohttp==3.9.1
Requires-Dist: ultralytics==8.0.227
Requires-Dist: easyocr==1.7.1
Requires-Dist: ollama==0.1.6
Requires-Dist: anthropic
Dynamic: description
Dynamic: description-content-type
Dynamic: license-file
Dynamic: requires-dist

ome
<h1 align="center">Self-Operating Computer Framework</h1>

<p align="center">
  <strong>A framework to enable multimodal models to operate a computer.</strong>
</p>

### Version v2.0.15 (Latest) Release Summary

**New Features:**

*   **Centralized Model Management:** All model configurations are now managed in a single file (`operate/models/model_configs.py`), making it easier to add, remove, and manage models.
*   **Expanded Ollama Model Support:** Added support for `qwen2.5vl:3b` and `gemma3:4b`.
*   **Enhanced Debugging:** Added a `-d` flag (alias for `--verbose`) that provides detailed debugging information, including the full prompt sent to the AI and the raw response received.

**Improvements:**

*   **Improved System Prompt:** The system prompt has been enhanced with a more structured format, explicit JSON schema definitions, and clear examples to improve model accuracy and reliability.

**Bug Fixes:**

*   Fixed an issue where the model selection screen was not correctly displaying all available models.
*   Resolved an `IndentationError` in the model configuration file.

<p align="center">
  Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Released Nov 2023, the Self-Operating Computer Framework was one of the first examples of usiself-ai-operating-computerng a multimodal model to view the screen and operate a computer.
</p>

<div align="center">
  <img src="/readme/choose-model.png" width="750"  style="max-width: 100%;"/>
</div>


<!--
:rotating_light: **OUTAGE NOTIFICATION: gpt-4o**
**This model is currently experiencing an outage so the self-operating computer may not work as expected.**
-->


## Key Features
- **Compatibility**: Designed for various multimodal models.
- **Expanded Model Support**: Now integrated with the latest **OpenAI o3, o4-mini, GPT-4.1, GPT-4.1 mini, GPT-4.1 nano**, **Gemini 2.5 Pro, Gemini 2.5 Flash**, and **Gemma 3n** models (including `e2b` and `e4b` variants), and **Gemma 3:12b** alongside existing support for GPT-4o, Claude 3, Qwen-VL, and LLaVa.
- **Enhanced Ollama Integration**: Improved handling for Ollama models, including default host configuration and more informative error messages.
- **Future Plans**: Support for additional models.

## Demo
https://github.com/malah-code/self-ai-operating-computer/assets/42594239/9e8abc96-c76a-46fb-9b13-03678b3c67e0


## Run `Self-Operating Computer`

### Run from PyPI

To run the application by installing it from PyPI, follow these steps:

1.  **Install the project:**
    ```bash
    pip install self-ai-operating-computer
    ```
2.  **Run the project:**
    ```bash
    operate
    ```

### Run from Source Code

To run the application from your local copy (after making changes), follow these steps:

1.  **Uninstall previous installations (if any):**
    ```bash
    pip uninstall self-ai-operating-computer
    ```
    Confirm uninstallation when prompted.

2.  **Install in editable mode:**
    Navigate to the project's root directory (where `setup.py` is located) and run:
    ```bash
    pip install -e .
    ```
    This links your local source code to your Python environment, so changes are immediately reflected.

3.  **Run the project:**
    ```bash
    operate
    ```
    If you run `operate` without any arguments, a welcome screen will be displayed, followed by an interactive model selection menu. If a model requires an API key that is not found in your environment variables or `.env` file, you will be prompted to enter it.

<div align="center">
  <img src="/readme/choose-model.png" width="750"  style="max-width: 100%;"/>
</div>

Once you select a model, you will be prompted to provide a custom system prompt if the `CUSTOM_SYSTEM_PROMPT` environment variable is not set. If the environment variable is set, the prompt screen will be skipped and the value from the environment variable will be used automatically. You can also load the prompt from a file if needed.

<div align="center">
  <img src="/readme/system-parameter-1.png" width="750"  style="max-width: 100%;"/>
  <img src="/readme/prompt1.png" width="750"  style="max-width: 100%;"/>
</div>

4.  **Enter your OpenAI Key**: If you don't have one, you can obtain an OpenAI key [here](https://platform.openai.com/account/api-keys). If you need you change your key at a later point, run `vim .env` to open the `.env` and replace the old key. 

<div align="center">
  <img src="/readme/key.png" width="300"  style="margin: 10px;"/>
</div>

5.  **Give Terminal app the required permissions**: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

<div align="center">
  <img src="/readme/terminal-access-1.png" width="300"  style="margin: 10px;"/>
  <img src="/readme/terminal-access-2.png" width="300"  style="margin: 10px;"/>
</div>

## Using `operate` Modes

#### OpenAI models

The default model for the project is gpt-4o which you can use by simply typing `operate`. To try running OpenAI's new `o1` model, use the command below.

```
operate -m o1-with-ocr
```

To experiment with OpenAI's latest `gpt-4.1` model, run:

```
operate -m gpt-4.1
```

To experiment with OpenAI's latest `gpt-4.1 mini` model, run:

```
operate -m gpt-4.1-mini
```

To experiment with OpenAI's latest `gpt-4.1 nano` model, run:

```
operate -m gpt-4.1-nano
```

To experiment with OpenAI's latest `o3` model, run:

```
operate -m o3
```

To experiment with OpenAI's latest `o4-mini` model, run:

```
operate -m o4-mini
```


### Multimodal Models  `-m`
Try Google's `gemini-1.5-pro-latest`, `gemini-2.5-pro`, or `gemini-2.5-flash` by following the instructions below. Start `operate` with the Gemini model
```
operate -m gemini-2.5-pro
```

**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.

#### Try Claude `-m claude-3`
Use Claude 3 with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Claude dashboard](https://console.anthropic.com/dashboard) to get an API key and run the command below to try it. 

```
operate -m claude-3
```

#### Try qwen `-m qwen-vl`
Use Qwen-vl with Vision to see how it stacks up to GPT-4-Vision at operating a computer. Navigate to the [Qwen dashboard](https://bailian.console.aliyun.com/) to get an API key and run the command below to try it. 

```
operate -m qwen-vl
```

#### Try LLaVa Hosted Through Ollama `-m llava`
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!   

First, install Ollama on your machine from https://ollama.ai/download.   

Once Ollama is installed, pull the LLaVA model:
```
ollama pull llava
```
This will download the model on your machine which takes approximately 5 GB of storage.   

When Ollama has finished pulling LLaVA, start the server:
```
ollama serve
```

That's it! Now start `operate` and select the LLaVA model:
```
operate -m llava
```   
**Important:** Error rates when using LLaVA are very high. This is simply intended to be a base to build off of as local multimodal models improve over time.

#### Try Gemma 3n Hosted Through Ollama `-m gemma3n`
If you wish to experiment with the Self-Operating Computer Framework using Gemma 3n on your own machine, you can with Ollama!   
*Note: Ollama currently only supports MacOS and Linux. Windows now in Preview*   

First, install Ollama on your machine from https://ollama.ai/download.   

Once Ollama is installed, pull the Gemma 3n model:
```
ollama pull gemma3n
```
This will download the model on your machine. You can also pull specific versions like `ollama pull gemma3n:e2b` or `ollama pull gemma3n:e4b`.

When Ollama has finished pulling Gemma 3n, start the server:
```
ollama serve
```

That's it! Now start `operate` and select the Gemma 3n model. You can specify `gemma3n`, `gemma3n:e2b`, or `gemma3n:e4b`:
```
operate -m gemma3n:e2b
```   

#### Try Gemma 3:12b Hosted Through Ollama `-m gemma3:12b`
If you wish to experiment with the Self-Operating Computer Framework using Gemma 3:12b on your own machine, you can with Ollama!   
*Note: Ollama currently only supports MacOS and Linux. Windows now in Preview*   

First, install Ollama on your machine from https://ollama.ai/download.   

Once Ollama is installed, pull the Gemma 3:12b model:
```
ollama pull gemma3:12b
```
This will download the model on your machine.   

When Ollama has finished pulling Gemma 3:12b, start the server:
```
ollama serve
```

That's it! Now start `operate` and select the Gemma 3:12b model:
```
operate -m gemma3:12b
```   

Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)

### Voice Mode `--voice`
The framework supports voice inputs for the objective. Try voice by following the instructions below. 
**Clone the repo** to a directory on your computer:
```
git clone https://github.com/malah-code/self-ai-operating-computer.git
```
**Cd into directory**:
```
cd self-ai-operating-computer
```
Install the additional `requirements-audio.txt`
```
pip install -r requirements-audio.txt
```
**Install device requirements**
For mac users:
```
brew install portaudio
```
For Linux users:
```
sudo apt install portaudio19-dev python3-pyaudio
```
Run with voice mode
```
operate --voice
```

### Optical Character Recognition Mode `-m gpt-4-with-ocr`
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click. 

Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write: 

 `operate` or `operate -m gpt-4-with-ocr` will also work. 

### Set-of-Mark Prompting `-m gpt-4-with-som`
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).

For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start `operate` with the SoM model

```
operate -m gpt-4-with-som
```



## Contributions are Welcomed!:

If you want to contribute yourself, see [CONTRIBUTING.md](/CONTRIBUTING.md).

## Feedback

For any input on improving this project, feel free to reach out to [Josh](https://twitter.com/josh_bickett) on Twitter. 

## Join Our Discord Community

For real-time discussions and community support, join our Discord server. 
- If you're already a member, join the discussion in [#self-ai-operating-computer](https://discord.com/channels/877638638001877052/1181241785834541157).
- If you're new, first [join our Discord Server](https://discord.com/channels/877638638001877052/1181241785834541157).

## Follow HyperWriteAI for More Updates

Stay updated with the latest developments:
- Follow HyperWriteAI on [Twitter](https://twitter.com/HyperWriteAI).
- Follow HyperWriteAI on [LinkedIn](https://www.linkedin.com/company/malah-code/).

## Compatibility
- This project is compatible with Mac OS, Windows, and Linux (with X server installed).

## OpenAI Rate Limiting Note
The ```gpt-4o``` model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.   
Learn more **[here](https://platform.openai.com/docs/guides/rate-limits?context=tier-one)**

## Supported Models Summary

Here's a summary of all currently supported models and how to run them:

*   **OpenAI GPT-4o (Default):**
    ```
    operate
    ```
*   **OpenAI o1-with-ocr:**
    ```
    operate -m o1-with-ocr
    ```
*   **OpenAI o3:**
    ```
    operate -m o3
    ```
*   **OpenAI o4-mini:**
    ```
    operate -m o4-mini
    ```
*   **OpenAI gpt-4.1:**
    ```
    operate -m gpt-4.1
    ```
*   **OpenAI gpt-4.1 mini:**
    ```
    operate -m gpt-4.1-mini
    ```
*   **OpenAI gpt-4.1 nano:**
    ```
    operate -m gpt-4.1-nano
    ```
*   **OpenAI gpt-4-with-ocr:**
    ```
    operate -m gpt-4-with-ocr
    ```
*   **OpenAI gpt-4-with-som:**
    ```
    operate -m gpt-4-with-som
    ```
*   **Google Gemini 1.5 Pro (latest):**
    ```
    operate -m gemini-1.5-pro-latest
    ```
*   **Google Gemini 2.5 Pro:**
    ```
    operate -m gemini-2.5-pro
    ```
*   **Google Gemini 2.5 Flash:**
    ```
    operate -m gemini-2.5-flash
    ```
*   **Anthropic Claude 3:**
    ```
    operate -m claude-3
    ```
*   **Qwen-VL:**
    ```
    operate -m qwen-vl
    ```
*   **LLaVa (via Ollama):**
    ```
    operate -m llava
    ```
*   **Gemma 3n (via Ollama):**
    ```
    operate -m gemma3n
    ```
*   **Gemma 3n:e2b (via Ollama):**
    ```
    operate -m gemma3n:e2b
    ```
*   **Gemma 3n:e4b (via Ollama):**
    ```
    operate -m gemma3n:e4b
    ```
*   **Gemma 3:12b (via Ollama):**
    ```
    operate -m gemma3:12b
    ```

## Release Notes

### Version 0.0.X (Latest)

**New Features:**

*   **Interactive Model Selection:** When running `operate` without specifying a model, a welcome screen is displayed, followed by an interactive menu to select your desired model.
*   **Dynamic API Key Prompting:** The application now intelligently prompts for required API keys (e.g., OpenAI, Google, Anthropic) only when a model requiring that key is selected and the key is not found in your environment variables or `.env` file.
*   **Custom System Prompt:** Users can now provide a custom system prompt from a file or an environment variable (`CUSTOM_SYSTEM_PROMPT`). If the environment variable is set, the option to load from it will be hidden.

**Improvements:**

*   **Expanded Google Gemini Support:** Added full support for `gemini-2.5-pro` and `gemini-2.5-flash` models.
*   **Enhanced Ollama Integration:** Improved handling for Ollama models, including setting `http://localhost:11434` as the default host and providing more informative error messages when Ollama models are not found.
*   **Gemma 3n Model Support:** Integrated support for `gemma3n`, `gemma3n:e2b`, and `gemma3n:e4b` models via Ollama.
*   **Robust Error Handling:** Improved error handling for API calls to prevent unexpected fallbacks and provide clearer error messages.

**Bug Fixes:**

*   Resolved an issue where the application would incorrectly prompt for an OpenAI API key when a Google Gemini model was selected.
*   Fixed an issue where the application would attempt to use an incorrect model name for `gemini-2.5-flash-lite` (which is not a valid model name).
*   Addressed the "Extra data" JSON parsing error when receiving responses from Gemini models.
