Metadata-Version: 2.1
Name: cua-som
Version: 0.1.4
Summary: Computer Vision and OCR library for detecting and analyzing UI elements
Keywords: computer-vision,ocr,ui-analysis,icon-detection
Author-Email: TryCua <gh@trycua.com>
License: AGPL-3.0-or-later
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Project-URL: Homepage, https://github.com/trycua/cua
Project-URL: Repository, https://github.com/trycua/cua
Project-URL: Documentation, https://github.com/trycua/cua/tree/main/docs
Requires-Python: <3.14,>=3.12
Requires-Dist: torch>=2.2.1
Requires-Dist: torchvision>=0.17.1
Requires-Dist: ultralytics>=8.1.28
Requires-Dist: easyocr>=1.7.1
Requires-Dist: numpy>=1.26.4
Requires-Dist: pillow>=10.2.0
Requires-Dist: setuptools>=75.8.1
Requires-Dist: opencv-python-headless>=4.11.0.86
Requires-Dist: matplotlib>=3.8.3
Requires-Dist: huggingface-hub>=0.21.4
Requires-Dist: supervision>=0.25.1
Requires-Dist: typing-extensions>=4.9.0
Requires-Dist: pydantic>=2.6.3
Description-Content-Type: text/markdown

<div align="center">
<h1>
  <div class="image-wrapper" style="display: inline-block;">
    <picture>
      <source media="(prefers-color-scheme: dark)" alt="logo" height="150" srcset="https://raw.githubusercontent.com/trycua/cua/main/img/logo_white.png" style="display: block; margin: auto;">
      <source media="(prefers-color-scheme: light)" alt="logo" height="150" srcset="https://raw.githubusercontent.com/trycua/cua/main/img/logo_black.png" style="display: block; margin: auto;">
      <img alt="Shows my svg">
    </picture>
  </div>

[![Python](https://img.shields.io/badge/Python-333333?logo=python&logoColor=white&labelColor=333333)](#)
[![macOS](https://img.shields.io/badge/macOS-000000?logo=apple&logoColor=F0F0F0)](#)
[![Discord](https://img.shields.io/badge/Discord-%235865F2.svg?&logo=discord&logoColor=white)](https://discord.com/invite/mVnXXpdE85)
[![PyPI](https://img.shields.io/pypi/v/cua-computer?color=333333)](https://pypi.org/project/cua-computer/)

</h1>
</div>

**Som** (Set-of-Mark) is a visual grounding component for the Computer-Use Agent (Cua) framework powering Cua, for detecting and analyzing UI elements in screenshots. Optimized for macOS Silicon with Metal Performance Shaders (MPS), it combines YOLO-based icon detection with EasyOCR text recognition to provide comprehensive UI element analysis.

## Features

- Optimized for Apple Silicon with MPS acceleration
- Icon detection using YOLO with multi-scale processing
- Text recognition using EasyOCR (GPU-accelerated)
- Automatic hardware detection (MPS → CUDA → CPU)
- Smart detection parameters tuned for UI elements
- Detailed visualization with numbered annotations
- Performance benchmarking tools

## System Requirements

- **Recommended**: macOS with Apple Silicon
  - Uses Metal Performance Shaders (MPS)
  - Multi-scale detection enabled
  - ~0.4s average detection time
- **Supported**: Any Python 3.11+ environment
  - Falls back to CPU if no GPU available
  - Single-scale detection on CPU
  - ~1.3s average detection time

## Installation

```bash
# Using PDM (recommended)
pdm install

# Using pip
pip install -e .
```

## Quick Start

```python
from som import OmniParser
from PIL import Image

# Initialize parser
parser = OmniParser()

# Process an image
image = Image.open("screenshot.png")
result = parser.parse(
    image,
    box_threshold=0.3,    # Confidence threshold
    iou_threshold=0.1,    # Overlap threshold
    use_ocr=True         # Enable text detection
)

# Access results
for elem in result.elements:
    if elem.type == "icon":
        print(f"Icon: confidence={elem.confidence:.3f}, bbox={elem.bbox.coordinates}")
    else:  # text
        print(f"Text: '{elem.content}', confidence={elem.confidence:.3f}")
```

## Docs

- [Configuration](http://localhost:8090/docs/libraries/som/configuration)

## Development

### Test Data

- Place test screenshots in `examples/test_data/`
- Not tracked in git to keep repository size manageable
- Default test image: `test_screen.png` (1920x1080)

### Running Tests

```bash
# Run benchmark with no OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr none

# Run benchmark with OCR
python examples/omniparser_examples.py examples/test_data/test_screen.png --runs 5 --ocr easyocr
```

## License

MIT License - See LICENSE file for details.
