Metadata-Version: 2.4
Name: somaya
Version: 1.0.5
Summary: SOMA - Advanced Tokenization & Intelligence Framework
Home-page: https://github.com/chavalasantosh/SanVerse
Author: Santosh Chavala
Author-email: Santosh Chavala <chavalasantosh@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/chavalasantosh/SanVerse
Project-URL: Documentation, https://github.com/chavalasantosh/SanVerse#readme
Project-URL: Repository, https://github.com/chavalasantosh/SanVerse.git
Project-URL: Issues, https://github.com/chavalasantosh/SanVerse/issues
Keywords: soma,tokenization,nlp,embeddings,intelligence
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: fastapi>=0.104.0
Requires-Dist: uvicorn[standard]>=0.24.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: click>=8.1.7
Requires-Dist: rich>=13.7.0
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# SOMA: Advanced Intelligence Framework

[![PyPI version](https://badge.fury.io/py/somaya.svg)](https://badge.fury.io/py/somaya)
[![Python Versions](https://img.shields.io/pypi/pyversions/somaya.svg)](https://pypi.org/project/somaya/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Build Status](https://github.com/chavalasantosh/SanVerse/actions/workflows/python-tests.yml/badge.svg)](https://github.com/chavalasantosh/SanVerse/actions)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/chavalasantosh/SanVerse)
[![zread](https://img.shields.io/badge/Ask_Zread-_.svg?style=flat&color=00b0aa&labelColor=000000&logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyB3aWR0aD0iMTYiIGhlaWdodD0iMTYiIHZpZXdCb3g9IjAgMCAxNiAxNiIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPHBhdGggZD0iTTQuOTYxNTYgMS42MDAxSDIuMjQxNTZDMS44ODgxIDEuNjAwMSAxLjYwMTU2IDEuODg2NjQgMS42MDE1NiAyLjI0MDFWNC45NjAxQzEuNjAxNTYgNS4zMTM1NiAxLjg4ODEgNS42MDAxIDIuMjQxNTYgNS42MDAxSDQuOTYxNTZDNS4zMTUwMiA1LjYwMDEgNS42MDE1NiA1LjMxMzU2IDUuNjAxNTYgNC45NjAxVjIuMjQwMUM1LjYwMTU2IDEuODg2NjQgNS4zMTUwMiAxLjYwMDEgNC45NjE1NiAxLjYwMDFaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00Ljk2MTU2IDEwLjM5OTlIMi4yNDE1NkMxLjg4ODEgMTAuMzk5OSAxLjYwMTU2IDEwLjY4NjQgMS42MDE1NiAxMS4wMzk5VjEzLjc1OTlDMS42MDE1NiAxNC4xMTM0IDEuODg4MSAxNC4zOTk5IDIuMjQxNTYgMTQuMzk5OUg0Ljk2MTU2QzUuMzE1MDIgMTQuMzk5OSA1LjYwMTU2IDE0LjExMzQgNS42MDE1NiAxMy43NTk5VjExLjAzOTlDNS42MDE1NiAxMC42ODY0IDUuMzE1MDIgMTAuMzk5OSA0Ljk2MTU2IDEwLjM5OTlaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik0xMy43NTg0IDEuNjAwMUgxMS4wMzg0QzEwLjY4NSAxLjYwMDEgMTAuMzk4NCAxLjg4NjY0IDEwLjM5ODQgMi4yNDAxVjQuOTYwMUMxMC4zOTg0IDUuMzEzNTYgMTAuNjg1IDUuNjAwMSAxMS4wMzg0IDUuNjAwMUgxMy43NTg0QzE0LjExMTkgNS42MDAxIDE0LjM5ODQgNS4zMTM1NiAxNC4zOTg0IDQuOTYwMVYyLjI0MDFDMTQuMzk4NCAxLjg4NjY0IDE0LjExMTkgMS42MDAxIDEzLjc1ODQgMS42MDAxWiIgZmlsbD0iI2ZmZiIvPgo8cGF0aCBkPSJNNCAxMkwxMiA0TDQgMTJaIiBmaWxsPSIjZmZmIi8%2BCjxwYXRoIGQ9Ik00IDEyTDEyIDQiIHN0cm9rZT0iI2ZmZiIgc3Ryb2tlLXdpZHRoPSIxLjUiIHN0cm9rZS1saW5lY2FwPSJyb3VuZCIvPgo8L3N2Zz4K&logoColor=ffffff)](https://zread.ai/chavalasantosh/SANVerse)

**SOMA** is a next-generation tokenization and intelligence framework designed to bridge the gap between raw text and semantic understanding. Unlike traditional tokenizers that simply split text, SOMA applies mathematical analysis, feature extraction, and cognitive structures to create a richer representation of language.

> _"Intelligence begins with how we perceive the data. SOMA changes the perception."_

---

## 🚀 Why SOMA?

SOMA is built for researchers and developers who need more than just BPE (Byte Pair Encoding). It offers a unified engine for:

- **Universal Tokenization**: Seamlessly switch between whitespace, word, character, subword, and grammar-based strategies.
- **Mathematical Embeddings**: Proprietary "Frontend Digit" calculation for deterministic, low-compute feature extraction.
- **Cognitive Architecture**: Integrated support for Small Language Models (SLMs) and reasoning pipelines.
- **Structure-Aware**: The `soma_core` module understands text hierarchy and structural patterns effectively.

## 📦 Installation

```bash
pip install somaya
```

## ⚡ Quick Start

### Python API

```python
from soma import TextTokenizationEngine

# Initialize the engine
engine = TextTokenizationEngine()

# Process text with advanced analysis
text = "The future of AI is structural."
result = engine.tokenize(text, tokenization_method="subword")

print(f"Tokens:   {result['tokens']}")
print(f"Features: {result['features']}")
# Output:
# Tokens:   ['The', 'fut', 'ure', 'of', 'AI', 'is', 'str', 'uct', 'ural', '.']
# Features: {'entropy_index': 7, 'balance_index': 4, ...}
```

### Command Line Interface

Process files directly from your terminal:

```bash
# Tokenize a file
soma tokenize input.txt --method subword --output result.json

# Analyze text structure
soma analyze "Analyze this sentence for structural balance."
```

## 🏗️ Architecture

SOMA is modular by design, allowing you to use only what you need:

| Module                 | Purpose                                                                                                           |
| :--------------------- | :---------------------------------------------------------------------------------------------------------------- |
| **`soma`**             | The high-level wrapper and entry point for all standard operations.                                               |
| **`soma_core`**        | **Structural Core**: Handles metrics, pattern recognition, and hierarchy detection.                               |
| **`cognitive`**        | **AI Layer**: Contains reasoning engines, SLM (Small Language Model) architectures (`soma_gpt`), and verbalizers. |
| **`src`**              | **Engine Room**: The low-level implementations of parallel tokenizers and embedding generators.                   |
| **`semantic_trainer`** | **Training**: Tools for training custom semantic embeddings on your own corpora.                                  |

## 🔧 modules Overview

### 1. SOMA Core (`soma_core`)

The backbone of the system. It replaces simple regex splitting with structure-aware parsing.

- _Key Class_: `StructureHierarchy`
- _Capabilities_: Pattern building, Similarity metrics via `soma_core_metrics`.

### 2. Cognitive Layer (`cognitive`)

Where text meets reasoning.

- **Reasoning**: `soma_reasoner.py` enables logical deduction chains.
- **SLM**: `soma_gpt.py` provides a lightweight, trainable transformer implementation for specialized tasks.

### 3. Vector Integration

Seamlessly plug into vector databases.

- Built-in support for **Weaviate** and **ChromaDB**.
- Easy export of semantic embeddings to downstream ML tasks.

## 🤝 Contributing

We welcome contributions! Please see `CONTRIBUTING.md` for details.

1.  Fork the repository
2.  Create your feature branch (`git checkout -b feature/amazing-feature`)
3.  Commit your changes (`git commit -m 'Add some amazing feature'`)
4.  Push to the branch (`git push origin feature/amazing-feature`)
5.  Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

**Author**: Santosh Chavala
**Repository**: [https://github.com/chavalasantosh/SanVerse](https://github.com/chavalasantosh/SanVerse)
