Metadata-Version: 2.4
Name: structai
Version: 0.1.22
Summary: A utility package for AI development
Author-email: Wanghan Xu <xu_wanghan@sjtu.edu.cn>
Project-URL: Homepage, https://github.com/black-yt/structai
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai
Requires-Dist: python-Levenshtein
Requires-Dist: json_repair
Requires-Dist: pillow
Requires-Dist: httpx[socks]
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: tqdm
Dynamic: license-file

<div align="center">
  <h1>StructAI</h1>
</div>

<div align="center">

[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.10+-yellow.svg)](https://www.python.org/)
[![GitHub](https://img.shields.io/badge/GitHub-000000?logo=github&logoColor=white)](https://github.com/black-yt/structai)&#160;
[![PyPI version](https://img.shields.io/pypi/v/structai.svg)](https://pypi.org/project/structai/)

</div>

<p align="center">
  <img src="banner.jpg" alt="banner" width="850">
</p>

StructAI is a comprehensive utility library for accelerating LLM application development, including multi-agent systems. It offers a robust toolkit for LLM interaction—such as structured outputs, context management, and parallel execution—streamlining development workflows and facilitating the deployment of scalable, production-ready AI systems.

## ✨ Key Features

| Feature Category | Description | Key Capabilities |
| :--- | :--- | :--- |
| **🤖 LLM Agents** | Powerful wrappers for LLM API interactions. | Structured JSON/Dict output parsing, conversation memory management, automatic retries, timeout handling, multimodal support. |
| **⚖️ LLM Judge & Arena** | Evaluation framework for LLM responses. | Ground truth exact matching, mathematical equivalence verification (`math_verify`), LLM-as-a-judge correctness, A/B testing (Arena). |
| **🚀 Concurrency** | Parallel execution utilities. | Easy-to-use thread pool (`multi_thread`) and process pool (`multi_process`) mapping with progress bars. |
| **📄 PDF & Document** | Advanced document processing. | High-quality PDF parsing via MinerU, Markdown extraction, and embedded image extraction. |
| **🛠️ Utilities & I/O** | Essential tools for AI workflows. | Auto-detect file loading/saving (JSON, CSV, PT, etc.), text sanitization, tag extraction (`<think>`, `<answer>`), network proxy handling, caching. |
| **🌟 Claude Skills** | Self-documenting capabilities. | Generates comprehensive Markdown documentation (`structai_skill`) for providing context to Claude/LLMs about this library. |

## ⚙️ Installation

> **Recommended for most users.** Installs the latest stable release from PyPI.
```bash
pip install structai
```

> **For development.** Installs StructAI in editable mode from source, enabling live code changes.

```bash
git clone https://github.com/black-yt/structai.git
cd structai
pip install -e .
```

> **Note:** Before using LLM-related features, please ensure you have set the necessary environment variables:

```bash
export LLM_API_KEY="your-api-key"
export LLM_BASE_URL="your-api-base-url"
```

> **Note:** If you need to use PDF parsing-related functions, please apply for the API at [MinerU](https://mineru.net/) and add it to your environment variables.
```bash
export MINERU_TOKEN="your-mineru-api-key"
```

---

## 📚 StructAI Library Documentation

### Table of Contents

- [🌟 Skill](#skill)
  - [`structai_skill`](#structai_skill)
- [🤖 LLMs/vLLMs](#llmsvllms)
  - [`prompts`](#prompts)
  - [`LLMAgent Class`](#llmagent-class)
    - [`initialization`](#initialization)
    - [`__call__`](#__call__)
  - [`Judge Class`](#judge-class)
    - [`initialization`](#initialization-1)
    - [`__call__`](#__call__-1)
  - [`messages_to_responses_input`](#messages_to_responses_input)
  - [`extract_text_outputs`](#extract_text_outputs)
  - [`print_messages`](#print_messages)
- [🚀 Concurrent](#concurrent)
  - [`multi_thread`](#multi_thread)
  - [`multi_process`](#multi_process)
- [📂 I/O](#io)
  - [`load_file`](#load_file)
  - [`save_file`](#save_file)
  - [`read_pdf`](#read_pdf)
  - [`encode_image`](#encode_image)
  - [`get_all_file_paths`](#get_all_file_paths)
  - [`print_once`](#print_once)
  - [`make_print_once`](#make_print_once)
- [📝 String Processing](#string-processing)
  - [`extract_markdown_images`](#extract_markdown_images)
  - [`sanitize_text`](#sanitize_text)
  - [`filter_excessive_repeats`](#filter_excessive_repeats)
  - [`cutoff_text`](#cutoff_text)
  - [`str2dict`](#str2dict)
  - [`str2list`](#str2list)
  - [`remove_tag`](#remove_tag)
  - [`parse_think_answer`](#parse_think_answer)
  - [`extract_within_tags`](#extract_within_tags)
- [🌐 Network Service](#network-service)
  - [`add_no_proxy_if_private`](#add_no_proxy_if_private)
  - [`run_server`](#run_server)
- [⏱️ Time Limit](#time-limit)
  - [`timeout_limit`](#timeout_limit)
  - [`run_with_timeout`](#run_with_timeout)

### Skill

#### `structai_skill`

Returns a comprehensive documentation string for the StructAI library in Markdown format. This is useful for providing context to LLMs about the available tools in this library.

*   **Args**:
    *   None
*   **Returns**:
    *   (str): The documentation string.

*   **Example**:
```python
from structai import structai_skill

docs = structai_skill()
print(docs)
```

```bash
python -c "from structai import structai_skill; print(structai_skill())" > structai_skill.md
```

[Back to Table of Contents](#table-of-contents)

### LLMs/vLLMs

#### `prompts`

A dictionary containing predefined LLM prompts for various evaluation tasks, such as LLM-as-a-judge and Arena comparisons.

| Prompt Name | Description | kwargs |
| :--- | :--- | :--- |
| `llm_judge_closed_answer` | Evaluates mathematical and logical equivalence between a model answer and a ground truth answer. | `prompt_tmp` (str): template with `{question}`, `{answer}`, `{model_answer}` placeholders; `llm_tags` (dict): `{"correct": 1, "incorrect": 0}` |
| `llm_judge_arena` | Compares two answers (Answer A and Answer B) to an open-ended question and determines which is better overall. | `prompt_tmp` (str): template with `{question}`, `{answer}` (Answer A), `{model_answer}` (Answer B) placeholders; `llm_tags` (dict): `{"A": 0, "B": 1}` |

*   **Example**:
```python
from structai import prompts

# Access the prompt template and tags for closed answer evaluation
closed_answer_prompt = prompts["llm_judge_closed_answer"]["prompt_tmp"]
tags = prompts["llm_judge_closed_answer"]["llm_tags"]

print(closed_answer_prompt)
# Output:
# # Role
# You are a precise mathematical and logical evaluator...
```

[Back to Table of Contents](#table-of-contents)

#### `LLMAgent` Class

A powerful wrapper class for interacting with OpenAI-compatible LLM APIs. It handles retries, timeouts, and structured output validation.

##### `initialization`

*   **Args**:
    *   `api_key` (str, optional): API Key. Defaults to `os.environ["LLM_API_KEY"]`.
    *   `api_base` (str, optional): Base URL. Defaults to `os.environ["LLM_BASE_URL"]`.
    *   `model_version` (str, optional): Model identifier. Default `'gpt-4.1-mini'`.
    *   `system_prompt` (str, optional): Default system prompt. Default `'You are a helpful assistant.'`.
    *   `max_tokens` (int, optional): Maximum tokens for generation. Default `None`.
    *   `temperature` (float, optional): Sampling temperature. Default `0`.
    *   `http_client` (httpx.Client, optional): Optional custom httpx client.
    *   `headers` (dict, optional): Optional custom headers.
    *   `time_limit` (int, optional): Timeout in seconds. Default `300` (5 minutes).
    *   `max_try` (int, optional): Default number of retries. Default `1`.
    *   `use_responses_api` (bool, optional): Whether to use the Responses API format. Default `False`.

*   **Returns**:
    *   (LLMAgent): LLMAgent instance.

*   **Example**:
```python
from structai import LLMAgent

agent = LLMAgent()
```

[Back to Table of Contents](#table-of-contents)

##### `__call__`
Sends a query to the LLM with built-in validation, parsing, and retry logic.

*   **Args**:
    *   `query` (str): The main input text or prompt to be sent to the LLM.
    *   `system_prompt` (str, optional): The system instruction. Overrides the default if provided.
    *   `return_example` (str | list | dict, optional): A template defining the expected structure and type of the response.
        *   `None` or `str` (default): Returns raw response string.
        *   `list`: Expects a JSON list string. Validates element types if example elements are provided.
        *   `dict`: Expects a JSON object string. Validates keys (supports fuzzy matching).
    *   `max_try` (int, optional): Max attempts. Defaults to instance's `max_try`.
    *   `wait_time` (float, optional): Time in seconds to wait between retries. Default `0.0`.
    *   `n` (int, optional): Number of completion choices. Default `1`.
    *   `max_tokens` (int, optional): Overrides instance's `max_tokens`.
    *   `temperature` (float, optional): Overrides instance's `temperature`.
    *   `image_paths` (list[str], optional): List of local image paths for multimodal models.
    *   `history` (list[dict], optional): Conversation history `[{"role": "user", "content": "..."}, ...]`.
    *   `use_responses_api` (bool, optional): Overrides instance setting.
    *   `list_len` (int, optional): *Validation* - Enforces exact list length.
    *   `list_min` (int | float, optional): *Validation* - Enforces minimum value for list elements.
    *   `list_max` (int | float, optional): *Validation* - Enforces maximum value for list elements.
    *   `check_keys` (bool, optional): *Validation* - Whether to validate dict keys. Default `True`.

*   **Returns**:
    *   (str | list | dict): The parsed response from the LLM.
        *   If `n > 1`, returns a list of results.
        *   Returns `None` if all retries fail.

*   **Example**:
```python
# Basic usage
response = agent("Generate a random number.", n=3, temperature=1)
# Output: ["Sure! Here's a random number for you: 738", "Sure! Here's a random number: 7382", "Sure! Here's a random number: 487."]

# Enforce the output format (List, Dict, or specific types) using `return_example`. Note that the output format needs to be explicitly specified in the prompt.
numbers = agent(
    "Generate 3 random numbers, for example, [1, 2, 3].", 
    return_example=[1], 
    list_len=3
)
# Output: [10, 42, 7]

profile = agent(
    "Create a user profile for Alice, for example, {'name': Alice, 'age': 1, 'city': 'shanghai'}.", 
    return_example={"name": "str", "age": 1, "city": "str"}
)
# Output: {'name': 'Alice', 'age': 25, 'city': 'New York'}

# Multimodal input for vision models
description = agent(
    "Describe these images", 
    image_paths=["path/to/image_1.jpg", "path/to/image_2.jpg"]
)

# Memory context
history = [
    {"role": "user", "content": "My name is Bob."},
    {"role": "assistant", "content": "Hello Bob."}
]
answer = agent(
    "What is my name?", 
    history=history, 
)
# Output: 'Your name is Bob.'
```

[Back to Table of Contents](#table-of-contents)

#### `Judge` Class

A class for evaluating model answers against ground truth answers using multiple methods: Exact Match, Math Verify, and LLM-based Judge.

##### `initialization`

*   **Args**:
    *   `api_key` (str, optional): API Key. Defaults to `os.environ["LLM_API_KEY"]`.
    *   `api_base` (str, optional): Base URL. Defaults to `os.environ["LLM_BASE_URL"]`.
    *   `model_version` (str, optional): Model identifier for the LLM Judge. Default `'gpt-4.1'`.
    *   `system_prompt` (str, optional): System prompt for the LLM Judge. Default `'You are a helpful assistant.'`.
    *   `max_tokens` (int, optional): Maximum tokens for LLM generation. Default `10`.
    *   `temperature` (float, optional): Sampling temperature for LLM. Default `0`.
    *   `http_client` (httpx.Client, optional): Optional custom httpx client.
    *   `headers` (dict, optional): Optional custom headers.
    *   `time_limit` (int, optional): Timeout in seconds for LLM API calls. Default `60`.
    *   `max_try` (int, optional): Number of retries for LLM API calls. Default `2`.
    *   `use_responses_api` (bool, optional): Whether to use the Responses API format. Default `False`.
    *   `prompt_tmp` (str, optional): Template for the LLM Judge prompt. Defaults to `default_prompt_tmp`.
    *   `use_tqdm` (bool, optional): Whether to show a progress bar for batch processing. Default `True`.
    *   `use_math_verify` (bool, optional): Whether to use the `math_verify` library for evaluation. Default `True`.
    *   `use_llm_judge` (bool, optional): Whether to use an LLM for evaluation. Default `True`.
    *   `llm_tags` (dict, optional): Mapping of LLM output strings to scores. Default `{"correct": 1, "incorrect": 0}`.
    *   `workers` (int, optional): Number of threads for parallel processing. Default `100`.

*   **Returns**:
    *   (Judge): Judge instance.

*   **Example**:
```python
from structai.judge import Judge

judge = Judge()
```

[Back to Table of Contents](#table-of-contents)

##### `__call__`

Evaluates one or more question dictionaries using the configured evaluation methods (Exact Match, Math Verify, LLM Judge).

This method processes the input dictionary (or list of dictionaries), extracts the model answer(s), and compares them against the ground truth answer using the enabled evaluation strategies. It supports multiple model answer samples separated by `<answer_split>`.

*   **Args**:
    *   `ques_dict` (dict | list[dict]): A single dictionary or a list of dictionaries containing evaluation data.
        Each dictionary must contain the following keys:
        - `"question"` (str): The question text.
        - `"answer"` (str): The ground truth answer.
        - `"model_answer"` (str): The model's answer. If multiple samples are provided, they should be separated by `<answer_split>`.
        - `"solution"` (str, optional): The step-by-step ground truth solution.

*   **Returns**:
    *   (dict | list[dict]): The input dictionary (or list of dictionaries) updated with the following evaluation metrics:

        **Per-Sample Results (Lists):**
        - `"exact_match_list"` (list[int]): A list of 0s and 1s indicating whether each sample in `model_answer` exactly matches the ground truth (case-insensitive).
        - `"math_verify_list"` (list[int | None]): A list of 0s and 1s indicating mathematical equivalence for each sample (if `use_math_verify` is True).
        - `"llm_judge_list"` (list[int | None]): A list of 0s and 1s indicating correctness as judged by an LLM for each sample (if `use_llm_judge` is True).

        **Single-Sample Metrics (Based on the LAST sample):**
        - `"exact_match"` (int): 1 if the **last** sample is an exact match, 0 otherwise.
        - `"math_verify"` (int): 1 if the **last** sample is mathematically equivalent, 0 otherwise (if enabled).
        - `"llm_judge"` (int): 1 if the **last** sample is correct according to the LLM, 0 otherwise (if enabled).

        **Pass@k Metrics (At least ONE sample is correct):**
        - `"exact_match_pass@k"` (int): 1 if **any** sample in the list is an exact match, 0 otherwise.
        - `"math_verify_pass@k"` (int): 1 if **any** sample is mathematically equivalent, 0 otherwise (if enabled).
        - `"llm_judge_pass@k"` (int): 1 if **any** sample is correct according to the LLM, 0 otherwise (if enabled).

        **PassAll@k Metrics (ALL samples are correct):**
        - `"exact_match_passall@k"` (int): 1 if **all** samples are exact matches, 0 otherwise.
        - `"math_verify_passall@k"` (int): 1 if **all** samples are mathematically equivalent, 0 otherwise (if enabled).
        - `"llm_judge_passall@k"` (int): 1 if **all** samples are correct according to the LLM, 0 otherwise (if enabled).

        **LLM Arena Metrics (If `prompt_tmp` and `llm_tags` are set to `llm_judge_arena`):**
        In Arena mode, `llm_judge` (and each element in `llm_judge_list`) is `1` if `model_answer` (Answer B) is better, or `0` if `answer` (Answer A) is better.

*   **Example**:
```python
from structai import Judge, prompts

judge = Judge()

ques_dict = {
    "question": "1+1=?",
    "answer": "2",
    "model_answer": "2"
}
result = judge(ques_dict)
print(result["exact_match"]) # 1

ques_dicts = [
    {
        "question": "Bob's age?",
        "answer": "22",
        "model_answer": "22<answer_split>Twenty-two"
    },
    {
        "question": "Bob's age?",
        "solution": "He was born in 2003, and today is 2025.",
        "answer": "22",
        "model_answer": "20<answer_split>Bob's age is 22"
    },
    {
        "question": "Bob's age?",
        "solution": "He was born in 2003, and today is 2025.",
        "answer": "22",
        "model_answer": "20<answer_split>20+2"
    }
]
results = judge(ques_dicts)
# Output:
[
    {
        "question": "Bob's age?",
        "answer": "22",
        "model_answer": "22<answer_split>Twenty-two",
        "exact_match_list": [
            1,
            0
        ],
        "math_verify_list": [
            1,
            0
        ],
        "llm_judge_list": [
            1,
            1
        ],
        "exact_match": 0,
        "math_verify": 0,
        "llm_judge": 1,
        "exact_match_pass@k": 1,
        "math_verify_pass@k": 1,
        "llm_judge_pass@k": 1,
        "exact_match_passall@k": 0,
        "math_verify_passall@k": 0,
        "llm_judge_passall@k": 1
    },
    {
        "question": "Bob's age?",
        "solution": "He was born in 2003, and today is 2025.",
        "answer": "22",
        "model_answer": "20<answer_split>Bob's age is 22",
        "exact_match_list": [
            0,
            0
        ],
        "math_verify_list": [
            0,
            1
        ],
        "llm_judge_list": [
            0,
            1
        ],
        "exact_match": 0,
        "math_verify": 1,
        "llm_judge": 1,
        "exact_match_pass@k": 0,
        "math_verify_pass@k": 1,
        "llm_judge_pass@k": 1,
        "exact_match_passall@k": 0,
        "math_verify_passall@k": 0,
        "llm_judge_passall@k": 0
    },
    {
        "question": "Bob's age?",
        "solution": "He was born in 2003, and today is 2025.",
        "answer": "22",
        "model_answer": "20<answer_split>20+2",
        "exact_match_list": [
            0,
            0
        ],
        "math_verify_list": [
            0,
            1
        ],
        "llm_judge_list": [
            0,
            1
        ],
        "exact_match": 0,
        "math_verify": 1,
        "llm_judge": 1,
        "exact_match_pass@k": 0,
        "math_verify_pass@k": 1,
        "llm_judge_pass@k": 1,
        "exact_match_passall@k": 0,
        "math_verify_passall@k": 0,
        "llm_judge_passall@k": 0
    }
]

# Using LLM Arena for A/B testing
arena_judge = Judge(
    prompt_tmp=prompts["llm_judge_arena"]["prompt_tmp"],
    llm_tags=prompts["llm_judge_arena"]["llm_tags"],
    use_math_verify=False
)

arena_data = {
    "question": "What are the benefits of learning Python?",
    "answer": "Python is great.", # Answer A
    "model_answer": "Python is easy to read, has a large ecosystem, and is widely used in data science and web development." # Answer B
}

arena_result = arena_judge(arena_data)
# If the LLM prefers model_answer (Answer B), arena_result["llm_judge"] will be 1.
```

[Back to Table of Contents](#table-of-contents)

#### `messages_to_responses_input`

Converts standard Chat Completions `messages` format (list of dicts) to the input format required by the Responses API.

*   **Args**:
    *   `messages` (list[dict]): List of message dictionaries with 'role' and 'content'.
*   **Returns**:
    *   (tuple): A tuple containing `(system_prompt_content, input_blocks)`.

*   **Example**:
```python
from structai import messages_to_responses_input

messages = [{"role": "user", "content": "Hello"}]
system_prompt, input_blocks = messages_to_responses_input(messages)
```

[Back to Table of Contents](#table-of-contents)

#### `extract_text_outputs`

Extracts the text content from an LLM API response object (supports both Chat Completions and Responses API formats).

*   **Args**:
    *   `result` (object): The response object from the LLM API.
*   **Returns**:
    *   (list[str]): A list of extracted text outputs.

*   **Example**:
```python
from structai import extract_text_outputs

# Assuming 'response' is the object returned by the OpenAI client
texts = extract_text_outputs(response)
print(texts[0])
```

[Back to Table of Contents](#table-of-contents)

#### `print_messages`

Print chat messages with colored labels and text.

*   **Args**:
    *   `messages` (list): List of message dictionaries with `role` and `content`.
    *   `user_color` (str, optional): Color for the user's message text and label background. Default is `cyan`.
    *   `ai_color` (str, optional): Color for the assistant's message text and label background. Default is `yellow`.
    *   `label_text_color` (str, optional): Color for the label text (User and Assistant). Default is `grey`.
*   **Returns**:
    *   None

*   **Example**:
```python
from structai import print_messages

messages = [
    {"role": "user", "content": "My name is Bob."},
    {"role": "assistant", "content": "Hello Bob."}
]
print_messages(messages)
```

[Back to Table of Contents](#table-of-contents)

### Concurrent

#### `multi_thread`

Executes a function concurrently for each item in `inp_list` using a thread pool.

*   **Args**:
    *   `inp_list` (list[dict]): A list of dictionaries, where each dictionary contains keyword arguments for `function`.
    *   `function` (callable): The function to execute.
    *   `max_workers` (int, optional): The maximum number of threads. Default `40`.
    *   `use_tqdm` (bool, optional): Whether to show a progress bar. Default `True`.
*   **Returns**:
    *   (list): A list of results corresponding to the input list order.

*   **Example**:
```python
from structai import multi_thread
import time

def square(x):
    return x * x

inputs = [{"x": i} for i in range(10)]
results = multi_thread(inputs, square, max_workers=4)
print(results) # [0, 1, 4, 9, ...]
```

[Back to Table of Contents](#table-of-contents)

#### `multi_process`

Executes a function concurrently for each item in `inp_list` using a process pool. Ideal for CPU-bound tasks.

*   **Args**:
    *   `inp_list` (list[dict]): A list of dictionaries, where each dictionary contains keyword arguments for `function`.
    *   `function` (callable): The function to execute.
    *   `max_workers` (int, optional): The maximum number of processes. Default `40`.
    *   `use_tqdm` (bool, optional): Whether to show a progress bar. Default `True`.
*   **Returns**:
    *   (list): A list of results corresponding to the input list order.

*   **Example**:
```python
from structai import multi_process

# 'heavy_computation' must be defined at the top level for multiprocessing pickling.
def heavy_computation(n):
    return sum(range(n))

inputs = [{"n": 1000} for _ in range(5)]
results = multi_process(inputs, heavy_computation)
```

[Back to Table of Contents](#table-of-contents)

### I/O

#### `load_file`
Automatically reads a file based on its extension.

*   **Args**:
    *   `path` (str): The path to the file to be read.
*   **Returns**:
    *   (Any): The content of the file, parsed into an appropriate Python object.
        *   `.json` -> `dict` or `list`
        *   `.jsonl` -> `list` of dicts
        *   `.csv`, `.parquet`, `.xlsx` -> `pandas.DataFrame`
        *   `.txt`, `.md`, `.py` -> `str`
        *   `.pkl` -> unpickled object
        *   `.npy` -> `numpy.ndarray`
        *   `.pt` -> `torch` object
        *   `.png`, `.jpg`, `.jpeg` -> `PIL.Image.Image`

*   **Example**:
```python
from structai import load_file

# Load a JSON file
data = load_file("config.json")

# Load a CSV file as a pandas DataFrame
df = load_file("data.csv")

# Load an image
image = load_file("photo.jpg")
```

[Back to Table of Contents](#table-of-contents)

#### `save_file`
Automatically saves data to a file based on the extension. Creates necessary directories if they don't exist.

*   **Args**:
    *   `data` (Any): The data object to save.
    *   `path` (str): The destination file path.
*   **Returns**:
    *   None

*   **Example**:
```python
from structai import save_file

data = {"key": "value"}

# Save as JSON
save_file(data, "output.json")

# Save as Pickle
save_file(data, "backup.pkl")
```

[Back to Table of Contents](#table-of-contents)

#### `read_pdf`

Processes PDF file(s) by uploading them to MinerU for parsing, downloading the results, and loading the extracted content (text and images) into memory.

*   **Args**:
    *   `path` (str | list[str]): A single file path (str) or a list of file paths (list[str]) pointing to the PDF files to be processed.
*   **Returns**:
    *   (dict | list[dict | None] | None):
        *   If `path` is a single string, returns a dictionary containing the parsed data, or None if processing failed.
        *   If `path` is a list, returns a list where each element is either a dictionary (success) or None (failure).
        *   The result dictionary has the following structure:
            ```python
            {
                "path": str,        # The original path of the PDF file.
                "text": str,        # The full extracted text content in Markdown format.
                "img_paths": list[str], # A list of absolute file paths to the extracted images.
                "imgs": list[PIL.Image.Image] # A list of PIL Image objects corresponding to the images in `img_paths`.
            }
            ```

*   **Example**:
```python
from structai import read_pdf

# Process a single PDF
result = read_pdf("paper.pdf")
if result:
    print(result["text"][:100])
    print(f"Found {len(result['imgs'])} images")

# Process multiple PDFs
results = read_pdf(["doc1.pdf", "doc2.pdf"])
```

[Back to Table of Contents](#table-of-contents)

#### `encode_image`

Encodes a PIL Image object into a base64 string.

*   **Args**:
    *   `image_obj` (PIL.Image.Image): The image object to encode.
*   **Returns**:
    *   (str): The base64 encoded string.

*   **Example**:
```python
from structai import encode_image

b64_str = encode_image(img)
```

[Back to Table of Contents](#table-of-contents)

#### `get_all_file_paths`

Recursively retrieves all file paths in a directory that match a given suffix.

*   **Args**:
    *   `directory` (str): The root directory to search.
    *   `suffix` (str, optional): The file suffix to filter by (e.g., '.py'). Default `''` (matches all files).
    *   `filter_func` (callable, optional): A function that takes a file path and returns True to include it. Default `None`.
    *   `absolute` (bool, optional): Whether to return absolute paths. Default `True`.
*   **Returns**:
    *   (list[str]): A list of matching file paths.

*   **Example**:
```python
from structai import get_all_file_paths

# Get all Python files in the current directory
py_files = get_all_file_paths(".", suffix=".py")
print(py_files)

# Get relative paths of all files, excluding those in 'test' directory
files = get_all_file_paths(
    ".", 
    filter_func=lambda p: "test" not in p, 
    absolute=False
)
```

[Back to Table of Contents](#table-of-contents)

#### `print_once`
Prints a message to stdout only once during the entire program execution. Useful for logging warnings or info inside loops.

*   **Args**:
    *   `msg` (str): The message to print.
*   **Returns**:
    *   None

*   **Example**:
```python
from structai import print_once

for i in range(10):
    print_once("Starting processing...") # print only once
```

[Back to Table of Contents](#table-of-contents)

#### `make_print_once`
Creates and returns a local function that prints a message only once. This is useful if you need a "print once" behavior scoped to a specific function or instance rather than globally.

*   **Args**:
    *   None
*   **Returns**:
    *   (callable): A function `inner(msg)` that behaves like `print_once`.

*   **Example**:
```python
from structai import make_print_once

logger1 = make_print_once()
logger2 = make_print_once()

logger1("Hello") # Prints "Hello"
logger1("Hello") # Does nothing

logger2("World") # Prints "World"
logger2("World") # Does nothing
```

[Back to Table of Contents](#table-of-contents)

### String Processing

#### `extract_markdown_images`

Parses Markdown text to extract paths of embedded images.

*   **Args**:
    *   `text` (str): The Markdown content string to analyze.
*   **Returns**:
    *   (list[str]): A list of image file paths extracted from the Markdown text.

*   **Example**:
```python
from structai import extract_markdown_images

md_text = "Here is an image: ![alt](images/img1.jpg)"
images = extract_markdown_images(md_text)
print(images) # ['images/img1.jpg']
```

[Back to Table of Contents](#table-of-contents)

#### `sanitize_text`

Sanitizes text by keeping only ASCII English characters, digits, and common punctuation. Removes control characters and ANSI codes.

*   **Args**:
    *   `text` (str): The text to sanitize.
*   **Returns**:
    *   (str): The sanitized text.

*   **Example**:
```python
from structai import sanitize_text

clean = sanitize_text("Hello \x1b[31mWorld\x1b[0m!")
print(clean) # 'Hello [31mWorld[0m!'
```

[Back to Table of Contents](#table-of-contents)

#### `filter_excessive_repeats`

Identifies sequences where a single character or a two-character substring repeats at least the specified threshold times and removes them entirely from the string.

*   **Args**:
    *   `text` (str): The input string.
    *   `threshold` (int, optional): The maximum allowed consecutive repetitions. Default `5`.
*   **Returns**:
    *   (str): The processed string with excessive repetitions removed.

*   **Example**:
```python
from structai import filter_excessive_repeats

clean = filter_excessive_repeats("Helloooooo World", threshold=5)
print(clean) # "Hell World"

clean = filter_excessive_repeats("Hello\\b\\b World", threshold=2)
print(clean) # "Heo World"
```

[Back to Table of Contents](#table-of-contents)

#### `cutoff_text`

Truncate and sanitize a string so that its final length is guaranteed to be <= l. The function applies a series of progressively stronger transformations:
1. Sanitize text with `sanitize_text`.
2. Reduce repetitions with `filter_excessive_repeats`.
3. If still too long, keep a head and tail segment and insert a separator in the middle.
4. Apply a final hard cutoff as a safety net.

*   **Args**:
    *   `s` (str): Input string to be processed. May contain invalid Unicode, excessive repetition, or arbitrarily long content.
    *   `l` (int): Maximum allowed length of the returned string. Must be greater than `9`. Defaults to `20_000`.
*   **Returns**:
    *   (str): A processed string whose length is guaranteed to be less than or equal to `l`.

*   **Example**:
```python
from structai import cutoff_text

s = cutoff_text("aaaaaaasdddddfdf", l=10)
print(s) # "sfdf"

s = cutoff_text("asdfjsdjgofgofdkmsdlfmldmsgkgnfkdsfagfsdafdsfskfn", 22)
print(s) # "asdfjsd\n\n...\n\ndsfskfn"
```

[Back to Table of Contents](#table-of-contents)

#### `str2dict`

Robustly converts a string representation of a dictionary to a Python `dict`. It handles common formatting errors and uses `json_repair` as a fallback.

*   **Args**:
    *   `s` (str): The string representation of a dictionary.
*   **Returns**:
    *   (dict): The parsed dictionary.

*   **Example**:
```python
from structai import str2dict

d = str2dict("{'a': 1, 'b': 2}")
print(d['a']) # 1
```

[Back to Table of Contents](#table-of-contents)

#### `str2list`

Robustly converts a string representation of a list to a Python `list`.

*   **Args**:
    *   `s` (str): The string representation of a list.
*   **Returns**:
    *   (list): The parsed list.

*   **Example**:
```python
from structai import str2list

l = str2list("[1, 2, 3]")
print(len(l)) # 3
```

[Back to Table of Contents](#table-of-contents)

#### `remove_tag`

Removes specified tags from a string, replacing them with a separator (default newline).

*   **Args**:
    *   `s` (str): The input string.
    *   `tags` (list[str], optional): A list of tags to remove. Default `["<think>", "</think>", "<answer>", "</answer>"]`.
    *   `r` (str, optional): The replacement string. Default `"\n"`.
*   **Returns**:
    *   (str): The cleaned string.

*   **Example**:
```python
from structai import remove_tag

clean_text = remove_tag("<think>...</think> Answer")
# Output: "...\n Answer"
```

[Back to Table of Contents](#table-of-contents)

#### `parse_think_answer`

Parses a string containing Chain-of-Thought tags (`<think>...</think>` and `<answer>...</answer>`) and returns the content of both.

*   **Args**:
    *   `text` (str): The input text containing the tags.
*   **Returns**:
    *   (tuple): A tuple `(think_content, answer_content)`.

*   **Example**:
```python
from structai import parse_think_answer

raw_text = "<think>Step 1...</think><answer>42</answer>"
think, answer = parse_think_answer(raw_text)
print(f"Reasoning: {think}") # Reasoning: Step 1...
print(f"Result: {answer}") # Result: 42
```

[Back to Table of Contents](#table-of-contents)

#### `extract_within_tags`

Extracts the substring found between two specific tags.

*   **Args**:
    *   `content` (str): The text to search within.
    *   `start_tag` (str, optional): The opening tag. Default `'<answer>'`.
    *   `end_tag` (str, optional): The closing tag. Default `'</answer>'`.
    *   `default_return` (Any, optional): The value to return if tags are not found. Default `None`.
*   **Returns**:
    *   (str | Any): The extracted content string, or `default_return` if not found.

*   **Example**:
```python
from structai import extract_within_tags

text = "Result: <json>{...}</json>"
json_str = extract_within_tags(text, "<json>", "</json>")
# Output: "{...}"
```

[Back to Table of Contents](#table-of-contents)

### Network Service

#### `add_no_proxy_if_private`

Checks if the hostname in the URL is a private IP address. If so, it adds it to the `no_proxy` environment variable to bypass proxies.

*   **Args**:
    *   `url` (str): The URL to check.
*   **Returns**:
    *   None

*   **Example**:
```python
from structai import add_no_proxy_if_private

add_no_proxy_if_private("http://192.168.1.100:8080/v1")
```

[Back to Table of Contents](#table-of-contents)

#### `run_server`

Starts a FastAPI server that acts as a proxy to an OpenAI-compatible LLM provider using LLM_BASE_URL and LLM_API_KEY in environment variables.

*   **Args**:
    *   `host` (str, optional): The host to bind to. Default `"0.0.0.0"`.
    *   `port` (int, optional): The port to bind to. Default `8001`.
*   **Returns**:
    *   None (Runs indefinitely until stopped).

*   **Example**:
```python
from structai import run_server

if __name__ == "__main__":
    run_server()
```

[Back to Table of Contents](#table-of-contents)

### Time Limit

#### `timeout_limit`

A decorator that enforces a maximum execution time on a function. Raises `TimeoutError` if the limit is exceeded.

*   **Args**:
    *   `timeout` (float | None): Maximum allowed execution time in seconds.
*   **Returns**:
    *   (decorator): A decorator function that wraps the target function.

*   **Example**:
```python
from structai import timeout_limit
import time

@timeout_limit(timeout=2.0)
def task():
    time.sleep(5)

# This will raise TimeoutError
task()
```

[Back to Table of Contents](#table-of-contents)

#### `run_with_timeout`

Runs a function with a specified timeout without using a decorator.

*   **Args**:
    *   `func` (callable): The function to run.
    *   `args` (tuple, optional): Positional arguments for the function. Default `()`.
    *   `kwargs` (dict, optional): Keyword arguments for the function. Default `None`.
    *   `timeout` (float | None): Maximum allowed execution time in seconds.
*   **Returns**:
    *   (Any): The return value of the function.

*   **Example**:
```python
from structai import run_with_timeout

def task(x):
    return x * 2

result = run_with_timeout(task, args=(10,), timeout=1.0)
```

[Back to Table of Contents](#table-of-contents)
