Metadata-Version: 2.4
Name: sectionminer
Version: 0.1.11
Summary: Extract sections and subsections from academic PDFs
Author: SectionMiner Contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/ehodiogo/SectionMiner
Project-URL: Repository, https://github.com/ehodiogo/SectionMiner
Keywords: pdf,nlp,llm,sections,academic
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf
Requires-Dist: langchain
Requires-Dist: langchain-openai
Requires-Dist: langchain-text-splitters
Requires-Dist: langchain-community
Requires-Dist: python-decouple
Requires-Dist: google-genai
Requires-Dist: fastapi
Requires-Dist: uvicorn
Requires-Dist: python-multipart
Requires-Dist: jinja2
Dynamic: license-file

<div align="center">

<br/>

```
███████╗███████╗ ██████╗████████╗██╗ ██████╗ ███╗   ██╗
██╔════╝██╔════╝██╔════╝╚══██╔══╝██║██╔═══██╗████╗  ██║
███████╗█████╗  ██║        ██║   ██║██║   ██║██╔██╗ ██║
╚════██║██╔══╝  ██║        ██║   ██║██║   ██║██║╚██╗██║
███████║███████╗╚██████╗   ██║   ██║╚██████╔╝██║ ╚████║
╚══════╝╚══════╝ ╚═════╝   ╚═╝   ╚═╝ ╚═════╝ ╚═╝  ╚═══╝
███╗   ███╗██╗███╗   ██╗███████╗██████╗
████╗ ████║██║████╗  ██║██╔════╝██╔══██╗
██╔████╔██║██║██╔██╗ ██║█████╗  ██████╔╝
██║╚██╔╝██║██║██║╚██╗██║██╔══╝  ██╔══██╗
██║ ╚═╝ ██║██║██║ ╚████║███████╗██║  ██║
╚═╝     ╚═╝╚═╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝
```

**Extract sections and subsections from academic PDFs — powered by layout heuristics and LLM consolidation.**

<br/>

[![PyPI version](https://img.shields.io/pypi/v/sectionminer?style=flat-square&color=0a0a0a&labelColor=f5f5f5)](https://pypi.org/project/sectionminer/)
[![Python](https://img.shields.io/pypi/pyversions/sectionminer?style=flat-square&color=0a0a0a&labelColor=f5f5f5)](https://pypi.org/project/sectionminer/)
[![License](https://img.shields.io/github/license/ehodiogo/SectionMiner?style=flat-square&color=0a0a0a&labelColor=f5f5f5)](LICENSE)
[![PyPI Downloads](https://img.shields.io/pypi/dm/sectionminer?style=flat-square&color=0a0a0a&labelColor=f5f5f5)](https://pypi.org/project/sectionminer/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/psf/black)

<br/>

[**Quickstart**](#-quickstart) · [**Installation**](#-installation) · [**Preset Sections**](#-preset-sections) · [**LiteLLM**](#-litellm-support) · [**CLI**](#-cli) · [**API Reference**](#-api-reference) · [**Web UI**](#-web-ui) · [**Examples**](#-examples)

<br/>

</div>

---

## Overview

**SectionMiner** is a Python library for extracting structured sections and subsections from academic PDFs. It combines local layout analysis (font sizes, spans) with LLM-based tree consolidation to reliably identify section boundaries — even in complex, multi-column, or OCR-heavy documents.

```
PDF File  →  Text Extraction  →  Heading Detection  →  LLM Consolidation  →  Structured Tree
              (PyMuPDF / Gemini)   (font heuristics)    (OpenAI / LiteLLM)
```

### Extraction Backends

| Backend | Description | Best For |
|---------|-------------|----------|
| `pymupdf` *(default)* | Local text extraction using PDF layout spans | Clean, text-native PDFs |
| `gemini` | OCR and extraction via Google Gemini | Scanned docs, complex layouts |

### LLM Consolidation Backends

| Backend | Description |
|---------|-------------|
| OpenAI *(default)* | Uses `ChatOpenAI` with any OpenAI model |
| LiteLLM | Uses `ChatLiteLLM` — supports OpenAI, Anthropic, Groq, Azure, Gemini, and more via a unified interface |

---

## ✦ Quickstart

```python
import json
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")

try:
    structure, usage = miner.extract_structure(return_tokens=True)

    print(json.dumps(structure, indent=2, ensure_ascii=False))
    print(usage)  # { prompt_tokens, completion_tokens, cost_usd, ... }

    # Get text from a specific section
    print(miner.get_section_text("introduction"))

    # Or slice by character offsets
    start, end = miner.get_section_start_and_end_chars("introduction")
    print(miner.get_full_text()[start:end])
finally:
    miner.close()
```

---

## ⬇ Installation

**From PyPI:**

```bash
pip install sectionminer
```

**From source:**

```bash
git clone https://github.com/ehodiogo/SectionMiner.git
cd SectionMiner
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
```

**With LiteLLM support:**

```bash
pip install sectionminer litellm langchain-community
```

### Requirements

- Python **3.10+**
- `OPENAI_API_KEY` — required for LLM consolidation (unless using LiteLLM with a different provider)
- `GEMINI_API_KEY` — required only when using `extraction_backend="gemini"`

### API Keys

Via environment variable:

```bash
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."      # optional, Gemini backend only
export LITELLM_API_KEY="..."     # optional, LiteLLM with non-OpenAI providers
export LITELLM_MODEL="openai/gpt-4o-mini"  # optional, LiteLLM model with provider prefix
```

Or via `.env` in your project root:

```env
OPENAI_API_KEY=sk-...
GEMINI_API_KEY=...
LITELLM_API_KEY=...
LITELLM_MODEL=openai/gpt-4o-mini
```

---

## 🎯 Preset Sections

By default, SectionMiner extracts **all** sections it detects in the PDF. When you only need specific sections, use `preset_sections` to activate **filter mode** — the library will return only the sections whose titles match your list, ignoring everything else.

```python
miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
```

### How matching works

Matching is flexible and normalised — it strips leading numbering, folds casing, removes diacritics, and collapses whitespace before comparing. This means a preset of `"Introdução"` will match headings like `"-Introdução"`, `"1. INTRODUÇÃO"`, `"2.1 Introdução Geral"`, etc.

| Preset | Matches in PDF |
|--------|----------------|
| `"Introdução"` | `"-Introdução"`, `"1. INTRODUÇÃO"`, `"Introdução Geral"` |
| `"Metodologia"` | `"3. Metodologia"`, `"METODOLOGIA"`, `"2.3 Metodologia de Pesquisa"` |
| `"Conclusão"` | `"-CONCLUSÃO"`, `"Conclusão e Trabalhos Futuros"` |

### Key behaviours

- **No fabrication** — if a preset name has no match in the document, it is silently omitted. SectionMiner never invents sections.
- **Subsections follow their parent** — subsections are included only when their parent section was matched.
- **Document order preserved** — matched sections appear in the order they occur in the PDF, not in preset list order.
- **Double-filtered** — the LLM is instructed to filter, and a Python post-processing step removes any hallucinated nodes before results are returned.

---

## 🔀 LiteLLM Support

LiteLLM lets you swap the LLM consolidation provider without changing your code — just set a model name with the appropriate provider prefix.

### Supported providers (examples)

| Provider | `model` value |
|----------|--------------|
| OpenAI | `openai/gpt-4o-mini` |
| Anthropic | `anthropic/claude-3-haiku-20240307` |
| Groq | `groq/llama3-8b-8192` |
| Azure OpenAI | `azure/your-deployment-name` |
| Google Gemini | `gemini/gemini-2.0-flash` |

### Python

```python
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="your-provider-api-key",
    model="anthropic/claude-3-haiku-20240307",
    use_litellm=True,
    preset_sections=["Introdução"],
)

try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
```

### LiteLLM + Gemini extraction backend

Use Gemini for PDF text extraction **and** LiteLLM for tree consolidation simultaneously:

```python
miner = SectionMiner(
    "paper.pdf",
    api_key="your-litellm-provider-key",
    model="openai/gpt-4o-mini",         # LiteLLM: merge consolidation
    extraction_backend="gemini",         # Gemini: PDF text extraction
    gemini_api_key="AIza...",
    use_litellm=True,
    preset_sections=["Introdução"],
)
```

### Via environment variables

```env
LITELLM_MODEL=groq/llama3-8b-8192
LITELLM_API_KEY=gsk_...
```

> **Note:** `get_openai_callback` in `_run` tracks token usage via OpenAI's SDK internals. When using LiteLLM with non-OpenAI providers, token counts may be reported as zero. Cost tracking works reliably only with OpenAI-compatible backends.

---

## ⌨ CLI

SectionMiner installs a `sectionminer` command.

```bash
sectionminer --help
```

### Extract section structure

```bash
# Full extraction with LLM consolidation (OpenAI)
sectionminer extract paper.pdf --tokens --pretty

# Heuristic-only (no LLM / no API key needed)
sectionminer extract paper.pdf --heuristic-only --pretty

# Show cost estimate
sectionminer extract paper.pdf --show-cost --pretty

# Save output to JSON
sectionminer extract paper.pdf --output out.json --pretty

# Use Gemini for extraction
sectionminer extract paper.pdf --extraction-backend gemini --gemini-api-key AIza... --pretty

# Use LiteLLM for consolidation
sectionminer extract paper.pdf --use-litellm --litellm-model groq/llama3-8b-8192 --litellm-api-key gsk_...

# Gemini extraction + LiteLLM consolidation
sectionminer extract paper.pdf \
  --extraction-backend gemini --gemini-api-key AIza... \
  --use-litellm --litellm-model openai/gpt-4o-mini
```

### Get text of a specific section

```bash
sectionminer section-text paper.pdf "introduction"

# With cost breakdown (printed to stderr, JSON unaffected)
sectionminer section-text paper.pdf "introduction" --show-cost

# Without LLM
sectionminer section-text paper.pdf "introduction" --heuristic-only

# With LiteLLM
sectionminer section-text paper.pdf "Introdução" \
  --use-litellm --litellm-model anthropic/claude-3-haiku-20240307
```

> **Note:** `--show-cost` outputs cost info to `stderr` so it never pollutes JSON output.

### LiteLLM CLI flags (available in all subcommands)

| Flag | Description |
|------|-------------|
| `--use-litellm` | Enable LiteLLM backend (replaces OpenAI) |
| `--litellm-model` | Model with provider prefix (e.g. `groq/llama3-8b-8192`). Fallback: `LITELLM_MODEL` env var |
| `--litellm-api-key` | Provider API key. Fallback: `LITELLM_API_KEY` → `OPENAI_API_KEY` |
| `--preset-section` / `--preset-sections` | Optional section title filter that can be repeated |

---

## 🌐 Web UI

SectionMiner includes a FastAPI-powered dashboard with real-time PDF rendering, section cards, a detail modal, and social links.

```bash
# Start with default PyMuPDF backend
sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Or, if you prefer the module entrypoint
python -m sectionminer runserver --host 127.0.0.1 --port 8000 --reload

# Use Gemini for extraction
sectionminer runserver --extraction-backend gemini --gemini-model gemini-2.0-flash

# Use LiteLLM for consolidation
sectionminer runserver --use-litellm --litellm-model groq/llama3-8b-8192

# Heuristic-only (no LLM)
sectionminer runserver --heuristic-only
```

> Se o comando `sectionminer runserver` nao aparecer, sua instalação local está desatualizada. Rode `pip install -e .` no projeto ou `pip install -U sectionminer` no ambiente virtual ativo.

Open in your browser: **http://127.0.0.1:8000**

**Features:**
- Upload any PDF and view extracted sections in real time
- Click a section to highlight its exact location in the PDF viewer
- Open "Ver detalhe" to read the full section text in a modal
- Dashboard shows: backend used, page count, section count, and extraction mode
- Preset sections can be passed from the UI, CLI, API, or Python code

### API Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/` | Visual UI |
| `POST` | `/api/extract` | Upload PDF, returns structured JSON |
| `GET` | `/api/files/{job_id}` | Stream the uploaded PDF for rendering |

<details>
<summary><strong>Sample <code>POST /api/extract</code> response</strong></summary>

```json
{
  "job_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "filename": "paper.pdf",
  "pdf_url": "/api/files/3fa85f64-...",
  "extraction_backend": "pymupdf",
  "heuristic_only": false,
  "pages": 10,
  "metrics": {
    "pages": 10,
    "sections": 24,
    "prompt_tokens": 1800,
    "completion_tokens": 450,
    "total_tokens": 2250,
    "cost_usd": 0.00046
  },
  "sections": [
    {
      "title": "1. Introduction",
      "level": 1,
      "start_char": 0,
      "end_char": 1200,
      "text": "...",
      "locations": [
        { "page": 0, "bbox": [72.0, 120.0, 380.0, 138.0], "text": "..." }
      ]
    }
  ]
}
```

</details>

### Frontend styles (Tailwind)

The dashboard uses Tailwind utilities. If you want to customize the stylesheet build pipeline, install the Node dev dependencies and run:

```bash
npm install
npm run build:css   # one-off build
npm run dev:css     # watch mode
```

The entry stylesheet lives at `sectionminer/server/static/tailwind.css` and compiles to `sectionminer/server/static/styles.css` (served by FastAPI).

---

## 📖 API Reference

### `SectionMiner(path, api_key, **kwargs)`

```python
miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                     # API key for LLM consolidation
    model="gpt-4o-mini",                  # Model name (OpenAI) or provider/model (LiteLLM)
    extraction_backend="pymupdf",         # "pymupdf" | "gemini"
    gemini_api_key="...",                 # required if backend="gemini"
    gemini_model="gemini-2.5-flash-lite", # optional, default model
    preset_sections=["Introdução", "Metodologia"],  # optional filter
    use_litellm=False,                    # set True to use LiteLLM instead of OpenAI
)
```

#### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | `str` | — | Path to the PDF file |
| `api_key` | `str` | — | API key for LLM consolidation |
| `model` | `str` | `"gpt-4o-mini"` | Model name. For LiteLLM, include provider prefix (e.g. `"groq/llama3-8b-8192"`) |
| `extraction_backend` | `str` | `"pymupdf"` | `"pymupdf"` or `"gemini"` |
| `gemini_api_key` | `str` | `None` | Google Gemini API key |
| `gemini_model` | `str` | `"gemini-2.0-flash"` | Gemini model name |
| `preset_sections` | `list[str]` | `None` | If provided, return **only** sections matching these names |
| `use_litellm` | `bool` | `False` | Use LiteLLM instead of direct OpenAI for LLM consolidation |

### Methods

| Method | Returns | Description |
|--------|---------|-------------|
| `extract_structure(return_tokens=False)` | `dict` or `(dict, usage)` | Full extraction pipeline. Returns section tree. |
| `get_section_text(title)` | `str` | Retrieve text of a section by title (fuzzy match). |
| `get_section_start_and_end_chars(title)` | `(int, int)` | Character offsets for a section in the full text. |
| `get_full_text()` | `str` | Complete linearized text of the PDF. |
| `get_sections()` | `list[str]` | List of all detected section titles. |
| `close()` | `None` | Release the open PDF file handle. |

<details>
<summary><strong>Low-level pipeline methods</strong></summary>

| Method | Description |
|--------|-------------|
| `extract_blocks()` | Extract raw text spans from PDF |
| `build_full_text()` | Assemble linearized full text |
| `build_sections()` | Run heading detection heuristics |

Useful for debugging or custom pipelines.

</details>

---

## 🔌 Backends

### PyMuPDF *(default)*

```python
miner = SectionMiner("paper.pdf", api_key="sk-...")
```

Reads text directly from PDF layout data (font sizes, span positions). Fast, offline, no external API needed for extraction.

### Gemini

```python
miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="...",
    gemini_model="gemini-2.5-flash-lite",
)
```

Sends the PDF to Google Gemini for OCR-based text extraction. Better for scanned documents or PDFs with unusual layouts.

---

## 💡 Examples

<details>
<summary><strong>Basic extraction</strong></summary>

```python
from sectionminer import SectionMiner

miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    for section in miner.get_sections():
        print(f"→ {section}")
        print(miner.get_section_text(section)[:200])
        print()
finally:
    miner.close()
```

</details>

<details>
<summary><strong>Extract only specific sections (preset filter)</strong></summary>

```python
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    preset_sections=["Introdução", "Metodologia", "Conclusão"],
)
try:
    miner.extract_structure()
    print(miner.get_section_text("Introdução"))
    print(miner.get_section_text("Metodologia"))
finally:
    miner.close()
```

</details>

<details>
<summary><strong>LiteLLM — swap provider without changing code</strong></summary>

```python
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="gsk_...",   # Groq API key
    model="groq/llama3-8b-8192",
    use_litellm=True,
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
```

</details>

<details>
<summary><strong>Gemini extraction + LiteLLM consolidation</strong></summary>

```python
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",                    # LiteLLM provider key
    model="openai/gpt-4o-mini",          # LiteLLM model
    extraction_backend="gemini",          # Gemini for PDF text extraction
    gemini_api_key="AIza...",
    use_litellm=True,
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
```

</details>

<details>
<summary><strong>Preset sections with Gemini backend</strong></summary>

```python
from sectionminer import SectionMiner

miner = SectionMiner(
    "paper.pdf",
    api_key="sk-...",
    extraction_backend="gemini",
    gemini_api_key="AIza...",
    preset_sections=["Introdução"],
)
try:
    structure, usage = miner.extract_structure(return_tokens=True)
    print(usage)
    print(miner.get_section_text("Introdução"))
finally:
    miner.close()
```

</details>

<details>
<summary><strong>Slice text by character offsets</strong></summary>

```python
miner = SectionMiner("paper.pdf", api_key="sk-...")
try:
    miner.extract_structure()
    start, end = miner.get_section_start_and_end_chars("conclusion")
    if start is not None:
        excerpt = miner.get_full_text()[start:end]
        print(excerpt[:500])
finally:
    miner.close()
```

</details>

---

## 💰 Cost Reference

Measured locally on `2026-03-21` using `gpt-4o-mini`:

| File           | Size | Pages | Tokens | Cost |
|----------------|------|-------|--------|------|
| `artigo_1.pdf` | 0.74 MB | 21 | 2,297 | `$0.000475` |
| `artigo_2.pdf` | 0.04 MB | 4 | 356 | `$0.000060` |

> Section text retrieval after extraction is **free** — it uses local character offsets.
> Using `preset_sections` reduces token usage further by limiting LLM output to matched sections only.

Reproduce with:
```bash
sectionminer extract paper.pdf --show-cost --pretty
```

---

## 🗂 Project Structure

```
SectionMiner/
├── sectionminer/
│   ├── __init__.py        # Public API
│   ├── miner.py           # SectionMiner class
│   ├── client.py          # LLM client + tree merge (OpenAI / LiteLLM)
│   ├── prompts.py         # Consolidation prompt
│   └── server/            # FastAPI + UI (routes, static, templates)
├── examples/
│   ├── basic_usage.py
│   └── api_smoke_test.py
├── files/                 # Sample PDFs
├── test.py                # PyMuPDF + OpenAI pipeline example
├── test_litellm.py        # LiteLLM pipeline example
├── test_gemini_litellm.py # Gemini extraction + LiteLLM consolidation example
└── requirements.txt
```

---

## 🐛 Troubleshooting

<details>
<summary><strong>"Invalid control character" when processing PDF</strong></summary>

The PDF contains invalid control characters that break JSON serialization.
The current version sanitizes these automatically. If the error persists, try a different PDF or validate it with a PDF reader.

</details>

<details>
<summary><strong>Sections are fragmented or broken</strong></summary>

- Review `_is_noise_heading` and `_looks_like_heading` in `sectionminer/miner.py`
- Adjust the threshold in `_detect_threshold` for your PDF's font pattern
- Two-column layouts, intrusive footers, and poor OCR quality increase detection errors

</details>

<details>
<summary><strong>Section not found by title</strong></summary>

- Try a variation without accents or in lowercase (search normalizes text)
- Inspect available titles with `miner.get_sections()`
- If using `preset_sections`, confirm the section actually exists in the PDF — presets with no match are silently omitted, never fabricated

</details>

<details>
<summary><strong>Preset section returns None text</strong></summary>

The section was matched by the LLM but `start_char` is null, meaning the title in `section_structures` differs from what the LLM returned. Debug with:

```python
miner.extract_structure()
for s in miner.section_structures:
    print(repr(s["title"]), s["start"])
```

Use the exact title shown there (or a close variation) in `preset_sections`.

</details>

<details>
<summary><strong>LiteLLM: "LLM Provider NOT provided"</strong></summary>

You passed a model name without the provider prefix (e.g. `"gpt-4o-mini"` instead of `"openai/gpt-4o-mini"`). LiteLLM requires the prefix to identify the provider. Always use `provider/model-name` format.

</details>

<details>
<summary><strong>LiteLLM: token usage shows zeros</strong></summary>

`get_openai_callback` only captures usage from OpenAI-compatible calls. With non-OpenAI providers via LiteLLM, token counts will report as zero. This is a known limitation — the extraction itself works correctly.

</details>

<details>
<summary><strong>OpenAI key error</strong></summary>

- Confirm `OPENAI_API_KEY` is set in the same environment as your script
- If using `.env`, ensure it's in the project root

</details>

---

## 🗺 Roadmap

- [ ] Automated tests for `detect_headings`, `build_sections`, `get_section_text`
- [ ] Expose heuristic parameters via config (threshold, noise filters)
- [ ] Native LiteLLM token/cost tracking (replace `get_openai_callback`)
- [x] LiteLLM support — use any provider for LLM consolidation
- [x] CLI: `sectionminer extract file.pdf --output out.json`
- [x] Heuristic-only mode (no LLM, fully offline)
- [x] Improved merge — keeps only valid sections/subsections without broken fragments
- [x] Web UI with PDF viewer and section highlighting
- [x] Preset sections filter — extract only named sections with flexible normalised matching

---

## 📄 License

[MIT](LICENSE) © [ehodiogo](https://github.com/ehodiogo)

---

<div align="center">

Made with ♥ for researchers who'd rather spend time reading papers than parsing them.

**[⬆ back to top](#)**

</div>
