Metadata-Version: 2.4
Name: neuraparse
Version: 0.1.0a1
Summary: Production-grade agentic document-to-dataset pipeline with GraphRAG support.
Author: Neura Parse
License-Expression: MIT
Project-URL: Homepage, https://github.com/neuraparse/neuraparse
Project-URL: Repository, https://github.com/neuraparse/neuraparse
Project-URL: Documentation, https://github.com/neuraparse/neuraparse/blob/main/README.md
Project-URL: Issues, https://github.com/neuraparse/neuraparse/issues
Keywords: graphrag,rag,document-processing,dataset-generation,llm,evaluation,ranking,entity-extraction,knowledge-graph
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: recipes-yaml
Requires-Dist: pyyaml>=6.0; extra == "recipes-yaml"
Provides-Extra: pdf
Requires-Dist: pypdf>=4.0.0; extra == "pdf"
Provides-Extra: office
Requires-Dist: python-docx>=1.0.0; extra == "office"
Provides-Extra: llm-openai
Requires-Dist: openai>=1.0.0; extra == "llm-openai"
Provides-Extra: llm-anthropic
Requires-Dist: anthropic>=0.40.0; extra == "llm-anthropic"
Provides-Extra: llm-ollama
Requires-Dist: ollama>=0.4.0; extra == "llm-ollama"
Provides-Extra: llm-all
Requires-Dist: openai>=1.0.0; extra == "llm-all"
Requires-Dist: anthropic>=0.40.0; extra == "llm-all"
Requires-Dist: ollama>=0.4.0; extra == "llm-all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: all
Requires-Dist: pyyaml>=6.0; extra == "all"
Requires-Dist: pypdf>=4.0.0; extra == "all"
Requires-Dist: python-docx>=1.0.0; extra == "all"
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: anthropic>=0.40.0; extra == "all"
Requires-Dist: ollama>=0.4.0; extra == "all"
Dynamic: license-file

# neuraparse

**Production-grade agentic document-to-dataset pipeline with GraphRAG support.**

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![Tests](https://img.shields.io/badge/tests-39%20passed-brightgreen.svg)](tests/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Version](https://img.shields.io/badge/version-0.1.0a1-orange.svg)](https://pypi.org/project/neuraparse/)

> ⚠️ **Alpha Release**: This is an early alpha version (0.1.0a1). APIs may change. Feedback and contributions welcome!

## 🚀 **What is neuraparse?**

neuraparse transforms documents into high-quality datasets for:
- **GraphRAG systems** (entity extraction, graph neighborhoods, hierarchical summaries)
- **Retrieval evaluation** (graded relevance, cross-document ranking, multi-context ranking)
- **LLM fine-tuning** (QA pairs, instruction datasets, summarization)
- **Agentic workflows** (memory, tool usage, knowledge graphs)

### **Key Features**

✅ **Multi-format ingestion**: Web pages, PDFs, Office docs, Markdown, plain text
✅ **Hierarchical parsing**: Layout-aware DocumentTree (sections, paragraphs, metadata)
✅ **GraphRAG-ready**: DocumentGraph with structural + semantic nodes (entities, summaries)
✅ **10+ dataset recipes**: RAG chunks, QA pairs, entity knowledge, graded relevance, cross-doc ranking
✅ **Profile system**: Bundle recipes into workflows (`graphrag`, `eval_ranking`, `eval_advanced`)
✅ **Real LLM integration**: OpenAI, Anthropic (Claude), Ollama (local models)
✅ **Production-ready**: 39 tests, type hints, comprehensive error handling

---

## 📦 **Installation**

```bash
# Basic installation
pip install neuraparse

# With LLM providers
pip install neuraparse[llm-openai]      # OpenAI GPT-4/3.5
pip install neuraparse[llm-anthropic]   # Anthropic Claude
pip install neuraparse[llm-ollama]      # Local Ollama models
pip install neuraparse[llm-all]         # All LLM providers

# With document parsing
pip install neuraparse[pdf]             # PDF support
pip install neuraparse[office]          # DOCX support
pip install neuraparse[recipes-yaml]    # YAML recipe configs

# Full installation
pip install neuraparse[llm-all,pdf,office,recipes-yaml]
```

---

## 🎯 **Quick Start**

### **1. Ingest a document**

```bash
# From a web page
neuraparse ingest https://example.com/article.html

# From a local file
neuraparse ingest path/to/document.pdf

# From markdown
neuraparse ingest path/to/notes.md
```

### **2. Build a document graph**

```bash
neuraparse build-graph <document_id>
```

This creates a **DocumentGraph** with:
- **Structural nodes**: DOCUMENT → SECTION → PARAGRAPH hierarchy
- **Semantic nodes**: ENTITY (keywords), SUMMARY (section summaries)
- **Edges**: parent_of, next_sibling, mentions, summarizes

### **3. Generate datasets**

#### **Option A: Run a single recipe**

```bash
# Generate RAG chunks
neuraparse run-recipe <document_id> --recipe examples/rag_chunks.json

# Generate QA pairs with OpenAI
neuraparse run-recipe <document_id> --recipe examples/recipe_with_openai.json

# Generate graded relevance dataset
neuraparse run-recipe <document_id> --recipe examples/graded_relevance.json
```

#### **Option B: Run a profile (multiple recipes)**

```bash
# GraphRAG profile (6 recipes: chunks, QA, summaries, entities, neighborhoods, relevance)
neuraparse run-profile <document_id> --profile graphrag

# Evaluation ranking profile (2 recipes: section_relevance, multi_context_ranking)
neuraparse run-profile <document_id> --profile eval_ranking

# Advanced evaluation profile (3 recipes: graded_relevance, cross_doc_ranking, entity_context_ranking)
neuraparse run-profile <document_id> --profile eval_advanced
```

---

## 📚 **Available Recipes**

| Recipe | Description | Output Format |
|--------|-------------|---------------|
| `rag_chunks` | Paragraph chunks for RAG | `{chunk_id, text, metadata}` |
| `basic_qa` | QA pairs per paragraph | `{question, answer, context}` |
| `outline_summary` | Hierarchical section summaries | `{section, summary, level}` |
| `entity_knowledge` | Entity-centric knowledge aggregation | `{entity, mentions, contexts}` |
| `graph_neighborhood` | Paragraph + graph context | `{paragraph, siblings, summary}` |
| `section_relevance` | Binary relevance pairs | `{query, context, label}` |
| `multi_context_ranking` | Multi-context ranking | `{query, contexts: [{text, label}]}` |
| `graded_relevance` | Graded relevance (0-3) | `{query, context, grade}` |
| `cross_document_ranking` | Cross-doc ranking | `{query, contexts: [{text, label, source_doc}]}` |
| `entity_context_ranking` | Entity + summary ranking | `{query, contexts: [{text, label, type}]}` |

---

## 🧠 **LLM Integration**

### **OpenAI**

```json
{
  "kind": "basic_qa",
  "params": {
    "llm": {
      "provider": "openai",
      "model": "gpt-4",
      "api_key": "sk-...",  // or set OPENAI_API_KEY env var
      "temperature": 0.7,
      "max_tokens": 512
    }
  }
}
```

### **Anthropic (Claude)**

```json
{
  "kind": "outline_summary",
  "params": {
    "llm": {
      "provider": "anthropic",
      "model": "claude-3-5-sonnet-20241022",
      "temperature": 0.5,
      "max_tokens": 1024
    }
  }
}
```

### **Ollama (Local)**

```json
{
  "kind": "basic_qa",
  "params": {
    "llm": {
      "provider": "ollama",
      "model": "llama3.2",
      "base_url": "http://localhost:11434",
      "temperature": 0.6
    }
  }
}
```

---

## 🏗️ **Architecture**

```
┌─────────────────┐
│  Raw Documents  │  (Web, PDF, DOCX, Markdown, Text)
└────────┬────────┘
         │ Ingestion
         ▼
┌─────────────────┐
│ DocumentTree    │  (Hierarchical: sections, paragraphs, metadata)
└────────┬────────┘
         │ Graph Building
         ▼
┌─────────────────┐
│ DocumentGraph   │  (Nodes: DOCUMENT, SECTION, PARAGRAPH, ENTITY, SUMMARY)
└────────┬────────┘
         │ Recipe Execution
         ▼
┌─────────────────┐
│   Datasets      │  (RAG chunks, QA pairs, rankings, evaluations)
└─────────────────┘
```

---

## 🔬 **Advanced Usage**

### **Custom Profiles**

Create `my_profiles.json`:

```json
{
  "profiles": {
    "my_custom_profile": [
      "rag_chunks",
      "graded_relevance",
      "entity_context_ranking"
    ]
  }
}
```

Run it:

```bash
neuraparse run-profile <document_id> --profile my_custom_profile --profiles-config my_profiles.json
```

### **Python API**

```python
from neuraparse.core.ingestion import ingest_from_url
from neuraparse.core.graph_builder import build_document_graph
from neuraparse.recipes import execute_recipe, execute_profile

# Ingest
doc = ingest_from_url("https://example.com/article.html", base_dir="./data")

# Build graph
graph = build_document_graph(doc.id, base_dir="./data")

# Run recipe
output_path = execute_recipe(
    config_path="examples/rag_chunks.json",
    graph=graph,
    base_dir="./data",
    document_id=doc.id
)

# Or run profile
outputs = execute_profile(
    profile_name="graphrag",
    graph=graph,
    base_dir="./data",
    document_id=doc.id
)
```

---

## 🧪 **Testing**

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=neuraparse --cov-report=html

# Run specific test file
pytest tests/test_advanced_eval_recipes.py -v
```

**Current status**: ✅ **39 passed, 1 skipped**

---

## 📖 **Documentation**

- [Full Documentation](docs/) (coming soon)
- [Recipe Guide](docs/recipes.md) (coming soon)
- [LLM Integration Guide](docs/llm_integration.md) (coming soon)
- [Examples](examples/)

---

## 🛣️ **Roadmap**

- [x] Core ingestion + parsing + graph building
- [x] 10+ dataset recipes
- [x] Profile system
- [x] Real LLM integration (OpenAI, Anthropic, Ollama)
- [x] Advanced evaluation recipes (graded relevance, cross-doc ranking)
- [ ] Multi-document graph merging
- [ ] Streaming ingestion for large documents
- [ ] Web UI for graph visualization
- [ ] PyPI package release

---

## 📄 **License**

MIT License - see [LICENSE](LICENSE) for details.

---

## 🤝 **Contributing**

Contributions welcome! Please:
1. Fork the repo
2. Create a feature branch
3. Add tests for new features
4. Ensure all tests pass (`pytest`)
5. Submit a pull request

---

## 🙏 **Acknowledgments**

Built with modern 2025 GraphRAG and agentic data pipeline patterns, inspired by:
- Microsoft GraphRAG
- LlamaIndex
- LangChain
- Recent ACL/NAACL/ICLR papers on retrieval evaluation

