Metadata-Version: 2.4
Name: bibr
Version: 0.0.1
Summary: Scientific paper processing pipeline for metadata extraction from PDF/DOCX/LaTeX/etc.
Project-URL: Homepage, https://bibr.org
Project-URL: Repository, https://github.com/scienceverse/pytacheck
Project-URL: Documentation, https://bibr.org
Author-email: Jakub Langr <james@bibr.org>
License: GPL-3.0
License-File: LICENSE.md
Keywords: bibliography,metadata extraction,pdf,research,scientific papers
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# bibr 🦫
<!-- badges: start -->
![Version](https://img.shields.io/badge/version-0.1.3-blue.svg)
![Made in Europe](https://img.shields.io/badge/Made_in_Europe-003399?logo=european-union&logoColor=FFCC00)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![codecov](https://codecov.io/gh/scienceverse/pytacheck/graph/badge.svg?token=Mt0vQyE4qX)](https://codecov.io/gh/scienceverse/pytacheck)
<!-- badges: end -->

## Description
Scientific paper processing pipeline, extracts comprehensive metadata from PDF/DOCX/LaTeX/etc. Designed as a preprocessing backend for [Metacheck](https://github.com/scienceverse/metacheck). 

### Overview

At its core, bibr aims to provide better accuracy than other tools, using more modern approaches and offering additional features. 
Written in Python.

### Features
- easy-to-use Python library, install with uv
- modular, scalable pipeline architecture
- supports PDF, TXT, Markdown, DOCX, HTML, etc. as input
- outputs an Arrow stream with parsed metadata, references, full-text content, etc.
- extracts:
   - paper metadata (title, authors, DOI, etc.) with high accuracy
   - sections with semantic classification of canonical sections (abstract, methods, etc.)
   - sentences (using spaCy tokenization for better accuracy)
   - tables with representation in markdown and JSON (currently fragile)
   - references/bibs
   - (URL) links
- uses a mix of traditional methods (Regex, NLP) and structured LLM-based extraction for better accuracy
- response caching with Redis, with TTL and versioning options
- API key, rate limiting, security features

### Dependencies

Not every service is used for every document; the pipeline adapts based on input type and available metadata.

- **LightOnOCR 2**: Open source OCR model for PDFs, made in EU (https://huggingface.co/lightonai/LightOnOCR-2-1B)
- **Pandoc**: Converts markdown/docx/LaTeX/etc. documents into a standardized structured tree format (AST)
- **LLM API**: Helps extract paper metadata with structured output
- **Crossref API**: Optional - enhances reference metadata, checks for DOIs, etc.


## LLM disclaimer
In some places, bibr selectively uses Large Language Models to extract metadata with more accuracy. This is done only where strictly needed - such as with references and citations, where traditional methods (Regex) are not always accurate and sensitive across varying contexts and citation styles.

### Supported LLM Providers
bibr supports multiple LLM providers. Set `LLM_PROVIDER` in your `.env` file:
- **`google`** (default): Google AI Studio (Gemini). Requires `GOOGLE_API_KEY`.
- **`openai`**: OpenAI API or any OpenAI-compatible endpoint. Requires `LLM_API_KEY`. Use `LLM_BASE_URL` for custom endpoints (vLLM, LM Studio, etc.).
- **`ollama`**: Local Ollama instance. Set `OLLAMA_BASE_URL` if not using the default `http://localhost:11434`.

## Privacy and security
For LLM providers, we recommend using providers with strict privacy policies or open source models. Please be careful with your API keys in the .env file and if deploying in production, use a Secrets Manager instead.

## How to run  