Pipeline Processing Reference
==============================

Overview
--------

The OneCite pipeline is a 4-stage process that transforms raw references into formatted citations:

1. **Validation** - Check reference validity
2. **Identification** - Query data sources
3. **Completion** - Enrich with metadata
4. **Formatting** - Convert to output format

Pipeline Stages
===============

Stage 1: Validation
-------------------

**Purpose:** Ensure input is valid and can be processed

**Input:** Raw reference text

**Output:** Validated RawEntry object

**Process:**

1. Check for empty/null input
2. Validate format (txt or bib)
3. Detect reference type
4. Extract metadata hints

**Error Handling:**

Raises ``ValidationError`` if:

- Input is empty
- Format is unrecognized
- Data is malformed
- Required fields missing

**Example:**

::

    from onecite import RawEntry
    from onecite.pipeline import Validator
    
    raw = RawEntry(content="10.1038/nature14539")
    validator = Validator()
    
    if validator.validate(raw):
        print("Valid reference")
    else:
        print("Invalid reference")

Stage 2: Identification
-----------------------

**Purpose:** Find matching citations in data sources

**Input:** Validated RawEntry

**Output:** List of IdentifiedEntry objects

**Process:**

1. Detect identifier type (DOI, arXiv, etc.)
2. Query appropriate data source
3. Parse results
4. Rank by relevance
5. Return candidates

**Data Sources:**

- CrossRef (DOI-based)
- Semantic Scholar (keyword search)
- OpenAlex (academic graph)
- PubMed (biomedical)
- DBLP (computer science)
- arXiv (preprints)
- DataCite (datasets)
- Zenodo (open research)
- Google Books (books)

**Intelligent Routing:**

OneCite automatically selects best sources:

- **Medical terms** → PubMed priority
- **CS terms** → DBLP/arXiv priority
- **DOI** → CrossRef priority
- **Mixed** → Semantic Scholar

**Example:**

::

    from onecite.pipeline import Identifier
    from onecite import RawEntry
    
    identifier = Identifier()
    raw = RawEntry(content="10.1038/nature14539")
    
    matches = identifier.identify(raw)
    for match in matches:
        print(f"{match.title} ({match.year})")

Stage 3: Completion
-------------------

**Purpose:** Enrich entries with complete metadata

**Input:** IdentifiedEntry (often incomplete)

**Output:** CompletedEntry (fully enriched)

**Process:**

1. Query additional data sources
2. Fill missing fields
3. Normalize author names
4. Verify publication details
5. Calculate completeness score

**Fields Enriched:**

- Authors
- Title
- Journal/Publisher
- Year
- Volume/Issue
- Pages
- DOI/URL
- Keywords
- Abstract

**Completeness Scoring:**

A score from 0-1 indicating data completeness:

- 0.9-1.0: Excellent (all fields present)
- 0.7-0.9: Good (most fields present)
- 0.5-0.7: Fair (core fields present)
- < 0.5: Poor (incomplete)

**Example:**

::

    from onecite.pipeline import Completer
    from onecite import IdentifiedEntry
    
    completer = Completer()
    identified = IdentifiedEntry(...)
    
    completed = completer.complete(identified)
    print(f"Completeness: {completed.completeness_score}")

Stage 4: Formatting
-------------------

**Purpose:** Convert to output format

**Input:** CompletedEntry

**Output:** Formatted string

**Supported Formats:**

- BibTeX
- APA
- MLA
- Custom (via templates)

**Process:**

1. Load template for format
2. Map fields to template variables
3. Apply formatting rules
4. Handle special characters
5. Return formatted string

**Example:**

::

    from onecite.pipeline import Formatter
    from onecite import CompletedEntry
    
    formatter = Formatter()
    completed = CompletedEntry(...)
    
    # BibTeX output
    bibtex = formatter.format(completed, "bibtex")
    
    # APA output
    apa = formatter.format(completed, "apa")

Complete Pipeline
=================

The PipelineController orchestrates all stages:

::

    from onecite import PipelineController
    
    controller = PipelineController()
    
    result = controller.process(
        entries=["10.1038/nature14539"],
        output_format="bibtex"
    )

Internal Process
~~~~~~~~~~~~~~~~

1. Validate input
2. For each entry:
   - Identify sources
   - Select best match
   - Complete entry
   - Format output
3. Aggregate results
4. Return summary

Advanced Pipeline Usage
=======================

Custom Data Processing
----------------------

::

    from onecite.pipeline import (
        Validator,
        Identifier,
        Completer,
        Formatter
    )
    from onecite import RawEntry
    
    # Create components
    validator = Validator()
    identifier = Identifier()
    completer = Completer()
    formatter = Formatter()
    
    # Manual pipeline
    raw = RawEntry(content="10.1038/nature14539")
    
    # Stage 1
    if not validator.validate(raw):
        raise ValidationError("Invalid reference")
    
    # Stage 2
    matches = identifier.identify(raw)
    if not matches:
        raise ResolverError("No matches found")
    
    # Stage 3
    identified = matches[0]
    completed = completer.complete(identified)
    
    # Stage 4
    formatted = formatter.format(completed, "bibtex")
    print(formatted)

Batch Processing
----------------

::

    from onecite import PipelineController
    
    controller = PipelineController()
    
    references = [
        "10.1038/nature14539",
        "1706.03762",
        "Smith (2020) Machine Learning"
    ]
    
    result = controller.process(
        entries=references,
        output_format="bibtex"
    )
    
    print(f"Processed: {result['processed_count']}")
    print(f"Failed: {result['failed_count']}")

Performance Optimization
------------------------

**Single Reference:**

::

    # Fast path for single reference
    result = process_references("10.1038/nature14539")

**Batch References:**

::

    # Use --quiet flag for better performance
    onecite process refs.txt --quiet -o output.bib

**Large Batches:**

::

    # Split into chunks
    split -l 100 large_file.txt chunk_
    
    for chunk in chunk_*; do
        onecite process "$chunk" -o "${chunk}.bib" --quiet
    done

Error Handling in Pipeline
==========================

Validation Errors
-----------------

::

    from onecite import ValidationError
    
    try:
        result = process_references("")
    except ValidationError:
        print("Empty input")

Resolution Errors
-----------------

::

    from onecite import ResolverError
    
    try:
        result = process_references("invalid/doi")
    except ResolverError:
        print("Could not find reference")
        print("Check identifier or try again later")

Partial Success
---------------

::

    from onecite import process_references
    
    result = process_references(mixed_refs)
    
    print(f"Success: {result['processed_count']}")
    print(f"Failed: {result['failed_count']}")
    
    if result['warnings']:
        for warning in result['warnings']:
            print(f"Warning: {warning}")

Pipeline Configuration
======================

Custom Templates
----------------

::

    from onecite import PipelineController
    
    controller = PipelineController()
    controller.add_template_path("./my_templates")
    
    result = controller.process(
        entries=["10.1038/nature14539"],
        output_format="my_format"
    )

Data Source Priority
--------------------

::

    from onecite.pipeline import Identifier
    
    identifier = Identifier()
    
    # Set priority for specific query types
    identifier.set_source_priority(
        query_type="biomedical",
        sources=["pubmed", "crossref", "openalex"]
    )

Timeout Configuration
---------------------

::

    from onecite import PipelineController
    
    controller = PipelineController()
    controller.set_timeout(10)  # 10 seconds per query

Next Steps
----------

- See :doc:`../python_api` for usage examples
- Check :doc:`../api/core` for class reference
- Review :doc:`../advanced_usage` for complex scenarios
