Metadata-Version: 2.4
Name: gliner2
Version: 1.3.1
Summary: GLiNER2: Unified Schema-Based Information Extraction and Text Classification
Maintainer: Urchade Zaratiana
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright [yyyy] [name of copyright owner]
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
        
Project-URL: Homepage, https://github.com/fastino-ai/GLiNER2
Project-URL: Repository, https://github.com/fastino-ai/GLiNER2
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: gliner
Requires-Dist: peft
Requires-Dist: pydantic>=2.0.0
Requires-Dist: requests>=2.28
Requires-Dist: urllib3>=1.26
Dynamic: license-file

<div align="center">
  <a href="https://pioneer.ai/gliner" target="_blank" rel="noopener noreferrer">
    <img src="image/GitHub.png" alt="Pioneer AI - Fine-tune GLiNER with a single prompt" width="100%"/>
  </a>
</div>

# GLiNER2: Unified Schema-Based Information Extraction and Text Classification

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyPI version](https://badge.fury.io/py/gliner2.svg)](https://badge.fury.io/py/gliner2)
[![Downloads](https://pepy.tech/badge/gliner2)](https://pepy.tech/project/gliner2)
[![Reddit](https://img.shields.io/badge/Reddit-r%2FGLiNER-FF4500?logo=reddit&logoColor=white)](https://www.reddit.com/r/GLiNER/)
[![Discord](https://img.shields.io/badge/Discord-Join%20Community-5865F2?logo=discord&logoColor=white)](https://discord.gg/fastino)

> *Extract entities, classify text, parse structured data, and extract relations—all in one efficient model.*

GLiNER2 unifies **Named Entity Recognition**, **Text Classification**, **Structured Data Extraction**, and **Relation Extraction** into a single 205M parameter model. It provides efficient CPU-based inference without requiring complex pipelines or external API dependencies.

Fine-tune via [Pioneer](https://pioneer.ai/gliner). Additional documentation via [Pioneer docs](https://agent.pioneer.ai/docs/api-reference). Join discussions on [Discord](https://discord.gg/fastino) and [Reddit](https://www.reddit.com/r/GLiNER/).

## ✨ Why GLiNER2?

- **🎯 One Model, Four Tasks**: Entities, classification, structured data, and relations in a single forward pass
- **💻 CPU First**: Lightning-fast inference on standard hardware—no GPU required
- **🛡️ Privacy**: 100% local processing, zero external dependencies

## 🚀 Installation & Quick Start

```bash
# Schema validation, API client, training-data utilities — no torch required
pip install gliner2

# Full local inference and training (installs torch, transformers, etc.)
pip install gliner2[local]
```

The base install gives you `Schema`, `SchemaInput`, `RegexValidator`, `GLiNER2API`,
`InputExample`, `TrainingDataset`, and all JSONL validation tooling — everything
needed to build schemas, validate data, and call the cloud API without pulling in
PyTorch.

To load and run models locally, install the `[local]` extra:

```python
from gliner2 import GLiNER2  # requires gliner2[local]

# Load model once, use everywhere
extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1")

# Extract entities in one line
text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday."
result = extractor.extract_entities(text, ["company", "person", "product", "location"])

print(result)
# {'entities': {'company': ['Apple'], 'person': ['Tim Cook'], 'product': ['iPhone 15'], 'location': ['Cupertino']}}
```

### Quantization and Compilation

Enable fp16 and/or `torch.compile` for faster inference — no extra dependencies required.

```python
# fp16
model = GLiNER2.from_pretrained("fastino/gliner2-base-v1", map_location="cuda", quantize=True)

# torch.compile (fused GPU kernels, first call triggers tracing)
model = GLiNER2.from_pretrained("fastino/gliner2-base-v1", map_location="cuda", compile=True)

# Both
model = GLiNER2.from_pretrained("fastino/gliner2-base-v1", map_location="cuda", quantize=True, compile=True)

# Or after loading
model.quantize()
model.compile()
```

### 🌐 API Access: GLiNER XL 1B

Our biggest and most powerful model—**GLiNER XL 1B**—is available exclusively via API. No GPU required, no model downloads, just instant access to state-of-the-art extraction. Get your API key at [gliner.pioneer.ai](https://gliner.pioneer.ai).

```python
from gliner2 import GLiNER2

# Access GLiNER XL 1B via API
extractor = GLiNER2.from_api()  # Uses PIONEER_API_KEY env variable

result = extractor.extract_entities(
    "OpenAI CEO Sam Altman announced GPT-5 at their San Francisco headquarters.",
    ["company", "person", "product", "location"]
)
# {'entities': {'company': ['OpenAI'], 'person': ['Sam Altman'], 'product': ['GPT-5'], 'location': ['San Francisco']}}
```

## 📦 Available Models

| Model | Parameters | Description | Use Case                                         |
|-------|------------|-------------|--------------------------------------------------|
| `fastino/gliner2-base-v1` | 205M | base size   | Extraction / classification |
| `fastino/gliner2-large-v1` | 340M | large size  | Extraction / classification                      |

The models are available on [Hugging Face](https://huggingface.co/collections/fastino/gliner2-family).

## 📚 Documentation & Tutorials

Comprehensive guides for all GLiNER2 features:

### Core Features
- **[Text Classification](tutorial/1-classification.md)** - Single and multi-label classification with confidence scores
- **[Entity Extraction](tutorial/2-ner.md)** - Named entity recognition with descriptions and spans
- **[Structured Data Extraction](tutorial/3-json_extraction.md)** - Parse complex JSON structures from text
- **[Combined Schemas](tutorial/4-combined.md)** - Multi-task extraction in a single pass
- **[Regex Validators](tutorial/5-validator.md)** - Filter and validate extracted spans
- **[Relation Extraction](tutorial/6-relation_extraction.md)** - Extract relationships between entities
- **[API Access](tutorial/7-api.md)** - Use GLiNER2 via cloud API

### Training & Customization
- **[Training Data Format](tutorial/8-train_data.md)** - Complete guide to preparing training data
- **[Model Training](tutorial/9-training.md)** - Train custom models for your domain
- **[LoRA Adapters](tutorial/10-lora_adapters.md)** - Parameter-efficient fine-tuning
- **[Adapter Switching](tutorial/11-adapter_switching.md)** - Switch between domain adapters

## 🎯 Core Capabilities

### 1. Entity Extraction
Extract named entities with optional descriptions for precision:

```python
# Basic entity extraction
entities = extractor.extract_entities(
    "Patient received 400mg ibuprofen for severe headache at 2 PM.",
    ["medication", "dosage", "symptom", "time"]
)
# Output: {'entities': {'medication': ['ibuprofen'], 'dosage': ['400mg'], 'symptom': ['severe headache'], 'time': ['2 PM']}}

# Enhanced with descriptions for medical accuracy
entities = extractor.extract_entities(
    "Patient received 400mg ibuprofen for severe headache at 2 PM.",
    {
        "medication": "Names of drugs, medications, or pharmaceutical substances",
        "dosage": "Specific amounts like '400mg', '2 tablets', or '5ml'",
        "symptom": "Medical symptoms, conditions, or patient complaints",
        "time": "Time references like '2 PM', 'morning', or 'after lunch'"
    }
)
# Same output but with higher accuracy due to context descriptions

# With confidence scores
entities = extractor.extract_entities(
    "Apple Inc. CEO Tim Cook announced iPhone 15 in Cupertino.",
    ["company", "person", "product", "location"],
    include_confidence=True
)
# Output: {
#     'entities': {
#         'company': [{'text': 'Apple Inc.', 'confidence': 0.95}],
#         'person': [{'text': 'Tim Cook', 'confidence': 0.92}],
#         'product': [{'text': 'iPhone 15', 'confidence': 0.88}],
#         'location': [{'text': 'Cupertino', 'confidence': 0.90}]
#     }
# }

# With character positions (spans)
entities = extractor.extract_entities(
    "Apple Inc. CEO Tim Cook announced iPhone 15 in Cupertino.",
    ["company", "person", "product"],
    include_spans=True
)
# Output: {
#     'entities': {
#         'company': [{'text': 'Apple Inc.', 'start': 0, 'end': 9}],
#         'person': [{'text': 'Tim Cook', 'start': 15, 'end': 23}],
#         'product': [{'text': 'iPhone 15', 'start': 35, 'end': 44}]
#     }
# }

# With both confidence and spans
entities = extractor.extract_entities(
    "Apple Inc. CEO Tim Cook announced iPhone 15 in Cupertino.",
    ["company", "person", "product"],
    include_confidence=True,
    include_spans=True
)
# Output: {
#     'entities': {
#         'company': [{'text': 'Apple Inc.', 'confidence': 0.95, 'start': 0, 'end': 9}],
#         'person': [{'text': 'Tim Cook', 'confidence': 0.92, 'start': 15, 'end': 23}],
#         'product': [{'text': 'iPhone 15', 'confidence': 0.88, 'start': 35, 'end': 44}]
#     }
# }
```

### 2. Text Classification
Single or multi-label classification with configurable confidence:

```python
# Sentiment analysis
result = extractor.classify_text(
    "This laptop has amazing performance but terrible battery life!",
    {"sentiment": ["positive", "negative", "neutral"]}
)
# Output: {'sentiment': 'negative'}

# Multi-aspect classification
result = extractor.classify_text(
    "Great camera quality, decent performance, but poor battery life.",
    {
        "aspects": {
            "labels": ["camera", "performance", "battery", "display", "price"],
            "multi_label": True,
            "cls_threshold": 0.4
        }
    }
)
# Output: {'aspects': ['camera', 'performance', 'battery']}

# With confidence scores
result = extractor.classify_text(
    "This laptop has amazing performance but terrible battery life!",
    {"sentiment": ["positive", "negative", "neutral"]},
    include_confidence=True
)
# Output: {'sentiment': {'label': 'negative', 'confidence': 0.82}}

# Multi-label with confidence
schema = extractor.create_schema().classification(
    "topics",
    ["technology", "business", "health", "politics", "sports"],
    multi_label=True,
    cls_threshold=0.3
)
text = "Apple announced new health monitoring features in their latest smartwatch, boosting their stock price."
results = extractor.extract(text, schema, include_confidence=True)
# Output: {
#     'topics': [
#         {'label': 'technology', 'confidence': 0.92},
#         {'label': 'business', 'confidence': 0.78},
#         {'label': 'health', 'confidence': 0.65}
#     ]
# }
```

### 3. Structured Data Extraction
Parse complex structured information with field-level control:

```python
# Product information extraction
text = "iPhone 15 Pro Max with 256GB storage, A17 Pro chip, priced at $1199. Available in titanium and black colors."

result = extractor.extract_json(
    text,
    {
        "product": [
            "name::str::Full product name and model",
            "storage::str::Storage capacity like 256GB or 1TB", 
            "processor::str::Chip or processor information",
            "price::str::Product price with currency",
            "colors::list::Available color options"
        ]
    }
)
# Output: {
#     'product': [{
#         'name': 'iPhone 15 Pro Max',
#         'storage': '256GB', 
#         'processor': 'A17 Pro chip',
#         'price': '$1199',
#         'colors': ['titanium', 'black']
#     }]
# }

# Multiple structured entities
text = "Apple Inc. headquarters in Cupertino launched iPhone 15 for $999 and MacBook Air for $1299."

result = extractor.extract_json(
    text,
    {
        "company": [
            "name::str::Company name",
            "location::str::Company headquarters or office location"
        ],
        "products": [
            "name::str::Product name and model",
            "price::str::Product retail price"
        ]
    }
)
# Output: {
#     'company': [{'name': 'Apple Inc.', 'location': 'Cupertino'}],
#     'products': [
#         {'name': 'iPhone 15', 'price': '$999'},
#         {'name': 'MacBook Air', 'price': '$1299'}
#     ]
# }

# With confidence scores
result = extractor.extract_json(
    "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage.",
    {
        "product": [
            "name::str",
            "price",
            "features"
        ]
    },
    include_confidence=True
)
# Output: {
#     'product': [{
#         'name': {'text': 'MacBook Pro', 'confidence': 0.95},
#         'price': [{'text': '$1999', 'confidence': 0.92}],
#         'features': [
#             {'text': 'M3 chip', 'confidence': 0.88},
#             {'text': '16GB RAM', 'confidence': 0.90},
#             {'text': '512GB storage', 'confidence': 0.87}
#         ]
#     }]
# }

# With character positions (spans)
result = extractor.extract_json(
    "The MacBook Pro costs $1999 and features M3 chip.",
    {
        "product": [
            "name::str",
            "price"
        ]
    },
    include_spans=True
)
# Output: {
#     'product': [{
#         'name': {'text': 'MacBook Pro', 'start': 4, 'end': 15},
#         'price': [{'text': '$1999', 'start': 22, 'end': 27}]
#     }]
# }

# With both confidence and spans
result = extractor.extract_json(
    "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage.",
    {
        "product": [
            "name::str",
            "price",
            "features"
        ]
    },
    include_confidence=True,
    include_spans=True
)
# Output: {
#     'product': [{
#         'name': {'text': 'MacBook Pro', 'confidence': 0.95, 'start': 4, 'end': 15},
#         'price': [{'text': '$1999', 'confidence': 0.92, 'start': 22, 'end': 27}],
#         'features': [
#             {'text': 'M3 chip', 'confidence': 0.88, 'start': 32, 'end': 39},
#             {'text': '16GB RAM', 'confidence': 0.90, 'start': 41, 'end': 49},
#             {'text': '512GB storage', 'confidence': 0.87, 'start': 55, 'end': 68}
#         ]
#     }]
# }
```

### 4. Relation Extraction
Extract relationships between entities as directional tuples:

```python
# Basic relation extraction
text = "John works for Apple Inc. and lives in San Francisco. Apple Inc. is located in Cupertino."

result = extractor.extract_relations(
    text,
    ["works_for", "lives_in", "located_in"]
)
# Output: {
#     'relation_extraction': {
#         'works_for': [('John', 'Apple Inc.')],
#         'lives_in': [('John', 'San Francisco')],
#         'located_in': [('Apple Inc.', 'Cupertino')]
#     }
# }

# With descriptions for better accuracy
schema = extractor.create_schema().relations({
    "works_for": "Employment relationship where person works at organization",
    "founded": "Founding relationship where person created organization",
    "acquired": "Acquisition relationship where company bought another company",
    "located_in": "Geographic relationship where entity is in a location"
})

text = "Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne, California."
results = extractor.extract(text, schema)
# Output: {
#     'relation_extraction': {
#         'founded': [('Elon Musk', 'SpaceX')],
#         'located_in': [('SpaceX', 'Hawthorne, California')]
#     }
# }

# With confidence scores
results = extractor.extract_relations(
    "John works for Apple Inc. and lives in San Francisco.",
    ["works_for", "lives_in"],
    include_confidence=True
)
# Output: {
#     'relation_extraction': {
#         'works_for': [{
#             'head': {'text': 'John', 'confidence': 0.95},
#             'tail': {'text': 'Apple Inc.', 'confidence': 0.92}
#         }],
#         'lives_in': [{
#             'head': {'text': 'John', 'confidence': 0.94},
#             'tail': {'text': 'San Francisco', 'confidence': 0.91}
#         }]
#     }
# }

# With character positions (spans)
results = extractor.extract_relations(
    "John works for Apple Inc. and lives in San Francisco.",
    ["works_for", "lives_in"],
    include_spans=True
)
# Output: {
#     'relation_extraction': {
#         'works_for': [{
#             'head': {'text': 'John', 'start': 0, 'end': 4},
#             'tail': {'text': 'Apple Inc.', 'start': 15, 'end': 25}
#         }],
#         'lives_in': [{
#             'head': {'text': 'John', 'start': 0, 'end': 4},
#             'tail': {'text': 'San Francisco', 'start': 33, 'end': 46}
#         }]
#     }
# }

# With both confidence and spans
results = extractor.extract_relations(
    "John works for Apple Inc. and lives in San Francisco.",
    ["works_for", "lives_in"],
    include_confidence=True,
    include_spans=True
)
# Output: {
#     'relation_extraction': {
#         'works_for': [{
#             'head': {'text': 'John', 'confidence': 0.95, 'start': 0, 'end': 4},
#             'tail': {'text': 'Apple Inc.', 'confidence': 0.92, 'start': 15, 'end': 25}
#         }],
#         'lives_in': [{
#             'head': {'text': 'John', 'confidence': 0.94, 'start': 0, 'end': 4},
#             'tail': {'text': 'San Francisco', 'confidence': 0.91, 'start': 33, 'end': 46}
#         }]
#     }
# }
```

### 5. Multi-Task Schema Composition
Combine all extraction types when you need comprehensive analysis:

```python
# Use create_schema() for multi-task scenarios
schema = (extractor.create_schema()
    # Extract key entities
    .entities({
        "person": "Names of people, executives, or individuals",
        "company": "Organization, corporation, or business names", 
        "product": "Products, services, or offerings mentioned"
    })
    
    # Classify the content
    .classification("sentiment", ["positive", "negative", "neutral"])
    .classification("category", ["technology", "business", "finance", "healthcare"])
    
    # Extract relationships
    .relations(["works_for", "founded", "located_in"])
    
    # Extract structured product details
    .structure("product_info")
        .field("name", dtype="str")
        .field("price", dtype="str")
        .field("features", dtype="list")
        .field("availability", dtype="str", choices=["in_stock", "pre_order", "sold_out"])
)

# Comprehensive extraction in one pass
text = "Apple CEO Tim Cook unveiled the revolutionary iPhone 15 Pro for $999. The device features an A17 Pro chip and titanium design. Tim Cook works for Apple, which is located in Cupertino."

results = extractor.extract(text, schema)
# Output: {
#     'entities': {
#         'person': ['Tim Cook'], 
#         'company': ['Apple'], 
#         'product': ['iPhone 15 Pro']
#     },
#     'sentiment': 'positive',
#     'category': 'technology',
#     'relation_extraction': {
#         'works_for': [('Tim Cook', 'Apple')],
#         'located_in': [('Apple', 'Cupertino')]
#     },
#     'product_info': [{
#         'name': 'iPhone 15 Pro',
#         'price': '$999',
#         'features': ['A17 Pro chip', 'titanium design'],
#         'availability': 'in_stock'
#     }]
# }
```

## 🏭 Example Usage Scenarios

### Financial Document Processing

```python
financial_text = """
Transaction Report: Goldman Sachs processed a $2.5M equity trade for Tesla Inc. 
on March 15, 2024. Commission: $1,250. Status: Completed.
"""

# Extract structured financial data
result = extractor.extract_json(
    financial_text,
    {
        "transaction": [
            "broker::str::Financial institution or brokerage firm",
            "amount::str::Transaction amount with currency",
            "security::str::Stock, bond, or financial instrument",
            "date::str::Transaction date",
            "commission::str::Fees or commission charged", 
            "status::str::Transaction status",
            "type::[equity|bond|option|future|forex]::str::Type of financial instrument"
        ]
    }
)
# Output: {
#     'transaction': [{
#         'broker': 'Goldman Sachs',
#         'amount': '$2.5M', 
#         'security': 'Tesla Inc.',
#         'date': 'March 15, 2024',
#         'commission': '$1,250',
#         'status': 'Completed',
#         'type': 'equity'
#     }]
# }
```

### Healthcare Information Extraction

```python
medical_record = """
Patient: Sarah Johnson, 34, presented with acute chest pain and shortness of breath.
Prescribed: Lisinopril 10mg daily, Metoprolol 25mg twice daily.
Follow-up scheduled for next Tuesday.
"""

result = extractor.extract_json(
    medical_record,
    {
        "patient_info": [
            "name::str::Patient full name",
            "age::str::Patient age",
            "symptoms::list::Reported symptoms or complaints"
        ],
        "prescriptions": [
            "medication::str::Drug or medication name",
            "dosage::str::Dosage amount and frequency",
            "frequency::str::How often to take the medication"
        ]
    }
)
# Output: {
#     'patient_info': [{
#         'name': 'Sarah Johnson',
#         'age': '34',
#         'symptoms': ['acute chest pain', 'shortness of breath']
#     }],
#     'prescriptions': [
#         {'medication': 'Lisinopril', 'dosage': '10mg', 'frequency': 'daily'},
#         {'medication': 'Metoprolol', 'dosage': '25mg', 'frequency': 'twice daily'}
#     ]
# }
```

### Legal Contract Analysis

```python
contract_text = """
Service Agreement between TechCorp LLC and DataSystems Inc., effective January 1, 2024.
Monthly fee: $15,000. Contract term: 24 months with automatic renewal.
Termination clause: 30-day written notice required.
"""

# Multi-task extraction for comprehensive analysis
schema = (extractor.create_schema()
    .entities(["company", "date", "duration", "fee"])
    .classification("contract_type", ["service", "employment", "nda", "partnership"])
    .relations(["signed_by", "involves", "dated"])
    .structure("contract_terms")
        .field("parties", dtype="list")
        .field("effective_date", dtype="str")
        .field("monthly_fee", dtype="str")
        .field("term_length", dtype="str")
        .field("renewal", dtype="str", choices=["automatic", "manual", "none"])
        .field("termination_notice", dtype="str")
)

results = extractor.extract(contract_text, schema)
# Output: {
#     'entities': {
#         'company': ['TechCorp LLC', 'DataSystems Inc.'],
#         'date': ['January 1, 2024'],
#         'duration': ['24 months'],
#         'fee': ['$15,000']
#     },
#     'contract_type': 'service',
#     'relation_extraction': {
#         'involves': [('TechCorp LLC', 'DataSystems Inc.')],
#         'dated': [('Service Agreement', 'January 1, 2024')]
#     },
#     'contract_terms': [{
#         'parties': ['TechCorp LLC', 'DataSystems Inc.'],
#         'effective_date': 'January 1, 2024',
#         'monthly_fee': '$15,000',
#         'term_length': '24 months', 
#         'renewal': 'automatic',
#         'termination_notice': '30-day written notice'
#     }]
# }
```

### Knowledge Graph Construction

```python
# Extract entities and relations for knowledge graph building
text = """
Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne, California.
SpaceX acquired Swarm Technologies in 2021. Many engineers work for SpaceX.
"""

schema = (extractor.create_schema()
    .entities(["person", "organization", "location", "date"])
    .relations({
        "founded": "Founding relationship where person created organization",
        "acquired": "Acquisition relationship where company bought another company",
        "located_in": "Geographic relationship where entity is in a location",
        "works_for": "Employment relationship where person works at organization"
    })
)

results = extractor.extract(text, schema)
# Output: {
#     'entities': {
#         'person': ['Elon Musk', 'engineers'],
#         'organization': ['SpaceX', 'Swarm Technologies'],
#         'location': ['Hawthorne, California'],
#         'date': ['2002', '2021']
#     },
#     'relation_extraction': {
#         'founded': [('Elon Musk', 'SpaceX')],
#         'acquired': [('SpaceX', 'Swarm Technologies')],
#         'located_in': [('SpaceX', 'Hawthorne, California')],
#         'works_for': [('engineers', 'SpaceX')]
#     }
# }
```

## ⚙️ Advanced Configuration

### Custom Confidence Thresholds

```python
# High-precision extraction for critical fields
result = extractor.extract_json(
    text,
    {
        "financial_data": [
            "account_number::str::Bank account number",  # default threshold
            "amount::str::Transaction amount",           # default threshold  
            "routing_number::str::Bank routing number"   # default threshold
        ]
    },
    threshold=0.9  # High confidence for all fields
)

# Per-field thresholds using schema builder (for multi-task scenarios)
schema = (extractor.create_schema()
    .structure("sensitive_data")
        .field("ssn", dtype="str", threshold=0.95)         # Highest precision
        .field("email", dtype="str", threshold=0.8)        # Medium precision  
        .field("phone", dtype="str", threshold=0.7)        # Lower precision
)
```

### Field Types and Constraints

```python
# Structured extraction with choices and types
result = extractor.extract_json(
    "Premium subscription at $99/month with mobile and web access.",
    {
        "subscription": [
            "tier::[basic|premium|enterprise]::str::Subscription level",
            "price::str::Monthly or annual cost",
            "billing::[monthly|annual]::str::Billing frequency", 
            "features::[mobile|web|api|analytics]::list::Included features"
        ]
    }
)
# Output: {
#     'subscription': [{
#         'tier': 'premium',
#         'price': '$99/month', 
#         'billing': 'monthly',
#         'features': ['mobile', 'web']
#     }]
# }
```

## 🔍 Regex Validators

Filter extracted spans to ensure they match expected patterns, improving extraction quality and reducing false positives.

```python
from gliner2 import GLiNER2, RegexValidator

extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1")

# Email validation
email_validator = RegexValidator(r"^[\w\.-]+@[\w\.-]+\.\w+$")
schema = (extractor.create_schema()
    .structure("contact")
        .field("email", dtype="str", validators=[email_validator])
)

text = "Contact: john@company.com, not-an-email, jane@domain.org"
results = extractor.extract(text, schema)
# Output: {'contact': [{'email': 'john@company.com'}]}  # Only valid emails

# Phone number validation (US format)
phone_validator = RegexValidator(r"\(\d{3}\)\s\d{3}-\d{4}", mode="partial")
schema = (extractor.create_schema()
    .structure("contact")
        .field("phone", dtype="str", validators=[phone_validator])
)

text = "Call (555) 123-4567 or 5551234567"
results = extractor.extract(text, schema)
# Output: {'contact': [{'phone': '(555) 123-4567'}]}  # Second number filtered out

# URL validation
url_validator = RegexValidator(r"^https?://", mode="partial")
schema = (extractor.create_schema()
    .structure("links")
        .field("url", dtype="list", validators=[url_validator])
)

text = "Visit https://example.com or www.site.com"
results = extractor.extract(text, schema)
# Output: {'links': [{'url': ['https://example.com']}]}  # www.site.com filtered out

# Exclude test data
import re
no_test_validator = RegexValidator(r"^(test|demo|sample)", exclude=True, flags=re.IGNORECASE)
schema = (extractor.create_schema()
    .structure("products")
        .field("name", dtype="list", validators=[no_test_validator])
)

text = "Products: iPhone, Test Phone, Samsung Galaxy"
results = extractor.extract(text, schema)
# Output: {'products': [{'name': ['iPhone', 'Samsung Galaxy']}]}  # Test Phone excluded

# Multiple validators (all must pass)
username_validators = [
    RegexValidator(r"^[a-zA-Z0-9_]+$"),  # Alphanumeric + underscore
    RegexValidator(r"^.{3,20}$"),        # 3-20 characters
    RegexValidator(r"^(?!admin)", exclude=True, flags=re.IGNORECASE)  # No "admin"
]

schema = (extractor.create_schema()
    .structure("user")
        .field("username", dtype="str", validators=username_validators)
)

text = "Users: ab, john_doe, user@domain, admin, valid_user123"
results = extractor.extract(text, schema)
# Output: {'user': [{'username': 'john_doe'}]}  # Only valid usernames
```

## FlashDeberta (Optional GPU Acceleration)

For DebertaV2-based models, you can use [FlashDeberta](https://github.com/fastino-ai/flashdeberta) to accelerate inference on GPU via flash attention kernels.

**Install:**

```bash
pip install flashdeberta
```

**Use:**

```python
import os
os.environ["USE_FLASHDEBERTA"] = "1"  # set before importing gliner2

from gliner2 import GLiNER2

extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
# Prints: "Using FlashDeberta backend."

result = extractor.extract_entities(
    "Apple CEO Tim Cook announced iPhone 15 in Cupertino.",
    ["company", "person", "product", "location"]
)
```

The flag is only effective when the model uses a DebertaV2 encoder and the `flashdeberta` package is installed. Otherwise standard HuggingFace `AutoModel` is used automatically.

A benchmark script is included to compare the two backends:

```bash
python benchmarks/benchmark_flashdeberta.py
```

## 📦 Batch Processing

Process multiple texts efficiently in a single call:

```python
# Batch entity extraction
texts = [
    "Google's Sundar Pichai unveiled Gemini AI in Mountain View.",
    "Microsoft CEO Satya Nadella announced Copilot at Build 2023.",
    "Amazon's Andy Jassy revealed new AWS services in Seattle."
]

results = extractor.batch_extract_entities(
    texts,
    ["company", "person", "product", "location"],
    batch_size=8
)
# Returns list of results, one per input text

# Batch relation extraction
texts = [
    "John works for Microsoft and lives in Seattle.",
    "Sarah founded TechStartup in 2020.",
    "Bob reports to Alice at Google."
]

results = extractor.batch_extract_relations(
    texts,
    ["works_for", "founded", "reports_to", "lives_in"],
    batch_size=8
)
# Returns list of relation extraction results for each text
# All requested relation types appear in each result, even if empty

# Batch with confidence and spans
results = extractor.batch_extract_entities(
    texts,
    ["company", "person"],
    include_confidence=True,
    include_spans=True,
    batch_size=8
)
```

## 🎓 Training Custom Models

Train GLiNER2 on your own data to specialize for your domain or use case.

### Quick Start Training

```python
from gliner2 import GLiNER2
from gliner2.training.data import InputExample
from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig

# 1. Prepare training data
examples = [
    InputExample(
        text="John works at Google in California.",
        entities={"person": ["John"], "company": ["Google"], "location": ["California"]}
    ),
    InputExample(
        text="Apple released iPhone 15.",
        entities={"company": ["Apple"], "product": ["iPhone 15"]}
    ),
    # Add more examples...
]

# 2. Configure training
model = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
config = TrainingConfig(
    output_dir="./output",
    num_epochs=10,
    batch_size=8,
    encoder_lr=1e-5,
    task_lr=5e-4
)

# 3. Train
trainer = GLiNER2Trainer(model, config)
trainer.train(train_data=examples)
```

### Training Data Format (JSONL)

GLiNER2 uses JSONL format where each line contains an `input` and `output` field:

```jsonl
{"input": "Tim Cook is the CEO of Apple Inc., based in Cupertino, California.", "output": {"entities": {"person": ["Tim Cook"], "company": ["Apple Inc."], "location": ["Cupertino", "California"]}, "entity_descriptions": {"person": "Full name of a person", "company": "Business organization name", "location": "Geographic location or place"}}}
{"input": "OpenAI released GPT-4 in March 2023.", "output": {"entities": {"company": ["OpenAI"], "model": ["GPT-4"], "date": ["March 2023"]}}}
```

**Classification Example:**
```jsonl
{"input": "This movie is absolutely fantastic! I loved every minute of it.", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}]}}
{"input": "The service was terrible and the food was cold.", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["negative"]}]}}
```

**Structured Extraction Example:**
```jsonl
{"input": "iPhone 15 Pro Max with 256GB storage, priced at $1199.", "output": {"json_structures": [{"product": {"name": "iPhone 15 Pro Max", "storage": "256GB", "price": "$1199"}}]}}
```

**Relation Extraction Example:**
```jsonl
{"input": "John works for Apple Inc. and lives in San Francisco.", "output": {"relations": [{"works_for": {"head": "John", "tail": "Apple Inc."}}, {"lives_in": {"head": "John", "tail": "San Francisco"}}]}}
```

### Training from JSONL File

```python
from gliner2 import GLiNER2
from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig

# Load model and train from JSONL file
model = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
config = TrainingConfig(output_dir="./output", num_epochs=10)

trainer = GLiNER2Trainer(model, config)
trainer.train(train_data="train.jsonl")  # Path to your JSONL file
```

### LoRA Training (Parameter-Efficient Fine-Tuning)

Train lightweight adapters for domain-specific tasks:

```python
from gliner2 import GLiNER2
from gliner2.training.data import InputExample
from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig

# Prepare domain-specific data
legal_examples = [
    InputExample(
        text="Apple Inc. filed a lawsuit against Samsung Electronics.",
        entities={"company": ["Apple Inc.", "Samsung Electronics"]}
    ),
    # Add more examples...
]

# Configure LoRA training
model = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
config = TrainingConfig(
    output_dir="./legal_adapter",
    num_epochs=10,
    batch_size=8,
    encoder_lr=1e-5,
    task_lr=5e-4,
    
    # LoRA settings
    use_lora=True,                    # Enable LoRA
    lora_r=8,                         # Rank (4, 8, 16, 32)
    lora_alpha=16.0,                  # Scaling factor (usually 2*r)
    lora_dropout=0.0,                 # Dropout for LoRA layers
    save_adapter_only=True            # Save only adapter (~5MB vs ~450MB)
)

# Train adapter
trainer = GLiNER2Trainer(model, config)
trainer.train(train_data=legal_examples)

# Use the adapter
model.load_adapter("./legal_adapter/final")
results = model.extract_entities(legal_text, ["company", "law"])
```

**Benefits of LoRA:**
- **Smaller size**: Adapters are ~2-10 MB vs ~450 MB for full models
- **Faster training**: 2-3x faster than full fine-tuning
- **Easy switching**: Swap adapters in milliseconds for different domains

### Complete Training Example

```python
from gliner2 import GLiNER2
from gliner2.training.data import InputExample, TrainingDataset
from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig

# Prepare training data
train_examples = [
    InputExample(
        text="Tim Cook is the CEO of Apple Inc., based in Cupertino, California.",
        entities={
            "person": ["Tim Cook"],
            "company": ["Apple Inc."],
            "location": ["Cupertino", "California"]
        },
        entity_descriptions={
            "person": "Full name of a person",
            "company": "Business organization name",
            "location": "Geographic location or place"
        }
    ),
    # Add more examples...
]

# Create and validate dataset
train_dataset = TrainingDataset(train_examples)
train_dataset.validate(strict=True, raise_on_error=True)
train_dataset.print_stats()

# Split into train/validation
train_data, val_data, _ = train_dataset.split(
    train_ratio=0.8,
    val_ratio=0.2,
    test_ratio=0.0,
    shuffle=True,
    seed=42
)

# Configure training
model = GLiNER2.from_pretrained("fastino/gliner2-base-v1")
config = TrainingConfig(
    output_dir="./ner_model",
    experiment_name="ner_training",
    num_epochs=15,
    batch_size=16,
    encoder_lr=1e-5,
    task_lr=5e-4,
    warmup_ratio=0.1,
    scheduler_type="cosine",
    fp16=True,
    eval_strategy="epoch",
    save_best=True,
    early_stopping=True,
    early_stopping_patience=3
)

# Train
trainer = GLiNER2Trainer(model, config)
trainer.train(train_data=train_data, val_data=val_data)

# Load best model
model = GLiNER2.from_pretrained("./ner_model/best")
```

For more details, see the [Training Tutorial](tutorial/9-training.md) and [Data Format Guide](tutorial/8-train_data.md).

## 📄 License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

## 📚 Citation

If you use GLiNER2 in your research, please cite:

```bibtex
@inproceedings{zaratiana-etal-2025-gliner2,
    title = "{GL}i{NER}2: Schema-Driven Multi-Task Learning for Structured Information Extraction",
    author = "Zaratiana, Urchade  and
      Pasternak, Gil  and
      Boyd, Oliver  and
      Hurn-Maloney, George  and
      Lewis, Ash",
    editor = {Habernal, Ivan  and
      Schulam, Peter  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-demos.10/",
    pages = "130--140",
    ISBN = "979-8-89176-334-0",
    abstract = "Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built on a fine-tuned encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across diverse IE tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source library available through pip, complete with pre-trained models and comprehensive documentation."
}
```

## 🙏 Acknowledgments

Built upon the original [GLiNER](https://github.com/urchade/GLiNER) architecture by the team at [Fastino AI](https://fastino.ai).

---

<div align="center">
    <strong>Ready to extract insights from your data?</strong><br>
    <code>pip install gliner2</code>
</div>
