Metadata-Version: 2.4
Name: swedish-nlp-utils
Version: 1.0.0
Summary: Swedish NLP utilities — stopwords, legal NER, text normalisation, chunking, and Pinecone vector pipeline helpers
Author-email: Trollfabriken AITrix AB <dev@trollfabriken.se>
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.5
Requires-Dist: pyyaml>=6.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: regex>=2024.0
Provides-Extra: vectors
Requires-Dist: pinecone-client>=3.0; extra == "vectors"
Requires-Dist: openai>=1.40; extra == "vectors"
Requires-Dist: httpx>=0.27; extra == "vectors"
Requires-Dist: tiktoken>=0.7; extra == "vectors"
Provides-Extra: spacy
Requires-Dist: spacy>=3.7; extra == "spacy"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.9; extra == "dev"
Dynamic: license-file

# swedish-nlp-utils

Swedish NLP utilities for text processing, entity extraction, number parsing,
chunking, and Pinecone vector pipeline helpers — built for
[Trollfabriken AITrix AB](https://trollfabriken.se) projects including
**SocKartan**, **DocFlow**, **ScrapeAssistant**, and **RevisionsUpproret**.

---

## Modules

| Module | Class | Purpose |
|---|---|---|
| `stopwords` | `SwedishStopwords` | Domain-specific Swedish stopword sets |
| `normalizer` | `SwedishNormalizer` | Municipality, authority, law-ref, OCR normalisation |
| `ner` | `SwedishNER` | Rule-based NER for Swedish legal/municipal texts |
| `formats` | `SwedishFormats` | Date, number, SEK, personnummer parsing & formatting |
| `chunker` | `SwedishChunker` | Swedish-aware text chunking for vector pipelines |
| `vectors` | `SwedishVectorPipeline` | Pinecone + OpenAI embedding pipeline |

---

## Installation

```bash
# Core (no external API dependencies)
pip install swedish-nlp-utils

# With vector pipeline support (Pinecone + OpenAI)
pip install "swedish-nlp-utils[vectors]"

# With spaCy NER upgrade
pip install "swedish-nlp-utils[spacy]"
python -m spacy download sv_core_news_sm
```

---

## Quick start

### Stopwords

```python
from swedish_nlp import SwedishStopwords

# General Swedish
sw = SwedishStopwords()
sw.filter_tokens(["socialnämnden", "beslutade", "att", "bevilja", "insats"])
# → ["socialnämnden", "beslutade", "bevilja", "insats"]

# Domain-specific
sw = SwedishStopwords.for_social_services()  # general + legal + social
sw.filter_text("handläggaren beslutade att genomföra utredningen")

# All domains combined
sw = SwedishStopwords.all_domains()

# Custom additions (fluent)
sw = SwedishStopwords.for_legal().add("paragrafen", "stycket")
```

### Normaliser

```python
from swedish_nlp import SwedishNormalizer

n = SwedishNormalizer()

# Municipality names
n.normalize_municipality("Göteborgs stad")       # → "Göteborg"
n.normalize_municipality("Stockholms kommunen")  # → "Stockholm"
n.normalize_municipality("GBG")                  # → "Göteborg"

# Authority names → canonical short form
n.normalize_authority("Justitieombudsmannen")              # → "JO"
n.normalize_authority("Inspektionen för vård och omsorg")  # → "IVO"
n.normalize_authority("Högsta förvaltningsdomstolen")      # → "HFD"

# Law references
n.normalize_law_reference("Socialtjänstlagen")  # → "SoL"
n.normalize_law_reference("sol")                # → "SoL"
n.normalize_law_reference("förvaltningslagen")  # → "FL"

# Replace law names in running text
n.normalize_law_references_in_text(
    "Enligt socialtjänstlagen och föräldrabalken..."
)
# → "Enligt SoL och FB..."

# OCR artefact correction
n.normalize_ocr("§  12 nämnden\xadbeslut")  # → "§ 12 nämndbeslut"

# Utilities
n.ascii_fold("åäö ÅÄÖ")       # → "aao AAO"
n.to_slug("Göteborgs Stad 2024")  # → "goteborgs-stad-2024"
```

### Named Entity Recognition

```python
from swedish_nlp import SwedishNER

ner = SwedishNER()
entities = ner.extract("""
    Socialnämnden fattade beslut 2024-03-15 enligt SoL 4 kap. 1 §.
    Handläggare: Anna Lindqvist. Diarienummer: SOC-2024-0042.
    JO har i HFD 2015:5 klargjort rättsläget.
""")

entities.authorities   # ["Socialnämnden", "JO"]
entities.courts        # ["HFD"]
entities.law_refs      # ["SoL", "4 kap. 1 §", "HFD 2015:5"]
entities.persons       # ["Anna Lindqvist"]
entities.diarienummer  # ["SOC-2024-0042"]
entities.dates         # ["2024-03-15"]
entities.roles         # ["handläggare"]

# Optional spaCy upgrade (better person/org detection)
ner = SwedishNER(use_spacy=True)
```

### Formats

```python
from swedish_nlp import SwedishFormats
from datetime import date

# Dates — parse
SwedishFormats.parse_date("15 mars 2024")   # → date(2024, 3, 15)
SwedishFormats.parse_date("2024-03-15")     # → date(2024, 3, 15)

# Dates — format
SwedishFormats.format_date(date(2024, 3, 15))       # → "2024-03-15"
SwedishFormats.format_date_long(date(2024, 3, 15))  # → "15 mars 2024"

# Numbers (Swedish format: space-thousands, comma-decimal)
SwedishFormats.parse_number("1 234 567,89")   # → 1234567.89
SwedishFormats.format_number(1234567.89)      # → "1 234 567,89"

# SEK
SwedishFormats.parse_sek("1 234 567 kr")       # → 1234567.0
SwedishFormats.format_sek(1234567.0)           # → "1 234 567 kr"
SwedishFormats.format_sek(1234.5, decimals=2)  # → "1 234,50 kr"
SwedishFormats.format_sek(1000.0, unit="tkr")  # → "1 000 tkr"

# Personnummer
SwedishFormats.validate_personnummer("19850312-4564")         # → bool
SwedishFormats.pseudonymize_personnummer("19850312-4564")     # → "1985-XX-XXXX"

# Extract from text
SwedishFormats.extract_sek_amounts("Budget 5 000 SEK")    # → [5000.0]
SwedishFormats.extract_dates("Beslut 2024-03-15")         # → ["2024-03-15"]
SwedishFormats.parse_postal_code("Göteborg 413 01")       # → "413 01"
```

### Chunker

```python
from swedish_nlp import SwedishChunker
from swedish_nlp.chunker.chunker import ChunkConfig

# Default config (512 tokens, 50 overlap)
chunker = SwedishChunker()
chunks = chunker.chunk(long_text)

# Custom config
cfg = ChunkConfig(chunk_size=256, chunk_overlap=30, min_chunk_size=40)
chunker = SwedishChunker(cfg)

# With Pinecone metadata for every chunk
chunks = chunker.chunk_document(
    text,
    doc_id       = "arsredovisning-goteborg-2023",
    doc_type     = "årsredovisning",
    municipality = "Göteborg",
    year         = 2023,
    extra_metadata = {"source_url": "https://goteborg.se/doc.pdf"},
)

for c in chunks:
    print(c.index, c.token_estimate, c.text[:80])
    print(c.metadata)  # {"doc_id": ..., "doc_type": ..., "municipality": ...}

# Format for Pinecone upsert
vector = c.to_pinecone_dict("vec-001", embedding=[0.1] * 1536)
```

### Vector pipeline (Pinecone + OpenAI)

```python
from swedish_nlp.vectors import SwedishVectorPipeline

# Requires: pip install "swedish-nlp-utils[vectors]"
# Requires: PINECONE_API_KEY and OPENAI_API_KEY in environment

pipeline = SwedishVectorPipeline(
    index_name = "sockartan-documents",
    namespace  = "protokoll",
)

# Index a document (chunk → embed → upsert in one call)
n_vectors = pipeline.chunk_and_upsert(
    text         = full_protocol_text,
    doc_id       = "protokoll-goteborg-2024-03",
    doc_type     = "protokoll",
    municipality = "Göteborg",
    year         = 2024,
)

# Semantic search
results = pipeline.search("socialtjänstens insatser för barn", top_k=5)
for r in results:
    print(r.score, r.municipality, r.text[:100])

# Filtered search
results = pipeline.search_with_filter(
    "budget underskott",
    doc_type     = "årsredovisning",
    municipality = "Göteborg",
    year         = 2023,
)

# Management
pipeline.delete_by_doc_id("protokoll-goteborg-2024-03")
stats = pipeline.get_index_stats()
print(stats["total_vectors"], stats["namespaces"])
```

---

## CLI

```bash
# Named entity extraction
swe-nlp analyze "Socialnämnden fattade beslut enligt SoL 4 kap. 1 §"

# From file, JSON output
swe-nlp analyze --file document.txt --format json

# Chunk a document
swe-nlp chunk --file arsredovisning.txt --size 256 --show-tokens

# Normalize OCR output and law references
swe-nlp normalize --file ocr_output.txt
swe-nlp normalize "Socialtjänstlagen 4 kap §  1" --municipality
```

---

## Environment variables

| Variable | Module | Required |
|---|---|---|
| `OPENAI_API_KEY` | `SwedishVectorPipeline` | For vector pipeline |
| `PINECONE_API_KEY` | `SwedishVectorPipeline` | For vector pipeline |
| `PINECONE_INDEX_NAME` | `SwedishVectorPipeline` | Default: `swedish-docs` |

---

## Package structure

```
swedish_nlp/
├── __init__.py              ← Public API surface
├── cli.py                   ← swe-nlp CLI (analyze, chunk, normalize)
├── py.typed                 ← PEP 561 typed marker
├── stopwords/
│   └── stopwords.py         ← SwedishStopwords (5 domains)
├── normalizer/
│   └── normalizer.py        ← SwedishNormalizer (municipality/authority/law/OCR)
├── ner/
│   └── ner.py               ← SwedishNER (rule-based + optional spaCy)
├── formats/
│   └── formats.py           ← SwedishFormats (dates/numbers/SEK/personnummer)
├── chunker/
│   └── chunker.py           ← SwedishChunker (section/paragraph/sentence/token)
└── vectors/
    └── vectors.py           ← SwedishVectorPipeline (Pinecone + OpenAI)

VNV/tests/
├── test_stopwords.py        ← 19 tests
├── test_normalizer.py       ← 30 tests
├── test_ner.py              ← 25 tests
├── test_formats.py          ← 38 tests
└── test_chunker.py          ← 17 tests
```

---

## Extending

**Add a new stopword domain:**

```python
from swedish_nlp.stopwords.stopwords import _DOMAIN_MAP, Domain

# Add a custom domain set
_DOMAIN_MAP[Domain.MEDICAL].update({"ny_term", "annan_term"})
```

**Add a new authority alias:**

```python
# In normalizer/normalizer.py
_AUTHORITY_CANONICAL["ny myndighet"] = "NM"
```

**Add a new law abbreviation:**

```python
# In normalizer/normalizer.py
_LAW_ALIASES["ny lagtext"] = "NL"
```

---

© 2025 Trollfabriken AITrix AB — MIT License
