Metadata-Version: 2.4
Name: Grimmerie
Version: 0.1.5
Summary: Functions for Prototyping, QOL and Sanity checking
Author: Joe Petrecca
License-Expression: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.38
Requires-Dist: adapters>=1.0
Requires-Dist: numpy>=1.23
Requires-Dist: sentencepiece
Requires-Dist: scikit-learn>=1.2
Requires-Dist: pandas>=1.5
Provides-Extra: nlp
Requires-Dist: spacy>=3.0; extra == "nlp"

# Grimmerie

A spellbook for Python.

Grimmerie is a collection of high-level utilities (“spells”) designed for rapid prototyping, sanity checking, and reducing friction in experimentation.

Each spell performs a **non-trivial amount of work under the hood**.  
They are intentionally designed to trade fine-grained control for speed, clarity, and momentum.

Use them when you want to move fast.  
Understand them before you rely on them.

---

## Installation

```bash
pip install grimmerie
```

---

## The Idea

Instead of wiring together pipelines every time, Grimmerie gives you:

- One function call  
- Sensible defaults  
- Heavy lifting handled internally  

Example philosophy:

```python
embeddings = specterize(papers)
```

Behind this single call:
- Model loading  
- Tokenization  
- Batching  
- Device handling  
- Adapter loading  
- Output formatting  

All handled for you.

---

## Spells

### `specterize`

Generate SPECTER2 embeddings from text or paper-like inputs.

```python
from grimmerie import specterize

papers = [
    {'abstract': 'We introduce a new language representation model called BERT'},
    {'abstract': 'The dominant sequence transduction models are based on neural networks'},
]

embeddings = specterize(papers, return_type='numpy')
```

### tfidfize
Generate TF-IDF representations from text with optional preprocessing.

```python
from grimmerie import tfidfize

docs = [
    {'abstract': 'We introduce a new language representation model called BERT'},
    {'abstract': 'The dominant sequence transduction models are based on neural networks'},
]

X = tfidfize(docs, return_type='array')
```
```python
tfidfize(
    input_data,
    lemmatize: bool = False,
    spacy_model: str = 'en_core_web_sm',
    batch_size: int = 2000,
    n_process: int = 1,
    progress_interval: int | None = None,
    min_df: int | float = 1,
    max_df: int | float = 1.0,
    stop_words: str | list[str] | None = 'english',
    ngram_range: tuple[int, int] = (1, 1),
    lowercase: bool = True,
    max_features: int | None = None,
    norm: Literal['l1', 'l2'] | None = 'l2',
    use_idf: bool = True,
    smooth_idf: bool = True,
    sublinear_tf: bool = False,
    return_type: Literal['sparse', 'array', 'list', 'frame'] = 'sparse',
    return_vectorizer: bool = False,
    vectorizer: TfidfVectorizer | None = None,
)
```

**Parameters:**

- `lemmatize`: Apply lemmatization (default `False`)  
- `spacy_model`: Spacy model for lemmatization (default `'en_core_web_sm'`)  
- `batch_size`: Processing batch size (default `2000`)  
- `n_process`: Number of processes (default `1`)  
- `progress_interval`: Progress reporting interval  
- `min_df`: Minimum document frequency (default `1`)  
- `max_df`: Maximum document frequency (default `1.0`)  
- `stop_words`: Stop words to filter (default `'english'`)  
- `ngram_range`: N-gram range (default `(1, 1)`)  
- `lowercase`: Convert to lowercase (default `True`)  
- `max_features`: Maximum vocabulary size  
- `norm`: Normalization method (default `'l2'`)  
- `use_idf`: Enable IDF weighting (default `True`)  
- `smooth_idf`: Smooth IDF values (default `True`)  
- `sublinear_tf`: Apply sublinear TF scaling (default `False`)  
- `return_type`: Output format (default `'sparse'`)  
- `return_vectorizer`: Return fitted vectorizer (default `False`)  
- `vectorizer`: Pre-fitted TfidfVectorizer instance
---

## API

```python
specterize(input_data, return_type='list', max_length=512)
```

- `input_data`: `str`, `dict`, `list`, or iterable  
- `return_type`: `"list"`, `"numpy"`, `"tensor"`  
- `max_length`: tokenizer truncation length (default `512`)  

---

## Design Principles

### 1. Abstraction over configuration  
You should not need to think about setup for common workflows.

### 2. Strong defaults  
Spells are opinionated. They are built to “just work” for most cases.

### 3. Hidden complexity  
A spell may do significantly more than it appears.

### 4. Use with awareness  
Because complexity is hidden, you should understand what a spell does before using it in critical systems.

---

## When to Use Grimmerie

- Rapid experimentation  
- Prototyping ML/NLP pipelines  
- Sanity checking ideas  
- Building quick demos  

---

## When *Not* to Use It

- When you need full control over every step  
- When reproducibility requires explicit pipelines  
- When debugging low-level behavior  

---

## Notes

- First call may be slower due to model downloads  
- Models are cached locally after first use  
- Subsequent calls reuse loaded resources within the same process  

---

## Direction

Grimmerie will expand into a broader system of spells for:

- Vectorization  
- Dimensionality reduction  
- Visualization  
- Data inspection  
- ML prototyping utilities  

Each designed to compress multi-step workflows into a single, intentional call.
