Metadata-Version: 2.4
Name: flatfish
Version: 0.1.3
Summary: Historical document analysis CLI - Extract, analyze, and present handwritten text from document images
Project-URL: Homepage, https://github.com/PULdischo/flatfish
Project-URL: Documentation, https://github.com/PULdischo/flatfish#readme
Project-URL: Repository, https://github.com/PULdischo/flatfish
Project-URL: Issues, https://github.com/PULdischo/flatfish/issues
Author-email: Andrew Janco <apjanco@gmail.com>
License-Expression: MIT
Keywords: document-analysis,handwriting,historical-documents,htr,named-entity-recognition,ocr
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Requires-Dist: datasets>=2.0
Requires-Dist: deep-translator>=1.11.0
Requires-Dist: httpx>=0.25
Requires-Dist: jinja2>=3.0
Requires-Dist: markdown>=3.0
Requires-Dist: netlify-python>=0.4.0
Requires-Dist: openai>=1.0
Requires-Dist: pillow>=10.0
Requires-Dist: pydantic-settings>=2.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Requires-Dist: tqdm>=4.0
Requires-Dist: typer[all]>=0.12
Provides-Extra: dev
Requires-Dist: black>=23.0; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pre-commit>=3.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Description-Content-Type: text/markdown

<picture>
  <source media="(prefers-color-scheme: dark)" srcset="logo-dark.png">
  <source media="(prefers-color-scheme: light)" srcset="logo-light.png">
  <img width="100" src="logo-light.png" alt="Flatfish Logo">
</picture>

# Flatfish

Historical document analysis CLI - Extract, analyze, and present handwritten text from document images.

## Features

- 📜 **Handwritten Text Recognition (HTR)** - Extract text from historical document images
- 🏷️ **Named Entity Recognition** - Identify people, places, dates, and more with contextual descriptions
- 📊 **AI-Powered Summaries** - Generate timelines, track changes, and suggest research questions
- 🌐 **Static Website Builder** - Create searchable, browsable document collections

## Installation

```bash
pip install flatfish
```

## Quick Start

```bash
# Initialize a new project
flatfish init

# Edit configuration
nano flatfish.yaml
nano .env

# Validate setup
flatfish validate

# Process documents
flatfish process

# Preview the site
flatfish publish
```

## Configuration

### flatfish.yaml

```yaml
dataset:
  source: "username/dataset-name"
  splits:
    - "train"
  image_column: "image"

processing:
  extract_entities: true
  entity_context: true

summary:
  enabled: true
  model: "qwen-vl-max"

website:
  title: "Document Collection"
  password: "changeme"
```

### .env

```bash
HUGGINGFACE_TOKEN=hf_xxxxxxxxxxxxx
DASHSCOPE_API_KEY=sk-xxxxxxxxxxxxx
```

## Commands

| Command | Description |
|---------|-------------|
| `flatfish init` | Initialize a new project |
| `flatfish process` | Run the full pipeline |
| `flatfish extract` | Extract text from images only |
| `flatfish entities` | Extract entities only |
| `flatfish summarize` | Generate AI summary only |
| `flatfish build` | Build static site only |
| `flatfish serve` | Preview site locally |
| `flatfish deploy` | Deploy to Netlify |
| `flatfish status` | Show processing status |
| `flatfish validate` | Validate configuration |

## Deployment  .

Deploy your site to Netlify:

```bash
# Install netlify-python
pip install netlify-python

# Set your Netlify token (get from https://app.netlify.com/user/applications)
export NETLIFY_TOKEN=your-token
export NETLIFY_SITE_ID=your-site-id

# Deploy a draft preview
flatfish deploy

# Deploy to production
flatfish deploy --prod

# Specify a site ID directly
flatfish deploy --prod --site your-site-id
```

## Output

```
project/
├── transcriptions/     # Extracted text files
├── entities/           # Entity JSON files
├── summaries/          # AI-generated summaries
└── _site/              # Built static website
```

## License

MIT

## Disclosure of Delegation to Generative AI

The authors declare the use of generative AI in the research and writing process. According to the GAIDeT taxonomy (2025), the following tasks were delegated to GAI tools under full human supervision:

- Code generation
- Code optimization

The GAI tool used was: Claude Sonnet.
Responsibility for the final manuscript lies entirely with the authors.
GAI tools are not listed as authors and do not bear responsibility for the final outcomes.
Declaration submitted by: Andrew Janco
