Metadata-Version: 2.4
Name: architxt
Version: 0.7.0
Summary: ArchiTXT is a tool for structuring textual data into a valid database model. It is guided by a meta-grammar and uses an iterative process of tree rewriting.
License-Expression: GPL-3.0-only
License-File: LICENSE
Keywords: python,nlp,database,structuration,text mining,text analysis,data analysis
Author: Jacques Chabin
Author-email: jacques.chabin@univ-orleans.fr
Maintainer: Nicolas Hiot
Maintainer-email: nicolas.hiot@univ-orleans.fr
Requires-Python: >=3.10,<3.13
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Provides-Extra: flair
Provides-Extra: llm
Provides-Extra: ui
Requires-Dist: aiostream (>=0.7,<0.8)
Requires-Dist: antlr4-python3-runtime (>=4.13,<5)
Requires-Dist: anyio (>=4.9.0,<5.0.0)
Requires-Dist: benepar
Requires-Dist: cachetools (>=5.5.0,<6.0.0)
Requires-Dist: click (<8.4.0)
Requires-Dist: flair (>=0.15,<0.16) ; extra == "flair"
Requires-Dist: googletrans (>=4.0.2,<5.0.0)
Requires-Dist: hdbscan (>=0.8.41,<0.9.0)
Requires-Dist: json-repair (>=0.52,<0.53) ; extra == "llm"
Requires-Dist: langchain (>=0.3,<0.4) ; extra == "llm"
Requires-Dist: langchain-core ; extra == "llm"
Requires-Dist: langchain-huggingface ; extra == "llm"
Requires-Dist: levenshtein
Requires-Dist: matplotlib (>=3.8.0,<4.0.0)
Requires-Dist: mlflow (>=3.5,<4.0)
Requires-Dist: more-itertools
Requires-Dist: neo4j (>=5.28.1,<6.0.0)
Requires-Dist: nltk (>=3.9,<4.0)
Requires-Dist: numpy (>=1.16,<2.0)
Requires-Dist: pandas (>=2.3.0,<3.0.0)
Requires-Dist: platformdirs (>=4.4.0,<5.0.0)
Requires-Dist: pybrat (>=0.1.7,<0.2.0)
Requires-Dist: relstorage (>=4.1.1,<5.0.0)
Requires-Dist: rich (>=13.9.4,<15.0.0)
Requires-Dist: ruamel.yaml (>=0.18.0,<0.19.0)
Requires-Dist: scikit-learn (>=1.7.0,<2.0.0)
Requires-Dist: scipy (>=1.15.0,<2.0.0)
Requires-Dist: scispacy (>=0.5.5,<0.6.0)
Requires-Dist: spacy (>=3.7.0,<4.0.0)
Requires-Dist: sqlalchemy (>=2.0.39,<3.0.0)
Requires-Dist: streamlit (>=1.53.0) ; extra == "ui"
Requires-Dist: streamlit-agraph ; extra == "ui"
Requires-Dist: streamlit-tags ; extra == "ui"
Requires-Dist: toml (>=0.10.0,<0.11.0)
Requires-Dist: tqdm (>=4.60)
Requires-Dist: typer (>=0.15.1,<0.22.0)
Requires-Dist: typing-extensions
Requires-Dist: unidecode
Requires-Dist: xlrd (>=2.0.1,<3.0.0)
Requires-Dist: xmltodict (>=0.14.2,<1.1.0)
Requires-Dist: zodb (>=6.0,<7.0)
Requires-Dist: zodburi (>=3.0,<3.1)
Project-URL: Documentation, https://neplex.github.io/ArchiTXT
Project-URL: Repository, https://github.com/neplex/ArchiTXT
Project-URL: funding, https://github.com/Neplex/ArchiTXT?tab=readme-ov-file#sponsors
Project-URL: issue, https://github.com/neplex/ArchiTXT/issues
Description-Content-Type: text/markdown

# ArchiTXT: Text-to-Database Structuring Tool

[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)
![PyPI - Status](https://img.shields.io/pypi/status/architxt)
[![PyPI - Version](https://img.shields.io/pypi/v/architxt)](https://pypi.org/project/architxt/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/architxt)
[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/neplex/architxt/python-build.yml)](https://github.com/Neplex/ArchiTXT/actions)
[![SWH](https://archive.softwareheritage.org/badge/origin/https://github.com/Neplex/ArchiTXT/)](https://archive.softwareheritage.org/browse/origin/?origin_url=https://github.com/Neplex/ArchiTXT)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.15688157.svg)](https://doi.org/10.5281/zenodo.15688157)

**ArchiTXT** is a Python library and CLI tool that automatically converts unstructured text corpora into structured,
database-ready data. It infers database schemas directly from text and generates corresponding structured instances
using a meta-grammar and iterative tree-rewriting process.

**ArchiTXT** is designed for researchers, data engineers, and NLP practitioners who need a transparent and auditable
process to transform raw textual data into storable, queryable and machine-learning-ready datasets.

## Why ArchiTXT?

Working with unstructured text becomes complex when you need:
- Structured storage
- Queryable entities and relations
- Reproducible data modeling

**ArchiTXT** bridges this gap by:
- Discovering latent structural patterns in annotated corpora
- Automatically generating database schemas
- Producing structured instances aligned with the inferred schema
- Ensuring transparency through rule-based rewriting

## Installation

To install **ArchiTXT**, make sure you have Python 3.10+ and pip installed. Then, run:

```sh
pip install architxt
```

For the development version, you can install it directly through GIT using

```sh
pip install git+https://github.com/Neplex/ArchiTXT.git
```

## Usage

**ArchiTXT** is built to work seamlessly with BRAT-annotated corpora that includes pre-labeled named entities.
It can parse the texts using either CoreNLP or SpaCy, depending on your preference and setup.
See the [documentation](https://neplex.github.io/ArchiTXT/importers/text.html) for more information.

For CoreNLP, it requires access to a CoreNLP server, which you can set up using the Docker Compose configuration
available in the source repository. To deploy it, you can use the following command:

```sh
docker compose up -d corenlp
```

After parsing the annotated texts into **ArchiTXT**'s internal representation, you can infer a database schema and instance based on
the annotated entities and generate structured instances accordingly.
See the [documentation](https://neplex.github.io/ArchiTXT/transformers/simplify.html) for more information.

The result can be exported as a relational or property graph database.
See the [documentation](https://neplex.github.io/ArchiTXT/exporters.html) for more information.

**ArchiTXT** is available as a Python library but also provides a command-line interface (CLI) for users who prefer
working in the terminal. You can run the CLI using:

```sh
architxt --help
```

## Sponsors

This work has received support under the JUNON Program, with financial support from Région Centre-Val de Loire (France).

<a href="https://www.junon-cvl.fr">
  <img src="https://www.junon-cvl.fr/sites/websites/www.junon-cvl.fr/files/logos/2025-07/logo-junon-new.svg" width="200" alt="JUNON Program logo">
</a>

<a href="https://www.univ-orleans.fr">
  <img src="https://ent.univ-orleans.fr/pages-locales-uo/images/logo_univ.svg" width="200" alt="UO logo">
</a>

<a href="https://www.univ-orleans.fr/lifo/">
  <img src="https://www.univ-orleans.fr/lifo/themes/custom/bs5_lifo_theme/logo.svg" width="200" alt="LIFO logo">
</a>

