Metadata-Version: 2.4
Name: datamarket
Version: 0.10.18
Summary: Utilities that integrate advanced scraping knowledge into just one library.
License: GPL-3.0-or-later
License-File: LICENSE
Author: DataMarket
Author-email: techsupport@datamarket.es
Requires-Python: >=3.12,<4.0
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: aws
Provides-Extra: azure-storage-blob
Provides-Extra: boto3
Provides-Extra: camoufox
Provides-Extra: chompjs
Provides-Extra: click
Provides-Extra: clickhouse-connect
Provides-Extra: clickhouse-driver
Provides-Extra: datetime
Provides-Extra: ddgs
Provides-Extra: demjson3
Provides-Extra: dnspython
Provides-Extra: drive
Provides-Extra: fake-useragent
Provides-Extra: geoalchemy2
Provides-Extra: geopandas
Provides-Extra: google-api-python-client
Provides-Extra: google-auth-httplib2
Provides-Extra: google-auth-oauthlib
Provides-Extra: html2text
Provides-Extra: httpx
Provides-Extra: json5
Provides-Extra: llm
Provides-Extra: lxml
Provides-Extra: matplotlib
Provides-Extra: nodriver
Provides-Extra: openai
Provides-Extra: openpyxl
Provides-Extra: pandarallel
Provides-Extra: pandas
Provides-Extra: pandera
Provides-Extra: peerdb
Provides-Extra: pii
Provides-Extra: pillow
Provides-Extra: playwright
Provides-Extra: playwright-stealth
Provides-Extra: plotly
Provides-Extra: pyarrow
Provides-Extra: pydantic
Provides-Extra: pydrive2
Provides-Extra: pygeohash
Provides-Extra: pymupdf
Provides-Extra: pyproj
Provides-Extra: pyrate-limiter
Provides-Extra: pysocks
Provides-Extra: pyspark
Provides-Extra: pytest
Provides-Extra: retry
Provides-Extra: shapely
Provides-Extra: soda-core-mysql
Provides-Extra: soda-core-postgres
Provides-Extra: sqlparse
Provides-Extra: tqdm
Provides-Extra: undetected-chromedriver
Provides-Extra: xmltodict
Requires-Dist: SQLAlchemy (>=2.0.0,<3.0.0)
Requires-Dist: azure-storage-blob (>=12.0.0,<13.0.0) ; extra == "azure-storage-blob"
Requires-Dist: babel (>=2.0.0,<3.0.0)
Requires-Dist: beautifulsoup4 (>=4.0.0,<5.0.0)
Requires-Dist: boto3 (>=1.35.0,<1.36.0) ; extra == "boto3" or extra == "aws" or extra == "peerdb"
Requires-Dist: botocore (>=1.35.0,<1.36.0) ; extra == "boto3" or extra == "aws"
Requires-Dist: browserforge (>=1.2.0,<2.0.0) ; extra == "camoufox"
Requires-Dist: camoufox[geoip] (>=0.4.11,<0.5.0) ; extra == "camoufox"
Requires-Dist: chompjs (>=1.0.0,<2.0.0) ; extra == "chompjs"
Requires-Dist: click (>=8.0.0,<9.0.0) ; extra == "click"
Requires-Dist: clickhouse-connect (>=0.11.0,<0.12.0) ; extra == "clickhouse-connect"
Requires-Dist: clickhouse-driver (>=0.2.0,<0.3.0) ; extra == "clickhouse-driver" or extra == "peerdb"
Requires-Dist: croniter (>=3.0.0,<4.0.0)
Requires-Dist: cryptography (>=43.0.0,<44.0.0) ; extra == "boto3" or extra == "aws"
Requires-Dist: datetime (>=5.0,<6.0) ; extra == "datetime"
Requires-Dist: ddgs (>=9.0.0,<10.0.0) ; extra == "ddgs"
Requires-Dist: demjson3 (>=3.0.0,<4.0.0) ; extra == "demjson3"
Requires-Dist: dnspython (>=2.0.0,<3.0.0) ; extra == "dnspython"
Requires-Dist: dynaconf (>=3.0.0,<4.0.0)
Requires-Dist: fake-useragent (>=2.0.0,<3.0.0) ; extra == "fake-useragent"
Requires-Dist: geoalchemy2 (>=0.18.0,<0.19.0) ; extra == "geoalchemy2"
Requires-Dist: geopandas (>=1.0.0,<2.0.0) ; extra == "geopandas"
Requires-Dist: geopy (>=2.0.0,<3.0.0)
Requires-Dist: google-api-python-client (>=2.0.0,<3.0.0) ; extra == "google-api-python-client"
Requires-Dist: google-auth-httplib2 (>=0.2.0,<0.3.0) ; extra == "google-auth-httplib2"
Requires-Dist: google-auth-oauthlib (>=1.0.0,<2.0.0) ; extra == "google-auth-oauthlib"
Requires-Dist: html2text (>=2024.0.0,<2025.0.0) ; extra == "html2text"
Requires-Dist: httpx[http2] (>=0.28.0,<0.29.0) ; extra == "httpx"
Requires-Dist: inflection (>=0.5.0,<0.6.0)
Requires-Dist: jellyfish (>=1.0.0,<2.0.0)
Requires-Dist: jinja2 (>=3.0.0,<4.0.0)
Requires-Dist: json5 (>=0.10.0,<0.11.0) ; extra == "json5"
Requires-Dist: lxml[html-clean] (>=5.0.0,<6.0.0) ; extra == "lxml"
Requires-Dist: matplotlib (>=3.0.0,<4.0.0) ; extra == "matplotlib"
Requires-Dist: nodriver (>=0.44,<0.45) ; extra == "nodriver"
Requires-Dist: numpy (>=2.0.0,<3.0.0)
Requires-Dist: openai (>=2.0.0,<3.0.0) ; extra == "openai" or extra == "llm"
Requires-Dist: openpyxl (>=3.0.0,<4.0.0) ; extra == "openpyxl"
Requires-Dist: pandarallel (>=1.0.0,<2.0.0) ; extra == "pandarallel"
Requires-Dist: pandas (>=2.0.0,<3.0.0) ; extra == "pandas"
Requires-Dist: pandera (>=0.22.0,<0.23.0) ; extra == "pandera"
Requires-Dist: pendulum (>=3.0.0,<4.0.0)
Requires-Dist: phonenumbers (>=9.0.0,<10.0.0)
Requires-Dist: pillow (>=11.0.0,<12.0.0) ; extra == "pillow"
Requires-Dist: playwright (==1.57.0) ; extra == "playwright" or extra == "camoufox"
Requires-Dist: plotly (>=6.0.0,<7.0.0) ; extra == "plotly"
Requires-Dist: pre-commit (>=4.0.0,<5.0.0)
Requires-Dist: presidio-analyzer[phonenumbers] (>=2.0.0,<3.0.0) ; extra == "pii"
Requires-Dist: presidio-anonymizer (>=2.0.0,<3.0.0) ; extra == "pii"
Requires-Dist: psycopg2-binary (>=2.0.0,<3.0.0)
Requires-Dist: pyarrow (>=19.0.0,<20.0.0) ; extra == "pyarrow"
Requires-Dist: pycountry (>=24.0.0,<25.0.0)
Requires-Dist: pydantic (>=2.0.0,<3.0.0) ; extra == "pydantic" or extra == "llm"
Requires-Dist: pydrive2 (>=1.0.0,<2.0.0) ; extra == "pydrive2" or extra == "drive"
Requires-Dist: pygeohash (>=3.0.0,<4.0.0) ; extra == "pygeohash"
Requires-Dist: pymupdf (>=1.0.0,<2.0.0) ; extra == "pymupdf"
Requires-Dist: pyproj (>=3.0.0,<4.0.0) ; extra == "pyproj"
Requires-Dist: pyrate-limiter (>=3.0.0,<4.0.0) ; extra == "pyrate-limiter"
Requires-Dist: pysocks (>=1.0.0,<2.0.0) ; extra == "pysocks"
Requires-Dist: pyspark (>=3.0.0,<4.0.0) ; extra == "pyspark"
Requires-Dist: pytest (>=8.0.0,<9.0.0) ; extra == "pytest"
Requires-Dist: python-string-utils (>=1.0.0,<2.0.0)
Requires-Dist: rapidfuzz (>=3.0.0,<4.0.0)
Requires-Dist: requests (>=2.0.0,<3.0.0)
Requires-Dist: retry (>=0.9.0,<0.10.0) ; extra == "retry"
Requires-Dist: rnet (>=3.0.0rc10,<4.0.0)
Requires-Dist: shapely (>=2.0.0,<3.0.0) ; extra == "shapely"
Requires-Dist: soda-core-mysql-utf8-hotfix (>=3.0.0,<4.0.0) ; extra == "soda-core-mysql"
Requires-Dist: soda-core-postgres (>=3.0.0,<4.0.0) ; extra == "soda-core-postgres"
Requires-Dist: spacy (>=3.0.0,<4.0.0) ; extra == "pii"
Requires-Dist: spacy-langdetect (>=0.1.0,<0.2.0) ; extra == "pii"
Requires-Dist: sqlparse (>=0.5.0,<0.6.0) ; extra == "sqlparse"
Requires-Dist: stem (>=1.0.0,<2.0.0)
Requires-Dist: tenacity (>=9.0.0,<10.0.0)
Requires-Dist: tqdm (>=4.0.0,<5.0.0) ; extra == "tqdm"
Requires-Dist: typer (>=0.15.0,<0.16.0)
Requires-Dist: unidecode (>=1.0.0,<2.0.0)
Requires-Dist: xmltodict (>=0.14.0,<0.15.0) ; extra == "xmltodict"
Project-URL: Documentation, https://github.com/Data-Market/datamarket
Project-URL: Homepage, https://datamarket.es
Project-URL: Repository, https://github.com/Data-Market/datamarket
Description-Content-Type: text/markdown

# datamarket

`datamarket` is a Python library of reusable scraping, data ingestion, and integration utilities used across DataMarket projects.

This README explains what the library is, how to run it locally, how to use it, and where deeper documentation lives.

It solves a practical problem: different scrapers and ETL jobs often re-implement the same low-level pieces (HTTP retries, proxy rotation, SQLAlchemy batch writes, cloud storage clients, LLM wrappers). This repository centralizes those capabilities in a single package so projects can stay focused on business logic.

## Project Overview

- **Primary value**: standardized interfaces for data collection, transformation, and delivery.
- **Language/runtime**: Python `^3.12` (from `pyproject.toml`).
- **Package manager/build**: Poetry (`pyproject.toml`, `poetry.lock`).
- **Testing**: pytest-based tests in `tests/`.
- **Lint/format**: pre-commit hooks (Ruff + Ruff format) via `pre-commit-config` submodule.

## High-Level Architecture

Core package code lives in `src/datamarket/` and is organized by responsibility:

- `src/datamarket/interfaces/`: service-facing interfaces (LLM, SQLAlchemy, proxy, AWS, Azure Blob, Drive, FTP, Tinybird, Nominatim, PeerDB).
- `src/datamarket/utils/`: shared helpers (HTTP client wrapper, config loading, logging, Playwright/Selenium helpers, string normalization, data quality sampler).
- `src/datamarket/exceptions/`: custom exception types used across request and proxy workflows.
- `src/datamarket/params/`: static parameter dictionaries/constants (for example, Nominatim enrichment data).

For architecture diagrams and deeper design notes, see `docs/2. Architecture Overview.md`.

## Prerequisites

- Python `3.12`.
- `pip` (for install) and optionally `poetry` (for dependency/workflow management).
- Optional: Conda if you want to use the bootstrap helper in `init.sh`.

## Installation

To install this library in your Python environment:

`pip install datamarket`

## Environment Setup

### Option A: Poetry workflow

```bash
poetry install
poetry shell
```

### Option B: Conda bootstrap script

`init.sh` creates a Conda environment named `<package>_env`, installs the package in editable mode, initializes submodules, and installs pre-commit hooks.

```bash
bash init.sh
```

## Basic Usage

This section shows how to use the package from consumer projects.

- Import interfaces directly from module paths, for example:
  - `from datamarket.interfaces.llm import LLMInterface`
  - `from datamarket.interfaces.proxy import ProxyInterface`
  - `from datamarket.interfaces.alchemy import AlchemyInterface`
- Load INI-style config using `datamarket.utils.main.get_config` when needed.
- Run end-to-end examples from `examples/` for LLM and vision use cases.

## Development Workflow

### Run examples

```bash
python examples/llm_usage_examples.py
python examples/llm_vision_examples.py
```

### Run tests

```bash
pytest -v
```

### Lint and format

This repo uses pre-commit hooks defined in `pre-commit-config/.pre-commit-config.yaml`:

```bash
pre-commit run --all-files
```

### Build artifacts

```bash
poetry build
```

Built distributions are output to `dist/`.

## Configuration

This library is configuration-driven. Most interfaces expect either:

- a dict-like object (`config["section"]["key"]`), or
- a `ConfigParser`/`RawConfigParser` object for INI files.

Common sections used by interfaces include:

- `[llm]` for `LLMInterface` (`provider`, `api_key`, `model`).
- `[db]` for `AlchemyInterface` and Postgres peer operations.
- `[proxy]` for `ProxyInterface` (`hosts`, optional `tor_password`).
- `[tinybird]`, `[osm]`, `[drive]`.
- Profile-based sections such as `[aws:<profile>]`, `[azure:<profile>]`, `[ftp:<profile>]`.
- PeerDB-specific sections: `[peerdb]`, `[clickhouse]`, `[peerdb-s3]`.

See the generated wiki pages in `docs/` for concrete config and workflow details, especially `docs/3. Workflows.md` and `docs/Deep Dive/Interfaces.md`.

## Deployment and Release Notes

- This repository is a library package, not a deployable service.
- Release packaging is supported through Poetry (`poetry build`) and Twine can be used for publishing.
- CI/CD release automation is **not configured in this repository** (no `.github/workflows/` present).
- See `docs/4. ADRs.md` for architecture-level release and maintenance trade-offs.

## Troubleshooting

- `ModuleNotFoundError` for optional features: install required extras (for example `.[llm]`, `.[pytest]`, `.[boto3]`).
- `Configuration must contain 'llm' section`: include `[llm]` with `api_key` before creating `LLMInterface`.
- `No working proxies available`: verify `[proxy] hosts` format (`host:port` or `user:pass@host:port`) and network access.
- SQLAlchemy connection errors: verify `[db]` credentials and engine string.
- Pre-commit command not found: install `pre-commit` in your active environment.

## Contributing (Summary)

- Keep changes scoped and aligned with existing module boundaries in `src/datamarket/`.
- Add or update tests under `tests/` for behavioral changes.
- Run `pytest -v` and `pre-commit run --all-files` before opening a PR.
- Keep docs current when interfaces, config keys, or workflows change.

## Documentation Map

- Wiki home: `docs/Home.md`
- Project overview: `docs/1. Project Overview.md`
- Architecture overview (C4): `docs/2. Architecture Overview.md`
- Workflows: `docs/3. Workflows.md`
- Architecture decisions: `docs/4. ADRs.md`
- Deep dives: `docs/Deep Dive/Interfaces.md`, `docs/Deep Dive/LLM.md`, `docs/Deep Dive/SQLAlchemy.md`, `docs/Deep Dive/Utilities.md`, `docs/Deep Dive/Geo Enrichment.md`
- Digital twin artifacts: `docs/_twin/inventory.json`, `docs/_twin/graph.json`, `docs/_twin/domain-map.md`, `docs/_twin/patterns.md`

## Documentation Status

Diataxis type: Reference.

- This README is the entry point and is maintained incrementally from validated repository summaries.
- Current generated references in this run:
  - `docs/1. Project Overview.md`
  - `docs/2. Architecture Overview.md`
  - `docs/3. Workflows.md`
  - `docs/4. ADRs.md`
- Known unknowns:
  - `UNKNOWN`: linked pages under the external `docs` submodule may differ from this local snapshot.

## License

GPL-3.0-or-later. See `LICENSE`.

Sources: README.md (summary_hash: 905c027d111146820a6ea5c807c7b4a0f7094f9b36ab6528f36500a3f5e07520)

