Metadata-Version: 2.4
Name: data_prep_connector
Version: 0.2.5.dev0
Summary: Scalable and Compliant Web Crawler
Author-email: Hiroya Matsubara <hmtbr@jp.ibm.com>
License: Apache-2.0
Keywords: data,data acquisition,crawler,web crawler,llm,generative,ai,fine-tuning,llmapps
Requires-Python: <3.15,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: scrapy==2.12.0
Requires-Dist: pydantic>=2.8.1
Requires-Dist: tldextract>=5.1.2
Provides-Extra: dev
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest>=7.3.2; extra == "dev"
Requires-Dist: pytest-dotenv>=0.5.2; extra == "dev"
Requires-Dist: pytest-env>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.3.2; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: pytest-datadir>=1.5.0; extra == "dev"
Requires-Dist: moto==5.0.5; extra == "dev"
Requires-Dist: markupsafe==2.0.1; extra == "dev"

# DPK Connector

DPK Connector is a scalable and compliant web crawler developed for data acquisition towards LLM development. It is built on [Scrapy](https://scrapy.org/).
For more details read [the documentation](doc/overview.md).

## Virtual Environment

The project uses `pyproject.toml` and a Makefile for operations.
To do development you should establish the virtual environment
```shell
make venv
```
and then either activate
```shell
source venv/bin/activate
```
or set up your IDE to use the venv directory when developing in this project

## Library Artifact Build and Publish

To test, build and publish the library
```shell
make test build publish
```

To up the version number, edit the Makefile to change VERSION and rerun the above. This will require committing both the `Makefile` and the autotmatically updated `pyproject.toml` file.

## How to use

See [the overview](doc/overview.md).
