Metadata-Version: 2.1
Name: data_prep_toolkit_transforms
Version: 1.0.0a0
Summary: Data Preparation Toolkit Transforms using Ray
Author-email: Maroun Touma <touma@us.ibm.com>
License: Apache-2.0
Keywords: transforms,data preprocessing,data preparation,llm,generative,ai,fine-tuning,llmapps
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: data-prep-toolkit>=0.2.3.dev0
Provides-Extra: dev
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest>=7.3.2; extra == "dev"
Requires-Dist: pytest-dotenv>=0.5.2; extra == "dev"
Requires-Dist: pytest-env>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.3.2; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: moto==5.0.5; extra == "dev"
Requires-Dist: markupsafe==2.0.1; extra == "dev"
Provides-Extra: ray
Requires-Dist: data-prep-toolkit[ray]>=0.2.3.dev0; extra == "ray"
Requires-Dist: networkx==3.3; extra == "ray"
Requires-Dist: colorlog==6.8.2; extra == "ray"
Requires-Dist: func-timeout==4.3.5; extra == "ray"
Requires-Dist: emerge-viz==2.0.0; extra == "ray"
Provides-Extra: all
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: scancode-toolkit==32.1.0; platform_system != "Darwin" and extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: bs4==0.0.2; extra == "all"
Requires-Dist: transformers==4.38.2; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: parameterized; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: pyyaml>=6.0.2; extra == "all"
Requires-Dist: boto3>=1.34.69; extra == "all"
Requires-Dist: kubernetes>=30.1.0; extra == "all"
Requires-Dist: polars==1.9.0; extra == "all"
Requires-Dist: disjoint-set>=0.8.0; extra == "all"
Requires-Dist: scipy<2.0.0,>=1.14.1; extra == "all"
Requires-Dist: numpy<1.29.0; extra == "all"
Requires-Dist: sentencepiece>=0.2.0; extra == "all"
Requires-Dist: mmh3>=4.1.0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: mmh3==4.1.0; extra == "all"
Requires-Dist: xxhash==3.4.1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: duckdb>=0.10.1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: docling-core==2.3.0; extra == "all"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "all"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: fasttext==0.9.2; extra == "all"
Requires-Dist: langcodes==3.3.0; extra == "all"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "all"
Requires-Dist: numpy==1.26.4; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: sentence-transformers==3.0.1; extra == "all"
Requires-Dist: docling-core==2.3.0; extra == "all"
Requires-Dist: docling-ibm-models==2.0.3; extra == "all"
Requires-Dist: deepsearch-glm==0.26.1; extra == "all"
Requires-Dist: docling==2.3.1; extra == "all"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: mmh3>=4.1.0; extra == "all"
Requires-Dist: xxhash==3.4.1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: nltk==3.9.1; extra == "all"
Requires-Dist: transformers==4.38.2; extra == "all"
Requires-Dist: torch<=2.4.1,>=2.2.2; extra == "all"
Requires-Dist: pandas==2.2.2; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: transformers==4.38.2; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "all"
Requires-Dist: data_prep_connector>=0.2.3; extra == "all"
Provides-Extra: language
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: pyyaml>=6.0.2; extra == "language"
Requires-Dist: boto3>=1.34.69; extra == "language"
Requires-Dist: kubernetes>=30.1.0; extra == "language"
Requires-Dist: polars==1.9.0; extra == "language"
Requires-Dist: disjoint-set>=0.8.0; extra == "language"
Requires-Dist: scipy<2.0.0,>=1.14.1; extra == "language"
Requires-Dist: numpy<1.29.0; extra == "language"
Requires-Dist: sentencepiece>=0.2.0; extra == "language"
Requires-Dist: mmh3>=4.1.0; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: mmh3==4.1.0; extra == "language"
Requires-Dist: xxhash==3.4.1; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: duckdb>=0.10.1; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: docling-core==2.3.0; extra == "language"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "language"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: fasttext==0.9.2; extra == "language"
Requires-Dist: langcodes==3.3.0; extra == "language"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "language"
Requires-Dist: numpy==1.26.4; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: sentence-transformers==3.0.1; extra == "language"
Requires-Dist: docling-core==2.3.0; extra == "language"
Requires-Dist: docling-ibm-models==2.0.3; extra == "language"
Requires-Dist: deepsearch-glm==0.26.1; extra == "language"
Requires-Dist: docling==2.3.1; extra == "language"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: mmh3>=4.1.0; extra == "language"
Requires-Dist: xxhash==3.4.1; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: nltk==3.9.1; extra == "language"
Requires-Dist: transformers==4.38.2; extra == "language"
Requires-Dist: torch<=2.4.1,>=2.2.2; extra == "language"
Requires-Dist: pandas==2.2.2; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: transformers==4.38.2; extra == "language"
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "language"
Requires-Dist: data_prep_connector>=0.2.3; extra == "language"
Provides-Extra: proglang-select
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "proglang-select"
Provides-Extra: header-cleanser
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "header-cleanser"
Requires-Dist: scancode-toolkit==32.1.0; platform_system != "Darwin" and extra == "header-cleanser"
Provides-Extra: license-select
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "license-select"
Provides-Extra: code-quality
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "code-quality"
Requires-Dist: bs4==0.0.2; extra == "code-quality"
Requires-Dist: transformers==4.38.2; extra == "code-quality"
Provides-Extra: code2parquet
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "code2parquet"
Requires-Dist: parameterized; extra == "code2parquet"
Requires-Dist: pandas; extra == "code2parquet"
Provides-Extra: pii-redactor
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "pii-redactor"
Requires-Dist: presidio-analyzer>=2.2.355; extra == "pii-redactor"
Requires-Dist: presidio-anonymizer>=2.2.355; extra == "pii-redactor"
Requires-Dist: flair>=0.14.0; extra == "pii-redactor"
Requires-Dist: pandas>=2.2.2; extra == "pii-redactor"
Provides-Extra: fdedup
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "fdedup"
Requires-Dist: pyyaml>=6.0.2; extra == "fdedup"
Requires-Dist: boto3>=1.34.69; extra == "fdedup"
Requires-Dist: kubernetes>=30.1.0; extra == "fdedup"
Requires-Dist: polars==1.9.0; extra == "fdedup"
Requires-Dist: disjoint-set>=0.8.0; extra == "fdedup"
Requires-Dist: scipy<2.0.0,>=1.14.1; extra == "fdedup"
Requires-Dist: numpy<1.29.0; extra == "fdedup"
Requires-Dist: sentencepiece>=0.2.0; extra == "fdedup"
Requires-Dist: mmh3>=4.1.0; extra == "fdedup"
Provides-Extra: profiler
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "profiler"
Requires-Dist: mmh3==4.1.0; extra == "profiler"
Requires-Dist: xxhash==3.4.1; extra == "profiler"
Provides-Extra: filter
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "filter"
Requires-Dist: duckdb>=0.10.1; extra == "filter"
Provides-Extra: resize
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "resize"
Provides-Extra: doc-chunk
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "doc-chunk"
Requires-Dist: docling-core==2.3.0; extra == "doc-chunk"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "doc-chunk"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "doc-chunk"
Provides-Extra: doc-quality
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "doc-quality"
Provides-Extra: html2parquet
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "html2parquet"
Requires-Dist: trafilatura==1.12.0; extra == "html2parquet"
Provides-Extra: lang-id
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "lang-id"
Requires-Dist: fasttext==0.9.2; extra == "lang-id"
Requires-Dist: langcodes==3.3.0; extra == "lang-id"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "lang-id"
Requires-Dist: numpy==1.26.4; extra == "lang-id"
Provides-Extra: pdf2parquet
Requires-Dist: docling-core==2.3.0; extra == "pdf2parquet"
Requires-Dist: docling-ibm-models==2.0.3; extra == "pdf2parquet"
Requires-Dist: deepsearch-glm==0.26.1; extra == "pdf2parquet"
Requires-Dist: docling==2.3.1; extra == "pdf2parquet"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "pdf2parquet"
Provides-Extra: text-encoder
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "text-encoder"
Requires-Dist: sentence-transformers==3.0.1; extra == "text-encoder"
Provides-Extra: doc-id
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "doc-id"
Provides-Extra: ededup
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "ededup"
Requires-Dist: mmh3>=4.1.0; extra == "ededup"
Requires-Dist: xxhash==3.4.1; extra == "ededup"
Provides-Extra: hap
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "hap"
Requires-Dist: nltk==3.9.1; extra == "hap"
Requires-Dist: transformers==4.38.2; extra == "hap"
Requires-Dist: torch<=2.4.1,>=2.2.2; extra == "hap"
Requires-Dist: pandas==2.2.2; extra == "hap"
Provides-Extra: tokenization
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "tokenization"
Requires-Dist: transformers==4.38.2; extra == "tokenization"
Provides-Extra: web2parquet
Requires-Dist: data-prep-toolkit>=0.2.3.dev0; extra == "web2parquet"
Requires-Dist: data_prep_connector>=0.2.3; extra == "web2parquet"

# DPK Python Transforms

## installation

The [transforms](https://github.com/IBM/data-prep-kit/blob/dev/transforms/README.md) are delivered as a standard pyton library available on pypi and can be installed using pip install:

`python -m pip install data-prep-toolkit-transforms`
or
`python -m pip install data-prep-toolkit-transforms[ray]`


installing the python transforms will also install  `data-prep-toolkit`

installing the ray transforms will also install  `data-prep-toolkit[ray]`

## List of Transforms in current package

Note: This list includes the transforms that were part of the release starting with data-prep-toolkit-transforms:0.2.1. This list may not always reflect up to date information. Users are encourage to raise an issue in git when they discover missing components or packages that are listed below but not in the current release they get from pypi.

* code
    * [code2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code2parquet/python/README.md)
    * [header_cleanser (Not available on MacOS)](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/header_cleanser/python/README.md)
    * [code_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_quality/python/README.md)
    * [proglang_select](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/README.md)
* language
    * [doc_chunk](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/python/README.md)
	* [doc_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/python/README.md)
	* [lang_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/lang_id/python/README.md)
	* [pdf2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md)
	* [text_encoder](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/text_encoder/python/README.md)
	* [pii_redactor](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pii_redactor/python/README.md)
* universal
    * [ededup](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/python/README.md)
	* [filter](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/filter/python/README.md)
	* [resize](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/resize/python/README.md)
	* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/python/README.md)
	* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/python/README.md)
	* [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md)
   
## Release notes:

### 0.2.2.dev3 
* web2parquet
### 0.2.2.dev2
* pdf2parquet now supports HTML,DOCX,PPTX, ... in addition to PDF




 
