Metadata-Version: 2.4
Name: multimodal-parsers
Version: 0.1.6
Summary: PDF processing pipeline: remove headers/footers, convert to markdown, and generate image captions
Home-page: https://github.com/thuuyen98/PIER-QA
Author: Uyen Hoang
Author-email: thho00003@stud.uni-saarland.de
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: absl-py==2.3.1
Requires-Dist: accelerate==1.12.0
Requires-Dist: addict==2.4.0
Requires-Dist: aiofiles==24.1.0
Requires-Dist: aiohappyeyeballs==2.6.1
Requires-Dist: aiohttp==3.13.2
Requires-Dist: aioice==0.10.2
Requires-Dist: aiortc==1.14.0
Requires-Dist: aiosignal==1.4.0
Requires-Dist: alembic==1.17.2
Requires-Dist: annotated-doc==0.0.4
Requires-Dist: annotated-types==0.7.0
Requires-Dist: anthropic==0.46.0
Requires-Dist: antlr4-python3-runtime==4.9.3
Requires-Dist: anyio==4.12.0
Requires-Dist: asttokens==3.0.1
Requires-Dist: async-timeout==5.0.1
Requires-Dist: attrs==25.4.0
Requires-Dist: audioread==3.1.0
Requires-Dist: av==16.0.1
Requires-Dist: babel==2.17.0
Requires-Dist: backports-datetime-fromisoformat==2.0.3
Requires-Dist: backports.asyncio.runner==1.2.0
Requires-Dist: beautifulsoup4==4.14.3
Requires-Dist: blis==1.3.3
Requires-Dist: braceexpand==0.1.7
Requires-Dist: brotli==1.2.0
Requires-Dist: cachetools==6.2.2
Requires-Dist: catalogue==2.0.10
Requires-Dist: certifi==2025.11.12
Requires-Dist: cffi==2.0.0
Requires-Dist: cfgv==3.5.0
Requires-Dist: charset-normalizer==3.4.4
Requires-Dist: click==8.3.1
Requires-Dist: cloudpathlib==0.23.0
Requires-Dist: cloudpickle==3.1.2
Requires-Dist: coloredlogs==15.0.1
Requires-Dist: colorlog==6.10.1
Requires-Dist: confection==0.1.5
Requires-Dist: contourpy==1.3.2
Requires-Dist: cryptography==46.0.3
Requires-Dist: csvw==3.7.0
Requires-Dist: ctc_segmentation==1.7.4
Requires-Dist: curated-tokenizers==0.0.9
Requires-Dist: curated-transformers==0.1.1
Requires-Dist: cycler==0.12.1
Requires-Dist: cymem==2.0.13
Requires-Dist: Cython==3.2.2
Requires-Dist: cytoolz==1.1.0
Requires-Dist: dacite==1.9.2
Requires-Dist: datasets==4.4.1
Requires-Dist: decorator==5.2.1
Requires-Dist: dill==0.4.0
Requires-Dist: distlib==0.4.0
Requires-Dist: distro==1.9.0
Requires-Dist: dlinfo==2.0.0
Requires-Dist: dnspython==2.8.0
Requires-Dist: docopt==0.6.2
Requires-Dist: editdistance==0.8.1
Requires-Dist: einops==0.8.1
Requires-Dist: einx==0.3.0
Requires-Dist: espeakng-loader==0.2.4
Requires-Dist: exceptiongroup==1.3.1
Requires-Dist: executing==2.2.1
Requires-Dist: fastapi==0.124.0
Requires-Dist: fastrtc==0.0.34
Requires-Dist: fastrtc-moonshine-onnx==20241016
Requires-Dist: ffmpy==1.0.0
Requires-Dist: fiddle==0.3.0
Requires-Dist: filelock==3.20.0
Requires-Dist: filetype==1.2.0
Requires-Dist: flatbuffers==25.9.23
Requires-Dist: fonttools==4.61.0
Requires-Dist: frozendict==2.4.7
Requires-Dist: frozenlist==1.8.0
Requires-Dist: fsspec==2024.12.0
Requires-Dist: ftfy==6.3.1
Requires-Dist: future==1.0.0
Requires-Dist: gitdb==4.0.12
Requires-Dist: GitPython==3.1.45
Requires-Dist: google-auth==2.43.0
Requires-Dist: google-crc32c==1.7.1
Requires-Dist: google-genai==1.54.0
Requires-Dist: gradio==5.50.0
Requires-Dist: gradio_client==1.14.0
Requires-Dist: graphviz==0.21
Requires-Dist: groovy==0.1.2
Requires-Dist: grpcio==1.76.0
Requires-Dist: h11==0.16.0
Requires-Dist: hf-xet==1.2.0
Requires-Dist: httpcore==1.0.9
Requires-Dist: httpx==0.28.1
Requires-Dist: huggingface-hub==0.36.0
Requires-Dist: humanfriendly==10.0
Requires-Dist: hydra-core==1.3.2
Requires-Dist: identify==2.6.15
Requires-Dist: idna==3.11
Requires-Dist: ifaddr==0.2.0
Requires-Dist: indic_numtowords==1.1.0
Requires-Dist: inflect==7.5.0
Requires-Dist: iniconfig==2.3.0
Requires-Dist: intervaltree==3.1.0
Requires-Dist: ipython==8.37.0
Requires-Dist: isodate==0.7.2
Requires-Dist: jedi==0.19.2
Requires-Dist: Jinja2==3.1.6
Requires-Dist: jiter==0.12.0
Requires-Dist: jiwer==3.1.0
Requires-Dist: joblib==1.5.2
Requires-Dist: jsonschema==4.25.1
Requires-Dist: jsonschema-specifications==2025.9.1
Requires-Dist: kaldi-python-io==1.2.2
Requires-Dist: kiwisolver==1.4.9
Requires-Dist: language-tags==1.2.0
Requires-Dist: lazy_loader==0.4
Requires-Dist: Levenshtein==0.27.3
Requires-Dist: lhotse==1.32.1
Requires-Dist: libcst==1.8.6
Requires-Dist: librosa==0.11.0
Requires-Dist: lightning==2.4.0
Requires-Dist: lightning-utilities==0.15.2
Requires-Dist: lilcom==1.8.1
Requires-Dist: llvmlite==0.46.0
Requires-Dist: loguru==0.7.3
Requires-Dist: Mako==1.3.10
Requires-Dist: Markdown==3.10
Requires-Dist: markdown-it-py==4.0.0
Requires-Dist: markdown2==2.5.4
Requires-Dist: markdownify==1.2.2
Requires-Dist: marker-pdf==1.8.0
Requires-Dist: MarkupSafe==3.0.3
Requires-Dist: marshmallow==4.1.1
Requires-Dist: matplotlib==3.10.7
Requires-Dist: matplotlib-inline==0.2.1
Requires-Dist: mdurl==0.1.2
Requires-Dist: mediapy==1.1.6
Requires-Dist: misaki==0.9.4
Requires-Dist: mistral_common==1.8.6
Requires-Dist: ml_dtypes==0.5.4
Requires-Dist: mlx==0.30.0
Requires-Dist: mlx-audio==0.2.6
Requires-Dist: mlx-lm==0.28.4
Requires-Dist: mlx-metal==0.30.0
Requires-Dist: mlx-vlm==0.3.3
Requires-Dist: more-itertools==10.8.0
Requires-Dist: mpmath==1.3.0
Requires-Dist: msgpack==1.1.2
Requires-Dist: multidict==6.7.0
Requires-Dist: multiprocess==0.70.18
Requires-Dist: murmurhash==1.0.15
Requires-Dist: nemo-toolkit==2.5.3
Requires-Dist: networkx==3.4.2
Requires-Dist: nodeenv==1.9.1
Requires-Dist: num2words==0.5.14
Requires-Dist: numba==0.63.0
Requires-Dist: numpy==2.2.6
Requires-Dist: nv-one-logger-core==2.3.1
Requires-Dist: nv-one-logger-pytorch-lightning-integration==2.3.1
Requires-Dist: nv-one-logger-training-telemetry==2.3.1
Requires-Dist: omegaconf==2.3.0
Requires-Dist: onnx==1.19.1
Requires-Dist: onnxruntime==1.23.2
Requires-Dist: openai==1.109.1
Requires-Dist: opencv-python==4.12.0.88
Requires-Dist: opencv-python-headless==4.11.0.86
Requires-Dist: optuna==4.6.0
Requires-Dist: orjson==3.11.5
Requires-Dist: overrides==7.7.0
Requires-Dist: packaging==24.2
Requires-Dist: pandas==2.3.3
Requires-Dist: parso==0.8.5
Requires-Dist: pdftext==0.6.3
Requires-Dist: peft==0.18.0
Requires-Dist: pexpect==4.9.0
Requires-Dist: phonemizer-fork==3.3.2
Requires-Dist: pillow==10.4.0
Requires-Dist: plac==1.4.5
Requires-Dist: platformdirs==4.5.1
Requires-Dist: pluggy==1.6.0
Requires-Dist: pooch==1.8.2
Requires-Dist: pre_commit==4.5.0
Requires-Dist: preshed==3.0.12
Requires-Dist: prompt_toolkit==3.0.52
Requires-Dist: propcache==0.4.1
Requires-Dist: protobuf==5.29.5
Requires-Dist: psutil==7.1.3
Requires-Dist: ptyprocess==0.7.0
Requires-Dist: pure_eval==0.2.3
Requires-Dist: pyannote.core==5.0.0
Requires-Dist: pyannote.database==5.1.3
Requires-Dist: pyannote.metrics==3.2.1
Requires-Dist: pyarrow==22.0.0
Requires-Dist: pyasn1==0.6.1
Requires-Dist: pyasn1_modules==0.4.2
Requires-Dist: pybind11==3.0.1
Requires-Dist: pycountry==24.6.1
Requires-Dist: pycparser==2.23
Requires-Dist: pydantic==2.12.3
Requires-Dist: pydantic-extra-types==2.10.6
Requires-Dist: pydantic-settings==2.12.0
Requires-Dist: pydantic_core==2.41.4
Requires-Dist: pydub==0.25.1
Requires-Dist: pyee==13.0.0
Requires-Dist: Pygments==2.19.2
Requires-Dist: pylibsrtp==1.0.0
Requires-Dist: pyloudnorm==0.1.1
Requires-Dist: PyMuPDF==1.26.6
Requires-Dist: pyOpenSSL==25.3.0
Requires-Dist: pyparsing==3.2.5
Requires-Dist: pypdfium2==4.30.0
Requires-Dist: pytest==9.0.2
Requires-Dist: pytest-asyncio==1.3.0
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: python-dotenv==1.2.1
Requires-Dist: python-multipart==0.0.20
Requires-Dist: pytorch-lightning==2.6.0
Requires-Dist: pytz==2025.2
Requires-Dist: PyYAML==6.0.3
Requires-Dist: RapidFuzz==3.14.3
Requires-Dist: rdflib==7.5.0
Requires-Dist: referencing==0.37.0
Requires-Dist: regex==2024.11.6
Requires-Dist: requests==2.32.5
Requires-Dist: resampy==0.4.3
Requires-Dist: rfc3986==1.5.0
Requires-Dist: rich==14.2.0
Requires-Dist: rpds-py==0.30.0
Requires-Dist: rsa==4.9.1
Requires-Dist: ruamel.yaml==0.18.16
Requires-Dist: ruamel.yaml.clib==0.2.15
Requires-Dist: ruff==0.14.8
Requires-Dist: sacremoses==0.1.1
Requires-Dist: safehttpx==0.1.7
Requires-Dist: safetensors==0.7.0
Requires-Dist: scikit-learn==1.7.2
Requires-Dist: scipy==1.15.3
Requires-Dist: segments==2.3.0
Requires-Dist: semantic-version==2.10.0
Requires-Dist: sentencepiece==0.2.1
Requires-Dist: sentry-sdk==2.47.0
Requires-Dist: shellingham==1.5.4
Requires-Dist: six==1.17.0
Requires-Dist: smart_open==7.5.0
Requires-Dist: smmap==5.0.2
Requires-Dist: sniffio==1.3.1
Requires-Dist: sortedcontainers==2.4.0
Requires-Dist: sounddevice==0.5.3
Requires-Dist: soundfile==0.13.1
Requires-Dist: soupsieve==2.8
Requires-Dist: sox==1.5.0
Requires-Dist: soxr==1.0.0
Requires-Dist: spacy==3.8.11
Requires-Dist: spacy-curated-transformers==0.3.1
Requires-Dist: spacy-legacy==3.0.12
Requires-Dist: spacy-loggers==1.0.5
Requires-Dist: SQLAlchemy==2.0.44
Requires-Dist: srsly==2.5.2
Requires-Dist: stack-data==0.6.3
Requires-Dist: starlette==0.50.0
Requires-Dist: StrEnum==0.4.15
Requires-Dist: surya-ocr==0.14.7
Requires-Dist: sympy==1.14.0
Requires-Dist: tabulate==0.9.0
Requires-Dist: tenacity==9.1.2
Requires-Dist: tensorboard==2.20.0
Requires-Dist: tensorboard-data-server==0.7.2
Requires-Dist: termcolor==3.2.0
Requires-Dist: text-unidecode==1.3
Requires-Dist: texterrors==0.5.1
Requires-Dist: thinc==8.3.10
Requires-Dist: threadpoolctl==3.6.0
Requires-Dist: tiktoken==0.12.0
Requires-Dist: tokenizers==0.21.4
Requires-Dist: toml==0.10.2
Requires-Dist: tomli==2.3.0
Requires-Dist: tomlkit==0.13.3
Requires-Dist: toolz==1.1.0
Requires-Dist: torch==2.9.1
Requires-Dist: torchmetrics==1.8.2
Requires-Dist: tqdm==4.67.1
Requires-Dist: traitlets==5.14.3
Requires-Dist: transformers==4.53.3
Requires-Dist: typeguard==4.4.4
Requires-Dist: typer==0.20.0
Requires-Dist: typer-slim==0.20.0
Requires-Dist: typing-inspection==0.4.2
Requires-Dist: typing_extensions==4.15.0
Requires-Dist: tzdata==2025.2
Requires-Dist: uritemplate==4.2.0
Requires-Dist: urllib3==2.6.1
Requires-Dist: uvicorn==0.38.0
Requires-Dist: virtualenv==20.35.4
Requires-Dist: wandb==0.23.1
Requires-Dist: wasabi==1.1.3
Requires-Dist: wcwidth==0.2.14
Requires-Dist: weasel==0.4.3
Requires-Dist: webdataset==1.0.2
Requires-Dist: webrtcvad==2.0.10
Requires-Dist: websockets==15.0.1
Requires-Dist: Werkzeug==3.1.4
Requires-Dist: wget==3.2
Requires-Dist: whisper_normalizer==0.1.12
Requires-Dist: wrapt==2.0.1
Requires-Dist: xxhash==3.6.0
Requires-Dist: yarl==1.22.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# multimodal-parsers

PDF processing pipeline: removes headers/footers, converts to markdown, and generates image captions using MLX VLM.

## Installation

```bash
pip install multimodal-parsers
```

## Dependencies

The package automatically installs:
- Pillow
- mlx-vlm
- pymupdf
- scikit-learn
- numpy
- marker-pdf

Additionally, you may need to install:
```bash
pip install "unstructured[pdf]"
```

## Usage

After installation, use the `multimodal-parsers` command:

```bash
multimodal-parsers <input_dir> <output_dir>
  ```

### Example

```bash
multimodal-parsers Database/Private/Files Database/Private/Files/output
```

## What it does

1. **Removes headers and footers** from PDF files using clustering algorithms
2. **Converts PDFs to markdown** using marker-pdf
3. **Generates image captions** using MLX VLM (InternVL3-1B-4bit)
4. **Outputs final markdown files** with captioned images

## Development

```bash
git clone https://github.com/thuuyen98/PIER-QA
cd PIER-QA
pip install -e ".[dev]"
```

## License

MIT License
