Metadata-Version: 2.4
Name: mistral-pdf-to-markdown
Version: 1.2.0
Summary: A CLI tool to convert PDF files to Markdown using the Mistral AI OCR API.
License-File: LICENSE
Author: Arcangelo Massari
Author-email: info@arcangelomassari.com
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: click (>=8.1.8,<9.0.0)
Requires-Dist: mistralai (>=1.7.0,<2.0.0)
Requires-Dist: pypandoc (>=1.15,<2.0)
Requires-Dist: python-dotenv (>=1.1.0,<2.0.0)
Requires-Dist: weasyprint (>=66.0,<67.0)
Description-Content-Type: text/markdown

# Mistral PDF to Markdown Converter

[![PyPI version](https://img.shields.io/pypi/v/mistral-pdf-to-markdown.svg)](https://pypi.org/project/mistral-pdf-to-markdown/)
![Poetry](https://img.shields.io/badge/poetry-2.1.2-blue?logo=poetry&logoColor=blue)

A simple command-line tool to convert PDF and EPUB files into Markdown format using the Mistral AI OCR API.
This tool also extracts embedded images and saves them in a subdirectory relative to the output markdown file.

## Installation

You can install the package directly from PyPI using pip:

```bash
pip install mistral-pdf-to-markdown
```

### Global Installation (Recommended for CLI Usage)

If you want to use the `pdf2md` command from anywhere in your system without activating a specific virtual environment, the recommended way is to use `pipx`:

1.  **Install `pipx`** (if you don't have it already). Follow the official [pipx installation guide](https://pipx.pypa.io/stable/installation/). A common method is:
    ```bash
    python3 -m pip install --user pipx
    python3 -m pipx ensurepath
    ```
    *(Restart your terminal after running `ensurepath`)*

2.  **Install the package using `pipx`:**
    ```bash
    pipx install mistral-pdf-to-markdown
    ```

This installs the package in an isolated environment but makes the `pdf2md` command globally available.

### Installation from Source

Alternatively, if you want to install from the source:

1.  **Clone the repository:**
    ```bash
    git clone https://github.com/arcangelo7/mistral-pdf-to-markdown.git
    cd mistral-pdf-to-markdown
    ```

2.  **Install dependencies using Poetry:**
    ```bash
    poetry install
    ```

### Additional Requirements for EPUB Support

To convert EPUB files, you need to install `pandoc`. See the [official installation guide](https://pandoc.org/installing.html) for your operating system.

## Usage

1.  **Set your Mistral API Key:**
    You can set your API key as an environment variable:
    ```bash
    export MISTRAL_API_KEY='your_api_key_here'
    ```
    Alternatively, you can create a `.env` file in the project root directory with the following content:
    ```
    MISTRAL_API_KEY=your_api_key_here
    ```
    You can also pass the API key directly using the `--api-key` option.

2.  **Run the conversion:**

    ### Convert a Single PDF or EPUB File
    The `convert` command processes a single PDF or EPUB file.
    ```bash
    poetry run pdf2md convert <path/to/your/document.pdf> [options]
    ```
    Or, if you have activated the virtual environment (`poetry shell`):
    ```bash
    pdf2md convert <path/to/your/document.pdf> [options]
    ```

    **Options for Single File Conversion:**
    *   `--output` or `-o`: Specify the path for the output Markdown file. If not provided, it defaults to the same name as the input file but with a `.md` extension (e.g., `document.md`).
    *   `--api-key`: Provide the Mistral API key directly.

    ### Convert Multiple PDF and EPUB Files from a Directory
    The `convert-dir` command processes all PDF and EPUB files in a specified directory.
    ```bash
    poetry run pdf2md convert-dir <path/to/directory/with/files> [options]
    ```
    Or, if you have activated the virtual environment (`poetry shell`):
    ```bash
    pdf2md convert-dir <path/to/directory/with/files> [options]
    ```

    **Options for Directory Conversion:**
    *   `--output-dir` or `-o`: Specify the directory where output Markdown files will be saved. If not provided, it defaults to the same directory as the input files.
    *   `--api-key`: Provide the Mistral API key directly.
    *   `--max-workers` or `-w`: Maximum number of concurrent conversions (default: 2). Increase this value to process multiple files in parallel for faster conversion.

**Image Handling:**

The script will attempt to extract images embedded in the document.
*   Images are saved in a subdirectory named `<output_filename_stem>_images` (e.g., if the output is `report.md`, images will be in `report_images/`).
*   The generated Markdown file will contain relative links pointing to the images in this subdirectory.

**Examples:**

```bash
# Convert a single PDF file (output: ./my_report.md)
poetry run pdf2md convert ./my_report.pdf

# Convert with custom output path
poetry run pdf2md convert ./my_report.pdf -o ./output/report.md

# Convert all files in a directory with 4 concurrent workers
poetry run pdf2md convert-dir ./documents/ -o ./markdown_output/ -w 4
```

An example output generated from `example.pdf` (included in the repository) can be found in [example.md](example.md), with its corresponding images located in the `example_images/` directory.

## License

This project is licensed under the ISC License - see the [LICENSE](LICENSE) file for details.
