Metadata-Version: 2.3
Name: article-to-md
Version: 0.4.0
Summary: Convert an article or web page to Markdown
Keywords: markdown,scraping,extraction,web scraping
Author: nateify
Author-email: nateify <nateify@users.noreply.github.com>
Classifier: Environment :: Console
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Utilities
Requires-Dist: curl-cffi>=0.13.0
Requires-Dist: cyclopts>=4.8.0
Requires-Dist: html-to-markdown>=2.28.1
Requires-Dist: lxml>=6.0.2
Requires-Dist: markdownify>=1.2.2
Requires-Dist: readabilipy
Requires-Dist: requests>=2.32.5
Requires-Dist: requests-cache>=1.3.1
Requires-Dist: trafilatura>=2.0.0
Requires-Dist: unidecode>=1.4.0
Requires-Python: >=3.13
Description-Content-Type: text/markdown

# article-to-md
A CLI tool to extract core content from webpages or local HTML and convert it to Markdown.

```
╭─ Commands ────────────────────────────────────────────────────────────────────╮
│ --help (-h)  Display this message and exit.                                   │
│ --version    Display application version.                                     │
╰───────────────────────────────────────────────────────────────────────────────╯
╭─ Parameters ──────────────────────────────────────────────────────────────────╮
│ *  SOURCE --source                [required]                                  │
│    --method                       [choices: readability, trafilatura, raw]    │
│                                   [default: readability]                      │
│    --favor                        [choices: recall, precision]                │
│    --remove-ads --no-remove-ads   [default: False]                            │
│    --strip-tag --empty-strip-tag                                              │
╰───────────────────────────────────────────────────────────────────────────────╯
```

## Installation

[uv](https://docs.astral.sh/uv/) is recommended to install the package in a managed environment:

    uv tool install article-to-md

**Note**: To use the readability method, Node.js (v14+) must be installed on your system. Without Node.js, the tool uses Python-based extraction.

## Usage

From a publicly accessible web page:

```bash
article-to-md https://example.com/article
```

From a local HTML file:

```bash
article-to-md /path/to/file.html
```

Advanced options:

- `--remove-ads` - Basic ad removal from the DOM using generic cosmetic filters from [EasyList](https://easylist.to/)
- `--method` - Affects pre-processing of the DOM before conversion to Markdown.
  - `readability` (default) - Uses [ReadabiliPy](https://github.com/alan-turing-institute/ReadabiliPy) which can use the original Readability.js Node package when Node is present on the system.
  - `trafilatura` - Uses the [Trafilatura](https://trafilatura.readthedocs.io/en/latest/index.html) pure Python library
  - `raw` - Sends the full DOM to be converted
- `--favor` - Only used with `--method trafilatura` to control options [documented here](https://trafilatura.readthedocs.io/en/latest/usage-cli.html#optimizing-for-precision-and-recall).
- `--strip-tag` - An HTML tag to be stripped from the DOM before conversion
  - This argument can be supplied multiple times 
  - By default, `<img>` tags are stripped; use `--empty-strip-tag` to keep them.

## Features

- Stealth Requests: Uses curl_cffi to impersonate a Chrome browser and avoid bot detection.
- Enhanced Markdown:
  - Converts `<var>` to italics. 
  - Includes `<abbr>` titles in the text output.
  - Renders Markdown tables from HTML tables