Metadata-Version: 2.4
Name: ggwebextract
Version: 0.1.1
Summary: Add your description here
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: html2text>=2025.4.15
Requires-Dist: lxml>=6.0.2
Requires-Dist: readability-lxml>=0.8.4.1
Requires-Dist: selenium>=4.41.0
Dynamic: license-file

## Installation

### Install chromium

If chromium is not detected after installation, set CHROME_PATH environment variable to point to the chromium executable file

**Linux**

```sh
sudo apt update
sudo apt upgrade
sudo apt install chromium
```

**Windows**

You can install chromium from [woolyss](https://chromium.woolyss.com/)

### Install venv to run tests

```sh
uv run python ## should automatically create venv and install dependencies
```

## Usage

- method to get html content of a page
- or with a "tab" handle if you want to automate a flow

check tests for details on how to use.

## Notes

**Browser control**

- -> selenium : old and battle tested
- pyppeteer : only chrome -> chrome debug protocol
- playwright : modern, support more browsers

**How to set user agents**

- The only working option found is the --user-agent chromium argument

**Chromium started in its own process**

- if vanilla selenium is used with patched driver, it does not work (detected as bot). The reason is that selenium starts chromium itself and that is a problem. Chromium has to be started independently.

## References

- [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver)

## Improvements

- ?
