Metadata-Version: 2.4
Name: scitex-web
Version: 0.1.2
Summary: Web scraping (URLs, images), PubMed search, URL summarization helpers — standalone module from the SciTeX ecosystem
Author-email: Yusuke Watanabe <ywatanabe@scitex.ai>
License-Expression: AGPL-3.0-only
Project-URL: Homepage, https://github.com/ywatanabe1989/scitex-web
Project-URL: Repository, https://github.com/ywatanabe1989/scitex-web
Project-URL: Documentation, https://scitex-web.readthedocs.io
Keywords: scitex,web,scraping,pubmed,url-summarize
Classifier: Development Status :: 3 - Alpha
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4
Requires-Dist: requests
Requires-Dist: aiohttp
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: tqdm
Provides-Extra: readability
Requires-Dist: readability-lxml; extra == "readability"
Provides-Extra: summarize
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=2.0; extra == "docs"
Requires-Dist: myst-parser>=2.0; extra == "docs"
Requires-Dist: sphinx-copybutton>=0.5; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.25; extra == "docs"
Provides-Extra: all
Requires-Dist: scitex-web[readability]; extra == "all"
Requires-Dist: scitex-web[summarize]; extra == "all"
Dynamic: license-file

# scitex-web

<!-- scitex-badges:start -->
[![PyPI](https://img.shields.io/pypi/v/scitex-web.svg)](https://pypi.org/project/scitex-web/)
[![Python](https://img.shields.io/pypi/pyversions/scitex-web.svg)](https://pypi.org/project/scitex-web/)
[![Tests](https://github.com/ywatanabe1989/scitex-web/actions/workflows/test.yml/badge.svg)](https://github.com/ywatanabe1989/scitex-web/actions/workflows/test.yml)
[![Install Test](https://github.com/ywatanabe1989/scitex-web/actions/workflows/install-test.yml/badge.svg)](https://github.com/ywatanabe1989/scitex-web/actions/workflows/install-test.yml)
[![Coverage](https://codecov.io/gh/ywatanabe1989/scitex-web/graph/badge.svg)](https://codecov.io/gh/ywatanabe1989/scitex-web)
[![Docs](https://readthedocs.org/projects/scitex-web/badge/?version=latest)](https://scitex-web.readthedocs.io/en/latest/)
[![License: AGPL v3](https://img.shields.io/badge/license-AGPL_v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
<!-- scitex-badges:end -->


Web scraping + PubMed search + URL summarization helpers, extracted from the [SciTeX](https://github.com/ywatanabe1989/scitex-python) ecosystem as a standalone package.

## Install

```bash
pip install scitex-web
pip install "scitex-web[readability]"   # readability-lxml for cleaner extraction
```

## API

```python
import scitex_web as web

# Scraping
web.get_urls(url, pattern=r"\.pdf$")
web.get_image_urls(url, min_size=128)
web.download_images(url, out_dir="imgs", same_domain=True)

# PubMed
web.search_pubmed("CRISPR Cas9 review", retmax=50)

# URL summarization (requires scitex.ai)
web.summarize_url("https://example.com/article")
```

## Status

Standalone fork of `scitex.web`. Deps: requests / aiohttp / bs4 / tqdm. The
umbrella package's `scitex.web` import path is preserved via a `sys.modules`-alias
bridge.

Decoupling notes:
- `scitex.logging.getLogger` → stdlib `logging.getLogger`.
- `scitex.str.printc` (colored print) → tiny inline ANSI helper.
- `scitex.ai.GenAI` (used by `summarize_url`) → deferred import that raises
  a clear ImportError if the umbrella `scitex` package isn't installed.

14/23 tests pass (7 pre-existing upstream failures around bs4 mocking that fail
in scitex-python too — unrelated to extraction; 2 skipped).

## License

AGPL-3.0-only (see [LICENSE](./LICENSE)).
