Metadata-Version: 2.4
Name: AutoWebPdfSummarizer
Version: 0.1.9
Summary: Summarize web pages and PDFs with Google Gemini
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31
Requires-Dist: playwright>=1.43
Requires-Dist: pillow>=9.0
Requires-Dist: PyMuPDF>=1.23
Requires-Dist: nest_asyncio>=1.5
Requires-Dist: google-generativeai>=0.5.0
Dynamic: license-file

# AI Knowledge Summarizer

`AutoWebPdfSummarizer` packages the core logic from the original notebook into a reusable library
that can be published on PyPI. It classifies incoming URLs as either standard web pages or
PDF documents, extracts text and imagery, and sends the materials to Google Gemini for a
structured summary.

## Installation

The project uses Playwright for browser automation. Install the Python package and the
Chromium browser binaries:

```bash
pip install AutoWebPdfSummarizer
playwright install chromium
```

Additional runtime dependencies (such as PyMuPDF) are pulled in automatically via the
package metadata.

## Usage

```python
import logging
from AutoWebPdfSummarizer import summarize_url

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(name)s - %(message)s",
)

result = summarize_url(
    "https://example.com/article",
    google_api_key="YOUR_API_KEY",
    logger=logging.getLogger("demo"),
    min_gemini_interval=2.0,
)

print(result.summary)
```

Key features:

- Automatic detection of PDF vs. HTML content.
- Smart truncation of large text blocks and screenshot size management for web pages.
- PDF rendering and text extraction powered by PyMuPDF.
- Customizable logging: pass any `logging.Logger` instance or rely on the built-in
  no-op logger.
- Configurable Gemini prompt, model selection, and request limits.

### Enabling detailed Gemini call tracing

Set the logger level to `DEBUG` to surface the additional timings that wrap every
`model.generate_content` call:

```python
logging.getLogger("demo").setLevel(logging.DEBUG)
```

The summarizer now reports how long each Gemini request takes and logs any exceptions before
they propagate, making it easier to diagnose stalls around the
`"Sending %d parts to Gemini"` message.

### Running inside Google Colab with a debugger

Colab notebooks can forward debug sessions using `debugpy`. Install it and start a listener
before invoking the summarizer:

```python
!pip install debugpy

import debugpy

debugpy.listen(("0.0.0.0", 5678))
print("debugpy is listening on port 5678")
debugpy.wait_for_client()  # optional: pause until your IDE attaches
```

With the listener active, attach your local IDE (VS Code, PyCharm, etc.) to the running
Colab kernel using the public URL and port 5678. Once connected you can set breakpoints in
the notebook and inspect the Gemini calls while the enhanced logging streams to the notebook
output.

## Configuration Options

`summarize_url` accepts several optional keyword arguments:

- `prompt`: supply a custom Gemini prompt string. The default prompt produces an English
  analyst-style summary.
- `max_chars`: maximum number of characters retained from the extracted text (default
  `6000`).
- `max_image_mb`: per-image size ceiling in megabytes for web page screenshots (default
  `4.0`).
- `max_pdf_pages`: number of PDF pages to process (default `5`).
- `request_timeout`: timeout in seconds used for HTTP and Playwright navigation (default
  `300`).
- `min_gemini_interval`: minimum delay in seconds enforced between Gemini requests (default
  `2.0`).

For streaming chunk-by-chunk updates when summarizing PDFs, use
`summarize_url_stream`, which yields `ChunkSummary` objects for each processed chunk before
emitting the final `SummarizationResult`.

The Google API key can be provided explicitly or via the `GOOGLE_API_KEY` environment
variable.

## Development

Install the local package in editable mode and the Playwright browser binary:

```bash
pip install -e .
playwright install chromium
```

Then run static checks:

```bash
python -m compileall src
```

## License

MIT
