Metadata-Version: 2.4
Name: capcat
Version: 1.1.9
Summary: A command-line tool designed to solve content preservation challenges with Ethical Scraping.
Author: Stayu Kasabov - Product Designer and Experiences Builder | AI-powered Prototyping & MVP | Strategic Generalist
License: MIT-Style Non-Commercial License
        
        Copyright (c) 2025 Stayu Kasabov
        
        Original Product: Capcat - News Article Archiving System
        Author: Stayu Kasabov | https://stayux.com
        Product Designer with Holistic Production Expertise
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction for NON-COMMERCIAL PURPOSES ONLY,
        including without limitation the rights to use, copy, modify, merge, publish,
        distribute, sublicense, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        1. NON-COMMERCIAL USE ONLY: This software may not be used for commercial
           purposes. Commercial purposes include, but are not limited to: selling
           the software, using it in a commercial product or service, or using it
           to generate revenue.
        
        2. ATTRIBUTION: The above copyright notice and this permission notice shall
           be included in all copies or substantial portions of the Software. Credit
           must be given to the original author: Stayu Kasabov (https://stayux.com)
        
        3. SHARE ALIKE: Any modifications or derivative works must be released under
           the same non-commercial terms.
        
        4. CONTRIBUTIONS WELCOME: Users are encouraged to contribute improvements
           back to the original project.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
        ---
        
        A free and open-source tool to make people's lives easier.
        Contributions welcome! Contact: Stayu Kasabov | https://stayux.com
Project-URL: Homepage, https://github.com/<owner>/capcat
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Environment :: Console
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: requests<3.0.0,>=2.28.0
Requires-Dist: beautifulsoup4<5.0.0,>=4.12.0
Requires-Dist: PyYAML<7.0,>=6.0
Requires-Dist: feedparser<7.0,>=6.0
Requires-Dist: questionary<3.0,>=2.0
Requires-Dist: markdownify<1.0,>=0.11
Requires-Dist: ruamel.yaml<0.19,>=0.17
Requires-Dist: validators<1.0,>=0.20
Requires-Dist: prompt_toolkit<4.0,>=3.0
Requires-Dist: yt-dlp<2027.0.0,>=2023.1.6
Requires-Dist: markdown<4.0,>=3.5
Requires-Dist: pygments<3.0,>=2.16
Requires-Dist: charset-normalizer<4.0,>=3.0
Requires-Dist: rich<15.0,>=13.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# Capcat — Content Preservation CLI

A command-line tool designed to solve content preservation challenges with Ethical Scraping.

Captures articles from 17+ curated sources as clean Markdown files with optional self-contained HTML output. Supports interactive TUI and batch automation.

## Installation

```bash
pip install capcat
```

Requires Python 3.8+.

## Quick Start

```bash
# Interactive TUI
capcat catch

# Fetch a bundle
capcat bundle tech --count 10

# Fetch specific sources
capcat fetch hn,bbc --count 15

# Archive a single article
capcat single https://example.com/article

# List available sources
capcat list sources

# Show version
capcat --version
```

No init required — capcat initializes automatically on first run.

## Commands

| Command | Description |
|---------|-------------|
| `catch` | Launch the interactive TUI |
| `single <url>` | Archive a single article |
| `fetch <sources>` | Batch fetch from sources (comma-separated) |
| `bundle <name>` | Fetch a pre-configured bundle |
| `list sources` | List all available sources |
| `list bundles` | List all available bundles |
| `add-source --url <url>` | Add a custom RSS/news source |
| `remove-source` | Remove a source |
| `generate-config` | Generate a YAML config |
| `init` | Manually initialize project in current directory |

## Options

| Flag | Description |
|------|-------------|
| `--count N` | Number of articles to fetch (default: 30) |
| `--output DIR` | Output directory (default: current dir) |
| `--media` | Download video, audio, and PDF files |
| `--html` | Generate self-contained HTML output |
| `--update` | Re-fetch and update existing articles |
| `-V, --verbose` | Verbose output |
| `-q, --quiet` | Quiet output |
| `-L <file>` | Log output to file |
| `--version` | Show version and exit |
| `--help` | Show help and exit |

## Bundles

Pre-configured topic collections:

| Bundle | Sources | Description |
|--------|---------|-------------|
| `tech` | IEEE, Mashable | Consumer technology news |
| `techpro` | HN, Lobsters, InfoQ | Professional developer news |
| `ai` | MIT News, Google Research | AI research and developments |
| `science` | Nature, Scientific American | Scientific publications |
| `news` | BBC, Guardian | General news |
| `sports` | BBC Sport | Sports coverage |

## Available Sources

**Tech**: Hacker News (`hn`), Lobsters (`lb`), InfoQ (`iq`), IEEE Spectrum (`ieee`), Mashable, Gizmodo, Futurism

**AI**: Google Research (`googleai`), OpenAI (`openai`), MIT News (`mitnews`), LessWrong (`lesswrong`)

**News**: BBC (`bbc`), The Guardian (`guardian`)

**Science**: Nature (`nature`), Scientific American (`scientificamerican`)

**Sports**: BBC Sport (`bbcsport`)

## Output Structure

### Batch mode (`fetch` / `bundle`)

```
News/news_DD-MM-YYYY/
├── Hacker-News_DD-MM-YYYY/
│   ├── 01_Article_Title/
│   │   ├── article.md
│   │   ├── comments.md
│   │   ├── html/
│   │   │   ├── article.html
│   │   │   └── comments.html
│   │   └── images/
│   └── 02_Another_Article/
└── BBC_DD-MM-YYYY/
```

### Single article mode

```
Capcats/cc_DD-MM-YYYY-Title/
├── article.md
├── html/
│   └── article.html
└── images/
```

HTML output is fully self-contained — embedded CSS, no external dependencies. Open in any browser, share via email, archive permanently.

## Configuration

Optional `capcat.yml` in your project directory:

```yaml
output_base_dir: "../MyNews"
max_workers: 8
download_media: false
```

Config priority: CLI args → environment variables → `capcat.yml` → defaults.

## Automation

```bash
# Daily tech news
0 9 * * * cd ~/news && capcat bundle tech --count 20 --html

# Weekly science digest
0 10 * * 0 cd ~/news && capcat bundle science --count 30 --media
```

## Privacy and Ethics

- Usernames anonymized as "Anonymous" in comment archives
- Respects `robots.txt`
- Rate limiting: 1 request per 10 seconds
- Prefers RSS/APIs over HTML scraping
- No paywall circumvention
- Proper source attribution

## Documentation

Full documentation at [capcat.org](https://capcat.org):
- [Quick Start Guide](https://capcat.org/docs/quick-start.html)
- [Architecture Overview](https://capcat.org/docs/architecture.html)
- [Source Development](https://capcat.org/docs/source-development.html)
- [Interactive Mode](https://capcat.org/docs/interactive-mode.html)

## Contributing

Open an issue or pull request on [GitHub](https://github.com/stayukasabov/capcat).

## License

MIT License — see [LICENSE.txt](LICENSE.txt)

## Links

- **Website**: [capcat.org](https://capcat.org)
- **Repository**: [github.com/stayukasabov/capcat](https://github.com/stayukasabov/capcat)
- **Issues**: [github.com/stayukasabov/capcat/issues](https://github.com/stayukasabov/capcat/issues)
