Metadata-Version: 2.4
Name: blog2md
Version: 1.0.1
Summary: Convert web pages to clean Markdown format
Author: blog2md
License: MIT
Project-URL: Homepage, https://github.com/yourusername/blog2md
Project-URL: Repository, https://github.com/yourusername/blog2md
Project-URL: Documentation, https://github.com/yourusername/blog2md#readme
Project-URL: Issues, https://github.com/yourusername/blog2md/issues
Keywords: web,scraping,markdown,html,converter,blog
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: trafilatura>=1.6.0
Requires-Dist: markdownify>=0.11.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: typer>=0.9.0
Requires-Dist: httpx>=0.24.0
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: aiosqlite>=0.19.0
Requires-Dist: feedparser>=6.0.0
Provides-Extra: js
Requires-Dist: playwright>=1.40.0; extra == "js"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# blog2md

[English](#english) | [中文](#中文)

Convert web pages to clean Markdown format.

---

## English

### Use Cases

| Scenario | Command | Description |
|----------|---------|-------------|
| Save blog articles offline | `blog2md <url>` | Fetch blog and convert to Markdown |
| Backup technical docs | `blog2md <doc-url> --force-js` | Render JS then fetch |
| Batch process websites | `blog2md --batch urls.txt` | Concurrent URL processing |
| Subscribe to blog updates | `blog2md --rss <feed-url>` | Auto-discover articles from RSS |
| Build knowledge base | `blog2md <url> --crawl` | Recursively crawl links |
| Extract video subtitles | `blog2md <url>` | Get page text (不含视频) |
| Migrate to Obsidian | `blog2md <url> --target obsidian --vault-path /path/to/vault` | Write directly to Obsidian vault |

### Installation

```bash
# Basic (CLI + SDK)
pip install blog2md

# With JavaScript rendering (for SPA/React/Vue sites)
pip install blog2md[js]
playwright install chromium

# From source
git clone https://github.com/nestedcat/blog2md.git
cd blog2md
pip install -e ".[js]"
```

### CLI Examples

```bash
# 1. Basic extraction (most common)
blog2md https://example.com/article

# 2. Specify output directory
blog2md https://example.com/article --output ./docs

# 3. Output to stdout
blog2md https://example.com/article --stdout

# 4. Keep original image URLs (no download)
blog2md https://example.com/article --images keep

# 5. Skip YAML frontmatter
blog2md https://example.com/article --no-frontmatter

# 6. Batch process URL list
blog2md --batch urls.txt --concurrency 5

# 7. RSS subscription
blog2md --rss https://example.com/feed.xml --output ./articles

# 8. JavaScript rendering (for SPA/React/Vue)
blog2md https://example.com/spa-page --force-js

# 9. Crawl with depth 1 (same domain)
blog2md https://example.com --crawl --crawl-depth 1

# 10. Crawl cross-domain with depth 2
blog2md https://example.com --crawl --crawl-depth 2 --crawl-cross-domain

# 11. Crawl with delay (reduce server load)
blog2md https://example.com --crawl --crawl-depth 2 --crawl-delay 2.0

# 12. Force refresh (ignore cache)
blog2md https://example.com/article --force

# 13. Write to GitHub repo
blog2md https://example.com/article \
  --target github \
  --github-token ghp_xxx \
  --github-repo owner/repo \
  --github-path content/posts/

# 14. Write to Obsidian vault
blog2md https://example.com/article \
  --target obsidian \
  --vault-path /path/to/your/vault
```

### Python API

```python
import blog2md

# One-liner to extract
result = blog2md.extract("https://example.com/article")
print(result.markdown)
print(result.title)    # Title from frontmatter
print(result.author)   # Author
print(result.date)    # Publication date

# Save to file
result.save("/path/to/file.md")
```

### CLI Options

| Option | Description |
|--------|-------------|
| `-o, --output` | Output directory |
| `--batch <file>` | Batch file (one URL per line or JSON) |
| `--rss <url>` | RSS/Atom feed URL |
| `--concurrency N` | Concurrent requests (1-10, default 3) |
| `-f, --filename` | Custom output filename |
| `--stdout` | Output to stdout |
| `--no-frontmatter` | Skip YAML frontmatter |
| `--force-js` | Force JavaScript rendering |
| `--images` | Image mode: `local`/`keep`/`inline`/`skip` (详细说明见下) |
| `--images-dir` | Image directory (default `_images`) |
| `--crawl` | Recursively follow links |
| `--crawl-depth N` | Max crawl depth (default 1) |
| `--crawl-cross-domain` | Allow cross-domain crawling |
| `--crawl-delay N` | Delay between requests in seconds (default 1.0) |
| `--target` | Output target: file/github/obsidian |
| `--cache/--no-cache` | Enable/disable cache (default enabled) |
| `--force` | Force reprocess ignoring cache |
| `-v, --verbose` | Verbose logging |

### Image Modes Explained

| Mode | Description | Use Case |
|------|-------------|----------|
| `local` | Download images to `_images/` folder, convert URLs to relative paths | **Default** - Best for offline reading |
| `keep` | Keep original image URLs unchanged | When images are already hosted reliably |
| `inline` | (Not fully implemented) Convert to base64 inline | For single-file portability |
| `skip` | Replace images with placeholder `[image]` | When you only want text content |

### Features

- **trafilatura**: High-precision content extraction, removes ads/nav
- **BeautifulSoup**: HTML parsing and content detection
- **Format preservation**: Headings, code blocks, tables, bold/italic
- **Image handling**: Download to local `_images/`, convert URLs to relative
- **Sidebar cleaning**: Auto-detect and remove nav elements
- **Caching**: Content hash detection, ETag/Last-Modified support

---

## 中文

### 使用场景

| 场景 | 命令 | 说明 |
|------|------|------|
| 保存博客文章离线阅读 | `blog2md <url>` | 抓取博客，转为 Markdown |
| 备份技术文档 | `blog2md <doc-url> --force-js` | 渲染 JS 后抓取 |
| 批量抓取网站 | `blog2md --batch urls.txt` | 并发处理多个 URL |
| 订阅博客更新 | `blog2md --rss <feed-url>` | 从 RSS 自动发现文章 |
| 构建知识库 | `blog2md <url> --crawl` | 循环抓取链接建立知识库 |
| 提取视频字幕 | `blog2md <url>` | 获取页面文字（不含视频） |
| 迁移到 Obsidian | `blog2md <url> --target obsidian --vault-path /path/to/vault` | 直接写入 Obsidian 库 |

### 安装

```bash
# 基本安装 (CLI + SDK)
pip install blog2md

# JavaScript 渲染支持 (适用于 SPA/React/Vue)
pip install blog2md[js]
playwright install chromium

# 从源码安装
git clone https://github.com/nestedcat/blog2md.git
cd blog2md
pip install -e ".[js]"
```

### CLI 示例

```bash
# 1. 基本抓取（最常用）
blog2md https://example.com/article

# 2. 指定输出目录
blog2md https://example.com/article --output ./docs

# 3. 输出到标准输出
blog2md https://example.com/article --stdout

# 4. 保留原始图片 URL（不下载）
blog2md https://example.com/article --images keep

# 5. 跳过 YAML frontmatter
blog2md https://example.com/article --no-frontmatter

# 6. 批量处理 URL 列表
blog2md --batch urls.txt --concurrency 5

# 7. RSS 订阅
blog2md --rss https://example.com/feed.xml --output ./articles

# 8. JavaScript 渲染（适合 React/Vue/SPA）
blog2md https://example.com/spa-page --force-js

# 9. 循环抓取（深度 1，站内链接）
blog2md https://example.com --crawl --crawl-depth 1

# 10. 循环抓取（跨域，深度 2）
blog2md https://example.com --crawl --crawl-depth 2 --crawl-cross-domain

# 11. 循环抓取（带延迟，减少对服务器的压力）
blog2md https://example.com --crawl --crawl-depth 2 --crawl-delay 2.0

# 12. 强制刷新（忽略缓存）
blog2md https://example.com/article --force

# 13. 写入 GitHub 仓库
blog2md https://example.com/article \
  --target github \
  --github-token ghp_xxx \
  --github-repo owner/repo \
  --github-path content/posts/

# 14. 写入 Obsidian 库
blog2md https://example.com/article \
  --target obsidian \
  --vault-path /path/to/your/vault
```

### Python API

```python
import blog2md

# 一行代码抓取
result = blog2md.extract("https://example.com/article")
print(result.markdown)
print(result.title)    # frontmatter 中的标题
print(result.author)   # 作者
print(result.date)    # 发布日期

# 保存到文件
result.save("/path/to/file.md")
```

### CLI 选项

| 选项 | 说明 |
|------|------|
| `-o, --output` | 输出目录 |
| `--batch <file>` | 批量处理文件（每行一个 URL 或 JSON） |
| `--rss <url>` | RSS/Atom 订阅地址 |
| `--concurrency N` | 并发数（1-10，默认 3） |
| `-f, --filename` | 自定义输出文件名 |
| `--stdout` | 输出到标准输出 |
| `--no-frontmatter` | 跳过 YAML 头 |
| `--force-js` | 强制 JavaScript 渲染 |
| `--images` | 图片模式：local/keep/inline/skip |
| `--crawl` | 循环抓取页面内链接 |
| `--crawl-depth N` | 抓取深度（默认 1） |
| `--crawl-cross-domain` | 允许跨域抓取 |
| `--crawl-delay N` | 请求间隔（秒，默认 1.0） |
| `--target` | 输出目标：file/github/obsidian |
| `--cache/--no-cache` | 启用/禁用缓存（默认启用） |
| `--force` | 强制重新抓取（忽略缓存） |
| `-v, --verbose` | 详细日志 |

### 图片模式说明

| 模式 | 说明 | 适用场景 |
|------|------|----------|
| `local` | 下载图片到 `_images/` 文件夹，URL 转为相对路径 | **默认** - 离线阅读最佳 |
| `keep` | 保留原始图片 URL | 图片已可靠托管时使用 |
| `inline` | （未完全实现）转为 base64 内联 | 单文件便携性 |
| `skip` | 用 `[image]` 占位符替换图片 | 只需文本内容时 |

### 功能特点

- **trafilatura**: 高精度正文提取，去除广告/导航
- **BeautifulSoup**: HTML 解析与内容识别
- **格式保留**: 标题层级、代码块、表格、加粗斜体
- **图片处理**: 下载到本地 `_images/`，自动转换 URL
- **侧边栏清理**: 自动识别并移除导航元素
- **缓存机制**: 内容哈希检测变化，支持 ETag/Last-Modified

---

## License

MIT License - see [LICENSE](LICENSE) for details.
