Metadata-Version: 2.4
Name: cve-extractor
Version: 0.1.0
Summary: Extract, filter, and analyze CVE data from the official CVE List
Requires-Python: >=3.11
Requires-Dist: cwe-tree>=0.1.0
Requires-Dist: githubkit>=0.11.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: requests>=2.31.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.9.0
Provides-Extra: cli
Requires-Dist: rich>=13.0.0; extra == 'cli'
Requires-Dist: typer>=0.9.0; extra == 'cli'
Provides-Extra: core
Requires-Dist: cwe-tree>=0.1.0; extra == 'core'
Requires-Dist: githubkit>=0.11.0; extra == 'core'
Requires-Dist: pydantic>=2.0.0; extra == 'core'
Requires-Dist: requests>=2.31.0; extra == 'core'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: isort>=5.12.0; extra == 'dev'
Requires-Dist: mypy>=1.4.0; extra == 'dev'
Requires-Dist: pylint>=2.17.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.11.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.0.280; extra == 'dev'
Provides-Extra: full
Requires-Dist: cwe-tree>=0.1.0; extra == 'full'
Requires-Dist: githubkit>=0.11.0; extra == 'full'
Requires-Dist: pydantic>=2.0.0; extra == 'full'
Requires-Dist: requests>=2.31.0; extra == 'full'
Requires-Dist: rich>=13.0.0; extra == 'full'
Requires-Dist: typer>=0.9.0; extra == 'full'
Description-Content-Type: text/markdown

# CVE Extractor

Extract, filter, and analyze CVE data from the official CVE List.

## Features

- **Download & Extract**: Automatically download and extract CVE data from the official CVE List
- **Filter**: Identify and filter language/ecosystem-specific CVEs from the full dataset
- **Extract**: Extract key information from CVE records including ID, type, and description
- **Analyze**: Generate statistics and analysis of CVE distribution
- **Caching**: Intelligent caching system to avoid redundant downloads and processing

## Requirements

- [uv](https://docs.astral.sh/uv/) for dependency and environment management.

## Installation

Clone the repository and install with uv:

```bash
# Install the package with dependencies
uv sync

# Install with development dependencies
uv sync --all-extras
```

Run the CLI via `uv run cve-extractor` or ensure the project's virtual environment is activated so the `cve-extractor` script is on PATH.

## Usage

All commands assume the project environment is active (e.g. after `uv sync`). Otherwise use `uv run cve-extractor` instead of `cve-extractor`.

### Command Line Interface

#### Download and Extract CVE Data

Download the latest CVE data and extract CVEs for a given language/ecosystem:

```bash
# Basic usage (requires --language; uses cache if available)
cve-extractor download output/ --language php

# Force fresh download
cve-extractor download output/ --language python --no-use-cache

# Verbose output for debugging
cve-extractor download output/ --language php --verbose
```

**Output**: Creates `output/collected.csv` with extracted CVE data.

#### Analyze CVE Distribution

Generate statistics about CVE distribution:

```bash
# Basic analysis
cve-extractor analyze output/collected.csv

# Filter by minimum count
cve-extractor analyze output/collected.csv --min 5
```

#### Clean Cache

Remove cache and intermediate files:

```bash
# Interactive cleanup
cve-extractor clean

# Force cleanup without confirmation
cve-extractor clean --force
```

## Project Structure

```
cve-extractor/
├── src/cve_extractor/          # Main package
│   ├── __init__.py
│   ├── config.py               # Configuration management
│   ├── logger.py               # Logging utilities
│   ├── cli.py                  # CLI interface
│   ├── core/                   # Core functionality
│   │   ├── __init__.py
│   │   ├── downloader.py       # CVE data download and extraction
│   │   ├── filter.py           # CVE filtering by language
│   │   └── extractor.py        # CVE information extraction
│   └── stats/                  # Statistics and analysis
│       ├── __init__.py
│       └── analyzer.py         # CVE distribution analysis
├── main.py                      # CLI entry point
├── pyproject.toml               # Project configuration (uv)
└── README.md                    # This file
```

## Configuration

Configuration is managed through `src/cve_extractor/config.py`. Default paths:

- **CACHE_PATH**: `data/.cache/` - Stores downloaded CVE data
- **INTER_PATH**: `data/.inter/` - Stores intermediate files and logs
- **CVELISTV5_URL**: Official CVE List v5 release URL

## Output Format

### CSV Output

The extracted CVE data is saved as CSV with the following columns:

| Column | Description |
|--------|-------------|
| `cve_id` | CVE identifier (e.g., CVE-2024-1234) |
| `cve_type` | CVE type/classification |
| `description` | CVE description (first 200 chars) |

Example:
```
cve_id,cve_type,description
CVE-2024-1234,CWE-79,"Cross-site scripting (XSS) vulnerability in..."
CVE-2024-5678,CWE-89,"SQL injection vulnerability in..."
```

## Logging

Logs are stored in `data/.inter/.logs/` with the following format:
- File logs: Detailed format with timestamps and line numbers
- Console output: Formatted with colors and emojis for easy reading

## Dependencies

All dependencies are declared in `pyproject.toml` and managed by uv.

- **Core**: requests, pydantic, typer, rich
- **Optional [core]**: `uv sync --extra core` — API-only install
- **Optional [full]**: default install includes CLI
- **Optional [dev]**: `uv sync --all-extras` — pytest, black, pylint, isort, mypy

## Development

### Setup Development Environment

```bash
# Install with dev extras (uv manages the environment)
uv sync --all-extras

# Format code
uv run black src/
uv run isort src/

# Lint code
uv run ruff check src/
uv run pylint src/

# Type check
uv run mypy src/

# Run tests
uv run pytest
```

### Code Style

This project uses uv for all tooling. Before committing, run:

- `uv run black src/` and `uv run isort src/` for formatting
- `uv run ruff check src/` and `uv run pylint src/` for linting
- `uv run mypy src/` for type checking

## Performance Notes

The extraction process is optimized for performance:

- **Batch processing**: Processes CVE files in batches of 5000
- **Progress tracking**: Real-time progress display with ETA
- **Caching**: 7-day cache for extracted data and GitHub requests
- **Incremental updates**: Only processes new CVEs since last run

## Troubleshooting

### No CVEs Found

If no CVEs are found for the selected language:
1. Check network connectivity
2. Verify the CVE data source URL is accessible
3. Try with `--no-use-cache` to force fresh download

### Out of Memory

For large datasets:
1. Reduce batch size in `src/cve_extractor/core/extractor.py`
2. Run on a machine with more RAM
3. Process data in smaller chunks

### API Rate Limiting

The downloader includes automatic rate limit handling:
- Automatic retries with exponential backoff
- Caching to minimize API calls
- 7-day cache TTL

## License

This project is open source and available under the MIT License.

## Contributing

Contributions are welcome! Please ensure:
1. Use uv for all commands; code follows project style (black, isort, ruff, pylint)
2. Type hints are included where applicable
3. Tests are added for new functionality
4. Documentation is updated

## Support

For issues, questions, or suggestions, please open an issue on the project repository.
