Metadata-Version: 2.4
Name: dns-threat-detector
Version: 1.0.3
Summary: Production-ready DNS threat detection using machine learning
Home-page: https://github.com/SubratDash67/DNS-Security
Author: UMUDGA Project
Author-email: UMUDGA Project <contact@umudga.edu>
License: MIT
Project-URL: Homepage, https://github.com/SubratDash67/DNS-Security
Project-URL: Documentation, https://github.com/SubratDash67/DNS-Security/wiki
Project-URL: Repository, https://github.com/SubratDash67/DNS-Security
Project-URL: Bug Reports, https://github.com/SubratDash67/DNS-Security/issues
Keywords: dns,threat-detection,machine-learning,dga,typosquatting,security,cybersecurity,malware,phishing
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Security
Classifier: Topic :: Internet :: Name Service (DNS)
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: lightgbm>=4.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# DNS Threat Detector

![Python](https://img.shields.io/badge/python-3.8%2B-blue)
![License](https://img.shields.io/badge/license-MIT-green)
![Version](https://img.shields.io/badge/version-1.0.0-brightgreen)

A production-ready machine learning system for detecting malicious DNS domains including DGA domains, typosquatting attempts, malware C&C domains, and phishing sites.

## Features

- **99.68% F1-Score** on comprehensive test dataset
- **100% Typosquatting Detection** with zero false positives
- **Sub-millisecond Latency** (0.439ms average inference time)
- **Multi-tier Safelist** with O(1) lookup for instant benign classification
- **99 Protected Brands** including Google, Microsoft, PayPal, Amazon, etc.
- **Hybrid Architecture** combining LightGBM, LSTM, and meta-learning
- **Easy Integration** with Python API and CLI tool

## Performance Metrics

| Metric | Value |
|--------|-------|
| F1-Score | 99.68% |
| Accuracy | 99.38% |
| Precision | 97.15% |
| Recall | 99.95% |
| Typosquatting Detection | 100% |
| False Positive Rate | 28.5% |
| False Negative Rate | 0.05% |
| Avg Latency | 0.439 ms |
| Throughput | ~2,275 domains/sec |

## Installation

```bash
pip install dns-threat-detector
```

**Having installation issues?** See [TROUBLESHOOTING.md](TROUBLESHOOTING.md) for solutions.

**Quick diagnostic:**
```bash
dns-detect check
```

## Quick Start

### Python API

```python
from dns_threat_detector import DNS_ThreatDetector

# Initialize detector with safelist enabled
detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()

# Predict a single domain
result = detector.predict('gooogle.com')
print(result)
# Output:
# {
#   'prediction': 'MALICIOUS',
#   'confidence': 0.9000,
#   'reason': 'Typosquatting (dist=1 to google)',
#   'method': 'typosquatting_rule',
#   'latency_ms': 0.234
# }

# Batch predictions
domains = ['google.com', 'gooogle.com', 'example.com']
results = detector.predict_batch(domains)
```

### Command-Line Interface

```bash
# Predict a single domain
dns-detect predict gooogle.com

# Get JSON output
dns-detect predict gooogle.com --json

# Batch process domains from file
dns-detect batch domains.txt --output results.json

# Show model information
dns-detect info

# Run self-tests
dns-detect test
```

## Architecture

The DNS Threat Detector uses a sophisticated hybrid ensemble approach:

### 1. LightGBM Classifier
- Gradient-boosted decision trees
- 11 features (4 FQDN + 7 typosquatting-specific)
- 200 trees with max depth 7
- Handles structured feature patterns

### 2. Bidirectional LSTM
- Character-level neural network
- 41-character vocabulary
- 159K parameters
- Embedding(41→32) → Bi-LSTM(32→64×2) → FC(128→64→2)
- Captures sequential patterns

### 3. Meta-Learner
- Logistic regression stacking ensemble
- Combines LightGBM and LSTM predictions
- Learned weights: LSTM=7.04, LightGBM=2.53
- Final classification decision

### 4. Typosquatting Detector
- Rule-based + ML hybrid approach
- Edit distance (Levenshtein) to 99 top brands
- Distance 1-3 → Malicious (typosquatting)
- Exact brand match → Benign (whitelist)

### 5. Multi-Tier Safelist (Optional)
- Tier 1: 30K critical domains (government, finance)
- Tier 2: 29K high-trust domains (tech, education)
- Tier 3: 85K general trusted domains
- O(1) in-memory lookup
- 322× speedup for safelisted domains

## Prediction Flow

```
Domain Input
    ↓
┌─────────────────────┐
│  Safelist Check     │ → BENIGN (if listed)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  Brand Whitelist    │ → BENIGN (exact match)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  Typosquatting Rule │ → MALICIOUS (edit dist 1-3)
└─────────────────────┘
    ↓
┌─────────────────────┐
│  ML Ensemble        │ → MALICIOUS/BENIGN
│  (LightGBM + LSTM)  │
└─────────────────────┘
```

## Features Extracted

### FQDN Features (4)
1. `domain_length` - Length of domain name excluding TLD
2. `subdomain_count` - Number of subdomains
3. `numeric_chars` - Count of numeric characters
4. `entropy` - Shannon entropy of character distribution

### Typosquatting Features (7)
5. `min_edit_distance` - Minimum Levenshtein distance to top brands
6. `edit_distance_ratio` - Normalized edit distance by brand length
7. `length_diff_to_closest` - Length difference to closest brand
8. `has_extra_char` - Binary: domain has 1 extra character
9. `has_missing_char` - Binary: domain missing 1 character
10. `has_swapped_char` - Binary: adjacent characters swapped
11. `digit_substitution` - Binary: contains digit substitution

## API Reference

### DNS_ThreatDetector

```python
class DNS_ThreatDetector(
    models_dir: Optional[str] = None,
    use_safelist: bool = False,
    safelist_dir: Optional[str] = None,
    safelist_tiers: List[int] = [1, 2, 3]
)
```

#### Methods

**`load_models()`**
Load all model components (LightGBM, LSTM, meta-learner, safelist)

**`predict(domain: str) -> Dict`**
Predict if a domain is malicious or benign

Returns:
```python
{
    'prediction': 'MALICIOUS' | 'BENIGN',
    'confidence': float,  # 0.0 to 1.0
    'reason': str,  # Human-readable explanation
    'method': str,  # 'safelist' | 'brand_whitelist' | 'typosquatting_rule' | 'ensemble'
    'latency_ms': float  # Inference time in milliseconds
}
```

**`predict_batch(domains: List[str]) -> List[Dict]`**
Predict multiple domains

**`get_model_info() -> Dict`**
Get comprehensive model information and statistics

**`save_metadata(output_path: str)`**
Save model metadata to JSON file

## Advanced Usage

### Custom Model Paths

```python
detector = DNS_ThreatDetector(
    models_dir='/path/to/models',
    use_safelist=True,
    safelist_dir='/path/to/safelists',
    safelist_tiers=[1, 2, 3]
)
detector.load_models()
```

### Disable Safelist

```python
# Faster initialization, no safelist loading
detector = DNS_ThreatDetector(use_safelist=False)
detector.load_models()
```

### Batch Processing with Progress

```python
from tqdm import tqdm

domains = ['example1.com', 'example2.com', ...]
results = []

for domain in tqdm(domains):
    result = detector.predict(domain)
    results.append(result)
```

### Model Statistics

```python
info = detector.get_model_info()
print(f"Total predictions: {info['usage_statistics']['total_predictions']}")
print(f"Safelist hits: {info['usage_statistics']['safelist_hits']}")
print(f"Typosquatting detections: {info['usage_statistics']['typosquatting_detections']}")
```

## CLI Reference

### Commands

**`dns-detect predict <domain>`**
Predict a single domain
- `--json`: Output as JSON
- `--no-safelist`: Disable safelist checking

**`dns-detect batch <file>`**
Batch process domains from file (one domain per line)
- `--output <file>`: Output file path (default: results.json)
- `--no-safelist`: Disable safelist checking

**`dns-detect info`**
Show model information and statistics
- `--no-safelist`: Show info without loading safelist

**`dns-detect test`**
Run built-in self-tests

## Examples

### Detecting Typosquatting

```python
detector = DNS_ThreatDetector()
detector.load_models()

# Legitimate brand
result = detector.predict('google.com')
# → BENIGN (brand_whitelist)

# Typosquatting attempts
result = detector.predict('gooogle.com')  # Extra 'o'
# → MALICIOUS (typosquatting_rule, dist=1)

result = detector.predict('g00gle.com')  # Digit substitution
# → MALICIOUS (typosquatting_rule, dist=2)
```

### Processing Domain Lists

```python
import pandas as pd

detector = DNS_ThreatDetector(use_safelist=True)
detector.load_models()

# Read domains from CSV
df = pd.read_csv('domains.csv')

# Add predictions
df['prediction'] = df['domain'].apply(
    lambda d: detector.predict(d)['prediction']
)
df['confidence'] = df['domain'].apply(
    lambda d: detector.predict(d)['confidence']
)

# Filter malicious domains
malicious = df[df['prediction'] == 'MALICIOUS']
print(malicious)
```

## Requirements

- Python ≥ 3.8
- PyTorch ≥ 2.0.0
- LightGBM ≥ 4.0.0
- scikit-learn ≥ 1.3.0
- pandas ≥ 2.0.0
- numpy ≥ 1.24.0

## Model Size

- Total package size: ~60 MB
- LightGBM models: ~10 MB
- LSTM model: ~5 MB
- Safelist files (tiers 1-3): ~20 MB
- Tokenizer: ~1 MB

## Dataset

Trained on 51,000 domains:
- 50% benign (legitimate domains)
- 50% malicious (DGA, typosquatting, malware C&C)
- 80/20 train/test split with stratification

## Citation

If you use this tool in your research or project, please cite:

```
@software{dns_threat_detector,
  title = {DNS Threat Detector},
  author = {UMUDGA Project},
  year = {2025},
  version = {1.0.0},
  url = {https://github.com/umudga/dns-threat-detector}
}
```

## License

MIT License - see LICENSE file for details

## Contributing

Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Submit a pull request

## Support

- GitHub Issues: https://github.com/umudga/dns-threat-detector/issues
- Documentation: https://github.com/umudga/dns-threat-detector/wiki

## Changelog

### Version 1.0.0 (2025-10-30)
- Initial release
- Hybrid ensemble architecture (LightGBM + LSTM + Meta-learner)
- 99.68% F1-score on test data
- 100% typosquatting detection
- Multi-tier safelist integration
- CLI tool with batch processing
- Comprehensive API documentation

## Acknowledgments

Developed by the UMUDGA Project team as part of a final-year academic research project on DNS threat detection using machine learning.

## Disclaimer

This tool is provided for educational and research purposes. While it achieves high accuracy, no detection system is perfect. Always use multiple layers of security in production environments.
