Metadata-Version: 2.4
Name: subtitle-deduplicator
Version: 1.0.0
Summary: Remove duplicate/ghost entries from SRT subtitle files generated by auto-captioning tools
Project-URL: Homepage, https://github.com/fr0stb1rd/subtitle-deduplicator
Project-URL: Issues, https://github.com/fr0stb1rd/subtitle-deduplicator/issues
License: MIT
License-File: LICENSE
Keywords: caption,cleanup,deduplicator,srt,subtitle
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Video
Classifier: Topic :: Text Processing
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# subtitle-deduplicator 🎬

[![PyPI](https://img.shields.io/pypi/v/subtitle-deduplicator)](https://pypi.org/project/subtitle-deduplicator/)
[![GitHub](https://img.shields.io/badge/github-fr0stb1rd%2Fsubtitle--deduplicator-blue?logo=github)](https://github.com/fr0stb1rd/subtitle-deduplicator)

**SRT Subtitle Duplicate Remover**

Remove duplicate/ghost entries from auto-generated SRT subtitle files. | [GitHub](https://github.com/fr0stb1rd/subtitle-deduplicator) | [PyPI](https://pypi.org/project/subtitle-deduplicator/)

## Why?

Auto-generated SRT subtitles (from YouTube, Whisper, etc.) often contain duplicated entries in a "scrolling karaoke" pattern:

```
1
00:00:00,440 --> 00:00:02,909

so welcome to the last Talk of the day

2
00:00:02,909 --> 00:00:02,919
so welcome to the last Talk of the day
 

3
00:00:02,919 --> 00:00:06,630
so welcome to the last Talk of the day
and OSS what do ABI and why should they
```

Each real entry shows 2 lines (previous + new), and between them are 10ms "ghost" entries that only repeat the previous text. This roughly **triples** the file size and makes the subtitles unreadable.

## Install

```bash
pip install subtitle-deduplicator
```

## Usage

```bash
# Basic usage (outputs to video_deduped.srt)
subtitle-dedup video.srt

# Specify output file
subtitle-dedup video.srt -o video_clean.srt

# Overwrite in place
subtitle-dedup video.srt --in-place

# Custom ghost threshold (default: 20ms)
subtitle-dedup video.srt -t 50

# Specify encoding
subtitle-dedup video.srt -e latin-1
```

**Example Output:**

```text
✔ Deduplication complete!
ℹ Input:               video.srt
ℹ Output:              video_clean.srt
ℹ Original entries:    1559
ℹ Deduplicated:        760
ℹ Removed:             799 (51.3%)
```

## What It Removes

| Duplicate Type | Description |
|---|---|
| Ghost entries | 10ms entries that repeat previous text |
| Carry-over lines | First line duplicating previous entry's last line |
| Identical entries | Back-to-back entries with same text |
| Empty entries | Entries with no actual text content |

## Zero Dependencies

subtitle-deduplicator uses only Python standard library — no `pip install` requirements beyond Python 3.8+.

## License

MIT
