Metadata-Version: 2.4
Name: cjm-transcription-utils
Version: 0.0.3
Summary: Miscellaneous utilities for helping with audio transcription.
Home-page: https://github.com/cj-mills/cjm-transcription-utils
Author: Christian J. Mills
Author-email: 9126128+cj-mills@users.noreply.github.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: silero-vad
Requires-Dist: num2words
Requires-Dist: librosa
Requires-Dist: pydub
Requires-Dist: RapidFuzz
Requires-Dist: numerizer
Provides-Extra: dev
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# cjm-transcription-utils


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

``` bash
pip install cjm_transcription_utils
```

## Project Structure

    nbs/
    ├── chunking.ipynb            # Utilities for splitting audio into chunks using VAD timestamps and merging transcripts with overlap correction.
    ├── formatting.ipynb          # Utilities for formatting time intervals into human-readable timestamp ranges.
    ├── librosa.ipynb             # Audio loading and normalization utilities using librosa.
    ├── numerizer.ipynb           # Text number conversion utilities with patched numerizer to preserve articles like 'a'.
    ├── postprocessing.ipynb      # Transcript post-processing utilities for converting numbers to words and normalizing text.
    ├── pydub.ipynb               # Audio segment extraction utilities using pydub.
    ├── silero_vad.ipynb          # Voice Activity Detection utilities using the Silero VAD model.
    └── timestamp_alignment.ipynb # Utilities for aligning VAD timestamps to corrected transcripts using fuzzy matching.

Total: 8 notebooks

## Module Dependencies

``` mermaid
graph LR
    chunking[chunking<br/>chunking]
    formatting[formatting<br/>formatting]
    librosa[librosa<br/>librosa]
    numerizer[numerizer<br/>numerizer]
    postprocessing[postprocessing<br/>postprocessing]
    pydub[pydub<br/>pydub]
    silero_vad[silero_vad<br/>silero vad]
    timestamp_alignment[timestamp_alignment<br/>timestamp alignment]

    silero_vad --> chunking
    silero_vad --> librosa
```

*2 cross-module dependencies detected*

## CLI Reference

No CLI commands found in this project.

## Module Overview

Detailed documentation for each module in the project:

### chunking (`chunking.ipynb`)

> Utilities for splitting audio into chunks using VAD timestamps and
> merging transcripts with overlap correction.

#### Import

``` python
from cjm_transcription_utils.chunking import (
    get_extended_timestamp_boundaries,
    get_extended_chunk_boundaries,
    generate_chunks_with_vad,
    generate_intermediate_chunks,
    generate_intermediate_chunk_tuples,
    merge_transcripts_with_overlaps
)
```

#### Functions

``` python
def get_extended_timestamp_boundaries(
    timestamps: List[Dict[str, float]], 
    index: int  # Index of the current timestamp
) -> Tuple[float, float]
    "Get extended boundaries for a timestamp using adjacent timestamps."
```

``` python
def get_extended_chunk_boundaries(
    chunks: List[Tuple[float, float]], 
    index: int  # Index of the current chunk
) -> Tuple[float, float]
    "Get extended boundaries for a chunk using adjacent chunks."
```

``` python
def generate_chunks_with_vad(
    audio_array: np.ndarray,  # Audio array
    duration: float,  # Total duration of audio in seconds
    max_chunk_seconds: float = 120,  # Maximum chunk duration in seconds
    max_chunk_seconds_offset: float = 0,  # Offset for chunk duration calculation
    speech_timestamps: Optional[List[Dict]] = None,  # List of speech timestamp dictionaries with 'start' and 'end' keys
    max_silence_threshold: float = 2.0  # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
    "Generate chunks using VAD timestamps with silence-based splitting"
```

``` python
def generate_intermediate_chunks(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries: bool  # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[float, float]]:  # List of tuples representing time intervals for intermediate chunks
    "Generate overlapping chunks between consecutive chunk boundaries"
```

``` python
def generate_intermediate_chunk_tuples(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries: bool  # Whether to use extended boundaries from adjacent timestamps
) -> List[Tuple[Dict, Dict]]
    "Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
```

``` python
def merge_transcripts_with_overlaps(
    normal_transcripts: List[str],  # List of transcripts for normal chunks
    intermediate_transcripts: List[str],  # List of transcripts for intermediate chunks
    segment_transcripts: List[Tuple[str, str]],
    verbose: bool = True  # Whether to print debug information
) -> str
    "Merge normal and intermediate transcripts with overlap correction"
```

### formatting (`formatting.ipynb`)

> Utilities for formatting time intervals into human-readable timestamp
> ranges.

#### Import

``` python
from cjm_transcription_utils.formatting import (
    time_interval_to_hms_range
)
```

#### Functions

``` python
def time_interval_to_hms_range(
    duration_tuple: tuple[float, float]  # A tuple of (start_seconds, end_seconds) as floats
) -> str:  # Formatted timestamp range string in [HH:MM:SS.ss]-[HH:MM:SS.ss] format
    "Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."
```

### librosa (`librosa.ipynb`)

> Audio loading and normalization utilities using librosa.

#### Import

``` python
from cjm_transcription_utils.librosa import (
    load_audio
)
```

#### Functions

``` python
def load_audio(
    audio_path: str,  # Path to the audio file to load
    target_sr: int = 16000  # Target sample rate for resampling
) -> Tuple[np.ndarray, int]:  # Tuple of (normalized audio array, sample rate)
    "Load and normalize audio file"
```

### numerizer (`numerizer.ipynb`)

> Text number conversion utilities with patched numerizer to preserve
> articles like ‘a’.

#### Import

``` python
from cjm_transcription_utils.numerizer import (
    original_numerize_numerals,
    patched_numerize_numerals,
    smart_numerize
)
```

#### Functions

``` python
def patched_numerize_numerals(
    s: str,  # String to convert written numbers to digits
    ignore: list = None,  # List of words to ignore during conversion
    bias: str = None  # Conversion bias (e.g., 'ordinal')
) -> str:  # String with written numbers converted to digits
    "Patched version that doesn't convert 'a' to '1'"
```

``` python
def smart_numerize(
    text: str  # Text containing written numbers to convert
) -> str:  # Text with written numbers converted to digits
    "Convert written numbers to digits with special handling for compound ordinals."
```

### postprocessing (`postprocessing.ipynb`)

> Transcript post-processing utilities for converting numbers to words
> and normalizing text.

#### Import

``` python
from cjm_transcription_utils.postprocessing import (
    replace_integers_in_string,
    transcription_post_processing
)
```

#### Functions

``` python
def replace_integers_in_string(
    text: str  # Text containing integers to convert to words
) -> str:  # Text with integers converted to their word representation
    "Replace integer numbers with their word equivalents while preserving special formats."
```

``` python
def transcription_post_processing(
    transcript: str  # Raw transcript text to process
) -> str:  # Processed transcript with integers converted to words and dashes normalized
    "Apply post-processing transformations to transcript text."
```

### pydub (`pydub.ipynb`)

> Audio segment extraction utilities using pydub.

#### Import

``` python
from cjm_transcription_utils.pydub import (
    get_audio_segment
)
```

#### Functions

``` python
def get_audio_segment(
    audio: AudioSegment,  # Source audio segment to extract from
    start: float,  # Start time in seconds
    end: float,  # End time in seconds
    offset: float = 0  # Offset in milliseconds to expand segment boundaries
) -> AudioSegment:  # Extracted audio segment
    "Extract audio segment between start and end times"
```

### silero vad (`silero_vad.ipynb`)

> Voice Activity Detection utilities using the Silero VAD model.

#### Import

``` python
from cjm_transcription_utils.silero_vad import (
    prepare_audio_and_vad
)
```

#### Functions

``` python
def prepare_audio_and_vad(
    audio_path: str,  # Path to audio file
    max_chunk_seconds: float,  # Maximum chunk duration in seconds
    max_silence_threshold: float,  # Maximum silence duration before creating a new chunk
    include_timestamps: bool,  # Whether timestamps will be needed
    verbose: bool = True  # Whether to print progress
) -> tuple[np.ndarray, int, float, list, list, list]:  # Tuple of (audio array, sample rate, duration, speech timestamps, chunks, chunk timestamps)
    "Load audio and prepare VAD timestamps if needed."
```

### timestamp alignment (`timestamp_alignment.ipynb`)

> Utilities for aligning VAD timestamps to corrected transcripts using
> fuzzy matching.

#### Import

``` python
from cjm_transcription_utils.timestamp_alignment import (
    TranscriptAligner,
    align_timestamps_to_transcript
)
```

#### Functions

``` python
def align_timestamps_to_transcript(
    final_transcript: str,  # The final merged transcript
    timestamp_transcripts: List[str],  # List of transcripts for each timestamp segment
    speech_timestamps: List[Dict],  # List of speech timestamp dictionaries
    verbose: bool = True  # Whether to print alignment details
) -> List[Dict]
    "Align timestamp segments to the final transcript."
```

#### Classes

``` python
class TranscriptAligner:
    def __init__(self, 
                 correct_transcript: str, # The full, correct transcript text
                 segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                 timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                 confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                )
    "Aligns VAD timestamps to a corrected transcript using fuzzy matching."
    
    def __init__(self,
                     correct_transcript: str, # The full, correct transcript text
                     segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                     timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                     confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                    )
        "Initialize the transcript aligner with complete coverage and correction mechanisms."
    
    def align_timestamps_to_correct_transcript(
            self
        ) -> List[Dict]:  # List of alignment dictionaries with timestamp, text, and confidence info
        "Align timestamps to the correct transcript with optional corrections."
```
