Metadata-Version: 2.4
Name: cjm-transcription-utils
Version: 0.0.1
Summary: Miscellaneous utilities for helping with audio transcription.
Home-page: https://github.com/cj-mills/cjm-transcription-utils
Author: Christian J. Mills
Author-email: 9126128+cj-mills@users.noreply.github.com
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: silero-vad
Requires-Dist: num2words
Requires-Dist: librosa
Requires-Dist: pydub
Requires-Dist: RapidFuzz
Requires-Dist: numerizer
Provides-Extra: dev
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# cjm-transcription-utils


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

``` bash
pip install cjm_transcription_utils
```

## Project Structure

    nbs/
    ├── chunking.ipynb            # Fill in a module description here
    ├── formatting.ipynb          # Fill in a module description here
    ├── librosa.ipynb             # Fill in a module description here
    ├── numerizer.ipynb           # Fill in a module description here
    ├── postprocessing.ipynb      # Fill in a module description here
    ├── pydub.ipynb               # Fill in a module description here
    ├── silero_vad.ipynb          # Fill in a module description here
    └── timestamp_alignment.ipynb # Fill in a module description here

Total: 8 notebooks

## Module Dependencies

``` mermaid
graph LR
    chunking[chunking<br/>chunking]
    formatting[formatting<br/>formatting]
    librosa[librosa<br/>librosa]
    numerizer[numerizer<br/>numerizer]
    postprocessing[postprocessing<br/>postprocessing]
    pydub[pydub<br/>pydub]
    silero_vad[silero_vad<br/>silero vad]
    timestamp_alignment[timestamp_alignment<br/>timestamp alignment]

    silero_vad --> chunking
```

*1 cross-module dependencies detected*

## CLI Reference

No CLI commands found in this project.

## Module Overview

Detailed documentation for each module in the project:

### chunking (`chunking.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.chunking import (
    get_extended_timestamp_boundaries,
    get_extended_chunk_boundaries,
    generate_chunks_with_vad,
    generate_intermediate_chunks,
    generate_intermediate_chunk_tuples,
    merge_transcripts_with_overlaps
)
```

#### Functions

``` python
def get_extended_timestamp_boundaries(
    timestamps: List[Dict[str, float]], 
    index: int  # Index of the current timestamp
) -> Tuple[float, float]
    "Get extended boundaries for a timestamp using adjacent timestamps."
```

``` python
def get_extended_chunk_boundaries(
    chunks: List[Tuple[float, float]], 
    index: int  # Index of the current chunk
) -> Tuple[float, float]
    "Get extended boundaries for a chunk using adjacent chunks."
```

``` python
def generate_chunks_with_vad(
    audio_array: np.ndarray,  # Audio array
    duration: float,  # Total duration of audio in seconds
    max_chunk_seconds: float = 120,  # Maximum chunk duration in seconds
    max_chunk_seconds_offset: float = 0,  # Offset for chunk duration calculation
    speech_timestamps: Optional[List[Dict]] = None,  # List of speech timestamp dictionaries with 'start' and 'end' keys
    max_silence_threshold: float = 2.0  # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
    "Generate chunks using VAD timestamps with silence-based splitting"
```

``` python
def generate_intermediate_chunks(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # TODO: Add description
    use_extended_boundaries:bool  # TODO: Add description
    
) -> List[Tuple[float, float]]:  # TODO: Add return description
    "Generate overlapping chunks between consecutive chunk boundaries"
```

``` python
def generate_intermediate_chunk_tuples(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries:bool  # TODO: Add description
) -> List[Tuple[Dict, Dict]]
    "Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
```

``` python
def merge_transcripts_with_overlaps(
    normal_transcripts: List[str],  # List of transcripts for normal chunks
    intermediate_transcripts: List[str],  # List of transcripts for intermediate chunks
    segment_transcripts: List[Tuple[str, str]],
    verbose: bool = True  # Whether to print debug information
) -> str
    "Merge normal and intermediate transcripts with overlap correction"
```

### formatting (`formatting.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.formatting import (
    time_interval_to_hms_range
)
```

#### Functions

``` python
def time_interval_to_hms_range(
    duration_tuple  # A tuple of (start_seconds, end_seconds) as floats
)
    "Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."
```

### librosa (`librosa.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.librosa import (
    load_audio
)
```

#### Functions

``` python
def load_audio(
    audio_path: str,  # TODO: Add description
    target_sr: int = 16000  # TODO: Add description
) -> Tuple[np.ndarray, int]:  # TODO: Add return description
    "Load and normalize audio file"
```

### numerizer (`numerizer.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.numerizer import (
    original_numerize_numerals,
    patched_numerize_numerals,
    smart_numerize
)
```

#### Functions

``` python
def patched_numerize_numerals(
    s,  # TODO: Add type hint and description
    ignore=None,  # TODO: Add type hint and description
    bias=None  # TODO: Add type hint and description
): # TODO: Add type hint
    "Patched version that doesn't convert 'a' to '1'"
```

``` python
def smart_numerize(
    text  # TODO: Add type hint and description
): # TODO: Add type hint
    "TODO: Add function description"
```

### postprocessing (`postprocessing.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.postprocessing import (
    replace_integers_in_string,
    transcription_post_processing
)
```

#### Functions

``` python
def replace_integers_in_string(
    text  # TODO: Add type hint and description
): # TODO: Add type hint
    "TODO: Add function description"
```

``` python
def transcription_post_processing(
    transcript:str  # TODO: Add description
)->str:  # TODO: Add return description
    "TODO: Add function description"
```

### pydub (`pydub.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.pydub import (
    get_audio_segment
)
```

#### Functions

``` python
def get_audio_segment(
    audio: AudioSegment,  # TODO: Add description
    start: float,  # TODO: Add description
    end: float,  # TODO: Add description
    offset: float=0  # TODO: Add description
) -> AudioSegment:  # TODO: Add return description
    "Extract audio segment between start and end times"
```

### silero vad (`silero_vad.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.silero_vad import (
    prepare_audio_and_vad
)
```

#### Functions

``` python
def prepare_audio_and_vad(
    audio_path: str,  # Path to audio file
    max_chunk_seconds: float,  # Maximum chunk duration in seconds
    max_silence_threshold: float,  # Maximum silence duration before creating a new chunk
    include_timestamps: bool,  # Whether timestamps will be needed
    verbose: bool = True  # Whether to print progress
)
    "Load audio and prepare VAD timestamps if needed."
```

### timestamp alignment (`timestamp_alignment.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.timestamp_alignment import (
    TranscriptAligner,
    align_timestamps_to_transcript
)
```

#### Functions

``` python
def align_timestamps_to_transcript(
    final_transcript: str,  # The final merged transcript
    timestamp_transcripts: List[str],  # List of transcripts for each timestamp segment
    speech_timestamps: List[Dict],  # List of speech timestamp dictionaries
    verbose: bool = True  # Whether to print alignment details
) -> List[Dict]
    "Align timestamp segments to the final transcript."
```

#### Classes

``` python
class TranscriptAligner:
    def __init__(self, 
                 correct_transcript: str, # The full, correct transcript text
                 segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                 timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                 confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                )
    "TODO: Add class description"
    
    def __init__(self,
                     correct_transcript: str, # The full, correct transcript text
                     segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                     timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                     confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                    )
        "Initialize the transcript aligner with complete coverage and correction mechanisms."
    
    def align_timestamps_to_correct_transcript(
            self
        ) -> List[Dict]:  # TODO: Add return description
        "Align timestamps to the correct transcript with optional corrections."
```
