Metadata-Version: 2.4
Name: whisper-lid
Version: 0.0.1
Summary: Spoken Language IDentification (LID) using multilingual Whisper model
Home-page: https://github.com/bond005/whisper-lid
Author: Ivan Bondarenko
Author-email: bond005@yandex.ru
License: Apache License Version 2.0
Keywords: whisper,LID,spoken-language,language-identification,spoken-language-identification
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
License-File: LICENSE
Requires-Dist: librosa>=0.10.0
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: sentencepiece
Requires-Dist: soundfile>=0.11.0
Requires-Dist: torch>=2.0.1
Requires-Dist: torchaudio>=2.0.1
Requires-Dist: transformers>=4.38.1
Requires-Dist: datasets<4.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: summary


Whisper-LID
===============

This is a spoken language identification system that is based on the Whisper
model. The system uses the Whisper-based algorithm to identify spoken languages
or non-speech event. The Section 2.3 of the paper about Whisper
(https://arxiv.org/abs/2212.04356) states that language tags or non-speech
tags need to be predicted after the `<|startoftranscript|>` special token.
Based on this information, the system estimates a probability distribution
for the next token after the `<|startoftranscript|>` and selects the token
with the highest probability as the final spoken language prediction. Since
the predicted token can be either a language tag or a non-speech tag, the
system combines the features of a spoken language identifier and a voice
activity detector.
