Metadata-Version: 2.4
Name: openflam
Version: 1.0.1
Summary: Framewise Lanaguge-Audio Modeling
Author: Yusong Wu
Author-email: Ke Chen <kechen@adobe.com>, Oriol Nieto <onieto@adobe.com>, Prem Seetharaman <seethara@adobe.com>
Maintainer: Yusong Wu
Maintainer-email: Ke Chen <kechen@adobe.com>, Oriol Nieto <onieto@adobe.com>, Prem Seetharaman <seethara@adobe.com>
License: Copyright © 2025, Adobe Inc. and its licensors. All rights reserved.
        
        ADOBE RESEARCH LICENSE
         
        Adobe grants any person or entity ("you" or "your") obtaining a copy of these certain research materials that are owned by Adobe ("Licensed Materials") a nonexclusive, worldwide, royalty-free, revocable, fully paid license to (A) reproduce, use, modify, and publicly display the Licensed Materials; and (B) redistribute the Licensed Materials, and modifications or derivative works thereof, provided the following conditions are met:
         
        -      The rights granted herein may be exercised for noncommercial research purposes (i.e., academic research and teaching) only. Noncommercial research purposes do not include commercial licensing or distribution, development of commercial products, or any other activity that results in commercial gain.
        -      You may add your own copyright statement to your modifications and/or provide additional or different license terms for use, reproduction, modification, public display, and redistribution of your modifications and derivative works, provided that such license terms limit the use, reproduction, modification, public display, and redistribution of such modifications and derivative works to noncommercial research purposes only.
        -      You acknowledge that Adobe and its licensors own all right, title, and interest in the Licensed Materials. 
        -      All copies of the Licensed Materials must include the above copyright notice, this list of conditions, and the disclaimer below.
         
        Failure to meet any of the above conditions will automatically terminate the rights granted herein. 
         
        THE LICENSED MATERIALS ARE PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK AS TO THE USE, RESULTS, AND PERFORMANCE OF THE LICENSED MATERIALS IS ASSUMED BY YOU. ADOBE DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, WITH REGARD TO YOUR USE OF THE LICENSED MATERIALS, INCLUDING, BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD-PARTY RIGHTS. IN NO EVENT WILL ADOBE BE LIABLE FOR ANY ACTUAL, INCIDENTAL, SPECIAL OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION, LOSS OF PROFITS OR OTHER COMMERCIAL LOSS, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE LICENSED MATERIALS, EVEN IF ADOBE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
        
Project-URL: Homepage, https://github.com/adobe-research/openflam
Project-URL: Bug Tracker, https://github.com/adobe-research/openflam/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2.0.0,>=1.23.5
Requires-Dist: soundfile
Requires-Dist: librosa
Requires-Dist: torchlibrosa
Requires-Dist: torch<2.8.0,>=2.6.0
Requires-Dist: torchaudio<2.8.0,>=2.6.0
Requires-Dist: torchvision<0.23.0,>=0.21.0
Requires-Dist: lightning
Requires-Dist: transformers==4.56.1
Requires-Dist: matplotlib
Dynamic: license-file

# OpenFLAM
<p align="center">
  <img src="https://raw.githubusercontent.com/adobe-research/openflam/main/assets/FLAM_SLOGAN.png" alt="Framewise Language-Audio Modeling" width="75%"/>
</p>
<p align="center">
  <a href="https://arxiv.org/abs/2505.05335"><img src="https://img.shields.io/badge/arXiv-2505.05335-brightgreen.svg?logo=arxiv&logoColor=red"/></a>
  <a href="https://pypi.org/project/openflam"><img src="https://badge.fury.io/py/openflam.svg?icon=si%3Apython&icon_color=%232add51"/></a>
  <a href="./LICENSE"><img alt="Static Badge" src="https://img.shields.io/badge/License-Adobe_Research-yellow?logo=bookstack&logoColor=yellow"></a>
  <a href="https://flam-model.github.io/"><img alt="Static Badge" src="https://img.shields.io/badge/FLAM%20Website-8A2BE2?logo=wolfram"></a>
</p>
 
### Joint Audio and Text Embeddings via Framewise Language-Audio Modeling (FLAM)

FLAM is a cutting-edge language–audio model that supports both zero-shot sound even detection and large-scale audio retrieval via free-form text.

This code accompanies the following ICML 2025 publication:
```
@inproceedings{flam2025,
    title={{FLAM}: Frame-Wise Language-Audio Modeling},
    author={Yusong Wu and Christos Tsirigotis and Ke Chen and Cheng-Zhi Anna Huang and Aaron Courville and Oriol Nieto and Prem Seetharaman and Justin Salamon},
    booktitle={Forty-second International Conference on Machine Learning (ICML)},
    year={2025},
    url={https://openreview.net/forum?id=7fQohcFrxG}
}
```

## Architecture

FLAM is based on contrastive language-audio pretraining, known as CLAP, and improve its capability by supporting the frame-wise event localization via learnable text and audio biases and scales.  
<p align="center">
  <img src="https://raw.githubusercontent.com/adobe-research/openflam/main/assets/FLAM_ARCH.png" alt="FLAM Architecture" width="100%"/>
</p>

## Quick Start 

Install FLAM via PyPi:

```bash
pip install openflam
```

Two examples are provided:

1. [global_example.py](./test/global_example.py): to obtain audio and text embeddings and do clip-wise similarity.
2. [local_example.py](./test/local_example.py) to do sound event localization and plot the results.

For the API documentation, please refer to [hook.py](./src/openflam/hook.py).


### Global Example: To obtain clip-wise similarity between audio and text embeddings

Please refer to [global_example.py](./test/global_example.py):

```python
import librosa
import torch

import openflam

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
SR = 48000  # Sampling Rate (FLAM requires 48kHz)

flam = openflam.OpenFLAM(model_name="v1-base", default_ckpt_path="/tmp/openflam").to(
    DEVICE
)

# Sanity Check (Optional)
flam.sanity_check()

# load audio
audio, sr = librosa.load("test/test_data/test_example.wav", sr=SR)
audio = audio[: int(10 * sr)]
audio_samples = torch.tensor(audio).unsqueeze(0).to(DEVICE)  # [B, 480000 = 10 sec]

# Define text
text_samples = [
    "breaking bones",
    "metallic creak",
    "tennis ball",
    "troll scream",
    "female speaker",
]

# Get Global Audio Features (10sec = 0.1Hz embeddings)
audio_global_feature = flam.get_global_audio_features(audio_samples)  # [B, 512]

# Get Text Features
text_feature = flam.get_text_features(text_samples)  # [B, 512]

# Calculate similarity (dot product)
global_similarities = (text_feature @ audio_global_feature.T).squeeze(1)

print("\nGlobal Cosine Similarities:")
for text, score in zip(text_samples, global_similarities):
    print(f"{text}: {score.item():.4f}")
```

### Local Example: To perform sound event localization and plot the diagram

Please refer to [local_example.py](./test/local_example.py). 

The following plot will be generated by running the code below:

<p align="center">
  <img src="https://raw.githubusercontent.com/adobe-research/openflam/main/assets/sed_heatmap.png" alt="FLAM Architecture" width="100%"/>
</p>


```python
from pathlib import Path

import librosa
import numpy as np
import scipy
import torch

import openflam
from openflam.module.plot_utils import plot_sed_heatmap

# Configuration
OUTPUT_DIR = Path("sed_output")  # Directory to save output figures

# Define target sound events
TEXTS = [
    "breaking bones",
    "metallic creak",
    "tennis ball",
    "troll scream",
    "female speaker",
]

# Define negative class (sounds that shouldn't be in the audio)
NEGATIVE_CLASS = [
    "female speaker"
]

SR = 48000
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

flam = openflam.OpenFLAM(model_name="v1-base", default_ckpt_path="/tmp/openflam")
flam.to(DEVICE)

# Load and prepare audio
audio, sr = librosa.load("test/test_data/test_example.wav", sr=SR)
audio = audio[: int(10 * sr)]

# Convert to tensor and move to device
audio_tensor = torch.tensor(audio).unsqueeze(0).to(DEVICE)

# Run inference
with torch.no_grad():
    # Get local similarity using the wrapper's built-in method
    # This uses the unbiased method (Eq. 9 in the paper)
    act_map_cross = (
        flam.get_local_similarity(
            audio_tensor,
            TEXTS,
            method="unbiased",
            cross_product=True,
        )
        .cpu()
        .numpy()
    )

# Apply median filtering for smoother results
act_map_filter = []
for i in range(act_map_cross.shape[0]):
    act_map_filter.append(scipy.ndimage.median_filter(act_map_cross[i], (1, 3)))
act_map_filter = np.array(act_map_filter)

# Prepare similarity dictionary for plotting
similarity = {f"{TEXTS[i]}": act_map_filter[0][i] for i in range(len(TEXTS))}

# Prepare audio for plotting (resample to 32kHz)
target_sr = 32000
audio_plot = librosa.resample(audio, orig_sr=SR, target_sr=target_sr)

# Create output directory if it doesn't exist
OUTPUT_DIR.mkdir(exist_ok=True)

# Generate and save visualization
output_path = OUTPUT_DIR / "sed_heatmap.png"
plot_sed_heatmap(
    audio_plot,
    target_sr,
    post_similarity=similarity,
    duration=10.0,
    negative_class=NEGATIVE_CLASS,
    figsize=(14, 8),
    save_path=output_path,
)

print(f"Plot saved: {output_path}")
```

## License

Both **code** and **models** for OpenFLAM are released under a non-commercial [Adobe Research License](./LICENSE). Please, review it carefully before using this technology.

## Pretrained Models

The pretrained checkpoints can be found [here](https://huggingface.co/kechenadobe/OpenFLAM/blob/main/open_flam_oct17.pth).

OpenFLAM automatically handles the downloading of the checkpoint. Please, refer to the previous section for more details.

## Datasets

The original experimental results reported in [our paper](https://arxiv.org/abs/2505.05335) were obtained by the model trained on internal datasets that are not publicly shareable.

OpenFLAM is trained **on all publicly available datasets**, including: 

1. Datasets with coarse (aka, global or weak) labels: AudioSet-ACD (a LLM-based captioning for AudioSet), FreeSound, WavCaps, AudioCaps, Clotho;
2. Datasets with fine-grained (aka, local or strong) labels: AudioSet Strong, UrbanSED, DESED, Maestro, and Simulation data from AudioSet-ACD & FreeSound.

We report a comparison of the OpenFLAM performance to the original paper report (the global retrieval metrics --ie, A2T and T2A-- are R@1 / R@5):
<p align="center">
  <img src="https://raw.githubusercontent.com/adobe-research/openflam/main/assets/Exp.png" alt="FLAM Exp" width="100%"/>
</p>


## Citation

If you use OpenFLAM, please cite our main work:

```
@inproceedings{flam2025,
    title={{FLAM}: Frame-Wise Language-Audio Modeling},
    author={Yusong Wu and Christos Tsirigotis and Ke Chen and Cheng-Zhi Anna Huang and Aaron Courville and Oriol Nieto and Prem Seetharaman and Justin Salamon},
    booktitle={Forty-second International Conference on Machine Learning (ICML)},
    year={2025},
    url={https://openreview.net/forum?id=7fQohcFrxG}
}
```

