Metadata-Version: 2.4
Name: bertrend
Version: 0.4.9
Summary: BERTrend analyses topic evolution over time using state-of-the-art transformer models: topic modelling and LLM
License: LICENSE.md
License-File: AUTHORS.txt
License-File: LICENSE.md
Keywords: topics,BERTopic,transformers,weak signals,LLM,trend analysis
Author: Jérôme Picault
Author-email: jerome.picault@rte-france.com
Maintainer: Jérôme Picault
Maintainer-email: jerome.picault@rte-france.com
Requires-Python: >=3.12,<4.0
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Development Status :: 4 - Beta
Requires-Dist: APScheduler (>=3.11.2)
Requires-Dist: PyJWT (>=2.11.0)
Requires-Dist: SQLAlchemy (>=2.0.46)
Requires-Dist: aio-pika (>=9.5.5)
Requires-Dist: arxiv (>=2.4.0)
Requires-Dist: bertopic (>=0.16.2)
Requires-Dist: coverage (>=7.13.2)
Requires-Dist: cron-descriptor (>=2.0.6)
Requires-Dist: datamapplot (>=0.6.4)
Requires-Dist: dateparser (>=1.2.2)
Requires-Dist: ddgs (>=9.10.0)
Requires-Dist: dill (>=0.4.1)
Requires-Dist: fastapi (>=0.128.0)
Requires-Dist: feedparser (>=6.0.12)
Requires-Dist: folium (>=0.20.0)
Requires-Dist: gensim (>=4.4.0)
Requires-Dist: geopy (>=2.4.1)
Requires-Dist: google-api-python-client (>=2.189.0)
Requires-Dist: google-auth-oauthlib (>=1.2.4)
Requires-Dist: googlenewsdecoder (>=0.1.7)
Requires-Dist: goose3 (>=3.1.21)
Requires-Dist: hdbscan (>=0.8.41)
Requires-Dist: ipython (>=8.28.0)
Requires-Dist: joblib (>=1.5.3)
Requires-Dist: jsonlines (>=4.0.0)
Requires-Dist: langchain-core (>=1.2.9)
Requires-Dist: langdetect (>=1.0.9)
Requires-Dist: loguru (>=0.7.3)
Requires-Dist: lxml-html-clean (>=0.4.3)
Requires-Dist: markdown (>=3.10.1)
Requires-Dist: newscatcher-sdk (>=1.4.0)
Requires-Dist: newspaper4k (>=0.9.3.1)
Requires-Dist: nicegui[highcharts] (>=3.8.0)
Requires-Dist: nltk (>=3.9.2)
Requires-Dist: openai (>=2.18.0)
Requires-Dist: openai-agents[litellm] (>=0.8.3)
Requires-Dist: openpyxl (>=3.1.5)
Requires-Dist: pandas (<3.0.0)
Requires-Dist: pika (>=1.3.2)
Requires-Dist: plotly (>=6.5.2)
Requires-Dist: plotly-resampler (>=0.11.0)
Requires-Dist: psutil (>=7.2.2)
Requires-Dist: pygooglenews (>=0.1.3)
Requires-Dist: pytest
Requires-Dist: pytest-asyncio
Requires-Dist: python-dotenv (>=1.2.1)
Requires-Dist: requests (>=2.32.3)
Requires-Dist: ruff (>=0.15.0)
Requires-Dist: scikit-learn (>=1.8.0)
Requires-Dist: seaborn (>=0.13.2)
Requires-Dist: sentence-transformers (>=5.2.2)
Requires-Dist: starlette (>=0.52.1)
Requires-Dist: streamlit (>=1.54.0)
Requires-Dist: streamlit-code-editor (>=0.1.22)
Requires-Dist: streamlit-folium (>=0.26.1)
Requires-Dist: supervisor (>=4.3.0)
Requires-Dist: thefuzz (>=0.22.1)
Requires-Dist: tiktoken (>=0.12.0)
Requires-Dist: tldextract (>=5.3.1)
Requires-Dist: torch (>=2.10.0)
Requires-Dist: tqdm (>=4.67.3)
Requires-Dist: typer (>=0.21.1)
Requires-Dist: umap-learn (>=0.5.11)
Project-URL: Bug Tracker, https://github.com/rte-france/BERTrend/issues
Project-URL: Documentation, https://github.com/rte-france/BERTrend
Project-URL: Homepage, https://github.com/rte-france/BERTrend
Project-URL: Repository, https://github.com/rte-france/BERTrend
Description-Content-Type: text/markdown

# BERTrend

[![rte-france - BERTrend](https://img.shields.io/static/v1?label=rte-france&message=BERTrend&color=blue&logo=github)](https://github.com/rte-france/BERTrend "Go to GitHub repo")
[![stars - BERTrend](https://img.shields.io/github/stars/rte-france/BERTrend?style=social)](https://github.com/rte-france/BERTrend)
[![forks - BERTrend](https://img.shields.io/github/forks/rte-france/BERTrend?style=social)](https://github.com/rte-france/BERTrend)
[![issues - BERTrend](https://img.shields.io/github/issues/rte-france/BERTrend)](https://github.com/rte-france/BERTrend/issues)
[![Downloads](https://pepy.tech/badge/bertrend)](https://pepy.tech/project/bertrend)
[![PyPi_Version](https://img.shields.io/pypi/v/bertrend.svg)](https://pypi.org/project/bertrend/)
![Static Badge](https://img.shields.io/badge/python-%3E%3D3.12-blue)
![GitHub License](https://img.shields.io/github/license/rte-france/BERTrend)
[![Lines of Code](https://img.shields.io/endpoint?url=https%3A%2F%2Ftokei.kojix2.net%2Fbadge%2Fgithub%2Frte-france%2FBERTrend%2Flines)](https://tokei.kojix2.net/github/rte-france/BERTrend)

## News

2025-11-24: BERTrend has been integrated into
the [Netminer 5](https://netminer.gitbook.io/netminer5/module-reference/text/bertopic/bertrend) data analysis tool !

## Overview

![BERTrend pipeline](./BERTrend_overview.png)

## Paper

The code in this repository is part of the work described in the paper:
[BERTrend: Neural Topic Modeling for Emerging Trends Detection](./BERTrend%20paper.pdf).

## Description

BERTrend is a novel framework for detecting and monitoring weak signals in large, evolving text corpora. It uses neural
topic modeling, specifically [BERTopic](https://github.com/MaartenGr/BERTopic), in an online learning setting to
identify and track topic evolution over time.

BERTrend classifies topics as noise, weak signals, or strong signals based on their popularity trends, using a metric
that considers both the number of documents and update frequency. This method allows for real-time monitoring and
analysis of emerging trends while filtering out noise.

## Main Features

The BERTrend repository provides a set of metrics, methods, and visualization tools to support the analysis of dynamic
topic modeling in large text corpora. The tools are designed to help users detect, analyze, and visualize emerging
topics and trends over time.

The code base is illustrated via three main demos:

### Dynamic Topic Modeling Analysis

- [bertrend/demos/topic_analysis](bertrend/demos/topic_analysis): Focuses on dynamic topic modeling analysis using:
    - Dedicated metrics, called TEMPTopic:
        - Stability Evaluation: Measures the stability of topics over time, evaluating how consistent and coherent
          topics remain.
        - Volatility Assessment: Analyzes the volatility of topics, identifying how much topics change over different
          time periods.
        - Two Perspectives:
            - Temporal Topic Embedding Stability: Evaluates the stability of topic embeddings over time.
            - Temporal Topic Representation Stability: Assesses the stability of topic representations over time.
    - Visualization tools to easily assess the results.

#### Topic Modeling Analysis Demonstrator Video

https://github.com/user-attachments/assets/f600f666-a6da-40be-8b07-5041b3bde1dc

#### Demonstrator Features

- Data management: selection of single/multiple datasets, filter by time, filter by size, split long documents by
  paragraphs
- Customization of BERTopic parameters
- Exploration of topics by source, generate topic summaries using specific sources
- Generate and customize newsletters by applying LLMs on BERTopic's output topics
- Visualize and explore topics via different types of interactive figures (topics treemap, data maps)
- Explore the evolution of topics in a temporal semantic space
- Explore the stability of topics measured by TEMPTopic

### Weak Signal Analysis

- [bertrend/demos/weak_signals](bertrend/demos/weak_signals): Identifies and analyzes emerging trends and signals

#### Weak Signal Analysis Demonstrator Video

https://github.com/user-attachments/assets/d79368d9-d4e0-4324-8a98-a888f0ab3b65

#### Demonstrator Features

- Exploration of signals in real time at any given timestamp
- Categorization of signals into three different dataframes: noise, weak and strong signals
  (each dataframe contains useful information about the signals)
- Interpretation of these signals using LLMs
- Inspection of the topic merging process using Sankey diagram

### Prospective Analysis

- [bertrend/bertrend_apps/prospective_demo](bertrend/bertrend_apps/prospective_demo): A comprehensive "Prospective
  Analysis Demo" application for monitoring, analysis, and trend detection in real time (works with live data).

#### Prospective Analysis Demonstrator Video

(Coming soon)

#### Demonstrator Features

- Data flow configuration and collection status monitoring
- Model status and monitoring by information source
- Real-time signal analysis and trend exploration
- Detailed topic analysis and interpretation
- Automated report and newsletter generation
- Comparative analysis between different periods

## Installation

### Environment Setup (.env and .venv)

BERTrend now supports configuration via a repository-level `.env` file. On import, the package attempts to load
environment variables automatically using `python-dotenv`.

Recommended steps:

1. Create and fill your `.env` at the repo root (a template is provided):
    - Copy `.env` and set your values (do not commit secrets).
    - Common keys:
        - `BERTREND_BASE_DIR`: base directory for BERTrend data/models/logs
        - OpenAI/LLM:
            - `OPENAI_API_KEY`
            - `OPENAI_BASE_URL` (optional for OpenAI-compatible providers such as LiteLLM and Azure)
            - `OPENAI_DEFAULT_MODEL`
        - Providers (optional): `SEMANTIC_SCHOLAR_API_KEY`, `NEWSCATCHER_API_KEY`, `DBPEDIA_REST_API_URL`
        - Embedding service security (optional): `BERTREND_SECRET_KEY`, `DEFAULT_RATE_LIMIT`, `DEFAULT_RATE_WINDOW`,
          `CLIENT_REGISTRY_FILE`
        - Email (optional): `BERTREND_EMAIL_BACKEND`, `GMAIL_*`, `SMTP_*`, `SENDGRID_API_KEY`
        - Misc: `CUDA_VISIBLE_DEVICES`
2. Use a virtual environment and install dependencies inside it:
   ```bash
   uv venv --python 3.13
   source .venv/bin/activate  # Windows: .venv\\Scripts\\activate
   uv pip install . 
   uv pip install python-dotenv  # ensures .env is auto-loaded
   ```
3. Notes
    - If `python-dotenv` isn’t installed, `.env` won’t be auto-loaded; you can either install it or export variables via
      your shell/runner.
    - Scheduled jobs created from the apps use the exact interpreter of your active environment (`sys.executable`),
      ensuring `.venv/bin/python` is used. Keep your venv path stable on the machine running cron.
    - Cron entries source your shell profile and pass through any required variables. You can also rely on `.env` for
      most settings.

### Hardware and Software Requirements

- Recommended: at least one GPU with 16GB RAM for faster processing
- Python >= 3.13 is required

### Docker Installation

BERTrend can be run using Docker, which provides an easy way to run the application without installing dependencies
directly on your system.

For detailed instructions on using BERTrend with Docker, see the [Docker documentation](./docs/docker.md).

### Package Installation

- (Recommended): create a new python environment using `uv`:

  `uv venv --python 3.13`


- Installation from pypi

  `uv pip install bertrend`


- Install using pip or poetry in the BERTrend home directory:

  ```bash
  # Basic installation with core dependencies
  uv pip install . 

  ```

- (Use `uv pip install -e .` for a local installation)

NB. Due to the number of dependencies, the installation may take some time on an empty environment.

### Dependency Management

BERTrend's dependencies are organized into logical groups for easier management:

- **Core dependencies**: Essential packages for topic modeling and analysis
- **NLP and text processing**: Libraries for natural language processing tasks
- **LLM integration**: Components for working with large language models
- **Visualization and UI**: Tools for data visualization and user interfaces
- **Utilities**: General utility libraries

This organization simplifies installation and maintenance, allowing users to install only the dependencies they need for
their specific use case.

### Verifying Installation

To verify that your installation is working correctly, you can run the installation verification script:

```bash
python -m bertrend.tests.test_installation
```

This script will check that all dependencies are installed correctly and provide a summary of the installation status.
If any dependencies are missing, the script will provide instructions on how to install them.

### Datasets

#### Dataset Sources

BERTrend can work with various text datasets such as:

- Scientific publications
- News articles
- Social media publications

Recommended public datasets:

- New York Times dataset: https://www.kaggle.com/datasets/aryansingh0909/nyt-articles-21m-2000-present
- Arxiv dataset: https://www.kaggle.com/datasets/Cornell-University/arxiv

For custom dataset creation, please refer to the documentation page about [data providers](./docs/data_provider.md).

For email configuration and newsletter sending (optional), please refer to
the [mail configuration guide](./docs/mail_configuration.md).

In order to use the provided demos, datasets :

- can be stored on the server in: `$BERTREND_BASE_DIR/data/bertrend/`
- or can be uploaded from the client device on which are displayed the Streamlit apps.

#### Dataset Format

Required columns:

- `text`
- `timestamp`

Supported formats:

- csv
- xlsx
- parquet
- json/jsonl

## Quick Start

### 1. Dynamic Topic Modeling Analysis

#### Launch the Demonstrator

```bash
cd bertrend/demos/topic_analysis
CUDA_VISIBLE_DEVICES=<gpu_number> streamlit run app.py
```

#### Description of Main Pages

- `Parameters`: Include all BERTopic parameters. Choose the embedding model carefully based on your data language.
  The embedding model can run locally (a GPU is recommended); it is also possible to use a remote embedding service.

We recommend using a remote embedding service (this is now the default config) as it simplifies the computing.

The change of configuration can be done
in: [services_default_config.toml](bertrend/config/services_default_config.toml).

An example of a very simple implementation to deploy a separate embedding service can be found here:
`bertrend/services/embedding_server`

![parameters](docs/images/parameters.png)

Please refer to the [BERTopic documentation](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) for
parameter details.

The application is split into several pages:

- `Data loading & model training`: Allows to
    - Select a dataset
    - Select the range of timestamps
    - Split paragraphs if needed
    - Train a topic model

![data_selection](docs/images/data_selection.png)

- `Topic exploration`
    - Computes topics over time
      ![exploration_over_time](./docs/images/explore1.png)

    - Allow selecting sources and generating short descriptions for each theme
      ![topic_exploration](./docs/images/explore2.png)

- `Topic visualization`: Several visualization types available:
    - `Overall Results`: Standard information such as inter-topic distance map
    - `Topics Treemap`: Block view of topics and their relative importance
    - `Data Map`: Easy topic and data search interface

  ![data_map](./docs/images/data_map.png)

- `Temporal Visualization`: Focus on dynamic topic modeling:
    - 3D visualization of temporal topic evolution
    - Metrics to assess topic stability over time
    - Overall topic stability, temporal topic stability, and temporal representation stability

  ![temptopic_metrics](./docs/images/temptopic_metrics.png)

- `Generate Newsletters`
    - This is an example of an application combining topic analysis and LLMs
    - Generates newsletters based on prominent topics
    - Uses LLMs to generate topic descriptions and summarize representative documents

### 2. Weak Signal Analysis

**Weak signals** refer to early, small indicators of larger, important trends (like disruptive technologies or public
health issues).

Identifying weak signals early is critical for decision-making across domains such as politics, healthcare, and
innovation

Key Features of the demonstrator:

- Dynamic Topic Modeling in an online learning setting
- Popularity-based Signal Classification
    * based on their popularity trends: noise, weak signals, and strong signals
    * **new popularity metric** considers both the number of documents within a topic and the **frequency of updates**

#### Launch the Weak Signal Analysis Demonstrator

```bash
cd bertrend/demos/weak_signals
CUDA_VISIBLE_DEVICES=<gpu_number> streamlit run app.py
```

#### Data Preparation

- Choose from available datasets
  ![bertrend_data](docs/images/bertrend_data.png)
- Click on `Embed Documents`

#### Train Model

- Select the `Model Training` tab
- Choose time granularity (in days)
- Click `Train Models` to generate topic models per time slice
- Click `Merge Models` for the aggregated model

#### Results Analysis

Provides various statistics about topics:

- Number of topics detected
- Topic size evolution
- Topic popularity evolution
- Overall popularity evolution

![popularity_evolution](docs/images/popularity_evolution.png)

- Lists of noise, weak, and strong signals

#### LLM-based signal interpretation

- As topics are hard to interprete by humans, we showcase how it can be
  done using LLMs.
- Choose a specific topic and click on the button to generate an interpretation
- LLM-based in-depth signal analysis is done using predefined templates, for example:
    - describe what's happened since the last timestamp (summary of main events)
    - present through a multi-faceted description info about key developments, foreseen changes, potential impact, etc.

![llm_interpretation](docs/images/llm_interpretation.png)

## Article Scoring Service

For service setup and API details, see:
[`docs/services/article_scoring_service.md`](docs/services/article_scoring_service.md).

## Support

Please use the 'Issues' section to report and discuss any problems with BERTrend.

## Contributing

We welcome contributions. For major changes, please open an issue first to discuss your proposal.

To contribute:

1. Fork the repository at https://github.com/rte-france/BERTrend
2. Sync your fork with the latest development branch
3. Implement your changes
4. Add tests and documentation if applicable
5. Sync with the latest development branch again
6. Submit a pull request targeting the appropriate branch

## Authors and Acknowledgment

See [AUTHORS](./AUTHORS.txt) for the full list of contributors.

## License Information

Copyright 2023–2025 RTE France RTE: http://www.rte-france.com

This Source Code is subject to the terms of the Mozilla Public License (MPL) v2 also available
here: https://www.mozilla.org/en-US/MPL/2.0/

## Citation

If you use BERTrend in your work, please cite:

```bibtex
@inproceedings{boutaleb-etal-2024-bertrend,
    title = "{BERT}rend: Neural Topic Modeling for Emerging Trends Detection",
    author = "Boutaleb, Allaa  and
      Picault, Jerome  and
      Grosjean, Guillaume",
    editor = "Tetreault, Joel  and
      Nguyen, Thien Huu  and
      Lamba, Hemank  and
      Hughes, Amanda",
    booktitle = "Proceedings of the Workshop on the Future of Event Detection (FuturED)",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.futured-1.1",
    pages = "1--17",
}

```

