Metadata-Version: 2.1
Name: easy-detm
Version: 0.1.1
Summary: A simple, easy-to-use toolkit for Dynamic Embedded Topic Models on temporal document collections.
Author: Jm Su
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch >=1.7.0
Requires-Dist: numpy >=1.19.0
Requires-Dist: scipy >=1.5.0
Requires-Dist: pandas >=1.1.0
Requires-Dist: scikit-learn >=0.23.0
Requires-Dist: matplotlib >=3.3.0
Requires-Dist: seaborn >=0.11.0
Requires-Dist: umap-learn >=0.5.0
Requires-Dist: plotly >=5.0.0
Requires-Dist: scienceplots >=2.0.0
Provides-Extra: dev
Requires-Dist: pytest >=6.0 ; extra == 'dev'
Requires-Dist: pytest-cov >=2.0 ; extra == 'dev'
Requires-Dist: black >=20.0 ; extra == 'dev'
Requires-Dist: flake8 >=3.8 ; extra == 'dev'

# easy-detm Package Document

## Package Scope

This package provides a Python interface for training and visualizing the Dynamic
Embedded Topic Model (DETM).

The current data API is intentionally simple:

```python
documents: List[str]
timestamps: List[int]
```


## Main API

### Model

```python
from easy_detm import DETMModel
```

`DETMModel` is the high-level class for:

- creating the DETM model,
- fitting it to temporal documents,
- extracting topics,
- inferring document-topic distributions,
- saving and loading checkpoints,
- evaluating topic coherence and topic diversity.

### Data

```python
from easy_detm.data import create_dataset_from_list, DocumentCorpus
```

Use `create_dataset_from_list()` for most workflows. Use `DocumentCorpus` only
when you need to manually control train/validation/test splits.

### Visualization

```python
from easy_detm import (
    configure_cjk_fonts,
    visualize_embeddings,
    visualize_embeddings_over_time,
    visualize_topic_evolution,
)
```

Visualization functions use the learned model parameters. They do not retrain or
modify the model. `configure_cjk_fonts()` is called automatically by the
visualization module and can also be called manually to inspect or reset CJK font
support for Korean, Japanese, Chinese, and English labels.

### Topic Metrics

```python
diversity = model.get_topic_diversity(num_words=10)
coherence = model.get_topic_coherence(data=train, num_words=10)
```

`get_topic_diversity()` uses only the trained topic-word distributions.
`get_topic_coherence()` also needs a reference corpus in DETM format, usually
the training split. If you call it on a model restored with `load()`, pass
`data=train` because checkpoints store model parameters and vocabulary, not the
original corpus.


## Input Requirements

### Documents

Documents should be strings where tokens are separated by whitespace:

```python
documents = [
    "climate carbon emissions",
    "trade market finance",
]
```

The current package does not perform advanced NLP preprocessing. Recommended
preprocessing before calling the package:

- lowercase text,
- remove or normalize punctuation,
- remove domain-specific noise,
- tokenize consistently,
- optionally remove stopwords,
- optionally lemmatize or stem terms.

### Timestamps

Timestamps should be integers:

```python
timestamps = [0, 0, 1, 1, 2, 2]
```

Recommended convention:

- use zero-based indices,
- keep time IDs contiguous,
- make sure every document has one timestamp.

## Hyperparameter Notes


Important parameters:

- `num_topics`: number of topics.
- `num_times`: number of time periods.
- `rho_size`: topic embedding dimension.
- `emb_size`: word embedding dimension.
- `t_hidden_size`: hidden size for the theta encoder.
- `eta_hidden_size`: hidden size for the eta LSTM.
- `eta_nlayers`: number of LSTM layers for eta.
- `delta`: random-walk prior variance used by the original DETM implementation.
- `enc_drop`: dropout in the theta encoder.
- `batch_size`: minibatch size.
- `learning_rate`: optimizer learning rate.

## Output Interpretation

### Topics

`model.get_topics()` returns top words from the learned topic-word distributions.
Because DETM is dynamic, each topic can have different top words at different
time points.

### Document-Topic Matrix

`model.get_document_topics()` returns an array with shape:

```text
num_documents x num_topics
```

Each row is a topic-proportion vector for one input document.

### Visualizations

- Embedding plots show topics and words in a shared 2D projection.
- Topic evolution plots show word probability changes over time for one topic.

## Acknowledgements

The core DETM model implementation is adapted from the original DETM code by
Adji Bousso Dieng, Francisco J. R. Ruiz, and David M. Blei:
https://github.com/adjidieng/DETM

Please cite the original paper when using the DETM model:
"The Dynamic Embedded Topic Model" (Dieng, Ruiz, and Blei, 2019).
