Metadata-Version: 2.4
Name: bocluster
Version: 0.1.0
Summary: Low code text clustering for the Tibetan language
Project-URL: Homepage, https://github.com/billingsmoore/bocluster
Project-URL: Issues, https://github.com/billingsmoore/bocluster/issues
Author-email: billingsmoore <billingsmoore@gmail.com>
License-File: LICENSE
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.8
Requires-Dist: botok
Requires-Dist: matplotlib
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scikit-learn
Requires-Dist: sentence-transformers
Requires-Dist: tqdm
Requires-Dist: umap-learn
Description-Content-Type: text/markdown

# Text Clustering

This repository contains tools to easily embed and cluster texts as well as label clusters and produce visualizations of those labeled clusters. 

## Install 

Install the library to get started:

```bash
pip install --upgrade bocluster
```

## Usage

The pipeline can be used following the code block below.

```python
from datasets import load_dataset
from bocluster.cluster import BoClusterClassifier

# load a Tibetan language text dataset
ds = load_dataset('billingsmoore/LotsawaHouse-bo-en', split='train')

# initilialize a BoClusterClassifier object
bcc = BoClusterClassifier()

# fit the classifier on a set of texts
bcc.fit(ds['bo'][:1000])

# if you want to treat all data points as members of clusters, with no data treated as outliers
bcc.classify_outliers()

# show a visualization of results
bcc.show()
```