Metadata-Version: 2.1
Name: wordreduce
Version: 0.0.1
Author-email: Jordi Carrera Ventura <jordi.carrera.ventura@gmail.com>
Project-URL: GitHub repository page, https://github.com/JordiCarreraVentura/wordreduce
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: build
Requires-Dist: numpy==2.1.3
Requires-Dist: pandas==2.2.3
Requires-Dist: pytest==8.3.4
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: twine
Requires-Dist: Unidecode==1.3.8

# WordReduce


This package implements two classes, `WordReduce` and `WordReduceLabeler`, that can be used in Natural Language Processing (NLP) for (i) **self-supervised explicit dimensionality reduction**, (ii) **parameter-free clustering**, and (iii) **self-supervised multilabel classification**.

### Key concept

`WordReduce` encodes a collection of raw texts (unstructured data) into a matrix (structured data) with a pre-defined number of dimensions.

The user is expected to provide the desired target of output dimensions. `WordReduce` then determines which words in the document collection best summarize the data, and maps all documents into the set of coordinates defined by those words.

# Usage

### Low-dimensional Vectorization

```
from wordreduce import WordReduceLabeler
wrl = WordReduce(schema_size=100, max_df=0.01, min_df=10)
low_dim_matrix = wrl.fit_transform(retokenized)
```

### Multilabel Classification

```
from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
bags_of_words = wrl.fit_transform(retokenized)
```

### Clustering

```
from wordreduce import WordReduceLabeler
wrl = WordReduceLabeler(schema_size=100, max_df=0.01, min_df=10)
cluster_ids = wrl.fit_clusterize(retokenized)
```

## Technical Description


### Motivation: feature selection _versus_ dimensionality reduction

Linguistic data can be transformed into structured data trivially using the Bag-of-Words (BOW) model. However, the resulting representations are high-dimensional, and cannot be easily used for other types of analysis in the context of data science problems.

High-dimensional spaces can be transformed into low-dimensional ones using dimensionality reduction techniques (e.g. LDA, PCA, NMF, SVD). However, these methods work by projecting an observable space onto a **_latent_** space and, as a result, end up as **black boxes**: the original structure is lost, along with its meaning, which again hampers further analysis.

 `WordReduce` addresses this problem by returning an **observable** space of the desired dimensionality. Hence, it performs dimensionality reduction while also retaining explainability and interpretability. The exact methodology is described in detail [below](#how_does_it_work).

<a id="how_does_it_work"></a>
#### How does it work?

##### WordReduce

`WordReduce` bridges the gap between **feature selection** and **dimensionality reduction** by applying the following steps:

1. Vectorization of the input dataset into an a BoW-TFIDF representation (by default).
2. Dimensionality reduction on the vectorized dataset (Non-Negative Matrix Factorization by default).
3. <a id="discretization"></a>_k_-bins discretization of the latent topography resulting from the previous step. This lowers its resolution through implicit clustering and serves as a simpler version of product quantization.
4. <a id="supervision"></a>Supervised learning of a feature selection model (a decision tree in the current implementation) using the quantized embeddings as the dependent variable. Each unique discretization is encoded categorically nominally.
5. Feature selection on the original input matrix using the decision tree trained on the preceding step to select units from the input representation obtained in the first step, down to the target dimensionality requested by the user.

##### WordReduceLabeler

`WordReduceLabeler` builds on top of `WordReduce`: it invokes implicitly to perform steps 1-5, but then returns a different output. Two options are available:

1. When this class' `transform` or `fit_transform` methods are invoked, the class takes the original dataset as input and, for every document, returns the list of words in that document that [were selected as features](#supervision) for describing the data.
2. When the class' `clusterize` or `fit_clusterize` methods are invoked, for every input document an integer is returned, corresponding to that document's discretization as computed by [step 3 above](#discretization).

### Questions

- **Why parameter-free clustering?** Unlike e.g. _k_-means, where the number of clusters _k_ must be provided by the user, `WordReduce` relies on the discretization step and infers the target number of clusters empirically as a byproduct of that step.
- **Why not feature selection?** Because the output dimensions are not a subset of the input dimensions. No dimensions are expected as input.
- **Why not dimensionality reduction?** Because the output dimensions are transparent and interpretable. The latent topography is only used for self-supervision, and it is not used as the output schema.



# Testing

```
$ cd wordreduce
$ python -m tests.wordreduce
$ pytest tests/wordreduce.py
$ pytest tests/*
```

# Build

## Instructions for building the package

1. Building the package before uploading: `python -m build` (from "wordreduce").
2. Upload the package to pypi: `python -m twine upload --repository {pypi|testpypi} dist/*`
3. Install the package from pypi: `python -m pip install --index-url {https://test.pypi.org/simple|https://pypi.org/simple} --no-deps wordreduce`
4. If any dependencies are required, edit the `pyproject.toml` file, "\[project\]" field, and add a `dependencies` key with a `List\[str\]` value, where each string is a `pip`-readable dependency.
