Metadata-Version: 2.4
Name: GeometricNearestNeighborsProcessor
Version: 0.1.0
Summary: Geometric nearest-neighbor workflows with threshold-based anomaly detection.
License: MIT
Project-URL: Homepage, https://github.com/antononcube/Python-GeometricNearestNeighborsProcessor
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: scikit-learn
Requires-Dist: plotly
Requires-Dist: OutlierIdentifiers
Dynamic: license-file

# GeometricNearestNeighborsProcessor

A Python package for Geometric Nearest Neighbors (GNN) workflows: data rescaling, fast anomalies finding, similarity matrices derivation.

This Python package is a translation to Python of the Wolfram Language software monad 
["MonadicGeometricNearestNeighbors"](https://resources.wolframcloud.com/PacletRepository/resources/AntonAntonov/MonadicGeometricNearestNeighbors), [AAp1].
(The R package ["GNNMon-R"](https://github.com/antononcube/R-packages/tree/master/GNNMon-R), [AAp4], is another translation of [AAp1].)
 
**Remark:** In order to keep outputs consistent with R and WL package the data frame outputs of the Python package
use camel case for the column names. (That might change in the future.)

----

## Purpose and theoretical background

Consider the following computational tasks for a given set of $n$-dimensional (nD) points $P$:

1. Find the points of $P$ that are outliers or anomalies
2. Find the points of another set $P_1$ that can be seen anomalies wrt to $P$
3. For a given nD point find its Nearest Neighbors (NNs) in $P$
   - The points of $P$ can have labels 
   - It might be desired to get the distances and labels of the NNs  
4. Plot the points $P$ with minimal setup or specification writing
5. Give the (sparse) proximity matrix of $P$ for a specified number of neighbors 

Let us define an anomalous point as one that is "too far" from the other points.  
Which points are "too far" from the rest can be determined by examining statistics of distances between each point 
and $k$ nearest neighbors of it.

More concretely, point anomalies are found in the following way:

1. Input:
   - Points `P` as a data frame, list, or dictionary
   - Number of nearest neighbors `n`
   - Distance function `d`
   - Aggregation function `a`
2. For each point of `P` 
   - Find its `n` nearest neighbors
   - Aggregate with `a` the corresponding `n` distance
3. Using the statistics for the previous step -- a 1D array -- find outlier identification parameters
   - Like, Hampel-, SPLUS-, Quartile parameters
4. Identify anomalies using the parameters of the previous step

----

## Usage examples

Load packages:

```python
from RandomDataGenerators import *
from OutlierIdentifiers import *

import numpy
import random

import pandas as pd
import plotly.express as px
```


Generate random points:

```python
help(random_data_frame)
```

```python
dfPoints = random_data_frame(n_rows=30, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints.shape)
print(dfPoints[1:6])
```

Here is a summary:

```python
dfPoints.describe()
```

Here is a plot of the points:

```python
fig = px.scatter(dfPoints, x="X", y="Y", template="plotly_dark")
fig.show()
```

A typical pipeline of geometric nearest neighbors processing:

```python
gnnObj = (GeometricNearestNeighborsProcessor(dsPoints)
   .make_nearest_function(distance_function = "EuclideanDistance")
   .compute_thresholds(number_of_nearest_neighbors = 10, aggregation_function = "mean", outlier_identifier = "QuartileIdentifierParameters")
   .find_anomalies()
   .echo_function_value("Anomaly points:", lambda x: print(x))
   .plot(title="Random points", template="plotly_dark")
)
```

Show the plot obtained above:

```python
gnnObj.take_value().show()
```

Here we generate another set of random points using the same random point generators:

```python
dfPoints2 = random_data_frame(n_rows=40, columns_spec = ["X", "Y"], generators= {"X": numpy.random.normal, "Y": numpy.random.normal})
print(dfPoints2.shape)
```

Here the points of second set are classified into being anomalous or not:

```python
gnnObj.classify(dfPoints2).take_value()
```

See the notebook ["Usage-examples.ipynb"](./docs/Usage-examples.ipynb) for more detailed examples.

----

## Implementation details

- The package provides the class `GeometricNearestNeighborsProcessor` that can be used to construct chainable, monadic pipeline-like behavior.

- Plotting of the data points is done via ["plotly"](https://plotly.com/python/) -- just a scatter plot of 2D points for now. 

- The chainable behavior of the methods of the class `GeometricNearestNeighborsProcessor`is implemented 
  by following the principle that all methods return `self`, except the so-called "takers".
  - I.e., methods with names that start with "take_".

- The core Nearest Neighbors (NNs) finding functionality is provided by the ["scikit-learn"](scikit-learn.org) class
[NearestNeighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html).

- The NNs finding algorithms used by `GeometricNearestNeighborsProcessor` are "scan" and "kdtree".
  - "scan" is implemented in the class `GeometricNearestNeighborsProcessor` instead of delegating to scikit-learn's `NearestNeighbors` "brute" algorithm.
  - "kdtree" delegates to `NearestNeighbors` "kd_tree" algorithm.

----

## References

### Wolfram Language

[AAp1] Anton Antonov, [MonadicGeometricNearestNeighbors](https://resources.wolframcloud.com/PacletRepository/resources/AntonAntonov/MonadicGeometricNearestNeighbors), Wolfram Language paclet, 
(2023-2025), 
[Wolfram Language Paclet Repository](https://resources.wolframcloud.com/PacletRepository).

[AAp2] Anton Antonov, [OutlierIdentifiers](https://resources.wolframcloud.com/PacletRepository/resources/AntonAntonov/OutlierIdentifiers/), Wolfram Language paclet, 
(2023), 
[Wolfram Language Paclet Repository](https://resources.wolframcloud.com/PacletRepository).

### R

[AAp3] Anton Antonov, [OutlierIdentifiers](https://github.com/antononcube/R-packages/tree/master/OutlierIdentifiers), R package,
(2019-2024),
[GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov, [GNNMon-R](https://github.com/antononcube/R-packages/tree/master/GNNMon-R), R package,
(2019-2025),
[GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov, [KDTreeAlgorithm](https://github.com/antononcube/R-packages/tree/master/KDTreeAlgorithm), R package,
(2025),
[GitHub/antononcube](https://github.com/antononcube).

### Python

[AAp6] Anton Antonov, [RandomDataGenerators](https://github.com/antononcube/Python-packages/tree/main/RandomDataGenerators), Python package,
(2021-2026), 
[GitHub/antononcube](https://github.com/antononcube).
([PIPy.org](https://pypi.org/project/RandomDataGenerators/).)

[AAp7] Anton Antonov, [OutlierIdentifiers](https://github.com/antononcube/Python-packages/tree/main/OutlierIdentifiers), Python package,
(2024), 
[GitHub/antononcube](https://github.com/antononcube).
([PIPy.org](https://pypi.org/project/OutlierIdentifiers/).)
