Metadata-Version: 2.1
Name: pysdoclust-stream
Version: 0.1
Summary: SDOstreamclust is an algorithm for clustering data streams
Home-page: UNKNOWN
Author: Simon Konzett
Author-email: konzett.simon@gmail.com
License: LGPL-3.0
Description: # pysdoclust-stream
        
        **SDOstreamclust** 
        
        Incremental stream clustering (and outlier detection) algorithm based on Sparse Data Observers (SDO). 
        
        SDOstreamclust is suitable for large, multi-dimensional datasets where clusters are statistically well represented.
        
        <br>
        
        ## Dependencies
        
        SDOstreamclust requires **numpy**.
        
        <br>
        
        ## Installation
        
        SDOstreamclust can be installed from the **main branch**:
        
                pip3 install git+https://github.com/CN-TU/pysdoclust-stream
        
        or simply:
        
                pip3 install pysdoclust-stream
        
        <br>
        
        ## Example
        
        SDOstreamclust is a straighforward algorithm and very easy to configure. The main parameters are the number of observers `k`, which determines the size of the model and the parameter `T`, which defines the memory of the algorithm. 
        
        Setting the right `k` (default=300) depends on the variability of the data and the expected number of clusters, but this is quite a robust parameter that gives proper performances with values between [200,500] in most scenarios. On the other hand, `T` (default=500) sets the model dynamics and inertia. Intuitively, it is the number of points processed that results in a fully replaced model (on average). Low `T` is recommended when the data show very fast dynamics, while if data evolution is slow and retaining old clusters is dedired, `T` should be set with high values.
        
        Additionally, `input_buffer` (default=0) establishes how many points are necessary for the observers to update the internal clustering. This fundamentally affects the processing speed. Most scenarios commonly tolerate high values in the `input_buffer` without significantly affecting the accuracy performance. Beyond the mentioned ones, other parameters are inherited from SDOclust and SDOstream and do not usually require adjustment. They are described in *python/clustering.py* file.
        
        The following example code retrieves a data stream and initialize SDOstreamclust.
        
        ```python
        from SDOstreamclust import clustering
        import numpy as np
        import pandas as pd
        
        df = pd.read_csv('example/dataset.csv')
        t = df['timestamp'].to_numpy()
        x = df[['f0','f1']].to_numpy()
        y = df['label'].to_numpy()
        
        k = 200 # Model size
        T = 400 # Time Horizon
        ibuff = 10 # input buffer
        classifier = clustering.SDOstreamclust(k=k, T=T, input_buffer=ibuff)
        ```
        
        In the piece of code below the stream data is processed point by point. SDOstreamclust provides a clustering label and an outlierness score per point. It can also perform outlier thresholding internally by giving the label *-1* to outliers. To do this, ``outlier_handling=True`` must be set and the ``outlier_threshold`` (default=5) adjusted.
        
        
        ```python
        all_predic = []
        all_scores = []
        
        block_size = 1 # per-point processing
        for i in range(0, x.shape[0], block_size):
            chunk = x[i:i + block_size, :]
            chunk_time = t[i:i + block_size]
            labels, outlier_scores = classifier.fit_predict(chunk, chunk_time)
            all_predic.append(labels)
            all_scores.append(outlier_scores)
        p = np.concatenate(all_predic) # clustering labels
        s = np.concatenate(all_scores) # outlierness scores
        s = -1/(s+1) # norm. to avoid inf scores
        
        # Thresholding top outliers based on Chebyshev's inequality (88.9%)
        th = np.mean(s)+3*np.std(s)
        p[s>th]=-1
        
        # Evaluation metrics
        from sklearn.metrics.cluster import adjusted_rand_score
        from sklearn.metrics import roc_auc_score
        print("Adjusted Rand Index (clustering):", adjusted_rand_score(y,p))
        print("ROC AUC score (outlier/anomaly detection):", roc_auc_score(y<0,s))
        ```
        
        Giving *ARI=0.97* and *ROC-AUC=0.99*. Note how SDOstreamclust assigns high outlierness scores to the first points of emerging clusters.
        
        ![](example/example_dataset.png)
        
        
Platform: UNKNOWN
Requires-Python: >=3.5
Description-Content-Type: text/markdown
