Metadata-Version: 2.4
Name: sf-sml
Version: 0.1.0.dev20250620
Summary: Secretflow Secure Machine Learning
Home-page: https://github.com/secretflow/spu
Author: SecretFlow Team
Author-email: secretflow-contact@service.alipay.com
License: Apache 2.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.10, <3.12
Description-Content-Type: text/markdown
Requires-Dist: numpy<2,>=1.22.0
Requires-Dist: multiprocess>=0.70.12.2
Requires-Dist: jax[cpu]<=0.4.34,>=0.4.16
Requires-Dist: spu==0.9.4.dev20250618
Requires-Dist: pandas==1.5.3
Requires-Dist: scikit-learn==1.5.2
Provides-Extra: dev
Requires-Dist: pylint; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# SML: Secure Machine Learning

**SML** is a python module implementing machine learning algorithm with [JAX](https://github.com/google/jax),
which can do **secure** training and inferring under the magic of [SPU](https://github.com/secretflow/spu).

Our vision is to establish a general-purpose privacy-preserving machine learning(PPML) library,
being a secure version of [scikit-learn](https://github.com/scikit-learn/scikit-learn).

Normally, the APIs of our algorithms are designed to be as consistent as possible with scikit-learn.
However, due to safety considerations and certain limitations of the SPU, some APIs will undergo changes.
Detailed explanations will be provided for any differences in the doc.

## Why not scikit-learn

First, scikit-learn is built top on Numpy and SciPy, running on centralized mode.
So you must collect all data into one node, which can't protect the privacy of data.

The implementations in scikit-learn are usually very efficient and valid, then why not we just "translate" it to MPC?

The quick answer for this question is **accuracy** and **efficiency**.

In PPML, we observe that most framework encodes floating-point to fixed-point number,
which parameterized by `field`(bitwidth of underlying integer) and `fxp_fraction_bits`(fractional part bitwidth),
greatly restricting the effective range and precision of floating-point numbers.
on other hand, The major determinant of computational overhead is determined by the MPC protocol,
so the origin cpu-friendly ops may have pool performance.

### Our Solution

So we establish a new library SML trying to bridge these gaps:

1. accuracy: optimize and test the algorithm based on fixed-point number,
e.g. prefer high-precision ops(`rsqrt` rather than `1/sqrt`),
essential re-transform to accommodate the valid range of non-linear ops
(see [fxp pitfalls](../docs/development/fxp.ipynb)).
2. efficiency: use MPC-friendly op to replace CPU-friendly op,
e.g. use numeric approximation trick to avoid sophistic computation, prefer arithmetic ops to comparison ops.

Of course, we also supply an easy-to-test toolbox for advanced developer
who wants to develop their own MPC program:

1. `Simulator`: provide a fixed-point computation environment and run at high speed.
But it's unable to provide a real SPU performance environment,
the test results cannot reflect the actual performance of the algorithm.
2. `Emulator`: emulate on the real MPC protocol using multiple processes/Docker(coming soon),
and can provide effective performance results.

So the **accuracy** can be proved if the algorithm pass the test of `simulator`,
and you should test the **efficiency** using `emulator`.

> WARNING: currently, SML is undergoing rapid developments,
> so it is not recommended for direct use in production environments.

## Installation

First, you should clone the spu repo to your local disk:

```bash
git clone https://github.com/secretflow/spu.git
```

Some [Prerequisites](../CONTRIBUTING.md#build) are required according to your system.
After all these installed, you can run any test like:

```bash
# run kmeans simulation
# simulation: run program in single process
# used for correctness test
pytest -n auto sml/sml/cluster/tests/kmeans_test.py

# run kmeans emulation
# emulation: run program with multiple processes(LAN setting)
# or multiple dockers(WAN setting, will come soon)
# used for efficiency test.
python3 sml/sml/cluster/emulations/kmeans_emul.py
```

## Algorithm Support lists

See [support lists](./support_lists.md) for all our algorithms and features we support.

## Development

See [development](./development.md) if you would like to contribute to SML.

## FAQ

We collect some [FAQ](./faq.md), you can check it first before submitting an issue.
