Metadata-Version: 2.3
Name: EMBER_CDD__Wdis
Version: 1.0.1
Summary: Elastic Malware Benchmark for Empowering Researchers 
License: MIT
Author: Luca Fabri
Author-email: luca.fabri1999@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: coverage (>=7.8.0,<8.0.0)
Requires-Dist: lief (>=0.16.5,<0.17.0)
Requires-Dist: lightgbm (>=4.6.0,<5.0.0)
Requires-Dist: mypy (>=1.15.0,<2.0.0)
Requires-Dist: numpy (>=2.0.0,<3.0.0)
Requires-Dist: pandas (>=2.0,<3.0)
Requires-Dist: pytest (>=8.3.5,<9.0.0)
Requires-Dist: ruff (>=0.11.6,<0.12.0)
Requires-Dist: scikit-learn (>=1.6.1,<2.0.0)
Requires-Dist: tqdm (>=4.0,<5.0)
Description-Content-Type: text/markdown



# EMBER Feature Extraction

![CI status](https://github.com/malware-concept-drift-detection/ember-features-extraction/actions/workflows/check.yml/badge.svg) 
![Version](https://img.shields.io/github/v/release/malware-concept-drift-detection/ember-features-extraction?style=plastic)



This repository allows the user to easily create a dataset using EMBER features, starting from a collection of PE files.

Setup the directory of PE executables, configure the Docker Compose file and deploy the pipeline. A final `csv` file with all the features will be created.

If you want to work with EMBER2017 dataset (containing features from 1.1 million PE files scanned in or before 2017) or the EMBER2018 dataset (containing features from 1 million PE files scanned in or before 2018), please refer to the official repository.

Details of the selected features is available here: [https://arxiv.org/abs/1804.04637](https://arxiv.org/abs/1804.04637)


## Prerequisites

- Make sure you have a running and active version of [Docker](https://docs.docker.com/engine/install/).

## Usage:

1. Clone the repository and change directory:
    ```bash
    git clone git@github.com:w-disaster/ember.git && cd ember
    ```

2. Setup the directory containing PE files. The directory should have the following structure:

    ```plaintext
    your_base_dir/
    ├── malware_family_0/
    │   ├── id_malware_sample_0_0.exe
    │   ├── id_malware_sample_0_1.exe
    ├── malware_family_1/
    │   ├── id_malware_sample_1_0.exe
    └── ...
    ```

    Each PE filename will be used as the sample index in the final dataset.

    The directory structure doesn't change if you want to do malware detection: simply create two directories `benign` and `malicious` as the malware families.

3. Configure `docker-compose.yaml`:
    1. Set the number of processes `N_PROCESSES` for parallel processing;
    2. Change the volume source point of the base directory with PE files (`your_base_dir`). Default is `/home/luca/WD/NortonDataset670/MALWARE/`;
    3. Set the directory volume where the final dataset will be saved (default `./dataset/`)


4. *Deploy* the pipeline:

    ```base
    docker compose up
    ```

5. Check out the dataset with filename `malware_ember_features.csv` inside the configured directory. The dataset will have all the columns named.

    Besides the features it contains a column `sha256` and a `family` column. The first one is the PE file id which has specifically been used in our case, while the `family` is the malware family of the corresponding sample. 
    If you use another PE id or do malware detection, consider to change these column names afterwards.


