Metadata-Version: 2.1
Name: tcpfeature
Version: 1.1.0
Summary: Extraction features from TCP/IP for machine learning using Cython & libpcap
Home-page: https://github.com/dzokha/tcpfeature
Author: dzokha
Author-email: dzokha1010@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Cython
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS
Classifier: Topic :: Security
Classifier: License :: Other/Proprietary License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Cython>=3.0.0
Requires-Dist: numpy>=1.20.0

# tcpfeature

[![PyPI version](https://img.shields.io/pypi/v/tcpfeature.svg)](https://pypi.org/project/tcpfeature/)
[![Python Supported](https://img.shields.io/pypi/pyversions/tcpfeature.svg)](https://pypi.org/project/tcpfeature/)
[![Platform Support](https://img.shields.io/badge/platform-windows%20%7C%20macos%20%7C%20linux-lightgrey)](https://pypi.org/project/tcpfeature/)
[![License: Proprietary](https://img.shields.io/badge/License-Proprietary%20Academic-orange.svg)](#-license)

A blazing-fast, cross-platform network traffic feature extraction tool specifically designed for Machine Learning and Cyber Security research. Built with **Cython** and linked directly to native **libpcap/Npcap** architectures to achieve maximum throughput while keeping logic completely compiled and secure.

---

## 🚀 Key Features

* **Blazing Fast Performance:** Powered by Cython C-Extensions to parse packets at the memory layer with zero Python overhead.
* **ML-Ready Output:** Extract advanced statistical features directly into structured formats compatible with standard network benchmarks.
* **Cross-Platform Binary Wheels:** Native support for Windows (AMD64), Linux (x86_64), and macOS (Apple Silicon M1/M2/M3/M4).
* **Live & Offline Modes:** Continuous streaming feature extraction from live network cards (using Python Generators) or batch processing from standard PCAP files.
* **Integrated Alert Manager:** High-speed Snort raw alert log parser with automated bitwise IP subnet checking to calculate directional network metrics (`is_src_internal`, `is_dst_internal`).

---

## 🛠️ System Requirements & Prerequisites

Since `tcpfeature` communicates directly with kernel-level packet capture facilities, the host system requires network capture libraries installed:

### 1. Linux (Ubuntu/Debian/CentOS)
```bash
# Ubuntu/Debian
sudo apt-get install libpcap-dev

# CentOS/RHEL/Fedora
sudo yum install libpcap-devel
```

### 2. macOS
macOS comes with libpcap pre-installed natively. No additional configuration is required.

### 3. Windows
Download and install the Npcap Driver from the official website: https://npcap.com/ (Ensure you check WinPcap API-compatible mode during installation).

The runtime environment automatically utilizes the optimized Windows WinSock2 stack.

## 📦 Installation
Install the compiled binary wheels directly via pip:

```Bash
pip install tcpfeature
```

## 💻 Quick Start & Usage Examples

`tcpfeature` exposes a collection of high-performance, stateless Functional APIs at the package root level, making it seamlessly compatible with the standard Python Data Science ecosystem.

### 1. Offline Batch Feature Extraction (PCAP to ML Features)
Extract the comprehensive 29 network layer feature vectors from an offline packet capture file into structured arrays.

```python
import tcpfeature
import json

# Define path to target pcap file
pcap_path = "data/pcaps/sample.pcap"

# Execute the high-speed Cython feature extraction core
dataset = tcpfeature.run_extraction(pcap_path, mode="offline", version="100")

if dataset:
    print(f"[+] Successfully extracted {len(dataset)} connection records.")
    # Print the first feature vector formatted as JSON
    print(json.dumps(dataset[0], indent=4))

```

### 2. Live Network Traffic Feature Streaming (Generator Mode)
Capture network packets directly from a physical network interface in real-time. The Cython engine evaluates moving traffic structures and yields ready-to-use data dictionaries through a continuous Python Generator.

```Python
import tcpfeature

# Automatically detect the optimal active network interface card
target_device = tcpfeature.get_best_device()

print(f"[*] Starting live traffic feature extraction on: {target_device}")
try:
    # Stream features continuously through the generator
    for ml_features in tcpfeature.run_extraction(target_device, mode="live", version="100"):
        proto = ml_features.get('protocol_type')
        src_port = ml_features.get('src_port')
        dst_port = ml_features.get('dst_port')
        src_bytes = ml_features.get('src_bytes', 0)
        count_2sec = ml_features.get('count', 0)
        
        print(f"[{proto}] {src_port}->{dst_port} | Bytes: {src_bytes} | Windowed Count (2s): {count_2sec}")
except KeyboardInterrupt:
    print("\n[+] Live feature streaming stopped securely.")
```

### 3. Basic Stream Reader (For Custom Feature Engineering)
If you need to engineer your own custom topological features, use the lightweight stream reader to pull raw isolated header fields from the memory stack sequentially.

```Python
import tcpfeature

pcap_file = "data/pcaps/sample.pcap"
total_bytes = 0
tcp_count = 0

# Loop through raw C-extracted layer-3/layer-4 boundary fields
for packet in tcpfeature.run_basic_capture(pcap_file, mode="offline"):
    frame_len = packet.get("frame_length", 0)
    proto = packet.get("protocol_type", "Others")
    
    # Apply custom real-time cumulative logic (Feature Engineering)
    total_bytes += frame_len
    if proto == "TCP":
        tcp_count += 1
        
    if tcp_count % 100 == 0:
        print(f"[Custom Metric] Accumulated Bytes: {total_bytes} | Total TCP Frames: {tcp_count}")
```

### 4. Automated End-to-End Dataset Labeling (Snort Integration)
Construct fully labeled classification training datasets (DARPA29F or HNET3A standards) by cross-referencing your network feature vectors with standard Snort alert system logs

```Python
import tcpfeature

feature_csv = "data/csv/pcap_features.csv"
snort_log = "data/alerts/alert.log"

# Step A: Parse raw Snort alerts and compute bitwise IP ranges (is_src_internal/is_dst_internal)
alert_csv_path = tcpfeature.convert_snort_log(snort_log)

# Step B: Match extracted traffic records with alert logs inside a 2-second moving time window
final_dataset_csv = "data/final/training_data_labeled.csv"
tcpfeature.build_labeled_dataset(
    feature_csv=feature_csv,
    alert_csv=alert_csv_path,
    output_csv=final_dataset_csv,
    time_window=2.0
)
print(f"[+] ML Training Dataset generated successfully at: {final_dataset_csv}")

```

## 📊 Extracted Feature Architecture

The core Cython engine evaluates network sessions and maps them into the standardized KDDCup'99 dataset features. 

<details>
<summary>🔍 Click to expand the Full 41 KDD-Cup'99 Feature Matrix & Support Status</summary>

### 1. Basic Features

These features are captured from packet headers only and without analyzing payload. Features 1 to 6 are in this category.

| ID | Feature Name | Type 	| tcpfeature Support Status |
| :--- | :--- | :--- | :--- |
| 1 | `duration` 			| continuous 	| ✅ length (number of seconds) of the connection 					|
| 2 | `protocol_type` 		| symbolic 		| ✅ type of the protocol, e.g. tcp, udp, etc.  					|
| 3 | `service` 			| symbolic 		| ✅ network service on the destination, e.g., http, telnet, etc. 	|
| 4 | `flag` 				| symbolic 		| ✅ normal or error status of the connection  						|
| 5 | `src_bytes` 			| continuous 	| ✅ number of data bytes from source to destination 				|
| 6 | `dst_bytes` 			| continuous 	| ✅ number of data bytes from destination to source 				|
| 7 | `land` 				| symbolic 		| ✅ 1 if connection is from/to the same host/port; 0 otherwise 	|
| 8 | `wrong_fragment` 		| continuous 	| ✅ number of "wrong'' fragments 									|
| 9 | `urgent` 				| continuous 	| ✅ number of urgent packets  										|

### 2. Content Features

In this category original tcp packets analyzed with assistance of domain knowledge. An example of this category is number of "hot" indicators.

| ID | Feature Name | Type  | tcpfeature Support Status |
| :--- | :--- | :--- | :--- |
| 10 | `hot` 				| continuous 	| ✅ number of "hot'' indicators 						            |
| 11 | `num_failed_logins` 	| continuous 	| ❌ number of failed login attempts 					            |
| 12 | `logged_in` 			| symbolic		| ❌ 1 if successfully logged in; 0 otherwise  			            |
| 13 | `num_compromised` 	| continuous 	| ❌ number of "compromised'' conditions 				            |
| 14 | `root_shell` 		| continuous 	| ❌ 1 if root shell is obtained; 0 otherwise 			            |
| 15 | `su_attempted` 		| continuous 	| ❌ 1 if "su root'' command attempted; 0 otherwise 	            |
| 16 | `num_root` 			| continuous 	| ❌ number of "root'' accesses  						            |
| 17 | `num_file_creations` | continuous 	| ❌ number of file creation operations  				            |
| 18 | `num_shells` 		| continuous	| ❌ number of shell prompts 							            |


### 3. Time-based Traffic Features

for capturing these types of features a window of 2 second interval is defined. In this interval, some properties of packets is measured. For example number of connections to the same service as the current connection in the past two seconds.

| ID | Feature Name | Type | tcpfeature Support Status |
| :--- | :--- | :--- | :--- |
| 19 | `num_access_files` 	| continuous 	| ❌ number of operations on access control files 				    |
| 20 | `num_outbound_cmds` 	| continuous 	| ❌ number of outbound commands in an ftp session 				    |
| 21 | `is_hot_login` 		| symbolic		| ❌ 1 if the login belongs to the "hot'' list; 0 otherwise 	    |
| 22 | `is_guest_login` 	| symbolic 		| ❌ 1 if the login is a "guest'' login; 0 otherwise  			    |
| 23 | `count` 				| continuous 	| ✅ number of connections to the same host as the current connection in the past two seconds  		|
| 24 | `srv_count` 			| continuous 	| ✅ number of connections to the same service as the current connection in the past two seconds 	|
| 25 | `serror_rate` 		| continuous 	| ✅ % of connections that have "SYN'' errors (same-host) 		    |
| 26 | `srv_serror_rate` 	| continuous 	| ✅ % of connections that have "SYN'' errors (same-service) 	    |
| 27 | `rerror_rate` 		| continuous 	| ✅ % of connections that have "REJ'' errors (same-host) 		    |
| 28 | `srv_rerror_rate` 	| continuous 	| ✅ % of connections that have "REJ'' errors (same-service) 	    |
| 29 | `same_srv_rate` 		| continuous 	| ✅ % of connections to the same service (same-host) 			    |
| 30 | `diff_srv_rate` 		| continuous 	| ✅ % of connections to different services (same-host) 		    |
| 31 | `srv_diff_host_rate` | continuous 	| ✅ % of connections to different hosts (same-service) 		    |

### 4. Host-based Traffic Features

In this category instead of a time based window, a number of connections are used for building the window. This category is designed so that attacks longer than 2 second can be detected.

| ID | Feature Name | Type | tcpfeature Support Status |
| :--- | :--- | :--- | :--- |
| 32 | `dst_host_count` 				| continuous 	| ✅ count of connections having the same destination host 								|
| 33 | `dst_host_srv_count` 			| continuous 	| ✅ count of connections having the same destination host and using the same service 	|
| 34 | `dst_host_same_srv_rate` 		| continuous 	| ✅ % of connections having the same destination host and using the same service 		|
| 35 | `dst_host_diff_srv_rate` 		| continuous 	| ✅ % of different services on the current host 										|
| 36 | `dst_host_same_src_port_rate` 	| continuous 	| ✅ % of connections to the current host having the same src port 						|
| 37 | `dst_host_srv_diff_host_rate` 	| continuous 	| ✅ % of connections to the same service coming from different hosts 					|
| 38 | `dst_host_serror_rate` 			| continuous 	| ✅ % of connections to the current host that have an S0 error 						|
| 39 | `dst_host_srv_serror_rate` 		| continuous 	| ✅ % of connections to the current host and specified service that have an S0 error 	|
| 40 | `dst_host_rerror_rate` 			| continuous 	| ✅ % of connections to the current host that have an RST error 						|
| 41 | `dst_host_srv_rerror_rate` 		| continuous 	| ✅ % of connections to the current host and specified service that have an RST error 	|

*💡 **Note:** `Requires DPI` status means the feature requires Deep Packet Inspection at the application layer (Layer 7), which is decoupled from the current native core design.*

</details>

---

## 🎓 Citation & Academic Reference

If you use `tcpfeature` or the accompanying `DARPA29F` dataset  in your research, academic publications, or permitted educational configurations, please cite the original peer-reviewed paper that introduced this methodology:

### IEEE Format
Van Kha Nguyen, Minh Nhat Quang Truong, Van Lam Le, Quyet Thang Le, and Thanh Hai Nguyen, "A Novel Approach for Data Collection and Network Attack Warning," *In 2019 11th International Conference on Knowledge and Systems Engineering (KSE)*, IEEE, 2019, pp. 1-6, doi: 10.1109/KSE.2019.8919494.

### BibTeX (For LaTeX Researchers)
```bibtex
@inproceedings{nguyen2019novel,
  title={A Novel Approach for Data Collection and Network Attack Warning}, 
  author={Nguyen, Van Kha and Truong, Minh Nhat Quang and Le, Van Lam and Le, Quyet Thang and Nguyen, Thanh Hai},
  booktitle={2019 11th International Conference on Knowledge and Systems Engineering (KSE)},
  pages={1--6},
  year={2019},
  publisher={IEEE},
  doi={10.1109/KSE.2019.8919494},
  note={Funded by Can Tho Department of Science and Technology}
}
```

## 🛡️ License

This project is licensed under a **Custom Proprietary Academic License**. 

* **Academic & Research Use:** Free to use for non-commercial research, academic publications, and educational purposes, provided that proper citation is given to the original paper.
* **Commercial Use:** Strictly prohibited. Any commercial deployment, production usage, or redistribution of this binary wheel requires a separate commercial license from the author.
* **Source Code:** The underlying Cython/C source code is proprietary and not open to the public. Reverse engineering or decompiling the binaries is strictly forbidden.

## 👥 Authors & Contact

Author: dzokha

Email: dzokha1010@gmail.com

GitHub: https://github.com/dzokha/tcpfeature



