Metadata-Version: 2.4
Name: reap-cli
Version: 0.1.1
Summary: Recursive extraction and parsing of firmware and partition images
Author: Blackbox Research
License-Expression: Apache-2.0
Keywords: firmware,forensics,reverse-engineering,android,partition,extraction,unpacker
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Security
Classifier: Topic :: System :: Archiving
Classifier: Topic :: System :: Filesystems
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: ext4>=1.2.2
Requires-Dist: lz4>=4.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# REAP

**Recursive Extraction And Parsing** — a general-purpose CLI tool for identifying and recursively extracting firmware and partition images. Works with raw eMMC/flash dumps, individual partition images, full-disk images (GPT or Rockchip PARM), and forensic disk images from a wide range of embedded Linux and Android devices.

Pure Python. No root, no FUSE, no mounting, no Linux kernel modules. Runs on macOS, Linux, and Windows.

## What it does

Point it at a directory of partition `.bin` files, a single image, or a set of 7z archives and it will:

1. **Identify** each image's format via magic bytes (37 format signatures)
2. **Annotate** what partition it is (boot, recovery, system, userdata, etc.) by reading ext4 superblock metadata and analyzing ramdisk contents
3. **Extract** contents recursively -- e.g. a boot image yields a kernel + ramdisk; the ramdisk decompresses to a cpio archive; the cpio extracts to a filesystem tree
4. **Analyze** kernels (version, config, build paths, kallsyms symbol table), bootloaders (U-Boot environment, embedded DTBs), and unknown partitions (forensic hex dump, strings, SHA256)
5. **Report** everything found in human-readable text and machine-readable JSON, including FBE encryption detection

## Supported formats

### Partition tables and disk layouts

| Format | Detection | Extraction |
|--------|-----------|------------|
| GPT partition table | `EFI PART` at offset 0x200 or 0x1000 (UFS 4K sectors) | Individual partition images |
| Rockchip PARM partition table | `PARM` at offset 0 | Individual partition images (RK29xx/RK3xxx flash dumps) |
| Android super.img (LP metadata) | `0x67446C70` at offset 0x1000 | Logical partition images (system, vendor, product, etc.) |

### Android boot and kernel

| Format | Detection | Extraction |
|--------|-----------|------------|
| Android Boot Image (v0--v4) | `ANDROID!` magic | Kernel, ramdisk, second-stage, recovery DTBO, DTB |
| ARM zImage | `0x016F2818` at offset 0x24 | Decompressed vmlinux, kernel config, version string, source paths, kallsyms, all strings |
| ARM64 Image | `ARM\x64` at offset 0x38 | Kernel config, version string, source paths, kallsyms, all strings |
| Raw ARM kernel binary | MSR CPSR instruction + `Linux version` string | Kernel config, version string, source paths, kallsyms, all strings |
| Device Tree Blob (DTB) | `0xD00DFEED` | Extracted DTB, optional `dtc` decompile to DTS |
| DTBO container | `0xD7B7AB1E` | Individual DT overlay entries |

### Bootloaders and firmware

| Format | Detection | Extraction |
|--------|-----------|------------|
| U-Boot uImage | `0x27051956` | Unwrapped payload (kernel, ramdisk, firmware, device tree, etc.) |
| U-Boot binary | `U-Boot <version>` string, 64 KB--4 MB | Default environment, embedded DTBs, strings |
| U-Boot environment | CRC32 + key=value pairs, power-of-2 size | Parsed environment variables |
| Samsung Exynos boot partition | BL1 header pointer + `Exynos BL` label | bl1.bin, u-boot.bin, tzsw.bin |
| Rockchip KRNL wrapper | `KRNL` at offset 0 | Unwrapped payload (re-identified as gzip, zImage, etc.) |
| ELF binary | `\x7fELF` magic | Metadata dump (class, machine, entry point), strings |
| AVB vbmeta | `AVB0` / `AVBf` | Metadata dump (version, algorithm, rollback index, flags) |

### Encrypted firmware containers

| Format | Detection | Extraction |
|--------|-----------|------------|
| IM\*H firmware container | `IM*H` at offset 0 or 0x400 | Header parse (version, module name/type, chunk table, key family). Encrypted chunks (RTOS, kernel, TZOS, DTB, etc.) extracted as raw `.bin` files. Decryption is out of scope for this tool. |
| Ambarella environment (UNR0) | `UNR0` + `0x5AA5` flags | Boot config, A/B slot status, firmware versions, bootloader logs |

### Filesystems

| Format | Detection | Extraction |
|--------|-----------|------------|
| ext4 | `0xEF53` at offset 0x438 | Full filesystem tree with FBE encryption detection |
| FAT12/16/32 | `0xEB`/`0xE9` + `0x55AA` at 510 | Full filesystem tree (LFN support) |
| exFAT | `EXFAT   ` OEM ID at offset 3 | Identified only (no extraction yet) |
| EROFS | `0xE0F5E1E2` at offset 0x400 | Identified only (no extraction yet) |
| F2FS | `0xF2F52010` at offset 0x400 | Identified only (no extraction yet) |

### Compression and archives

| Format | Detection | Extraction |
|--------|-----------|------------|
| gzip | `1F 8B` | Decompressed content |
| LZ4 frame | `04 22 4D 18` | Decompressed content |
| LZ4 legacy | `02 21 4C 18` | Decompressed content (Android ramdisk format) |
| LZMA | `5D 00 00` | Decompressed content |
| bzip2 | `BZh` | Decompressed content |
| XZ | `FD 37 7A 58 5A 00` | Decompressed content |
| cpio newc | `070701` / `070702` | Files, directories, symlinks (as text files with `-> target`) |
| 7z archive | `37 7A BC AF 27 1C` | Full decompression (supports split `.7z.001` parts) |
| Android sparse image | `0xED26FF3A` | Converted to raw image, then re-identified and extracted |

### Device-specific partitions

| Format | Detection | Extraction |
|--------|-----------|------------|
| Android devinfo | `ANDROID-BOOT!` magic | Lock status, tamper flags |
| Qualcomm modemst (EFS) | `IMGEFS` marker in first 64 bytes | Forensic scan (SHA256, strings, hex dump) |
| BMP image | `BM` + valid DIB header | Trimmed BMP (strips partition padding) |
| Boot logo container | ASCII count/sizes header + BMP at 0x200 | Individual BMP images |
| Empty / zeroed | All-zero content | Verified-empty marker with likely purpose annotation |

## Installation

Requires **Python 3.10+** (tested with 3.11).

From PyPI:

```bash
pip install reap-cli
```

> The PyPI distribution is `reap-cli` because the bare `reap` name on PyPI is held by an unrelated, long-abandoned 2012 package. We are pursuing a [PEP 541](https://peps.python.org/pep-0541/) transfer. The installed CLI command is `reap` regardless.

From source (for development):

```bash
git clone https://gitlab.com/blackbox-research/reap
cd reap
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
```

Dependencies (installed automatically):
- `ext4` -- pure-Python ext4 filesystem reader (no FUSE/mounting)
- `lz4` -- LZ4 decompression for Android ramdisks

Third-party plugins that add format handlers or detectors are discovered automatically via the `reap.plugins` entry point group. See the architecture section below.

## Usage

```
reap <input_path> [options]
```

`input_path` can be a single image file, a directory containing partition images, or a set of 7z archives.

### Options

| Flag | Description |
|------|-------------|
| `-o DIR` | Output directory (default: `<input>_unpacked/`) |
| `--identify-only` | Print format identification only, no extraction |
| `--skip-ext4` | Skip ext4 filesystem extraction (useful for huge partitions) |
| `--skip-archives` | Skip 7z archive extraction |
| `--force-archives` | Force archive extraction even when `physicalImage/` already exists |
| `--no-recursive` | Don't recurse into extracted children |
| `--max-depth N` | Maximum recursion depth (default: 10) |
| `-j, --jobs N` | Parallel extraction workers (0=auto, 1=sequential; default: auto) |
| `-v` | Verbose output (INFO level) |
| `-vv` | Debug output |
| `--report text\|json\|both` | Report format (default: both) |

### Examples

**Identify all partitions in a dump:**
```bash
reap ./physicalImage --identify-only
```

**Full extraction (skip large ext4 partitions):**
```bash
reap ./physicalImage --skip-ext4 -v
```

**Extract a single boot image:**
```bash
reap boot.img -o ./boot_extracted -v
```

**Extract a directory of 7z archives (split parts supported):**
```bash
reap ./archives/ -v
```

**Parallel extraction with 4 workers:**
```bash
reap ./physicalImage -j 4 -v
```

### Output structure

For a boot image, the recursive extraction produces:

```
boot_unpacked/
    kernel_info.txt          # Kernel analysis summary
    kernel_config.txt        # Build-time .config (if IKCONFIG enabled)
    kernel_source_paths.txt  # Build-time source paths
    kernel_strings.txt       # All embedded ASCII strings
    kallsyms.txt             # Kernel symbol table (if present)
    vmlinux                  # Decompressed kernel binary
    ramdisk_unpacked/
        ramdisk_unpacked/    # cpio filesystem tree
            init
            init.rc
            fstab.*
            sbin/
            ...
```

For a directory of partitions, you get a subdirectory per partition plus reports:

```
physicalImage_unpacked/
    report.txt               # Human-readable report
    report.json              # Machine-readable report
    mmcblk0p1/               # boot image contents
    mmcblk0p2/               # DTB contents
    mmcblk0p3/               # recovery image contents
    mmcblk0p4/               # system filesystem tree
    ...
```

### Partition annotation

The tool automatically identifies partition roles by:
- Reading the ext4 superblock `s_last_mounted` field (e.g. `/system`, `/data`, `/cache`)
- Analyzing boot image ramdisks for `/sbin/recovery` to distinguish boot vs recovery
- Parsing U-Boot uImage type fields (kernel, ramdisk, firmware, device tree)
- Parsing IM\*H firmware module names and types (bootloader, kernel, RTOS)
- Recognizing format-specific roles (DTB, vbmeta, DTBO, sparse, super, modemst)
- Inferring empty partition purpose from size (<=4 MB zeroed = likely misc or metadata)

Annotations appear in reports and verbose output as labels like `(recovery)`, `(system)`, `(userdata)`, etc.

### FBE encryption detection

When extracting ext4 filesystems with File-Based Encryption (FBE), the tool:
- Detects the encryption superblock flag and per-inode encryption flags
- Hex-encodes encrypted filenames for safe extraction
- Writes `encrypted_paths.txt` listing all encrypted files and directories
- Reports encryption algorithms (AES-256-XTS, AES-256-GCM, etc.) in JSON output

## Architecture

```
reap/
    cli.py              # Argument parsing, entry point
    identify.py         # Magic-byte format detection (37 signatures)
    annotate.py         # Partition role inference
    pipeline.py         # Recursive extraction orchestrator (parallel workers)
    report.py           # Text + JSON report generation (FBE-aware)
    handlers/
        __init__.py     # BaseHandler ABC, handler registry
        ambarella_env.py # Ambarella UNR0 boot environment
        avb.py          # AVB vbmeta metadata
        bmp.py          # BMP image (partition padding trim)
        boot_img.py     # Android boot image (v0--v4)
        bootlogo.py     # Boot logo container (multiple BMPs)
        compression.py  # gzip, LZ4, LZMA, bzip2, XZ
        cpio_handler.py # cpio newc archives
        devinfo.py      # Android devinfo (lock status)
        dji_imah.py     # IM*H encrypted firmware container (header parse, encrypted chunks)
        dtb.py          # Device Tree Blob
        dtbo.py         # DTBO container
        elf.py          # ELF binary metadata + strings
        ext4_handler.py # ext4 filesystem (FBE detection, dir_index fallback)
        exynos_boot.py  # Samsung Exynos eMMC boot partition
        fat.py          # FAT12/16/32 filesystem
        gpt.py          # GPT partition table (512-byte + 4K UFS sectors)
        modemst.py      # Qualcomm modem EFS partition
        raw.py          # Empty + unknown fallback (forensic scan)
        raw_kernel.py   # Raw ARM kernel binary
        rk_krnl.py      # Rockchip KRNL wrapper
        rkparm.py       # Rockchip PARM partition table
        seven_zip.py    # 7z archive (split-part support)
        sparse_img.py   # Android sparse -> raw conversion
        super_img.py    # super.img LP metadata
        uboot_bin.py    # U-Boot binary (environment, embedded DTBs)
        uboot_env.py    # U-Boot environment block
        uimage.py       # U-Boot uImage wrapper
        zimage.py       # ARM zImage / ARM64 Image kernel extraction
        _kernel_utils.py # Shared kernel analysis (version, config, kallsyms)
```

Each handler implements `BaseHandler.extract()` and returns an `ExtractionResult` with optional children for recursive processing. Handlers register themselves at import time via `register_handler()`.

The pipeline orchestrator (`pipeline.py`) drives the flow: identify -> annotate -> dispatch to handler -> recurse into children. Children can be processed in parallel via `ThreadPoolExecutor`. Adding new formats is straightforward -- write a handler, register it for a `Format` enum value, and the pipeline picks it up automatically.

## Symlink handling

Symlinks found inside ext4 filesystems and cpio archives are **not** created as OS symlinks (which can cause issues on some platforms and create security risks with path traversal). Instead, they're written as small text files containing `-> target` and recorded in the extraction metadata / JSON report.

## Running tests

```bash
pip install -e ".[dev]"
pytest tests/ -v
```

110 tests covering format detection, handler extraction, kernel analysis (kallsyms), pipeline orchestration, and forensic scanning. All tests use synthetic data -- no real image files needed.

## Known limitations

- **EROFS, F2FS, and exFAT**: Identified but not yet extracted (no pure-Python reader available).
- **Encrypted partitions**: FBE-encrypted ext4 partitions are detected and documented, but file contents remain encrypted. The tool does not perform Android FDE/FBE decryption.
- **IM\*H decryption**: Out of scope. The core tool parses IM\*H headers and extracts encrypted chunks as raw `.bin` files. Producing plaintext requires AES keys that are not distributed with this tool.
- **Large partitions**: Extracting a 54 GB ext4 partition takes time and disk space. Use `--skip-ext4` to skip these, or extract individual partitions as needed.
- **7z extraction**: Requires system `7z` binary for split archives; falls back to `py7zr` for single files.
- **Symlinks**: Recorded as text files, not created as actual OS symlinks.
- **Text files**: Plain-text metadata files (.txt, .sha256, .xml, README) in the input directory are detected and skipped rather than subjected to forensic extraction.
