Metadata-Version: 2.4
Name: msvcpp-normalize-pe
Version: 0.0.post38
Summary: Normalize PE files for reproducible MSVC++ builds
Author-email: Tim Ansell <me@mith.ro>
License: Apache-2.0
Project-URL: Homepage, https://github.com/mithro/msvcpp-normalize-pe
Project-URL: Documentation, https://msvcpp-normalize-pe.readthedocs.io
Project-URL: Repository, https://github.com/mithro/msvcpp-normalize-pe
Project-URL: Issues, https://github.com/mithro/msvcpp-normalize-pe/issues
Project-URL: Changelog, https://github.com/mithro/msvcpp-normalize-pe/releases
Keywords: msvc,reproducible-builds,pe,windows,compiler,deterministic
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: Topic :: Software Development :: Compilers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# msvcpp-normalize-pe - Normalize PE Files for Reproducible MSVC++ Builds

[![Documentation](https://readthedocs.org/projects/msvcpp-normalize-pe/badge/?version=latest)](https://msvcpp-normalize-pe.readthedocs.io/en/latest/)
[![PyPI](https://img.shields.io/pypi/v/msvcpp-normalize-pe.svg)](https://pypi.org/project/msvcpp-normalize-pe/)
[![Python Version](https://img.shields.io/pypi/pyversions/msvcpp-normalize-pe.svg)](https://pypi.org/project/msvcpp-normalize-pe/)
[![Tests](https://github.com/mithro/msvcpp-normalize-pe/workflows/Test%20%26%20Lint/badge.svg)](https://github.com/mithro/msvcpp-normalize-pe/actions)
[![Windows Tests](https://github.com/mithro/msvcpp-normalize-pe/workflows/Test%20Windows%20C++%20Compilation/badge.svg)](https://github.com/mithro/msvcpp-normalize-pe/actions/workflows/test-windows.yml)

**⚠️ AI-Assisted Development Notice: This project was developed as an experiment in AI-assisted "vibe coding" using Claude Code. While the code has comprehensive tests and linting, it was primarily generated through AI assistance. The implementation is based on reverse-engineering PE file formats and may have edge cases or behaviors that haven't been thoroughly tested with all possible MSVC configurations. Use with caution in production environments and verify results with your specific toolchain.**

A Python tool to patch Windows PE (Portable Executable) files to make MSVC builds reproducible by normalizing timestamps, GUIDs, and other non-deterministic debug metadata.

## The Problem

When compiling Windows executables with Microsoft Visual C++ (MSVC), even with the `/Brepro` flag enabled, builds are **not fully reproducible**. The same source code compiled twice produces different binaries due to non-deterministic debug information:

- **COFF Header TimeDateStamp**: Build timestamp in PE header
- **Debug Directory Timestamps**: 4 separate timestamps in debug entries (CODEVIEW, VC_FEATURE, POGO, REPRO)
- **CODEVIEW GUID**: Random GUID linking .exe to .pdb file
- **CODEVIEW Age**: Incremental counter that varies between builds
- **REPRO Hash**: Composite hash containing the GUID and timestamps

This makes **binary verification in CI impossible** - you can't verify that committed binaries match the source code because every rebuild produces different bytes, even though the executable code is identical.

## The Solution

This tool patches all non-deterministic fields in PE files to fixed, deterministic values:

- **All timestamps** → `0x00000001` (January 1, 1970 + 1 second)
- **CODEVIEW GUID** → `00000000-0000-0000-0000-000000000000`
- **CODEVIEW Age** → `1`
- **REPRO Hash** → All zeros

After patching, **identical source code produces byte-for-byte identical binaries**, enabling reproducible builds and CI verification.

## What Gets Patched

### Fields Patched (8 total)

1. **PE COFF Header TimeDateStamp** (offset varies, typically 0xC0-0x100)
2. **Debug CODEVIEW Entry Timestamp**
3. **Debug CODEVIEW GUID** (16 bytes)
4. **Debug CODEVIEW Age** (4 bytes)
5. **Debug VC_FEATURE Entry Timestamp**
6. **Debug POGO Entry Timestamp**
7. **Debug REPRO Entry Timestamp**
8. **Debug REPRO Hash** (36 bytes)

### What Doesn't Change

- **All executable code** (.text section)
- **All program data** (.data, .rdata sections)
- **Import/Export tables**
- **Section headers**
- **Relocations**

The binary behaves **identically at runtime** - only metadata used for debugging is normalized.

## Installation

### From PyPI (Recommended)

```bash
pip install msvcpp-normalize-pe
```

### From Source

```bash
git clone https://github.com/mithro/msvcpp-normalize-pe.git
cd msvcpp-normalize-pe
pip install .
```

### Using uv

```bash
uv pip install msvcpp-normalize-pe
```

## Usage

### Command Line

After installation, the `msvcpp-normalize-pe` command is available:

```bash
# Basic usage
msvcpp-normalize-pe program.exe

# Custom timestamp
msvcpp-normalize-pe program.exe 1234567890

# Verbose output
msvcpp-normalize-pe --verbose program.exe

# See all options
msvcpp-normalize-pe --help
```

### Python API

You can also use msvcpp-normalize-pe as a library in your Python code:

```python
from pathlib import Path
from msvcpp_normalize_pe import patch_pe_file

result = patch_pe_file(Path("program.exe"), timestamp=1, verbose=True)
if result.success:
    print(f"Patched {result.patches_applied} fields")
else:
    print(f"Errors: {result.errors}")
```

### Example Output

```
[1/1] COFF header: 0x829692a8 -> 0x00000001
  [2/?] Debug CODEVIEW timestamp: 0x829692a8 -> 0x00000001
  [3/?] Debug CODEVIEW GUID: e97b6ac706ea9b2dd577392d2bf08df7 -> 00000000000000000000000000000000
  [4/?] Debug CODEVIEW Age: 7 -> 1
  [5/?] Debug VC_FEATURE timestamp: 0x829692a8 -> 0x00000001
  [6/?] Debug POGO timestamp: 0x829692a8 -> 0x00000001
  [7/?] Debug REPRO timestamp: 0x829692a8 -> 0x00000001
  [8/?] Debug REPRO hash: 20000000e97b6ac7... -> 000000000000000000...
  Total: 8 timestamp(s) patched in program.exe
```

## Integration with Build Systems

### Makefile Integration (Native MSVC)

```makefile
# Native MSVC builds
ifeq ($(USE_NATIVE_MSVC),1)
  program.exe: program.cpp
	cl.exe /O2 /Zi program.cpp /link /DEBUG:FULL /Brepro
	msvcpp-normalize-pe program.exe 1
endif
```

### CI/CD Verification Workflow

```yaml
name: Verify Binary Reproducibility

jobs:
  verify:
    runs-on: windows-latest
    steps:
      - name: Build from source
        run: |
          cl.exe /O2 program.cpp /link /DEBUG:FULL /Brepro
          msvcpp-normalize-pe program.exe 1

      - name: Compare with committed binary
        run: |
          fc /b program.exe committed/program.exe
```

## Requirements

- **Python 3.9+** (type hints, dataclasses)
- **Target files**: Windows PE executables (.exe) or DLLs (.dll)
- **Architecture**: Works with both 32-bit (PE32) and 64-bit (PE32+) binaries

No runtime dependencies - uses only Python standard library (`struct`, `sys`, `pathlib`, `dataclasses`).

## Limitations and Known Issues

### What This Tool Fixes

- ✅ Makes **PE executables** reproducible (timestamps, GUIDs)
- ✅ Works with **native MSVC** (cl.exe + link.exe)
- ✅ Preserves **debugging capability** (PDB files still work)

### What This Tool Cannot Fix

- ❌ **PDB files remain non-deterministic** (~11% of PDB content varies)
  - PDB files contain thousands of small differences (padding, internal offsets, GUIDs)
  - Microsoft's PDB format has fundamental non-determinism issues
  - Industry solution: Use clang-cl + lld-link instead of native MSVC

- ❌ **Does not work with stripped binaries** (no debug directory to patch)

### Alternative: Use clang-cl + lld-link

For **fully reproducible builds** including PDB files, use LLVM's Windows toolchain:

```bash
clang-cl /O2 /std:c++17 program.cpp /link /DEBUG:FULL /Brepro /TIMESTAMP:1
```

The `/TIMESTAMP:` flag is **only supported by lld-link**, not native MSVC link.exe.

## Technical Details

### PE File Structure

The tool parses the PE file structure to locate and patch:

1. **DOS Header** (offset 0x3C) → PE signature offset
2. **PE Signature** (offset varies) → Verify "PE\0\0"
3. **COFF Header** (after PE sig) → TimeDateStamp at +4
4. **Optional Header** (after COFF) → Contains Data Directories
5. **Data Directory #6** → Debug Directory (RVA + Size)
6. **Debug Directory Entries** → 28-byte structures with timestamps
7. **CODEVIEW RSDS Structure** → GUID at +4, Age at +20
8. **REPRO Hash** → Full hash data

### Why /Brepro Isn't Enough

MSVC's `/Brepro` flag:
- ✅ Removes **some** non-determinism
- ✅ Uses hash-based timestamps instead of wall clock time
- ❌ Still produces **different hashes** for each build
- ❌ GUID remains random
- ❌ Age field increments

This is because `/Brepro` computes a hash of build inputs, but includes random/variable data in that hash.

## Comparison with Alternatives

### vs. ducible

[ducible](https://github.com/jasonwhite/ducible) is an older tool with similar goals:
- ❌ **Unmaintained** (last update 2018)
- ❌ Only patches COFF header timestamp
- ❌ Does not patch Debug Directory timestamps
- ❌ Does not patch GUIDs or Age fields

### vs. clang-cl + lld-link

Using LLVM's toolchain:
- ✅ **Fully reproducible** (including PDB files)
- ✅ Supports `/TIMESTAMP:` flag
- ❌ Not always possible (may need native MSVC for compatibility)

This tool fills the gap when you **must use native MSVC** but still want reproducible .exe files.

## Research and References

The non-determinism of MSVC builds with debug symbols is well-documented:

- **Microsoft PDB Repository Issue #9**: PDB non-determinism issues (GUIDs, padding, uninitialized buffers)
- **Chromium Project**: Uses clang-cl + lld-link specifically for reproducible builds
- **Bazel Team**: Marked `/experimental:deterministic` as "not planned" because "PDBs are not deterministic"
- **Reproducible Builds Mailing List** (Dec 2024): "there is no way to really solve this issue" with MSVC
- **Stack Overflow** (Nov 2024): "No complete solution currently exists for achieving fully reproducible MSVC builds with debug symbols"

## License

Apache License 2.0 - See LICENSE file

## Contributing

Contributions welcome! Please test thoroughly with your build system before submitting PRs.

## Credits

Developed as part of the [ghidra-optimized-stdvector-decompiler](https://github.com/mithro/ghidra-optimized-stdvector-decompiler) project to enable CI verification of demo binaries compiled with multiple MSVC versions.
