Metadata-Version: 2.4
Name: redactflow
Version: 0.0.1
Summary: AI-powered PDF redaction tool with conversational interface
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: azure-ai-documentintelligence>=1.0.0b4
Requires-Dist: fastapi>=0.109.0
Requires-Dist: google-genai>=1.0.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: langchain-openai>=0.2.0
Requires-Dist: langchain>=0.3.0
Requires-Dist: langgraph>=0.2.0
Requires-Dist: langsmith>=0.1.0
Requires-Dist: openai>=1.93.0
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc>=1.29.0
Requires-Dist: opentelemetry-instrumentation-fastapi>=0.50b0
Requires-Dist: opentelemetry-instrumentation-httpx>=0.50b0
Requires-Dist: opentelemetry-sdk>=1.29.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: pillow>=9.0.0
Requires-Dist: pydantic>=2.5.3
Requires-Dist: pymupdf<2.0.0,>=1.25.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: python-jose[cryptography]>=3.3.0
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: requests>=2.31.0
Requires-Dist: rich>=13.0.0
Requires-Dist: stripe>=8.0.0
Requires-Dist: supabase>=2.0.0
Requires-Dist: tavily-python>=0.3.7
Requires-Dist: typer>=0.9.0
Requires-Dist: uvicorn>=0.27.0
Provides-Extra: cli
Requires-Dist: httpx>=0.25.0; extra == 'cli'
Requires-Dist: rich>=13.0.0; extra == 'cli'
Requires-Dist: typer>=0.9.0; extra == 'cli'
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# ✂️🤖 RedactFlow: Agentic PDF Sanitizer

RedactFlow is a powerful and intelligent PDF sanitization tool that uses a sophisticated agentic workflow to detect and redact sensitive information from your documents. It combines state-of-the-art AI models with a human-in-the-loop (HITL) interface to ensure accurate and reliable redaction.

## 🚀 Quick Start

### Prerequisites

- **Python 3.8+**
- **Node.js 16+** (for the React frontend)
- **An Azure account** with access to Azure OpenAI and Azure Document Intelligence
- **A Supabase account** for user authentication and subscription management (free tier available)

### Option 1: Modern React Frontend (Recommended)

The application now features a modern React frontend with FastAPI backend for better performance and user experience.

#### 1. Clone and Setup

```bash
git clone https://github.com/matthewyijielu0317/RedactFlow.git
cd RedactFlow
```

#### 2. Backend Setup

```bash
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install backend dependencies
pip install -r backend/requirements.txt
```

#### 3. Frontend Setup

```bash
# Install frontend dependencies
cd frontend
npm install
cd ..
```

#### 4. Environment Configuration

> **Note on Environment Files**: This project uses two `.env` files:
> - **Root `.env`**: Backend configuration (Python/FastAPI)
> - **`frontend/.env`**: Frontend configuration (React)
> 
> This separation is necessary because:
> - React (Create React App) only reads `.env` from its own directory
> - Backend and frontend have different environment variable requirements
> - **Important**: Keep Supabase URLs in sync between both files!

**Root `.env` file** (Backend Configuration):

Create a `.env` file in the root directory:

```bash
# Azure OpenAI Configuration
AZURE_OPENAI_ENDPOINT=your_azure_openai_endpoint
AZURE_OPENAI_API_KEY=your_azure_openai_api_key

# Azure Document Intelligence Configuration
AZURE_DI_ENDPOINT=your_azure_di_endpoint
AZURE_DI_KEY=your_azure_di_key

# Google Cloud Configuration (Optional - for image detection)
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro

# Supabase Configuration (for backend authentication)
SUPABASE_URL=your_supabase_project_url
SUPABASE_JWT_SECRET=your_supabase_jwt_secret
SUPABASE_ANON_KEY=your_supabase_anon_key
SUPABASE_SERVICE_KEY=your_supabase_service_key

# Stripe Configuration (Optional, for payment integration)
STRIPE_SECRET_KEY=your_stripe_secret_key
STRIPE_PUBLISHABLE_KEY=your_stripe_publishable_key
STRIPE_WEBHOOK_SECRET=your_stripe_webhook_secret

# Tavily Search (Optional)
TAVILY_KEY=your_tavily_key
```

**`frontend/.env` file** (Frontend Configuration):

Create a `.env` file in the `frontend/` directory (use `frontend/.env.example` as template):

```bash
# Supabase Configuration
# ⚠️ IMPORTANT: These values should match the SUPABASE_URL and SUPABASE_ANON_KEY in root .env
# Get these values from: https://app.supabase.com/project/_/settings/api
REACT_APP_SUPABASE_URL=your_supabase_project_url
REACT_APP_SUPABASE_ANON_KEY=your_supabase_anon_key
```

#### 5. Run the Application

**Option A: Use the provided scripts (Recommended)**

```bash
# Terminal 1: Start Backend
chmod +x start_backend.sh
./start_backend.sh

# Terminal 2: Start Frontend
chmod +x start_frontend.sh
./start_frontend.sh
```

**Option B: Manual startup**

```bash
# Terminal 1: Start Backend
source venv/bin/activate
cd backend
python main.py

# Terminal 2: Start Frontend
cd frontend
npm start
```

#### 6. Access the Application

- **Frontend**: http://localhost:3000
- **Backend API**: http://localhost:8000
- **API Documentation**: http://localhost:8000/docs

### Option 2: Original Streamlit Interface

If you prefer the original Streamlit interface:

```bash
# Setup (same as above)
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Run Streamlit app
streamlit run app.py
```

## 🎯 Features

### Modern React Frontend
- **User Authentication**: Secure login/signup with Supabase (email/password and Google OAuth)
- **Subscription Management**: Track user plans and page limits
- **Interactive PDF Canvas**: Draw manual redactions directly on the PDF
- **Real-time Preview**: See redactions applied instantly
- **Smart Detection Panel**: Quick prompts for common redaction types
- **Workflow Progress**: Visual progress tracking through the AI workflow
- **Responsive Design**: Works on desktop and mobile devices

### Core AI Features
- **Agentic Workflow**: Utilizes a robust agentic workflow powered by LangGraph
- **Dual OCR Technology**: Employs a creative dual OCR process for both high-level content understanding and precise word-level coordinate mapping
- **Intelligent Detection**: Leverages large language models (LLMs) to analyze document content and identify sensitive information
- **Image Detection (Optional)**: Uses Google Gemini's spatial understanding API to detect and redact logos, icons, stamps, seals, and other graphical elements
- **Evaluator Feedback Loop**: Includes an evaluator agent that provides feedback to the detector, iteratively improving accuracy
- **Human-in-the-Loop (HITL) Interface**: Review, edit, and approve AI-detected redactions, plus add manual redactions
- **Flexible and Configurable**: Easily configurable with your own Azure OpenAI and Document Intelligence API keys

## 🏗️ System Architecture

The RedactFlow system is built around a `langgraph` state machine that orchestrates the flow of data through a series of nodes:


### Workflow Nodes


- **Orchestrator**: Entry point that interprets user prompts and routes requests
- **Searcher**: (Optional) Searches for external regulations and compliance information
- **Detector**: Core detection using dual OCR and dual LLM architecture
- **Evaluator**: Reviews detected data and provides feedback to improve accuracy
- **Human-in-the-Loop (HITL)**: Pauses workflow for user review and approval
- **Redactor**: Applies final redactions to create sanitized PDF

### The Detector Workflow

The most innovative part of RedactFlow is the **Detector Workflow** with its unique dual OCR and dual LLM architecture:

-   **Orchestrator:** The entry point of the workflow. It interprets the user's prompt and decides whether to route the request to the `Searcher` for external regulation lookup or directly to the `Detector`.
-   **Searcher:** (Optional) Searches for external regulations and compliance information to enrich the detection criteria.
-   **Detector:** The core of the sensitive data detection process. Uses an innovative content batching approach with dual OCR and LLM architecture to analyze all pages together in a single API call, achieving massive efficiency gains.
-   **Evaluator:** Reviews all detected sensitive data across the entire document in a single LLM call and provides comprehensive feedback to improve accuracy.
-   **Corrector:** Applies evaluator feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.
-   **Human-in-the-Loop (HITL):** Pauses the workflow and waits for the user to review, edit, and approve the redactions through the Streamlit UI.
-   **Redactor:** Applies the final redactions to the PDF, creating a sanitized version of the document.

## The Innovative Content Batching Architecture

The most innovative part of RedactFlow is the **Content Batching Architecture** that revolutionizes how AI processes multi-page documents. Instead of processing pages individually, RedactFlow combines all pages into single prompts and processes entire documents in just 3 LLM API calls, achieving dramatic efficiency gains.

### How Content Batching Works:

1. **Dual OCR in Parallel:**
   - **Page-level OCR**: Extracts content for high-level semantic analysis
   - **Word-level OCR**: Gets precise coordinates of each word

2. **Dual LLM Analysis:**
   - **Sensitive Identification LLM**: Analyzes content to identify sensitive information
   - **Mapping LLM**: Maps sensitive content to precise word-level coordinates

## 🎮 How to Use

### Using the React Frontend

1. **Upload a PDF**: Drag and drop or click to upload your PDF file
2. **Set Detection Prompt**: Enter what you want to redact (e.g., "names, addresses, phone numbers")
3. **Run Detection**: Click "Run Detection" to start the AI workflow
4. **Review Results**: 
   - Review AI-detected sensitive information
   - Edit or delete incorrect detections
   - Add manual redactions by drawing on the PDF
5. **Approve or Reject**:
   - **Approve**: Proceed to final redaction
   - **Reject**: Modify your prompt and try again
6. **Download**: Get your redacted PDF

### Key Features

- **Manual Redaction**: Draw rectangles directly on the PDF to mark sensitive areas
- **Edit AI Detections**: Modify content, reasons, or bounding boxes
- **Real-time Preview**: See changes applied instantly
- **Workflow Control**: Approve, reject, or go back to review stage

RedactFlow's architecture is designed to continuously improve its detection accuracy through a sophisticated feedback loop between the `Evaluator`, `Corrector`, and `HumanInLoop` nodes.

-   **Content Batching Feedback Loop:** After the `Detector` identifies sensitive data across all pages, the `Evaluator` node analyzes the entire document in a single LLM call, comparing results with the user prompt and document context. It generates comprehensive feedback for all pages simultaneously by processing all content together, identifying patterns and inconsistencies across the entire document. The `Corrector` then applies this feedback to refine detections across all pages in a single LLM call, ensuring consistent quality improvements throughout the document.


## 📁 Project Structure


```
RedactFlow/
├── frontend/                 # React frontend
│   ├── src/
│   │   ├── components/      # React components
│   │   ├── types.ts         # TypeScript type definitions
│   │   └── App.tsx          # Main React app
│   ├── package.json         # Frontend dependencies
│   └── tsconfig.json        # TypeScript configuration
├── backend/                  # FastAPI backend
│   ├── main.py              # FastAPI server
│   ├── requirements.txt     # Python dependencies
│   └── static/              # Generated PDF previews
├── nodes/                    # LangGraph workflow nodes
│   ├── orchestrator.py      # Main workflow orchestration
│   ├── detector_node.py     # Dual OCR/LLM detection
│   ├── evaluator_node.py    # Detection evaluation
│   ├── hitl_node.py         # Human-in-the-loop logic
│   └── redactor_node.py     # PDF redaction
├── output/                   # Generated files
│   ├── original/            # Original uploaded PDFs
│   ├── preview/             # Preview images
│   └── redacted/            # Final redacted PDFs
├── app.py                   # Original Streamlit app
├── requirements.txt         # Root Python dependencies
└── start_*.sh              # Startup scripts
```
This powerful combination of an automated content batching feedback loop and human oversight ensures that the final redacted document is accurate, reliable, and meets the user's specific needs.

## Performance & Real-World Results

RedactFlow has been tested with real-world documents, including complex immigration forms (I-20), demonstrating exceptional performance:

### **Real I-20 Form Processing Results:**
- **Document**: 4-page I-20 Certificate of Eligibility for Nonimmigrant Student Status
- **OCR Extraction**: 318 page elements, 1,297 word elements
- **Detection Results**: 39 sensitive items with precise coordinates
- **API Efficiency**: 3 LLM calls vs 48+ traditional individual calls
- **Cost Savings**: ~90% reduction in API costs
- **Processing Time**: Significant reduction through content batching

### **Detection Categories Successfully Identified:**
- **Student Information**: Names, SEVIS IDs, birth dates, citizenship
- **Academic Program**: School names, degree types, program duration, majors
- **Official Information**: School codes, approval dates, certification details
- **Financial Data**: Tuition amounts, funding sources
- **Immigration Data**: Document numbers, status classifications

### **Quality Assurance:**
- **Content Batching Evaluation**: Comprehensive quality checks across all pages in single LLM calls
- **Feedback Integration**: Automatic correction of detection gaps
- **Coordinate Precision**: Exact pixel-level redaction boundaries
- **Fallback Mechanisms**: Robust error handling and recovery

## Preview before and after user's feedback

![Preview Before Human Feedbakc](diagrams/preview_before_feedback.png)

Give a user prompt: You failed to detect the UID and financial amount. Also, include the date.

![Preview After Human Feedbakc](diagrams/preview_after_feedback.png)

The Final Redacted Version is below:

![FinaL Redacted Version](diagrams/final_redacted_version.png)


## 🔧 Configuration

### Environment Variables

Create a `.env` file with the following variables:

```env
# Required: Azure OpenAI
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your_api_key_here

# Required: Azure Document Intelligence
AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_DI_KEY=your_di_key_here

# Optional: Google Cloud Image Detection
GOOGLE_CLOUD_PROJECT=redactflow-486302
ENABLE_IMAGE_DETECTION=true
GEMINI_MODEL_ID=gemini-2.5-pro
# GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json  # If using service account

# Optional: Tavily Search (for external regulation lookup)
TAVILY_KEY=your_tavily_key_here

# Demo/Deployment Controls
# Limit analysis to the first N pages (0 disables the cap)
MAX_ANALYZED_PAGES=4
```

### Azure Setup

1. **Azure OpenAI Service**:
   - Create an Azure OpenAI resource
   - Deploy a GPT-4 model
   - Get your endpoint and API key

2. **Azure Document Intelligence**:
   - Create a Document Intelligence resource
   - Get your endpoint and key

### Google Cloud Setup (for Image Detection - Optional)

RedactFlow uses **Google Gemini's spatial understanding API** to detect logos, icons, stamps, seals, and other graphical elements within PDF documents. This feature is optional but highly recommended for comprehensive document sanitization.

#### Quick Setup Steps

1. **Install Google Cloud SDK**:
   ```bash
   # macOS (Homebrew)
   brew install --cask google-cloud-sdk
   source "$(brew --prefix)/share/google-cloud-sdk/path.zsh.inc"
   
   # Verify installation
   gcloud --version
   ```

2. **Fix Permissions (if needed)**:
   If you encounter permission errors during installation, run:
   ```bash
   sudo chown -R $USER:staff ~/.config
   brew reinstall gcloud-cli
   ```

3. **Authenticate**:
   ```bash
   gcloud auth application-default login
   ```
   
   A browser window will open. Sign in with your Google account that has access to the `redactflow-486302` project.

4. **Set Quota Project**:
   ```bash
   gcloud auth application-default set-quota-project redactflow-486302
   ```

5. **Update Your `.env` File**:
   Add these lines to your root `.env` file:
   ```env
   # Google Cloud Image Detection
   GOOGLE_CLOUD_PROJECT=redactflow-486302
   ENABLE_IMAGE_DETECTION=true
   GEMINI_MODEL_ID=gemini-2.5-pro
   ```

6. **Restart the Backend**:
   ```bash
   ./start_backend.sh
   ```

#### Alternative: Service Account Key

If you don't want to install `gcloud`, you can use a service account key file:

1. Place the key file at `secrets/redactflow-service-account.json`
2. Add to `.env`:
   ```env
   GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json
   GOOGLE_CLOUD_PROJECT=redactflow-486302
   ENABLE_IMAGE_DETECTION=true
   GEMINI_MODEL_ID=gemini-2.5-pro
   ```
3. Restart the backend (no `gcloud` CLI needed)

#### Environment Variables for Image Detection

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `GOOGLE_CLOUD_PROJECT` | Yes | — | Your GCP project ID |
| `ENABLE_IMAGE_DETECTION` | Yes | `false` | Set to `true` to enable |
| `GEMINI_MODEL_ID` | No | `gemini-2.5-flash` | Model choice (see below) |
| `GEMINI_LOCATION` | No | `global` | Vertex AI region |
| `IMAGE_DETECT_CONCURRENCY` | No | `4` | Max pages processed in parallel |
| `MAX_ANALYZED_PAGES` | No | `4` | Page limit for detection |
| `GOOGLE_APPLICATION_CREDENTIALS` | No | — | Path to service account JSON |

#### Model Choices

| Model | Speed | Cost | Accuracy | Best for |
|-------|-------|------|----------|----------|
| `gemini-2.5-flash` | Fast | Low | Good | Cost-sensitive, high volume |
| `gemini-2.5-pro` | Moderate | Mid | Better | Precision matters, small logos |

For detailed setup instructions and troubleshooting, see [`docs/IMAGE_DETECTION_SETUP.md`](docs/IMAGE_DETECTION_SETUP.md).

### Supabase Setup (Authentication + Storage)

1. **Create a Supabase Project**:
   - Go to [Supabase](https://supabase.com/) and create a new project
   - Wait for the project to be fully provisioned

2. **Get API Credentials**:
   - Navigate to **Project Settings → API**
   - Copy your `Project URL` → `SUPABASE_URL`
   - Copy your `anon/public` key → `SUPABASE_ANON_KEY`
   - Copy your `service_role` key → `SUPABASE_SERVICE_KEY` (keep this secret!)
   - Copy your `JWT Secret` from **Project Settings → API → JWT Settings** → `SUPABASE_JWT_SECRET`

3. **Run Database Migrations** (in order):

   Open **Supabase Dashboard → SQL Editor** and run these scripts:

   | Order | Script | What it creates |
   |-------|--------|-----------------|
   | 1 | `scripts/create_storage_tables.sql` | `users` table (synced from auth.users via trigger), `files` table (metadata, OCR, annotations), `source-files` and `redacted-files` Storage buckets, RLS policies, indexes |
   | 2 | `scripts/create_subscriptions_table.sql` | Subscription management table (for Stripe integration) |

   > **Important**: `create_storage_tables.sql` must be run first. It creates the core `users` and `files` tables, Storage buckets, and Row Level Security policies that the application depends on. Without it, the backend will fail to start.

4. **Configure Environment Variables**:
   - Copy `.env.example` to `.env` in the project root and fill in your Supabase credentials
   - Copy `frontend/.env.example` to `frontend/.env` and fill in the frontend Supabase credentials
   - Make sure `SUPABASE_URL` and `SUPABASE_ANON_KEY` match between both files

## 🐛 Troubleshooting

### Common Issues

1. **Port Already in Use**:
   ```bash
   # Kill processes on ports 3000 and 8000
   lsof -ti:3000 | xargs kill -9
   lsof -ti:8000 | xargs kill -9
   ```

2. **TypeScript Errors**:
   ```bash
   cd frontend
   npm install
   ```

3. **Python Dependencies**:
   ```bash
   pip install --upgrade pip
   pip install -r backend/requirements.txt
   ```

4. **Node Modules Issues**:
   ```bash
   cd frontend
   rm -rf node_modules package-lock.json
   npm install
   ```

5. **Google Cloud Authentication Errors**:
   If you see "Your default credentials were not found" errors:
   
   ```bash
   # Option A: Re-authenticate with gcloud
   gcloud auth application-default login
   gcloud auth application-default set-quota-project redactflow-486302
   
   # Option B: Use service account key
   # Add to .env: GOOGLE_APPLICATION_CREDENTIALS=secrets/redactflow-service-account.json
   ```

6. **Permission Denied for .config Directory**:
   ```bash
   # Fix ownership of .config directory
   sudo chown -R $USER:staff ~/.config
   
   # Reinstall gcloud if needed
   brew reinstall gcloud-cli
   ```

7. **Image Detection Not Working**:
   - Ensure `ENABLE_IMAGE_DETECTION=true` in your `.env` file
   - Verify Google Cloud credentials are set up correctly
   - Check backend logs for specific error messages
   - See [`docs/IMAGE_DETECTION_SETUP.md`](docs/IMAGE_DETECTION_SETUP.md) for detailed troubleshooting

### Development

- **Frontend Development**: `cd frontend && npm start`
- **Backend Development**: `cd backend && python main.py`
- **API Testing**: Visit http://localhost:8000/docs for interactive API documentation

## 📄 File Descriptions

- **`frontend/src/App.tsx`**: Main React application with PDF canvas and workflow management
- **`backend/main.py`**: FastAPI server handling PDF processing and AI workflow
- **`nodes/orchestrator.py`**: LangGraph workflow orchestration
- **`nodes/detector_node.py`**: Dual OCR and dual LLM detection logic
- **`nodes/recall_node.py`**: AI recall additions and feedback
- **`nodes/hitl_node.py`**: Human-in-the-loop workflow control
- **`nodes/redactor_node.py`**: PDF redaction application
- **`app.py`**: Original Streamlit interface (legacy)

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request

## 📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🙏 Acknowledgments


- Built with [LangGraph](https://github.com/langchain-ai/langgraph) for workflow orchestration
- Powered by [Azure OpenAI](https://azure.microsoft.com/en-us/products/ai-services/openai-service) and [Azure Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/form-recognizer)
- Frontend built with [React](https://reactjs.org/) and [Tailwind CSS](https://tailwindcss.com/)
- Backend powered by [FastAPI](https://fastapi.tiangolo.com/)
