Metadata-Version: 2.4
Name: formalyzer
Version: 0.0.5
Summary: Analyze PDF and web forms and fill in the forms
Home-page: https://github.com/drscotthawley/formalyzer
Author: Scott H. Hawley
Author-email: scott.hawley@belmont.edu
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python pdf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4
Requires-Dist: playwright
Requires-Dist: claudette
Requires-Dist: lisette
Requires-Dist: pypdf
Requires-Dist: fastcore
Provides-Extra: dev
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# formalyzer


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Motivation

I am happy to write a recommendation letter “by hand” for a student. But
then each graduate school has their own lengthy, idiosyncratic form,
foisting upon me their job of data entry. This is tedious work,
especially with many schools and several students. Thus, I’ve wanted to
automate the form-filling for quite a while.

## Description

Formalyzer will scrape the text from the PDF recc letter, and for each
URL in url_list, it will:

- launch a browser tab for that url
- fill in the form using what the LLM has gleaned from the recc letter
- attach the PDF via the form’s upload/attachment button

…and do no more.

The user will need to review the page and press the Submit button
manually.

### Requirements

- Either `ollama` installed locally or `ANTHROPIC_API_KEY` environment
  variable set
- `beautifulsoup4, playwright, claudette, lisette, pypdf, fastcore`

### Technical Approach

You *could* try to feed raw HTML and PDF into an LLM, but that might be
a waste of resources – prohibitively slow, expensive, and error-prone.
Instead, `formalyzer` uses

- standard packages to pre-process & reduce the inputs: `bs4` for HTML,
  `pypdf` for PDF
- the LLM *only* for *reading* the reduced input texts (+ a system
  prompt) and *outputting* values to assign to form fields.
- another existing package (`playwright`) to fill in those fields.

## Usage

On MacOS, startup the Chrome browser looking to port 9222 by executing
this command in the terminal:

``` bash
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug
```

Then you can run this command:

``` bash
formalyzer --debug <recc_info.txt> <recc_letter.pdf> <url_list.txt>
```

where `recc_info.txt` contains information about the recommender, their
name, their title, their address, phone number and email.
`urls_list.txt` is a file containing one URL per line.

### Installation

Install latest from the GitHub
[repository](https://github.com/drscotthawley/formalyzer):

``` sh
$ pip install git+https://github.com/drscotthawley/formalyzer.git
```

or from [pypi](https://pypi.org/project/formalyzer/):

``` sh
$ pip install formalyzer
```

After installing, users need to run `playwright install chromium` to
download the browser binaries.

## Demo

On MacOS, run these commands in Terminal:

1.  `/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug &`
2.  `cd example`
3.  `python -m http.server 8000 &`
4.  `export ANTHROPIC_API_KEY="__your_API_key_goes_here__"`
5.  `formalyzer --debug recc_info.txt sample_letter.pdf sample_urls.txt`

## Local LLM Execution

For [FERPA](https://studentprivacy.ed.gov/ferpa) compliance, running a
local model is preferable so that student data is not broadcast
elsewhere. I recommend using [`ollama`](https://ollama.com) and starting
with something medium-small like `qwen2.5:14b` (9 GB). Start up ollama:

``` bash
ollama serve & 
ollama pull qwen2.5:14b 
```

Then you can use the `--model` CLI flag, e.g. 

``` bash
formalyzer --debug --model 'ollama/qwen2.5:14b' recc_info.txt sample_letter.pdf sample_urls.txt
```

The quality of the form-filling will vary depending on the quality and
size of the model you get. Smaller models like `mistral` (4 GB) may
hallucinate many of the form field IDs, resulting in a mostly-blank form
in the end. For a huge (41 GB) model, try `ollama/qwen2:72b`.

## Developer Guide

### Install formalyzer in Development mode

``` sh
# make sure formalyzer package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to formalyzer
$ nbdev_prepare
```

## Documentation

Documentation can be found hosted on this GitHub
[repository](https://github.com/drscotthawley/formalyzer)’s
[pages](https://drscotthawley.github.io/formalyzer/). Additionally you
can find package manager specific guidelines on
[conda](https://anaconda.org/drscotthawley/formalyzer) and
[pypi](https://pypi.org/project/formalyzer/) respectively.

## Limitations

Sometimes the LLM will miss certain fields – that’s just the nature of
the game – so you’ll still need to fill those in by hand. But it gets
*most* of them!
