Semantic PDF Highlighting
Overlay colour-coded highlights on a PDF that separate what the paper claims from its methods, self-admitted limitations, and stance toward related work. Highlights are standard PDF annotation objects placed on a copy of the source — the original bytes are unchanged and any viewer can show or strip them.
Five rhetorical categories, five colours:
colour |
category |
meaning |
|---|---|---|
green |
|
what THIS paper clarifies, suggests, demonstrates, establishes, reports. First-person stance markers (“we show/find/demonstrate”, “our results”, “this finding”) and quantitative results attached to the paper’s own analysis. |
purple |
|
THIS paper’s own novel method, model, cohort, or analysis pipeline. Not routine study logistics, not background on existing methods. |
red |
|
self-admitted limitation, caveat, confound, or threat to validity. |
blue |
|
a specific prior/other paper whose finding SUPPORTS this paper’s position. |
orange |
|
a specific prior/other paper whose finding CONTRADICTS this paper. |
Precision-biased: when a sentence does not clearly fall into one category, the classifier
returns none and no highlight is drawn.
Pipeline
PDF
-> PyMuPDF paragraph extraction (bounding boxes)
-> sentence splitting (abbreviation-aware — handles Fig., e.g., et al.)
-> Claude classifier (5 categories, batched JSON-structured output)
-> tight per-line quads via PyMuPDF dict layout
-> PDF Highlight annotations (opacity 0.4)
-> compact legend overlay stamped in the lower-right corner of the last page
-> save
The source PDF is never modified; the output is a copy with annotation objects overlaid.
Four Interfaces
Python API
from scitex_scholar.pdf_highlight import (
highlight_pdf, # one-shot: extract -> classify -> annotate -> save
extract_blocks, # stage 1 only
apply_classifications, # merge offline labels into extracted blocks
save_with_highlights, # stage 3 only (apply labels -> write PDF)
)
result = highlight_pdf(
"paper.pdf",
output_path="paper.highlighted.pdf",
model="claude-haiku-4-5-20251001",
sentence_level=True,
add_legend=True,
)
print(result.counts(), result.annotations_added)
CLI
# Top-level scholar CLI
scitex-scholar highlight paper.pdf
scitex-scholar highlight paper.pdf --output paper.marked.pdf
scitex-scholar highlight paper.pdf --model claude-sonnet-4-6
scitex-scholar highlight paper.pdf --stub # offline keyword heuristic
scitex-scholar highlight paper.pdf --dry-run # classify only, no PDF write
scitex-scholar highlight paper.pdf --max-blocks 20 # smoke test
# Offline label-apply workflow
scitex-scholar highlight paper.pdf --labels-dump blocks.json # extract
# ... produce labels.json with {id, category, confidence} objects ...
scitex-scholar highlight paper.pdf --labels-apply labels.json # annotate only
# Standalone module entry point
python -m scitex_scholar.pdf_highlight paper.pdf
MCP
Exposed as scholar_highlight_pdf in the unified scitex serve server. The handler
is re-exported from scitex_scholar._mcp.all_handlers for direct registration.
from scitex_scholar._mcp.all_handlers import highlight_pdf_handler
# Register with any FastMCP-compatible server
Skill
The AI-agent skill lives at
src/scitex_scholar/_skills/scitex-scholar/semantic-highlight.md
and documents the classification scheme, model trade-offs, and the label-apply workflow.
Model Selection
model |
wall time (8-page paper) |
cost tier |
strengths |
|---|---|---|---|
|
~40 s sentence / ~10 s paragraph |
cheap |
structural tagging (method / claim / limitation) |
|
~70 s sentence / ~17 s paragraph |
mid |
catches subtle supportive / contradictive stances in related work |
Use Haiku for library-wide scans. Escalate to Sonnet when the supportive/contradictive axis matters for a specific paper.
Known Limitations
Paragraph clip regions. Extraction uses PyMuPDF’s paragraph blocks as the clip region for each sentence. Multi-column papers are handled, but pathological layouts (wrapped tables, pull-quotes) may produce skewed rectangles.
Sentence splitter is regex-based. Handles common academic abbreviations (
Fig.,e.g.,et al., single-initial names) but is not a full statistical sentence tokenizer. Edge cases fall back to line-level quads (still tight — no margin inflation).One label per sentence. A sentence that both describes a method and reports what the method found gets the higher-priority label (
focal_claimwins overfocal_methodper the prompt). Overlapping annotations would be possible but are not currently emitted.No self-citation detection. “Cook et al. (11)” where (11) is the same first author as the focal paper is tagged as
related_supportivewhen the sentence endorses it, even though mechanically it is self-citation.