==========================================
EMBEDDINGS INFRASTRUCTURE - QUICK REFERENCE
==========================================

EMBEDDING MODEL
---------------
Name: Universal Sentence Encoder v4
Dimensions: 512 floats
Use: Semantic similarity search
Type: Transformer-based

CHUNKING STATISTICS
-------------------
Total Chunks: 120 sentence-level
Sections: 10 different types
Avg Chunk Size: ~8-10 words

DISTRIBUTION BY SECTION:
  qualifications      25 chunks | Nested structure
  aboutRole          24 chunks | Multi-paragraph
  howDifferent       16 chunks | Paragraph list
  responsibilities   16 chunks | Bulleted list
  logistics          12 chunks | Bulleted list
  strongCandidates    9 chunks | Short items
  aboutCompany        6 chunks | Single paragraph
  comeWorkWithUs      6 chunks | Single paragraph
  notThisRole         4 chunks | Single paragraph
  compensation        2 chunks | Minimal text

STORAGE FORMATS
---------------
Primary Storage:    SQLite (embeddings.db)
  - Binary Float32Array embeddings
  - Indexed by section_type and parent_section
  - ~500KB total size

Frontend Export:    JSON (src/data/embeddings.json)
  - Text format for easy loading in browser
  - ~1.7MB size
  - Loaded at app startup

GENERATION PIPELINE
-------------------
Input:              Job posting (src/data/jobData.ts)
                         ↓
Process:            Sentence splitting
                    Metadata enrichment
                    Universal Sentence Encoder
                         ↓
Output:             120 vectors × 512 dims
                         ↓
Storage:            embeddings.db (SQLite)
                    src/data/embeddings.json (JSON)

SIMILARITY SEARCH
-----------------
Algorithm:          Cosine similarity
Query Processing:   
  1. Embed user text (50-200ms)
  2. Compute cosine with all 120 vectors (<5ms)
  3. Rank by score
  4. Return top K results

Response Range:     0 to 1 (1.0 = identical)
UI Visualization:   
  >70%  = Green   (high similarity)
  50-70% = Yellow (moderate)
  <50%  = Gray    (low)

KEY FILES
---------
Generation:
  scripts/preprocess.ts           ~300 lines, TypeScript
  src/data/jobData.ts             ~140 lines, TypeScript

Storage:
  embeddings.db                   SQLite binary (~500KB)
  src/data/embeddings.json        JSON export (~1.7MB)

Frontend:
  src/utils/semanticSearch.ts     ~180 lines, SemanticSearch class
  src/components/SemanticSimilarityView.tsx  ~340 lines, React component
  src/App.tsx                     Integration point

DEPENDENCIES
------------
@tensorflow/tfjs@^4.22.0                    JS runtime
@tensorflow/tfjs-node@^4.22.0               Node.js backend
@tensorflow-models/universal-sentence-encoder@^1.3.3
better-sqlite3@^12.4.1                      SQLite access
tsx@^4.20.6                                 TS execution

EXECUTION
---------
Generate Embeddings:  npm run preprocess
                      (creates embeddings.db and embeddings.json)

Run Application:      npm run dev
                      (loads embeddings.json in browser)

View Feature:         Select "Semantic Similarity" from dropdown
                      Select text in left panel
                      Results appear in right panel

PERFORMANCE
-----------
Model Load:          ~5-10s (first time)
Query Embedding:     50-100ms (WebGL) / 200-500ms (CPU)
Similarity Calc:     <5ms
Total Query Time:    60-150ms typical

ARCHITECTURE
------------
                    ┌─────────────────────┐
                    │   Job Data (TS)     │
                    └──────────┬──────────┘
                               ↓
                    ┌─────────────────────┐
                    │  Preprocessing (TS) │
                    │  - Split sentences  │
                    │  - Use-v4 model     │
                    │  - Store vectors    │
                    └────────┬──────┬─────┘
                             ↓      ↓
              ┌──────────────────┐  ┌─────────────────┐
              │ embeddings.db    │  │ embeddings.json │
              │ (SQLite binary)  │  │ (JSON frontend) │
              └─────────┬────────┘  └────────┬────────┘
                        └────────────┬───────┘
                                    ↓
                    ┌──────────────────────────┐
                    │  SemanticSearch Class    │
                    │  - Load embeddings       │
                    │  - Initialize TF.js      │
                    │  - Search & rank         │
                    └──────────┬───────────────┘
                               ↓
                    ┌──────────────────────────┐
                    │ SemanticSimilarityView   │
                    │ - Split panel layout     │
                    │ - Text selection detect  │
                    │ - Result visualization   │
                    └──────────────────────────┘

WHAT'S EMBEDDED
---------------
Content Type:       Sentences and sub-sentences from job posting
Granularity:        Sentence-level (not word, not document)
Scope:              Single document (120 chunks total)
Coverage:           All major job description sections
Metadata:           Section type, indices, parent sections

WHAT'S NOT EMBEDDED
-------------------
- Document-level vectors
- Cross-document similarity
- Section-level embeddings
- Hierarchical relationships
- Full-text search index
- Keyword extraction

LIMITATIONS
-----------
1. Single document only
2. Static embeddings (no real-time updates)
3. Client-side model loading (50MB+)
4. WebGL not available on all browsers
5. 10-character minimum selection length
6. Fixed sentence chunking strategy

EXTENSIBILITY
--------------
Add More Documents:  Update jobData.ts format, modify preprocessing
Change Model:        Swap imports, adjust dimension handling
Support Updates:     Re-run preprocessing script
Offline Storage:     Implement IndexedDB + Service Worker
Real-time Queries:   Already supported via search() method
Hierarchical Search: Add parent embedding relationships

