flowchart TD
START([Input Source]) --> TYPE{Ingestor Type}
TYPE -->|URL| WC_START[Start BFS Crawl]
TYPE -->|API| API_START[Build HTTP Request]
TYPE -->|Docs| DOC_START[Load Local File]
WC_START --> WC_SCROLL[Inject Scroll JS\nTrigger lazy content]
WC_SCROLL --> WC_WAIT[Wait for network idle\n+ 2s buffer]
WC_WAIT --> WC_ANTI[Apply anti-bot\nsimulate user · override navigator]
WC_ANTI --> WC_FETCH{Page fetched?}
WC_FETCH -->|Yes| WC_CLEAN[Strip tracking pixels\n+ nav/footer noise]
WC_FETCH -->|No| WC_RETRY{Retries < 3?}
WC_RETRY -->|Yes| WC_BACKOFF[Wait · backoff\n5s · 10s · 15s]
WC_BACKOFF --> WC_FETCH
WC_RETRY -->|No| DLQ[Dead Letter Queue\nlog url + error]
WC_CLEAN --> WC_BFS{More pages\nin BFS queue?}
WC_BFS -->|Yes| WC_SCROLL
WC_BFS -->|No| WC_REORDER[Reorder files\nby URL depth]
WC_REORDER --> FM[Add Frontmatter\npath · title · type · timestamp]
API_START --> API_PAG{Pagination type?}
API_PAG -->|Page| API_PAGE[Increment page number]
API_PAG -->|Offset| API_OFFSET[Advance offset]
API_PAG -->|Cursor| API_CURSOR[Follow next cursor]
API_PAG -->|None| API_FETCH[Single fetch]
API_PAGE & API_OFFSET & API_CURSOR --> API_FETCH
API_FETCH --> API_CHECK{Data returned?}
API_CHECK -->|Yes| API_CONVERT[Convert JSON to Markdown]
API_CHECK -->|No / cap hit| FM
API_CONVERT --> API_CHECK
DOC_START --> DOC_TYPE{File type?}
DOC_TYPE -->|PDF| DOC_LAYOUT[Layout-aware converter\npreserves tables · headings · code]
DOC_TYPE -->|CSV · DOCX · XLSX| DOC_GENERAL[General converter]
DOC_LAYOUT --> DOC_CHECK{Success?}
DOC_CHECK -->|Yes| FM
DOC_CHECK -->|No| DOC_GENERAL
DOC_GENERAL --> FM
FM --> OUT[Markdown file\nready for chunker]
style DLQ fill:#ff6b6b,color:#fff
style OUT fill:#51cf66,color:#fff
style START fill:#339af0,color:#fff