Ingestor Flow Diagram

flowchart TD START([Input Source]) --> TYPE{Ingestor Type} TYPE -->|URL| WC_START[Start BFS Crawl] TYPE -->|API| API_START[Build HTTP Request] TYPE -->|Docs| DOC_START[Load Local File] WC_START --> WC_SCROLL[Inject Scroll JS\nTrigger lazy content] WC_SCROLL --> WC_WAIT[Wait for network idle\n+ 2s buffer] WC_WAIT --> WC_ANTI[Apply anti-bot\nsimulate user · override navigator] WC_ANTI --> WC_FETCH{Page fetched?} WC_FETCH -->|Yes| WC_CLEAN[Strip tracking pixels\n+ nav/footer noise] WC_FETCH -->|No| WC_RETRY{Retries < 3?} WC_RETRY -->|Yes| WC_BACKOFF[Wait · backoff\n5s · 10s · 15s] WC_BACKOFF --> WC_FETCH WC_RETRY -->|No| DLQ[Dead Letter Queue\nlog url + error] WC_CLEAN --> WC_BFS{More pages\nin BFS queue?} WC_BFS -->|Yes| WC_SCROLL WC_BFS -->|No| WC_REORDER[Reorder files\nby URL depth] WC_REORDER --> FM[Add Frontmatter\npath · title · type · timestamp] API_START --> API_PAG{Pagination type?} API_PAG -->|Page| API_PAGE[Increment page number] API_PAG -->|Offset| API_OFFSET[Advance offset] API_PAG -->|Cursor| API_CURSOR[Follow next cursor] API_PAG -->|None| API_FETCH[Single fetch] API_PAGE & API_OFFSET & API_CURSOR --> API_FETCH API_FETCH --> API_CHECK{Data returned?} API_CHECK -->|Yes| API_CONVERT[Convert JSON to Markdown] API_CHECK -->|No / cap hit| FM API_CONVERT --> API_CHECK DOC_START --> DOC_TYPE{File type?} DOC_TYPE -->|PDF| DOC_LAYOUT[Layout-aware converter\npreserves tables · headings · code] DOC_TYPE -->|CSV · DOCX · XLSX| DOC_GENERAL[General converter] DOC_LAYOUT --> DOC_CHECK{Success?} DOC_CHECK -->|Yes| FM DOC_CHECK -->|No| DOC_GENERAL DOC_GENERAL --> FM FM --> OUT[Markdown file\nready for chunker] style DLQ fill:#ff6b6b,color:#fff style OUT fill:#51cf66,color:#fff style START fill:#339af0,color:#fff