When the input HTML is completely empty, the extraction pipeline raises an AttributeError instead of returning an empty result.
AttributeError
extract_web_content("", url)
Should return an ExtractedWebContent with empty markdown.
ExtractedWebContent
I can reproduce this. The issue is in pipeline.py line 42 where soup.body is accessed without a None check.
pipeline.py
soup.body
Thanks for the confirmation! PR incoming.
Subscribe
Sign up for free to join this conversation on GitHub.
When the input HTML is completely empty, the extraction pipeline raises an
AttributeErrorinstead of returning an empty result.Steps to reproduce
extract_web_content("", url)Expected behavior
Should return an
ExtractedWebContentwith empty markdown.