Summary

This fixes a regression where content was clipped partway through extraction.

The root cause was a malformed <figure> in the source HTML.

Changes

  • Skip processing when element contains unexpected content
  • Preserve remaining content after extraction
  • Add regression fixture and test coverage

Testing

  • npm test

Consider removing just the image element instead of the entire anchor, to preserve any text content inside the link.

The early return here might skip valid figures that happen to contain extra whitespace nodes. Consider checking for actual block-level content instead.

Posted a follow-up commit to address the review comments.

  • Preserve linked text when stripping the image
  • Check for block-level content instead of early return