A Python library that pulls structured data from any news page HTML — title, author, date, content, images — using text density algorithms with near-perfect accuracy.
1from gne import GeneralNewsExtractor 2 3extractor = GeneralNewsExtractor() 4result = extractor.extract(html) 5print(result) 6 7# { 8# "title": "Breaking News...", 9# "author": "Jane Doe", 10# "publish_time": "2025-01-15", 11# "content": "Full article...", 12# "images": ["header.jpg"] 13# }
Density-based algorithm adapts to any news site structure without custom rules or selectors.
Built on a published research paper on text and symbol density methods for web content extraction.
Extracts title, author, publish time, images, and full article content in one call.
Filter out comments, ads, and sidebars with XPath-based noise node exclusion.
Extract article listings from index pages. Provide one XPath, get every item back.
Just lxml and pyyaml. No numpy, no heavy dependencies. Fast and lean.
class GeneralNewsExtractor: def extract( self, html, title_xpath='', host='', author_xpath='', publish_time_xpath='', body_xpath='', noise_node_list=None, with_body_html=False, use_visiable_info=False )
{ "title": "Article headline", "author": "Author name", "publish_time": "2025-01-15 09:30", "content": "Full article text...", "images": ["/img/photo.jpg"], "body_html": "<div>...</div>" }
One command. No configuration required.