Open Source News Extraction

Extract news content with precision

A Python library that pulls structured data from any news page HTML — title, author, date, content, images — using text density algorithms with near-perfect accuracy.

extract.py
1from gne import GeneralNewsExtractor
2
3extractor = GeneralNewsExtractor()
4result = extractor.extract(html)
5print(result)
6
7# {
8#   "title": "Breaking News...",
9#   "author": "Jane Doe",
10#   "publish_time": "2025-01-15",
11#   "content": "Full article...",
12#   "images": ["header.jpg"]
13# }
~100%
Accuracy
3.8 — 3.13
Python
MIT
License
2
Dependencies

Built for reliable extraction at scale

Universal Extraction

Density-based algorithm adapts to any news site structure without custom rules or selectors.

Academic Foundation

Built on a published research paper on text and symbol density methods for web content extraction.

Rich Metadata

Extracts title, author, publish time, images, and full article content in one call.

Noise Removal

Filter out comments, ads, and sidebars with XPath-based noise node exclusion.

List Page Support

Extract article listings from index pages. Provide one XPath, get every item back.

Lightweight

Just lxml and pyyaml. No numpy, no heavy dependencies. Fast and lean.

A clean API that stays out of your way

Method Signature
class GeneralNewsExtractor:
    def extract(
        self,
        html,
        title_xpath='',
        host='',
        author_xpath='',
        publish_time_xpath='',
        body_xpath='',
        noise_node_list=None,
        with_body_html=False,
        use_visiable_info=False
    )
Return Value
{
    "title":        "Article headline",
    "author":       "Author name",
    "publish_time": "2025-01-15 09:30",
    "content":      "Full article text...",
    "images":       ["/img/photo.jpg"],
    "body_html":   "<div>...</div>"
}
Get started now

One command. No configuration required.

$ pip install gne Copied