BeautifulSoup Parser

Author: Stefan Behnel

BeautifulSoup is a Python package that parses broken HTML. While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is much more forgiving and has superiour support for encoding detection.

lxml can benefit from the parsing capabilities of BeautifulSoup through the lxml.html.ElementSoup module. It provides two main functions: parse() to parse a file using BeautifulSoup, and convert_tree() to convert a BeautifulSoup tree into a list of top-level Elements.

Here is a document full of tag soup, similar to, but not quite like, HTML:

>>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'

all you need to do is pass it to the parse() function:

>>> from lxml.html.ElementSoup import parse
>>> from StringIO import StringIO
>>> root = parse(StringIO(tag_soup))

To see what we have here, you can serialise it:

>>> from lxml.etree import tostring
>>> print tostring(root, pretty_print=True)
<html>
  <meta/>
  <head>
    <title>Hello</title>
  </head>
  <body onload="crash()">Hi all<p/></body>
</html>

Not quite what you'd expect from an HTML page, but, well, it was broken already, right? BeautifulSoup did its best, and so now it's a tree.

To control which Element implementation is used, you can pass a makeelement factory function to parse(). By default, this is based on the HTML parser defined in lxml.html.