What's new in lxml 2.0?

Contents

During the development of the lxml 1.x series, a couple of quirks were discovered in the design that made the API less obvious and its future extensions harder than necessary. lxml 2.0 is a soft evolution of lxml 1.x towards a simpler, more consistent and more powerful API - with some major extensions. Wherever possible, lxml 1.3 comes close to the semantics of lxml 2.0, so that migrating should be easier for code that currently runs with 1.3.

Changes in etree and objectify

A graduation towards a more consistent API cannot go without a certain amount of incompatible changes. The following is a list of those differences that applications need to take into account when migrating from lxml 1.x to lxml 2.0.

Incompatible changes

  • lxml 0.9 introduced a feature called namespace implementation. The global Namespace factory was added to register custom element classes and have lxml.etree look them up automatically. However, the later development of further class lookup mechanisms made it appear less and less adequate to register this mapping at a global level, so lxml 1.1 first removed the namespace based lookup from the default setup and lxml 2.0 finally removes the global namespace registry completely. As all other lookup mechanisms, the namespace lookup is now local to a parser, including the registry itself. Applications that use a module-level parser can easily map its get_namespace() method to a global Namespace function to mimic the old behaviour.
  • XPath now raises exceptions specific to the part of the execution that failed: XPathSyntaxError for parser errors and XPathEvalError for errors that occurred during the evaluation. Note that the distinction only works for the XPath() class. The other two evaluators only have a single evaluation call that includes the parsing step, and will therefore only raise an XPathEvalError. Applications can catch both exceptions through the common base class XPathError (which also exists in earlier lxml versions).
  • Network access in parsers is now disabled by default, i.e. the no_network option defaults to True. Due to a somewhat 'interesting' implementation in libxml2, this does not affect the first document (i.e. the URL that is parsed), but only subsequent documents, such as a DTD when parsing with validation. This means that you will have to check the URL you pass, instead of relying on lxml to prevent any access to external resources. As this can be helpful in some use cases, lxml does not work around it.
  • The type annotations in lxml.objectify (the pytype attribute) now use NoneType for the None value as this is the correct Python type name. Previously, lxml 1.x used a lower case ǹone.
  • Another change in objectify regards the way it deals with ambiguous types. Previously, setting a value like the string "3" through normal attribute access would let it come back as an integer when reading the object attribute. lxml 2.0 prevents this by always setting the pytype attribute to the type the user passed in, so "3" will come back as a string, while the number 3 will come back as a number. To remove the type annotation on serialisation, you can use the deannotate() function.
  • The C-API function findOrBuildNodeNs() was replaced by the more generic findOrBuildNodeNsPrefix()

Enhancements

Most of the enhancements of lxml 2.0 were made under the hood. Most people won't even notice them, but they make the maintenance of lxml easier and thus facilitate further enhancements and an improved integration between lxml's features.

  • lxml.objectify now has its own implementation of the E factory. It uses the built-in type lookup mechanism of lxml.objectify, thus removing the need for an additional type registry mechanism (as previously available through the typemap parameter).
  • XML entities are supported through the Entity() factory, an Entity element class and a parser option resolve_entities that allows to keep entities in the element tree when set to False. Also, the parser will now report undefined entities as errors if it needs to resolve them (which is still the default, as in lxml 1.x).
  • A major part of the XPath code was rewritten and can now benefit from a bigger overlap with the XSLT code. The main benefits are improved thread safety in the XPath evaluators and Python RegExp support in standard XPath.

New modules

The most visible changes in lxml 2.0 regard the new modules that were added.

lxml.usedoctest

A very useful module for doctests based on XML or HTML is lxml.doctestcompare. It provides a relaxed comparison mechanism for XML and HTML in doctests. Using it is as simple as:

>>> import lxml.usedoctest

for XML comparisons and:

>>> import lxml.html.usedoctest

for HTML comparisons.

lxml.html

The largest new package that was added to lxml 2.0 is lxml.html. It contains various tools and modules for HTML handling. The major features include support for cleaning up HTML (removing unwanted content), a readable HTML diff and various tools for working with links.

lxml.cssselect

The Cascading Stylesheet Language (CSS) has a very short and generic path language for pointing at elements in XML/HTML trees (CSS selectors). The module lxml.cssselect provides an implementation based on XPath.