This module defines an xml.etree.ElementTree “tree builder” that uses the .NET System.Xml.XmlReader XML parser to parse an Aglyph XML context document.
IronPython is not able to load CPython’s xml.parsers.expat module, and so the default parser (used by ElementTree) simply does not exist. This means that, by default, Aglyph XML contexts cannot be loaded in IronPython applications.
IronPython developers who wish to use XML configuration have several options to work around this limitation:
Note
Programmatic configuration using aglyph.context.Context is fully supported in IronPython.
Builds an ElementTree using the .NET System.Xml.XmlReader XML parser.
Set validating to True to use a validating parser.
Adds more XML data to be parsed.
data is raw XML read from a stream or passed in as a string.
Note
All data across calls to this method are buffered internally; the parser itself is not actually created until the close() method is called.
Parses the XML from the internal buffer to build an element tree.
Returns: | the root element of the XML document |
---|---|
Return type: | xml.etree.ElementTree.ElementTree |
IronPython does not have an encoded-bytes str type; rather, the str and unicode types are one and the same:
>>> str is unicode
True
Unfortunately, this means that IronPython cannot not properly decode byte streams/sequences to Unicode strings using Python language facilities. Consider the simple example of a UTF-8-encoded XML file test.xml:
<?xml version="1.0" encoding="utf-8"?>
<test>façade</test>
CPython
>>> open("test.xml", "rb").read()
'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'
IronPython
>>> open("test.xml", "rb").read()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'
The byte sequence C3 A7 in UTF-8-encoded byte string represents a single Unicode code point (U+00E7 LATIN SMALL LETTER C WITH CEDILLA), while the character sequence C3 A7 in a Unicode string are the Unicode code points U+00C3 LATIN CAPITAL LETTER A WITH TILDE followed by U+00A7 SECTION SIGN. Clearly the latter is incorrect.
In many cases, this difference between CPython and IronPython will be transparent. For example:
CPython
>>> "fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
IronPython
>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
However, IronPython‘s behavior poses a problem for Aglyph XML context parsing because the xml.etree.ElementTree.ElementTree class uses open(source, "rb") (as in the first comparison) to access the file contents when the source argument to xml.etree.ElementTree.ElementTree.parse() is a string (filename). This would cause the XML parser to return the Unicode string u"fa\xc3\xa7ade" as the value of the text node under <test>. If, for example, this was in an Aglyph <str> or <bytes> element (e.g. <str encoding="iso-8859-1">façade</str>), Aglyph would attempt (correctly) to encode the Unicode string using ISO-8859-1, which would result in an incorrect ISO-8859-1 string under IronPython:
>>> u"fa\xc3\xa7ade".encode("iso-8859-1")
u'fa\xc3\xa7ade'
This happens because both '\xc3' and '\xa7' represent valid ISO-8859-1 characters (LATIN SMALL LETTER C WITH CEDILLA and SECTION SIGN, respectively).
One workaround is to use the .NET System.IO.StreamReader class instead of the Python built-in function open():
>>> from System.IO import StreamReader
>>> from System.Text import Encoding
>>> sr = StreamReader("test.xml", Encoding.UTF8)
>>> sr.ReadToEnd()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xe7ade</test>\n'
Unfortunately, this requires knowledge of the file encoding prior to reading, which isn’t always possible when parsing XML. (Arguably, it should not need to be known in advance for XML parsing, since the XML declaration should convey this piece of metadata to the XML parser.)
Aglyph’s aglyph.compat.ipyetree.XmlReaderTreeBuilder takes a two-step approach to work around IronPython‘s Unicode issues when parsing an Aglyph XML context document:
Step #1 is possible because, luckily, the System.Xml.XmlReader class reports XmlNodeType.XmlDeclaration.
Note
If the XML document does not specify an explicit encoding in the XML declaration, XmlReaderTreeBuilder assumes UTF-8.
Step #2 works because the same “glitch” that causes IronPython‘s Unicode issues can be exploited to work around it:
>>> str is unicode
True
>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
>>> "no non-ascii bytes".decode("utf-8")
'no non-ascii bytes'
Because of this, the text node string u"fa\xc3\xa7ade" can actually be decoded to u"fa\xe7ade" before being handed off to aglyph.context.XMLContext, allowing XMLContext to remain ignorant of IronPython‘s Unicode issues.