aglyph.compat.ipyetree
— an ElementTree parser for IronPython¶
Release: | 2.1.1 |
---|
This module defines an xml.etree.ElementTree.XMLParser
that
delegates to the .NET
System.Xml.XmlReader XML
parser to parse an Aglyph XML context document.
IronPython is not able to load CPython’s
xml.parsers.expat
module, and so the default parser used by
ElementTree does not exist.
New in version 2.0.0: To address the missing xml.parsers.expat
module, this module now defines the CLRXMLParser
class, which replaces XmlReaderTreeBuilder
and is used by aglyph.context.XMLContext
as the default parser when running under IronPython.
Alternatively, IronPython developers may wish to install expat
or an
expat
-compatible library as a site package. However, this has not
been tested with Aglyph.
-
class
aglyph.compat.ipyetree.
CLRXMLParser
(target=None, validating=False)[source]¶ Bases:
xml.etree.ElementTree.XMLParser
An
xml.etree.ElementTree.XMLParser
that delegates parsing to the .NET System.Xml.XmlReader parser.If target is omitted, a standard
TreeBuilder
instance is used.If validating is
True
, theSystem.Xml.XmlReader
parser will be configured for DTD validation.-
feed
(data)¶ Add more XML data to be parsed.
Parameters: data (str) – raw XML read from a stream Note
All data across calls to this method are buffered internally; the parser itself is not actually created until the
close()
method is called.
-
close
()¶ Parse the XML from the internal buffer to build an element tree.
Returns: the root element of the XML document Return type: xml.etree.ElementTree.ElementTree
-
-
class
aglyph.compat.ipyetree.
XmlReaderTreeBuilder
(validating=False)[source]¶ Bases:
aglyph.compat.ipyetree.CLRXMLParser
Build an ElementTree using the .NET System.Xml.XmlReader XML parser.
Changed in version 2.0.0: It is no longer necessary for IronPython applications to use this class explicitly.
aglyph.context.XMLContext
now usesCLRXMLParser
by default if running under IronPython.Deprecated since version 2.0.0: This class has been renamed to
CLRXMLParser
.XmlReaderTreeBuilder
will be removed in release 3.0.0.
A note on IronPython Unicode issues¶
IronPython does not have an encoded-bytes str
type; rather, the
str
and unicode
types are one and the same:
>>> str is unicode
True
Unfortunately, this means that IronPython cannot not properly decode byte streams/sequences to Unicode strings using Python language facilities. Consider the simple example of a UTF-8-encoded XML file test.xml:
<?xml version="1.0" encoding="utf-8"?>
<test>façade</test>
CPython
>>> open("test.xml", "rb").read()
'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'
IronPython
>>> open("test.xml", "rb").read()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xc3\xa7ade</test>\n'
The byte sequence C3 A7
in UTF-8-encoded byte string represents a single
Unicode code point (U+00E7 LATIN SMALL LETTER C WITH CEDILLA), while the
character sequence C3 A7
in a Unicode string are the Unicode code
points U+00C3 LATIN CAPITAL LETTER A WITH TILDE followed by
U+00A7 SECTION SIGN. Clearly the latter is incorrect.
In many cases, this difference between CPython and IronPython will be transparent. For example:
CPython
>>> "fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
IronPython
>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
However, IronPython‘s behavior poses a problem for Aglyph XML context
parsing because the xml.etree.ElementTree.ElementTree
class uses
open(source, "rb")
(as in the first comparison) to access the file contents
when the source argument to xml.etree.ElementTree.ElementTree.parse()
is a string (filename). This would cause the XML parser to return the Unicode
string u"fa\xc3\xa7ade"
as the value of the text node under <test>
.
If, for example, this was in an Aglyph <str>
or <bytes>
element (e.g.
<str encoding="iso-8859-1">façade</str>
), Aglyph would attempt (correctly)
to encode the Unicode string using ISO-8859-1, which would result in an
incorrect ISO-8859-1 string under IronPython:
>>> u"fa\xc3\xa7ade".encode("iso-8859-1")
u'fa\xc3\xa7ade'
This happens because both '\xc3'
and '\xa7'
represent valid ISO-8859-1
characters (LATIN SMALL LETTER C WITH CEDILLA and SECTION SIGN, respectively).
One workaround is to use the .NET System.IO.StreamReader
class instead of
the Python built-in function open()
:
>>> from System.IO import StreamReader
>>> from System.Text import Encoding
>>> sr = StreamReader("test.xml", Encoding.UTF8)
>>> sr.ReadToEnd()
u'<?xml version="1.0" encoding="utf-8"?>\n<test>fa\xe7ade</test>\n'
Unfortunately, this requires knowledge of the file encoding prior to reading, which isn’t always possible when parsing XML. (Arguably, it should not need to be known in advance for XML parsing, since the XML declaration should convey this piece of metadata to the XML parser.)
Aglyph’s aglyph.compat.ipyetree.XmlReaderTreeBuilder
takes a two-step
approach to work around IronPython‘s Unicode issues when parsing an Aglyph
XML context document:
- Save the document encoding from the XML declaration.
- Use the document encoding to decode data before handing it off to
aglyph.context.XMLContext
.
Step #1 is possible because, luckily, the System.Xml.XmlReader class reports XmlNodeType.XmlDeclaration.
Note
If the XML document does not specify an explicit encoding in the XML
declaration, XmlReaderTreeBuilder
assumes UTF-8.
Step #2 works because the same “glitch” that causes IronPython‘s Unicode issues can be exploited to work around it:
>>> str is unicode
True
>>> u"fa\xc3\xa7ade".decode("utf-8")
u'fa\xe7ade'
>>> "no non-ascii bytes".decode("utf-8")
'no non-ascii bytes'
Because of this, the text node string u"fa\xc3\xa7ade"
can actually be
decoded to u"fa\xe7ade"
before being handed off to
aglyph.context.XMLContext
, allowing XMLContext
to remain ignorant
of IronPython‘s Unicode issues.