Coverage for lino/utils/html2odf.py : 85%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# -*- coding: UTF-8 -*- # Copyright 2011-2016 Luc Saffre # License: BSD (see file COPYING for details)
which converts an ElementTree object generated using :mod:`lino.utils.xmlgen.html` to a fragment of ODF.
.. This is part of the Lino test suite. To test it individually, run:
$ python lino/utils/html2odf.py
This is not trivial. The challenge is that HTML and ODF are quite different document representations. But something like this seems necessary. Lino uses it in order to generate .odt documents which contain (among other) chunks of html that have been entered using TinyMCE and stored in database fields.
TODO: is there really no existing library for this task? I saw approaches which call libreoffice in headless mode to do the conversion, but this sounds inappropriate for our situation where we must glue together fragments from different sources. Also note that we use :mod:`appy.pod` to do the actual generation.
Usage examples:
>>> from lino.utils.xmlgen.html import E >>> def test(e): ... print E.tostring(e) ... print toxml(html2odf(e)) >>> test(E.p("This is a ", E.b("first"), " test.")) ... #doctest: +NORMALIZE_WHITESPACE <p>This is a <b>first</b> test.</p> <text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">This is a <text:span text:style-name="Strong Emphasis">first</text:span> test.</text:p>
>>> test(E.p(E.b("This")," is another test.")) ... #doctest: +NORMALIZE_WHITESPACE <p><b>This</b> is another test.</p> <text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"><text:span text:style-name="Strong Emphasis">This</text:span> is another test.</text:p>
>>> test(E.p(E.i("This")," is another test.")) ... #doctest: +NORMALIZE_WHITESPACE <p><i>This</i> is another test.</p> <text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"><text:span text:style-name="Emphasis">This</text:span> is another test.</text:p>
>>> test(E.td(E.p("This is another test."))) ... #doctest: +NORMALIZE_WHITESPACE <td><p>This is another test.</p></td> <text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">This is another test.</text:p>
>>> test(E.td(E.p(E.b("This"), " is another test."))) ... #doctest: +NORMALIZE_WHITESPACE <td><p><b>This</b> is another test.</p></td> <text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"><text:span text:style-name="Strong Emphasis">This</text:span> is another test.</text:p>
>>> test(E.ul(E.li("First item"),E.li("Second item"))) #doctest: +NORMALIZE_WHITESPACE <ul><li>First item</li><li>Second item</li></ul> <text:list xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" text:style-name="podBulletedList"><text:list-item><text:p text:style-name="podBulletItem">First item</text:p></text:list-item><text:list-item><text:p text:style-name="podBulletItem">Second item</text:p></text:list-item></text:list>
N.B.: the above chunk is obviously not correct since Writer doesn't display it. (How can I debug a generated odt file? I mean if my content.xml is syntactically valid but Writer ...) Idea: validate it against the ODF specification using lxml
:func:`html2odf` converts bold text to a span with a style named "Strong Emphasis". That's currently a hard-coded name, and the caller must make sure that a style of that name is defined in the document.
The text formats `<i>` and `<em>` are converted to a style "Emphasis".
Edge case:
>>> print toxml(html2odf("Plain string")) <text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">Plain string</text:p>
>>> print toxml(html2odf(u"Ein schöner Text")) <text:p xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0">Ein schöner Text</text:p>
Not yet supported =================
The following is an example for :ticket:`788`. Conversion fails if a sequence of paragraph-level items are grouped using a div:
>>> test(E.div(E.p("Two numbered items:"), ... E.ol(E.li("first"), E.li("second")))) ... #doctest: +NORMALIZE_WHITESPACE Traceback (most recent call last): ... IllegalText: The <text:section> element does not allow text
>>> test(E.raw('<ul type="disc"><li>First</li><li>Second</li></ul>')) <ul type="disc"><li>First</li><li>Second</li></ul> <text:list xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" text:style-name="podBulletedList"><text:list-item><text:p text:style-name="podBulletItem">First</text:p></text:list-item><text:list-item><text:p text:style-name="podBulletItem">Second</text:p></text:list-item></text:list>
"""
"""Convert an ODF node to a string with its XML representation."""
#~ PTAGS = ('p','td','li')
""" Convert a :mod:`lino.utils.xmlgen.html` element to an ODF text element. Most formats are not implemented. There's probably a better way to do this...
:ct: the root element ("container"). If not specified, we create one.
""" #~ print "20120613 html2odf()", e.tag, e.text #~ if e.tag in PTAGS: #~ oe = text.P(**ctargs) #~ else: #~ oe = text.P(**ctargs) #~ logger.info("20130201 %s",E.tostring(e)) #~ raise NotImplementedError("<%s> without container" % e.tag) #~ oe = text.Span() #~ oe.addText(e) #~ yield oe
#~ ctargs = dict()
#~ oe = text.Span(stylename='Bold Text') oe = text.Span(stylename='Strong Emphasis') #~ oe = text.Span(stylename='Bold Text') oe = text.Span() oe = text.LineBreak()
""" <text:h text:style-name="Heading_20_1" text:outline-level="1"> """ oe = ct = text.H(stylename="Heading 1", outlinelevel=1) oe = ct = text.H(stylename="Heading 2", outlinelevel=2) oe = ct = text.H(stylename="Heading 3", outlinelevel=3)
return # ignore images #~ elif e.tag in ('ul','ol'): #~ oe = text.List(stylename=e.tag.upper()) #~ ctargs = dict(stylename=e.tag.upper()+"_P") #~ oe = ct
#~ if ct.tagName == 'p': #~ oe = ct #~ else: #~ oe = text.P(**ctargs) else: #~ logger.info("20130201 %s",E.tostring(e)) raise NotImplementedError("<%s> inside <%s>" % (e.tag, ct.tagName)) #~ oe = text.Span()
#~ html2odf(child,oe) #~ for oc in html2odf(child,oe): # ~ # oe.addElement(oc) #~ oe.appendChild(oc) #~ if not True: #~ if e.tail: #~ oe.addText(e.tail) #~ yield oe #~ if True: #~ yield e.tail #~ yield text.Span(text=e.tail) #~ yield Text(e.tail)
import doctest doctest.testmod()
_test() |