Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

# -*- coding: utf-8 -*- 

# Copyright 2011-2016 Luc Saffre 

# License: BSD (see file COPYING for details) 

 

# How to test this document: 

# 

#  $ python setup.py test -s tests.UtilsTests.test_tidy 

 

r"""Defines the :func:`html2xhtml` function which converts HTML to 

valid XHTML. 

 

It uses Jason Stitt's `pytidylib 

<http://countergram.com/open-source/pytidylib/docs/index.html>`__ 

module. This module requires the `HTML Tidy library 

<http://tidy.sourceforge.net/>`__ to be installed on the system:: 

 

    $ sudo aptitude install tidy 

 

Some examples: 

 

>>> print(html2xhtml('''\ 

... <p>Hello,&nbsp;world!<br>Again I say: Hello,&nbsp;world!</p> 

... <img src="foo.org" alt="Foo">''')) 

... #doctest: +NORMALIZE_WHITESPACE -SKIP 

<p>Hello,&nbsp;world!<br /> 

Again I say: Hello,&nbsp;world!</p> 

<img src="foo.org" alt="Foo" /> 

 

Above test is currently skipped because tidylib output can slightly 

differ (``alt="Foo">`` versus ``alt="Foo" >``) depending on the 

installed version of tidylib. 

 

 

>>> html = '''\ 

... <p style="font-family: &quot;Verdana&quot;;">Verdana</p>''' 

>>> print(html2xhtml(html)) 

<p style="font-family: &quot;Verdana&quot;;">Verdana</p> 

 

>>> print(html2xhtml('A &amp; B')) 

A &amp; B 

 

>>> print(html2xhtml('a &lt; b')) 

a &lt; b 

 

A `<div>` inside a `<span>` is not valid XHTML. 

Neither is a `<li>` inside a `<strong>`. 

 

But how to convert it?  Inline tags must be "temporarily" closed 

before and reopended after a block element. 

 

>>> print(html2xhtml('<p>foo<span class="c">bar<div> oops </div>baz</span>bam</p>')) 

<p>foo<span class="c">bar</span></p> 

<div><span class="c">oops</span></div> 

<span class="c">baz</span>bam 

 

>>> print(html2xhtml('''<strong><ul><em><li>Foo</li></em><li>Bar</li></ul></strong>''')) 

<ul> 

<li><strong><em>Foo</em></strong></li> 

<li><strong>Bar</strong></li> 

</ul> 

 

In HTML it was tolerated to not end certain tags. 

For example, a string "<p>foo<p>bar<p>baz" converts 

to "<p>foo</p><p>bar</p><p>baz</p>". 

 

>>> print(html2xhtml('<p>foo<p>bar<p>baz')) 

<p>foo</p> 

<p>bar</p> 

<p>baz</p> 

 

 

   

 

""" 

 

# from __future__ import print_function, unicode_literals 

 

WRAP_BEFORE = """\ 

<html> 

<head> 

<title></title> 

</head> 

<body> 

""" 

 

try: 

 

    from tidylib import tidy_fragment 

 

    # http://tidy.sourceforge.net/docs/quickref.html 

 

    def html2xhtml(html, **options): 

        options.update(doctype='omit') 

        options.update(show_warnings=0) 

        options.update(indent=0) 

        # options.update(output_xml=1) 

        options.update(output_xhtml=1) 

        document, errors = tidy_fragment(html, options=options) 

        if errors: 

            #~ raise Exception(repr(errors)) 

            raise Exception("Errors while processing %s\n==========\n%s" % 

                            (html, errors)) 

        # if document.startswith(WRAP_BEFORE): 

        #     document = document[len(WRAP_BEFORE):] 

        #     document = document[:-15] 

        return document.strip() 

 

    HAS_TIDYLIB = True 

 

except OSError: 

    # happens on readthedocs.org and Travis CI: OSError: Could not 

    # load libtidy using any of these names: 

    # libtidy,libtidy.so,libtidy-0.99.so.0,cygtidy-0-99-0,tidylib, 

    # libtidy.dylib,tidy 

 

    # We can simply ignore it since it is just for building the docs. 

    from lino.utils.mytidylib import html2xhtml 

    # TODO: emulate it well enough so that at least the test suite passes 

 

    HAS_TIDYLIB = False 

 

 

def _test(): 

    import doctest 

    doctest.testmod() 

 

if __name__ == "__main__": 

    _test()