Usage¶
Converting files using the command line interface¶
Using the pydocx
command,
you can specify the output format
with the input and output files:
$ pydocx --html input.docx output.html
Converting files using the library directly¶
If you don’t want to mess around
having to create exporters,
you can use the
PyDocX.to_html
helper method:
from pydocx.pydocx import PyDocX
# Pass in a path
html = PyDocX.to_html('file.docx')
# Pass in a file object
html = PyDocX.to_html(open('file.docx', 'rb'))
# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
buf.write(f.read())
html = PyDocX.to_html(buf)
Of course, you can do the same using the exporter class:
from pydocx.export import PyDocXHTMLExporter
# Pass in a path
exporter = PyDocXHTMLExporter('file.docx')
html = exporter.parsed
# Pass in a file object
exporter = PyDocXHTMLExporter(open('file.docx', 'rb'))
html = exporter.parsed
# Pass in a file-like object
from cStringIO import StringIO
buf = StringIO()
with open('file.docx') as f:
buf.write(f.read())
exporter = PyDocXHTMLExporter(buf)
html = exporter.parsed
Currently Supported HTML elements¶
- tables
- nested tables
- rowspans
- colspans
- lists in tables
- lists
- list styles
- nested lists
- list of tables
- list of pragraphs
- justification
- images
- styles
- bold
- italics
- underline
- hyperlinks
- headings
HTML Styles¶
The export class
pydocx.export.PyDocXHTMLExporter
relies on certain
CSS classes being defined
for certain behavior to occur.
Currently these include:
- class
pydocx-insert
-> Turns the text green. - class
pydocx-delete
-> Turns the text red and draws a line through the text. - class
pydocx-center
-> Aligns the text to the center. - class
pydocx-right
-> Aligns the text to the right. - class
pydocx-left
-> Aligns the text to the left. - class
pydocx-comment
-> Turns the text blue. - class
pydocx-underline
-> Underlines the text. - class
pydocx-caps
-> Makes all text uppercase. - class
pydocx-small-caps
-> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts. - class
pydocx-strike
-> Strike a line through. - class
pydocx-hidden
-> Hide the text. - class
pydocx-tab
-> Represents a tab within the document.
Exceptions¶
There is only one custom exception (MalformedDocxException
).
It is raised if either the xml
or zipfile
libraries raise an exception.
Deviations from the ECMA-376 Specification¶
Missing val attribute in underline tag¶
In the event that the
val
attribute is missing from au
(ST_Underline
type), we treat the underline as off, or none. See also http://msdn.microsoft.com/en-us/library/ff532016%28v=office.12%29.aspxIf the val attribute is not specified, Word defaults to the value defined in the style hierarchy and then to no underline.