module documentation

ZMS content extraction toolkit module

This module provides helpful functions and classes for use in Python Scripts. It can be accessed from Python with the statement "import Products.zms.content_extraction"

Function extract_content No summary
Function extract_content_pdfminer Apply the pdfminer.six library to extract text from a PDF file. Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data...
Function extract_content_tika Apache Tika - a content analysis toolkit
Function extract_text_from_html Removes html tags and converts html entities to plain text.
Variable security Undocumented
def extract_content(context, b, content_type=''): (source)
Parameters
contextthe ZMS-context
b:bytesthe bytes
content_typeUndocumented
def extract_content_pdfminer(context, b, content_type=None): (source)

Apply the pdfminer.six library to extract text from a PDF file. Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. Install: pip install pdfminer.six

Parameters
contextthe ZMS-context
b:bytespdf data stream
content_type:str or Nonethe content type
See Also
//github.com/pdfminer/pdfminer.six
def extract_content_tika(context, b, content_type=None): (source)

Apache Tika - a content analysis toolkit

See Also
//tika.apache.org/
def extract_text_from_html(context, html_data): (source)

Removes html tags and converts html entities to plain text.

Parameters
contextthe ZMS-context
html_data:str or byteshtml data stream
security = (source)

Undocumented