module documentation

content_extraction.py - ZMS Content Extraction Toolkit for Search Indexing

This module provides functions for extracting text content from various file types, including HTML and PDF. It uses Apache Tika for content analysis and pdfminer.six for PDF text extraction. The extracted text can be used for search indexing or other text processing tasks within ZMS. It also includes a helper function for extracting text from HTML content by removing tags and unescaping entities. The main function, extract_content, takes a byte stream and content type, and attempts to extract text content using the appropriate method based on the content type.

If Tika is configured, it will be used for extraction; otherwise, pdfminer.six will be used for PDFs, and plain text extraction will be applied for certain text-based content types.

It can be accessed from Python with the statement:

    import Products.zms.content_extraction

License: GNU General Public License v2 or later, Organization: ZMS Publishing

Function extract_content No summary
Function extract_content_pdfminer Apply the pdfminer.six library to extract text from a PDF file. Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data...
Function extract_content_tika Apache Tika - a content analysis toolkit
Function extract_text_from_html Removes html tags and converts html entities to plain text.
Variable security Security declaration for the content_extraction module.
def extract_content(context, b, content_type=''): (source)
Parameters
context:ZMSthe ZMS-context
b:bytesthe bytes
content_typeUndocumented
def extract_content_pdfminer(context, b, content_type=None): (source)

Apply the pdfminer.six library to extract text from a PDF file. Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. Install: pip install pdfminer.six

Parameters
contextthe ZMS-context
b:bytespdf data stream
content_type:str or Nonethe content type
See Also
//github.com/pdfminer/pdfminer.six
def extract_content_tika(context, b, content_type=None): (source)

Apache Tika - a content analysis toolkit

See Also
//tika.apache.org/
def extract_text_from_html(context, html_data): (source)

Removes html tags and converts html entities to plain text.

Parameters
contextthe ZMS-context
html_data:str or byteshtml data stream
security = (source)

Security declaration for the content_extraction module.

This ModuleSecurityInfo object declares public access permissions for functions in the Products.zms.content_extraction module, allowing authorized users to call extract_content() and extract_text_from_html() functions through Zope's AccessControl security mechanism.