content_extraction.py - ZMS Content Extraction Toolkit for Search Indexing
This module provides functions for extracting text content from various file types, including HTML and PDF. It uses Apache Tika for content analysis and pdfminer.six for PDF text extraction. The extracted text can be used for search indexing or other text processing tasks within ZMS. It also includes a helper function for extracting text from HTML content by removing tags and unescaping entities. The main function, extract_content, takes a byte stream and content type, and attempts to extract text content using the appropriate method based on the content type.
If Tika is configured, it will be used for extraction; otherwise, pdfminer.six will be used for PDFs, and plain text extraction will be applied for certain text-based content types.
It can be accessed from Python with the statement:
import Products.zms.content_extraction
License: GNU General Public License v2 or later, Organization: ZMS Publishing
| Function | extract |
No summary |
| Function | extract |
Apply the pdfminer.six library to extract text from a PDF file. Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data... |
| Function | extract |
Apache Tika - a content analysis toolkit |
| Function | extract |
Removes html tags and converts html entities to plain text. |
| Variable | security |
Security declaration for the content_extraction module. |
Apply the pdfminer.six library to extract text from a PDF file. Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data. Pdfminer.six extracts the text from a page directly from the sourcecode of the PDF. Install: pip install pdfminer.six
| Parameters | |
| context | the ZMS-context |
| b:bytes | pdf data stream |
| content | the content type |
| See Also | |
| //github.com/pdfminer/pdfminer.six | |
Removes html tags and converts html entities to plain text.
| Parameters | |
| context | the ZMS-context |
| html | html data stream |
Security declaration for the content_extraction module.
This ModuleSecurityInfo object declares public access permissions for functions in the Products.zms.content_extraction module, allowing authorized users to call extract_content() and extract_text_from_html() functions through Zope's AccessControl security mechanism.