| |
- html.parser.HTMLParser(_markupbase.ParserBase)
-
- AdvancedHTMLParser
-
- IndexedAdvancedHTMLParser
class AdvancedHTMLParser(html.parser.HTMLParser) |
|
AdvancedHTMLParser - This class parses and allows searching of documents |
|
- Method resolution order:
- AdvancedHTMLParser
- html.parser.HTMLParser
- _markupbase.ParserBase
- builtins.object
Methods defined here:
- __contains__(self, other)
- __init__(self, filename=None, encoding='utf-8')
- __init__ - Creates an Advanced HTML parser object. For read-only parsing, consider IndexedAdvancedHTMLParser for faster searching.
@param filename <str> - Optional filename to parse. Otherwise use parseFile or parseStr methods.
@param encoding <str> - Specifies the document encoding. Default utf-8
- contains(self, em)
- Checks if #em is found anywhere within this element tree
@param em <AdvancedTag> - Tag of interest
@return <bool> - If element #em is within this tree
- containsUid(self, uid)
- Check if #uid is found anywhere within this element tree
@param uid <uuid.UUID> - Uid
@return <bool> - If #uid is found within this tree
- feed(self, contents)
- feed - Feed contents. Use parseStr or parseFile instead.
@param contents - Contents
- filter(self, **kwargs)
- filter aka filterAnd - Filter ALL the elements in this DOM.
Results must match ALL the filter criteria. for ANY, use the *Or methods
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative without QueryableList,
consider #AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
Special Keys:
tagname - The tag name
text - The inner text
@return TagCollection<AdvancedTag>
- filterAnd = filter(self, **kwargs)
- filterOr(self, **kwargs)
- filterOr - Perform a filter operation on this node and all children (and their children, onto the end)
Results must match ANY the filter criteria. for ALL, use the *AND methods
For special filter keys, @see #AdvancedHTMLParser.AdvancedHTMLParser.filter
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative, consider AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
@return TagCollection<AdvancedTag>
- find(self, **kwargs)
- find - Perform a search of elements using attributes as keys and potential values as values
(i.e. parser.find(name='blah', tagname='span') will return all elements in this document
with the name "blah" of the tag type "span" )
Arguments are key = value, or key can equal a tuple/list of values to match ANY of those values.
Append a key with __contains to test if some strs (or several possible strs) are within an element
Append a key with __icontains to perform the same __contains op, but ignoring case
Special keys:
tagname - The tag name of the element
text - The text within an element
NOTE: Empty string means both "not set" and "no value" in this implementation.
NOTE: If you installed the QueryableList module (i.e. ran setup.py without --no-deps) it is
better to use the "filter"/"filterAnd" or "filterOr" methods, which are also available
on all tags and tag collections (tag collections also have filterAllAnd and filterAllOr)
@return TagCollection<AdvancedTag> - A list of tags that matched the filter criteria
- getAllNodes(self)
- getAllNodes - Get every element
@return TagCollection<AdvancedTag>
- getElementById(self, _id, root='root')
- getElementById - Searches and returns the first (should only be one) element with the given ID.
@param id <str> - A string of the id attribute.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByAttr(self, attrName, attrValue, root='root')
- getElementsByAttr - Searches the full tree for elements with a given attribute name and value combination. This is always a full scan.
@param attrName <lowercase str> - A lowercase attribute name
@param attrValue <str> - Expected value of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsByClassName(self, className, root='root')
- getElementsByClassName - Searches and returns all elements containing a given class name.
@param className <str> - A one-word class name
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByName(self, name, root='root')
- getElementsByName - Searches and returns all elements with a specific name.
@param name <str> - A string of the name attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByTagName(self, tagName, root='root')
- getElementsByTagName - Searches and returns all elements with a specific tag name.
@param tagName <lowercase str> - A lowercase string of the tag name.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsCustomFilter(self, filterFunc, root='root')
- getElementsCustomFilter - Scan elements using a provided function
@param filterFunc <function>(node) - A function that takes an AdvancedTag as an argument, and returns True if some arbitrary criteria is met
@return - TagCollection of all matching elements
- getElementsWithAttrValues(self, attrName, attrValues, root='root')
- getElementsWithAttrValues - Returns elements with an attribute, named by #attrName contains one of the values in the list, #values
@param attrName <lowercase str> - A lowercase attribute name
@param attrValues set<str> - A set of all valid values.
@return - TagCollection of all matching elements
- getFormattedHTML(self, indent=' ')
- getFormattedHTML - Get formatted and xhtml of this document
@param indent - space/tab/newline of each level of indent, or integer for how many spaces per level
@return - Formatted html as string
- getHTML(self)
- getHTML - Get the full HTML as contained within this tree
@returns - String
- getRoot(self)
- getRoot - returns the root Tag.
NOTE: if there are multiple roots, this will be a special tag.
You may want to consider using getRootNodes instead if this
is a possible situation for you.
@return AdvancedTag
- getRootNodes(self)
- getRootNodes - Gets all objects at the "root" (first level; no parent). Use this if you may have multiple roots (not children of <html>)
Use this method to get objects, for example, in an AJAX request where <html> may not be your root.
Note: If there are multiple root nodes (i.e. no <html> at the top), getRoot will return a special tag. This function automatically
handles that, and returns all root nodes.
@return list<AdvancedTag> - A list of AdvancedTags which are at the root level of the tree.
- parseFile(self, filename)
- parseFile - Parses a file and creates the DOM tree and indexes
@param filename <str/file> - A string to a filename or a file object. If file object, it will not be closed, you must close.
- parseStr(self, html)
- parseStr - Parses a string and creates the DOM tree and indexes.
@param html <str> - valid HTML
- setRoot(self, root)
- Sets the root node, and reprocesses the indexes
- reset(self)
- Reset this instance. Loses all unprocessed data.
|
class IndexedAdvancedHTMLParser(AdvancedHTMLParser) |
|
An AdvancedHTMLParser that indexes for much much faster searching. If you are doing searching/validation, this is your bet.
If you are writing/modifying, you may use this, but be sure to call reindex() after changes. |
|
- Method resolution order:
- IndexedAdvancedHTMLParser
- AdvancedHTMLParser
- html.parser.HTMLParser
- _markupbase.ParserBase
- builtins.object
Methods defined here:
- __init__(self, filename=None, encoding='utf-8', indexIDs=True, indexNames=True, indexClassNames=True, indexTagNames=True)
- __init__ - Creates an Advanced HTML parser object, with specific indexing settings.
For the various index* arguments, if True the index will be collected and use (if useIndex=True [default] on get* function)
@param filename <str> - Optional filename to parse. Otherwise use parseFile or parseStr methods.
@param encoding <str> - Specifies the document encoding. Default utf-8
@param indexIDs <bool> - True to create an index for getElementByID method. <default True>
@param indexNames <bool> - True to create an index for getElementsByName method <default True>
@param indexClassNames <bool> - True to create an index for getElementsByClassName method. <default True>
@param indexTagNames <bool> - True to create an index for tag names. <default True>
For indexing other attributes, see the more generic addIndexOnAttribute
- addIndexOnAttribute(self, attributeName)
- addIndexOnAttribute - Add an index for an arbitrary attribute. This will be used by the getElementsByAttr function.
You should do this prior to parsing, or call reindex. Otherwise it will be blank. "name" and "id" will have no effect.
@param attributeName <lowercase str> - An attribute name. Will be lowercased.
- disableIndexing(self)
- disableIndexing - Disables indexing. Consider using plain AdvancedHTMLParser class.
Maybe useful in some scenarios where you want to parse, add a ton of elements, then index
and do a bunch of searching.
- getElementById(self, _id, root='root', useIndex=True)
- getElementById - Searches and returns the first (should only be one) element with the given ID.
@param id <str> - A string of the id attribute.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and ids are indexed [see constructor] only the index will be used. Otherwise a full search is performed.
- getElementsByAttr(self, attrName, attrValue, root='root', useIndex=True)
- getElementsByAttr - Searches the full tree for elements with a given attribute name and value combination. If you want multiple potential values, see getElementsWithAttrValues
If you want an index on a random attribute, use the addIndexOnAttribute function.
@param attrName <lowercase str> - A lowercase attribute name
@param attrValue <str> - Expected value of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and this specific attribute is indexed [see addIndexOnAttribute] only the index will be used. Otherwise a full search is performed.
- getElementsByClassName(self, className, root='root', useIndex=True)
- getElementsByClassName - Searches and returns all elements containing a given class name.
@param className <str> - A one-word class name
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and class names are indexed [see constructor] only the index will be used. Otherwise a full search is performed.
- getElementsByName(self, name, root='root', useIndex=True)
- getElementsByName - Searches and returns all elements with a specific name.
@param name <str> - A string of the name attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and names are indexed [see constructor] only the index will be used. Otherwise a full search is performed.
- getElementsByTagName(self, tagName, root='root', useIndex=True)
- getElementsByTagName - Searches and returns all elements with a specific tag name.
@param tagName <lowercase str> - A lowercase string of the tag name.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex - If True [default] and tag names are set to be indexed [default, see constructor], only the index will be used. If False, all tags
will be searched.
- getElementsWithAttrValues(self, attrName, values, root='root', useIndex=True)
- getElementsWithAttrValues - Returns elements with an attribute matching one of several values. For a single name/value combination, see getElementsByAttr
@param attrName <lowercase str> - A lowercase attribute name
@param attrValues set<str> - List of expected values of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and this specific attribute is indexed [see addIndexOnAttribute] only the index will be used. Otherwise a full search is performed.
- reindex(self, newIndexIDs=None, newIndexNames=None, newIndexClassNames=None, newIndexTagNames=None)
- reindex - reindex the tree. Optionally, change what fields are indexed.
@param newIndexIDs <bool/None> - None to leave same, otherwise new value to index IDs
@parma newIndexNames <bool/None> - None to leave same, otherwise new value to index names
@param newIndexClassNames <bool/None> - None to leave same, otherwise new value to index class names
@param newIndexTagNames <bool/None> - None to leave same, otherwise new value to index tag names
- removeIndexOnAttribute(self, attributeName)
- removeIndexOnAttribute - Remove an attribute from indexing (for getElementsByAttr function) and remove indexed data.
@param attributeName <lowercase str> - An attribute name. Will be lowercased. "name" and "id" will have no effect.
- setRoot(self, root)
- Sets the root node, and reprocesses the indexes
@param root - AdvancedTag for root
Methods inherited from AdvancedHTMLParser:
- __contains__(self, other)
- contains(self, em)
- Checks if #em is found anywhere within this element tree
@param em <AdvancedTag> - Tag of interest
@return <bool> - If element #em is within this tree
- containsUid(self, uid)
- Check if #uid is found anywhere within this element tree
@param uid <uuid.UUID> - Uid
@return <bool> - If #uid is found within this tree
- feed(self, contents)
- feed - Feed contents. Use parseStr or parseFile instead.
@param contents - Contents
- filter(self, **kwargs)
- filter aka filterAnd - Filter ALL the elements in this DOM.
Results must match ALL the filter criteria. for ANY, use the *Or methods
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative without QueryableList,
consider #AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
Special Keys:
tagname - The tag name
text - The inner text
@return TagCollection<AdvancedTag>
- filterAnd = filter(self, **kwargs)
- filter aka filterAnd - Filter ALL the elements in this DOM.
Results must match ALL the filter criteria. for ANY, use the *Or methods
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative without QueryableList,
consider #AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
Special Keys:
tagname - The tag name
text - The inner text
@return TagCollection<AdvancedTag>
- filterOr(self, **kwargs)
- filterOr - Perform a filter operation on this node and all children (and their children, onto the end)
Results must match ANY the filter criteria. for ALL, use the *AND methods
For special filter keys, @see #AdvancedHTMLParser.AdvancedHTMLParser.filter
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative, consider AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
@return TagCollection<AdvancedTag>
- find(self, **kwargs)
- find - Perform a search of elements using attributes as keys and potential values as values
(i.e. parser.find(name='blah', tagname='span') will return all elements in this document
with the name "blah" of the tag type "span" )
Arguments are key = value, or key can equal a tuple/list of values to match ANY of those values.
Append a key with __contains to test if some strs (or several possible strs) are within an element
Append a key with __icontains to perform the same __contains op, but ignoring case
Special keys:
tagname - The tag name of the element
text - The text within an element
NOTE: Empty string means both "not set" and "no value" in this implementation.
NOTE: If you installed the QueryableList module (i.e. ran setup.py without --no-deps) it is
better to use the "filter"/"filterAnd" or "filterOr" methods, which are also available
on all tags and tag collections (tag collections also have filterAllAnd and filterAllOr)
@return TagCollection<AdvancedTag> - A list of tags that matched the filter criteria
- getAllNodes(self)
- getAllNodes - Get every element
@return TagCollection<AdvancedTag>
- getElementsCustomFilter(self, filterFunc, root='root')
- getElementsCustomFilter - Scan elements using a provided function
@param filterFunc <function>(node) - A function that takes an AdvancedTag as an argument, and returns True if some arbitrary criteria is met
@return - TagCollection of all matching elements
- getFormattedHTML(self, indent=' ')
- getFormattedHTML - Get formatted and xhtml of this document
@param indent - space/tab/newline of each level of indent, or integer for how many spaces per level
@return - Formatted html as string
- getHTML(self)
- getHTML - Get the full HTML as contained within this tree
@returns - String
- getRoot(self)
- getRoot - returns the root Tag.
NOTE: if there are multiple roots, this will be a special tag.
You may want to consider using getRootNodes instead if this
is a possible situation for you.
@return AdvancedTag
- getRootNodes(self)
- getRootNodes - Gets all objects at the "root" (first level; no parent). Use this if you may have multiple roots (not children of <html>)
Use this method to get objects, for example, in an AJAX request where <html> may not be your root.
Note: If there are multiple root nodes (i.e. no <html> at the top), getRoot will return a special tag. This function automatically
handles that, and returns all root nodes.
@return list<AdvancedTag> - A list of AdvancedTags which are at the root level of the tree.
- parseFile(self, filename)
- parseFile - Parses a file and creates the DOM tree and indexes
@param filename <str/file> - A string to a filename or a file object. If file object, it will not be closed, you must close.
- parseStr(self, html)
- parseStr - Parses a string and creates the DOM tree and indexes.
@param html <str> - valid HTML
Methods inherited from html.parser.HTMLParser:
- check_for_whole_start_tag(self, i)
- # Internal -- check to see if we have a complete starttag; return end
# or -1 if incomplete.
- reset(self)
- Reset this instance. Loses all unprocessed data.
| |