| |
- AdvancedHTMLParser.Parser.AdvancedHTMLParser(html.parser.HTMLParser)
-
- ValidatingAdvancedHTMLParser
- AdvancedHTMLParser.exceptions.HTMLValidationException(builtins.Exception)
-
- AdvancedHTMLParser.exceptions.InvalidCloseException
- AdvancedHTMLParser.exceptions.MissedCloseException
class InvalidCloseException(HTMLValidationException) |
|
InvalidCloseException - Raised when a tag is closed that shouldn't be closed. |
|
- Method resolution order:
- InvalidCloseException
- HTMLValidationException
- builtins.Exception
- builtins.BaseException
- builtins.object
Methods defined here:
- __init__(self, triedToClose, stillOpen)
- Initialize self. See help(type(self)) for accurate signature.
Data descriptors inherited from HTMLValidationException:
- __weakref__
- list of weak references to the object (if defined)
Methods inherited from builtins.Exception:
- __new__(*args, **kwargs) from builtins.type
- Create and return a new object. See help(type) for accurate signature.
Methods inherited from builtins.BaseException:
- __delattr__(self, name, /)
- Implement delattr(self, name).
- __getattribute__(self, name, /)
- Return getattr(self, name).
- __reduce__(...)
- helper for pickle
- __repr__(self, /)
- Return repr(self).
- __setattr__(self, name, value, /)
- Implement setattr(self, name, value).
- __setstate__(...)
- __str__(self, /)
- Return str(self).
- with_traceback(...)
- Exception.with_traceback(tb) --
set self.__traceback__ to tb and return self.
Data descriptors inherited from builtins.BaseException:
- __cause__
- exception cause
- __context__
- exception context
- __dict__
- __suppress_context__
- __traceback__
- args
|
class MissedCloseException(HTMLValidationException) |
|
MissedCloseException - Raised when a close was missed |
|
- Method resolution order:
- MissedCloseException
- HTMLValidationException
- builtins.Exception
- builtins.BaseException
- builtins.object
Methods defined here:
- __init__(self, triedToClose, stillOpen)
- Initialize self. See help(type(self)) for accurate signature.
Data descriptors inherited from HTMLValidationException:
- __weakref__
- list of weak references to the object (if defined)
Methods inherited from builtins.Exception:
- __new__(*args, **kwargs) from builtins.type
- Create and return a new object. See help(type) for accurate signature.
Methods inherited from builtins.BaseException:
- __delattr__(self, name, /)
- Implement delattr(self, name).
- __getattribute__(self, name, /)
- Return getattr(self, name).
- __reduce__(...)
- helper for pickle
- __repr__(self, /)
- Return repr(self).
- __setattr__(self, name, value, /)
- Implement setattr(self, name, value).
- __setstate__(...)
- __str__(self, /)
- Return str(self).
- with_traceback(...)
- Exception.with_traceback(tb) --
set self.__traceback__ to tb and return self.
Data descriptors inherited from builtins.BaseException:
- __cause__
- exception cause
- __context__
- exception context
- __dict__
- __suppress_context__
- __traceback__
- args
|
class ValidatingAdvancedHTMLParser(AdvancedHTMLParser.Parser.AdvancedHTMLParser) |
|
ValidatingAdvancedHTMLParser - A parser which will raise Exceptions for a couple HTML errors that would otherwise cause
an assumption to be made during parsing.
exceptions.InvalidCloseException - The parsed string/file tried to close something it shouldn't have.
exceptions.MissedCloseException - The parsed string/file missed closing an item. |
|
- Method resolution order:
- ValidatingAdvancedHTMLParser
- AdvancedHTMLParser.Parser.AdvancedHTMLParser
- html.parser.HTMLParser
- _markupbase.ParserBase
- builtins.object
Methods defined here:
- handle_endtag(self, tagName)
- Internal for parsing
Methods inherited from AdvancedHTMLParser.Parser.AdvancedHTMLParser:
- __contains__(self, other)
- __init__(self, filename=None, encoding='utf-8')
- __init__ - Creates an Advanced HTML parser object. For read-only parsing, consider IndexedAdvancedHTMLParser for faster searching.
@param filename <str> - Optional filename to parse. Otherwise use parseFile or parseStr methods.
@param encoding <str> - Specifies the document encoding. Default utf-8
- contains(self, em)
- Checks if #em is found anywhere within this element tree
@param em <AdvancedTag> - Tag of interest
@return <bool> - If element #em is within this tree
- containsUid(self, uid)
- Check if #uid is found anywhere within this element tree
@param uid <uuid.UUID> - Uid
@return <bool> - If #uid is found within this tree
- feed(self, contents)
- feed - Feed contents. Use parseStr or parseFile instead.
@param contents - Contents
- filter(self, **kwargs)
- filter aka filterAnd - Filter ALL the elements in this DOM.
Results must match ALL the filter criteria. for ANY, use the *Or methods
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative without QueryableList,
consider #AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
Special Keys:
tagname - The tag name
text - The inner text
@return TagCollection<AdvancedTag>
- filterAnd = filter(self, **kwargs)
- filter aka filterAnd - Filter ALL the elements in this DOM.
Results must match ALL the filter criteria. for ANY, use the *Or methods
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative without QueryableList,
consider #AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
Special Keys:
tagname - The tag name
text - The inner text
@return TagCollection<AdvancedTag>
- filterOr(self, **kwargs)
- filterOr - Perform a filter operation on this node and all children (and their children, onto the end)
Results must match ANY the filter criteria. for ALL, use the *AND methods
For special filter keys, @see #AdvancedHTMLParser.AdvancedHTMLParser.filter
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative, consider AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
@return TagCollection<AdvancedTag>
- find(self, **kwargs)
- find - Perform a search of elements using attributes as keys and potential values as values
(i.e. parser.find(name='blah', tagname='span') will return all elements in this document
with the name "blah" of the tag type "span" )
Arguments are key = value, or key can equal a tuple/list of values to match ANY of those values.
Append a key with __contains to test if some strs (or several possible strs) are within an element
Append a key with __icontains to perform the same __contains op, but ignoring case
Special keys:
tagname - The tag name of the element
text - The text within an element
NOTE: Empty string means both "not set" and "no value" in this implementation.
NOTE: If you installed the QueryableList module (i.e. ran setup.py without --no-deps) it is
better to use the "filter"/"filterAnd" or "filterOr" methods, which are also available
on all tags and tag collections (tag collections also have filterAllAnd and filterAllOr)
@return TagCollection<AdvancedTag> - A list of tags that matched the filter criteria
- getAllNodes(self)
- getAllNodes - Get every element
@return TagCollection<AdvancedTag>
- getElementById(self, _id, root='root')
- getElementById - Searches and returns the first (should only be one) element with the given ID.
@param id <str> - A string of the id attribute.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByAttr(self, attrName, attrValue, root='root')
- getElementsByAttr - Searches the full tree for elements with a given attribute name and value combination. This is always a full scan.
@param attrName <lowercase str> - A lowercase attribute name
@param attrValue <str> - Expected value of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsByClassName(self, className, root='root')
- getElementsByClassName - Searches and returns all elements containing a given class name.
@param className <str> - A one-word class name
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByName(self, name, root='root')
- getElementsByName - Searches and returns all elements with a specific name.
@param name <str> - A string of the name attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByTagName(self, tagName, root='root')
- getElementsByTagName - Searches and returns all elements with a specific tag name.
@param tagName <lowercase str> - A lowercase string of the tag name.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsCustomFilter(self, filterFunc, root='root')
- getElementsCustomFilter - Scan elements using a provided function
@param filterFunc <function>(node) - A function that takes an AdvancedTag as an argument, and returns True if some arbitrary criteria is met
@return - TagCollection of all matching elements
- getElementsWithAttrValues(self, attrName, attrValues, root='root')
- getElementsWithAttrValues - Returns elements with an attribute, named by #attrName contains one of the values in the list, #values
@param attrName <lowercase str> - A lowercase attribute name
@param attrValues set<str> - A set of all valid values.
@return - TagCollection of all matching elements
- getFormattedHTML(self, indent=' ')
- getFormattedHTML - Get formatted and xhtml of this document
@param indent - space/tab/newline of each level of indent, or integer for how many spaces per level
@return - Formatted html as string
- getHTML(self)
- getHTML - Get the full HTML as contained within this tree
@returns - String
- getRoot(self)
- getRoot - returns the root Tag.
NOTE: if there are multiple roots, this will be a special tag.
You may want to consider using getRootNodes instead if this
is a possible situation for you.
@return AdvancedTag
- getRootNodes(self)
- getRootNodes - Gets all objects at the "root" (first level; no parent). Use this if you may have multiple roots (not children of <html>)
Use this method to get objects, for example, in an AJAX request where <html> may not be your root.
Note: If there are multiple root nodes (i.e. no <html> at the top), getRoot will return a special tag. This function automatically
handles that, and returns all root nodes.
@return list<AdvancedTag> - A list of AdvancedTags which are at the root level of the tree.
- handle_charref(self, charRef)
- Internal for parsing
- handle_comment(self, comment)
- Internal for parsing
- handle_data(self, data)
- Internal for parsing
- handle_decl(self, decl)
- Internal for parsing
- handle_entityref(self, entity)
- Internal for parsing
- handle_startendtag(self, tagName, attributeList)
- Internal for parsing
- handle_starttag(self, tagName, attributeList, isSelfClosing=False)
- Internal for parsing
- parseFile(self, filename)
- parseFile - Parses a file and creates the DOM tree and indexes
@param filename <str/file> - A string to a filename or a file object. If file object, it will not be closed, you must close.
- parseStr(self, html)
- parseStr - Parses a string and creates the DOM tree and indexes.
@param html <str> - valid HTML
- setRoot(self, root)
- Sets the root node, and reprocesses the indexes
- unknown_decl(self, decl)
- Internal for parsing
Methods inherited from html.parser.HTMLParser:
- check_for_whole_start_tag(self, i)
- # Internal -- check to see if we have a complete starttag; return end
# or -1 if incomplete.
- clear_cdata_mode(self)
- close(self)
- Handle any buffered data.
- get_starttag_text(self)
- Return full source of start tag: '<...>'.
- goahead(self, end)
- # Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
- handle_pi(self, data)
- # Overridable -- handle processing instruction
- parse_bogus_comment(self, i, report=1)
- # Internal -- parse bogus comment, return length or -1 if not terminated
# see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
- parse_endtag(self, i)
- # Internal -- parse endtag, return end or -1 if incomplete
- parse_html_declaration(self, i)
- # Internal -- parse html declarations, return length or -1 if not terminated
# See w3.org/TR/html5/tokenization.html#markup-declaration-open-state
# See also parse_declaration in _markupbase
- parse_pi(self, i)
- # Internal -- parse processing instr, return end or -1 if not terminated
- parse_starttag(self, i)
- # Internal -- handle starttag, return end or -1 if not terminated
- reset(self)
- Reset this instance. Loses all unprocessed data.
- set_cdata_mode(self, elem)
- unescape(self, s)
- # Internal -- helper to remove special character quoting
Data and other attributes inherited from html.parser.HTMLParser:
- CDATA_CONTENT_ELEMENTS = ('script', 'style')
Methods inherited from _markupbase.ParserBase:
- error(self, message)
- getpos(self)
- Return current line number and offset.
- parse_comment(self, i, report=1)
- # Internal -- parse comment, return length or -1 if not terminated
- parse_declaration(self, i)
- # Internal -- parse declaration (for use by subclasses).
- parse_marked_section(self, i, report=1)
- # Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
- updatepos(self, i, j)
- # Internal -- update line number and offset. This should be
# called for each piece of data exactly once, in order -- in other
# words the concatenation of all the input strings to this
# function should be exactly the entire input.
Data descriptors inherited from _markupbase.ParserBase:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
| |