| |
- builtins.Exception(builtins.BaseException)
-
- AdvancedHTMLParser.exceptions.HTMLValidationException
-
- AdvancedHTMLParser.exceptions.InvalidCloseException
- AdvancedHTMLParser.exceptions.MissedCloseException
- builtins.object
-
- AdvancedHTMLParser.Tags.AdvancedTag
- html.parser.HTMLParser(_markupbase.ParserBase)
-
- AdvancedHTMLParser.Formatter.AdvancedHTMLFormatter
- AdvancedHTMLParser.Parser.AdvancedHTMLParser
-
- AdvancedHTMLParser.Parser.IndexedAdvancedHTMLParser
class AdvancedHTMLFormatter(html.parser.HTMLParser) |
|
A formatter for HTML. Note this does not understand CSS, so if you are enabling preformatted text based on css rules, it will not work.
It does, however, understand "pre", "code" and "script" tags and will not try to format their contents. |
|
- Method resolution order:
- AdvancedHTMLFormatter
- html.parser.HTMLParser
- _markupbase.ParserBase
- builtins.object
Methods defined here:
- __init__(self, indent=' ', encoding='utf-8')
- Create a formatter.
@param indent - Either a space/tab/newline that represents one level of indent, or an integer to use that number of spaces
@param encoding - Use this encoding for the document.
- feed(self, contents)
- feed - Load contents
@param contents - HTML contents
- getHTML(self)
- getHTML - Get the full HTML as contained within this tree, converted to valid XHTML
@returns - String
- getRoot(self)
- getRoot - returns the root Tag
@return - AdvancedTag at root. If you provided multiple root nodes, this will be a "holder" with tagName value as constants.INVISIBLE_ROOT_TAG
- getRootNodes(self)
- getRootNodes - Gets all objects at the "root" (first level; no parent). Use this if you may have multiple roots (not children of <html>)
Use this method to get objects, for example, in an AJAX request where <html> may not be your root.
Note: If there are multiple root nodes (i.e. no <html> at the top), getRoot will return a special tag. This function automatically
handles that, and returns all root nodes.
@return list<AdvancedTag> - A list of AdvancedTags which are at the root level of the tree.
- handle_charref(self, charRef)
- Internal for parsing
- handle_comment(self, comment)
- Internal for parsing
- handle_data(self, data)
- handle_data - Internal for parsing
- handle_decl(self, decl)
- Internal for parsing
- handle_endtag(self, tagName)
- handle_endtag - Internal for parsing
- handle_entityref(self, entity)
- Internal for parsing
- handle_startendtag(self, tagName, attributeList)
- handle_startendtag - Internal for parsing
- handle_starttag(self, tagName, attributeList, isSelfClosing=False)
- handle_starttag - Internal for parsing
- parseFile(self, filename)
- parseFile - Parses a file and creates the DOM tree and indexes
@param filename <str/file> - A string to a filename or a file object. If file object, it will not be closed, you must close.
- parseStr(self, html)
- parseStr - Parses a string and creates the DOM tree and indexes.
@param html <str> - valid HTML
- setRoot(self, root)
- setRoot - Sets the root node, and reprocesses the indexes
@param root - AdvancedTag to be new root
- unknown_decl(self, decl)
- Internal for parsing
Methods inherited from html.parser.HTMLParser:
- check_for_whole_start_tag(self, i)
- # Internal -- check to see if we have a complete starttag; return end
# or -1 if incomplete.
- clear_cdata_mode(self)
- close(self)
- Handle any buffered data.
- error(self, message)
- get_starttag_text(self)
- Return full source of start tag: '<...>'.
- goahead(self, end)
- # Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
- handle_pi(self, data)
- # Overridable -- handle processing instruction
- parse_bogus_comment(self, i, report=1)
- # Internal -- parse bogus comment, return length or -1 if not terminated
# see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
- parse_endtag(self, i)
- # Internal -- parse endtag, return end or -1 if incomplete
- parse_html_declaration(self, i)
- # Internal -- parse html declarations, return length or -1 if not terminated
# See w3.org/TR/html5/tokenization.html#markup-declaration-open-state
# See also parse_declaration in _markupbase
- parse_pi(self, i)
- # Internal -- parse processing instr, return end or -1 if not terminated
- parse_starttag(self, i)
- # Internal -- handle starttag, return end or -1 if not terminated
- reset(self)
- Reset this instance. Loses all unprocessed data.
- set_cdata_mode(self, elem)
- unescape(self, s)
- # Internal -- helper to remove special character quoting
Data and other attributes inherited from html.parser.HTMLParser:
- CDATA_CONTENT_ELEMENTS = ('script', 'style')
Methods inherited from _markupbase.ParserBase:
- getpos(self)
- Return current line number and offset.
- parse_comment(self, i, report=1)
- # Internal -- parse comment, return length or -1 if not terminated
- parse_declaration(self, i)
- # Internal -- parse declaration (for use by subclasses).
- parse_marked_section(self, i, report=1)
- # Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
- updatepos(self, i, j)
- # Internal -- update line number and offset. This should be
# called for each piece of data exactly once, in order -- in other
# words the concatenation of all the input strings to this
# function should be exactly the entire input.
Data descriptors inherited from _markupbase.ParserBase:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class AdvancedHTMLParser(html.parser.HTMLParser) |
|
AdvancedHTMLParser - This class parses and allows searching of documents |
|
- Method resolution order:
- AdvancedHTMLParser
- html.parser.HTMLParser
- _markupbase.ParserBase
- builtins.object
Methods defined here:
- __init__(self, filename=None, encoding='utf-8')
- __init__ - Creates an Advanced HTML parser object. For read-only parsing, consider IndexedAdvancedHTMLPaser for faster searching.
@param filename <str> - Optional filename to parse. Otherwise use parseFile or parseStr methods.
@param encoding <str> - Specifies the document encoding. Default utf-8
- feed(self, contents)
- feed - Feed contents. Use parseStr or parseFile instead.
@param contents - Contents
- getElementById(self, _id, root='root')
- getElementById - Searches and returns the first (should only be one) element with the given ID.
@param id <str> - A string of the id attribute.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByAttr(self, attrName, attrValue, root='root')
- getElementsByAttr - Searches the full tree for elements with a given attribute name and value combination. This is always a full scan.
@param attrName <lowercase str> - A lowercase attribute name
@param attrValue <str> - Expected value of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsByClassName(self, className, root='root')
- getElementsByClassName - Searches and returns all elements containing a given class name.
@param className <str> - A one-word class name
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByName(self, name, root='root')
- getElementsByName - Searches and returns all elements with a specific name.
@param name <str> - A string of the name attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByTagName(self, tagName, root='root')
- getElementsByTagName - Searches and returns all elements with a specific tag name.
@param tagName <lowercase str> - A lowercase string of the tag name.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsCustomFilter(self, filterFunc, root='root')
- getElementsCustomFilter - Scan elements using a provided function
@param filterFunc <function>(node) - A function that takes an AdvancedTag as an argument, and returns True if some arbitrary criteria is met
@return - TagCollection of all matching elements
- getElementsWithAttrValues(self, attrName, attrValues, root='root')
- getElementsWithAttrValues - Returns elements with an attribute, named by #attrName contains one of the values in the list, #values
@param attrName <lowercase str> - A lowercase attribute name
@param attrValues set<str> - A set of all valid values.
@return - TagCollection of all matching elements
- getFormattedHTML(self, indent=' ')
- getFormattedHTML - Get formatted and xhtml of this document
@param indent - space/tab/newline of each level of indent, or integer for how many spaces per level
@return - Formatted html as string
- getHTML(self)
- getHTML - Get the full HTML as contained within this tree
@returns - String
- getRoot(self)
- getRoot - returns the root Tag
@return Tag
- getRootNodes(self)
- getRootNodes - Gets all objects at the "root" (first level; no parent). Use this if you may have multiple roots (not children of <html>)
Use this method to get objects, for example, in an AJAX request where <html> may not be your root.
Note: If there are multiple root nodes (i.e. no <html> at the top), getRoot will return a special tag. This function automatically
handles that, and returns all root nodes.
@return list<AdvancedTag> - A list of AdvancedTags which are at the root level of the tree.
- handle_charref(self, charRef)
- Internal for parsing
- handle_comment(self, comment)
- Internal for parsing
- handle_data(self, data)
- Internal for parsing
- handle_decl(self, decl)
- Internal for parsing
- handle_endtag(self, tagName)
- Internal for parsing
- handle_entityref(self, entity)
- Internal for parsing
- handle_startendtag(self, tagName, attributeList)
- Internal for parsing
- handle_starttag(self, tagName, attributeList, isSelfClosing=False)
- Internal for parsing
- parseFile(self, filename)
- parseFile - Parses a file and creates the DOM tree and indexes
@param filename <str/file> - A string to a filename or a file object. If file object, it will not be closed, you must close.
- parseStr(self, html)
- parseStr - Parses a string and creates the DOM tree and indexes.
@param html <str> - valid HTML
- setRoot(self, root)
- Sets the root node, and reprocesses the indexes
- unknown_decl(self, decl)
- Internal for parsing
Methods inherited from html.parser.HTMLParser:
- check_for_whole_start_tag(self, i)
- # Internal -- check to see if we have a complete starttag; return end
# or -1 if incomplete.
- clear_cdata_mode(self)
- close(self)
- Handle any buffered data.
- error(self, message)
- get_starttag_text(self)
- Return full source of start tag: '<...>'.
- goahead(self, end)
- # Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
- handle_pi(self, data)
- # Overridable -- handle processing instruction
- parse_bogus_comment(self, i, report=1)
- # Internal -- parse bogus comment, return length or -1 if not terminated
# see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
- parse_endtag(self, i)
- # Internal -- parse endtag, return end or -1 if incomplete
- parse_html_declaration(self, i)
- # Internal -- parse html declarations, return length or -1 if not terminated
# See w3.org/TR/html5/tokenization.html#markup-declaration-open-state
# See also parse_declaration in _markupbase
- parse_pi(self, i)
- # Internal -- parse processing instr, return end or -1 if not terminated
- parse_starttag(self, i)
- # Internal -- handle starttag, return end or -1 if not terminated
- reset(self)
- Reset this instance. Loses all unprocessed data.
- set_cdata_mode(self, elem)
- unescape(self, s)
- # Internal -- helper to remove special character quoting
Data and other attributes inherited from html.parser.HTMLParser:
- CDATA_CONTENT_ELEMENTS = ('script', 'style')
Methods inherited from _markupbase.ParserBase:
- getpos(self)
- Return current line number and offset.
- parse_comment(self, i, report=1)
- # Internal -- parse comment, return length or -1 if not terminated
- parse_declaration(self, i)
- # Internal -- parse declaration (for use by subclasses).
- parse_marked_section(self, i, report=1)
- # Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
- updatepos(self, i, j)
- # Internal -- update line number and offset. This should be
# called for each piece of data exactly once, in order -- in other
# words the concatenation of all the input strings to this
# function should be exactly the entire input.
Data descriptors inherited from _markupbase.ParserBase:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class AdvancedTag(builtins.object) |
|
AdvancedTag - Represents a Tag. Used with AdvancedHTMLParser to create a DOM-model
Keep tag names lowercase.
Use the getters and setters instead of attributes directly, or you may lose accounting. |
|
Methods defined here:
- __getitem__(self, key)
- __init__(self, tagName, attrList=None, isSelfClosing=False)
- __init__ - Construct
@param tagName - String of tag name. This will be lowercased!
@param attrList - A list of tuples (key, value)
@param isSelfClosing - True if self-closing tag ( <tagName attrs /> ) will be set to False if text or children are added.
- __setattr__(self, name, value)
- __str__(self)
- __str__ - Returns start tag, inner text, and end tag
- addClass(self, className)
- addClass - append a class name if not present
- appendChild(self, child)
- appendChild - Append a child to this element.
@param child <AdvancedTag> - Append a child element to this element
- appendNode = appendChild(self, child)
- appendText(self, text)
- appendText - append some inner text
- getAttribute(self, attrName)
- getAttribute - Gets an attribute on this tag. Be wary using this for classname, maybe use addClass/removeClass. Attribute names are all lowercase.
@return - The attribute value, or None if none exists.
- getChildren(self)
- getChildren - returns child nodes as a searchable TagCollection.
@return - TagCollection of the immediate children to this tag.
- getElementById(self, _id)
- getElementById - Search children of this tag for a tag containing an id
@param _id - String of id
@return - AdvancedTag or None
- getElementsByAttr(self, attrName, attrValue)
- getElementsByAttr - Search children of this tag for tags with an attribute name/value pair
@param attrName - Attribute name (lowercase)
@param attrValue - Attribute value
@return - TagCollection of matching elements
- getElementsByClassName(self, className)
- getElementsByClassName - Search children of this tag for tags containing a given class name
@param className - Class name
@return - TagCollection of matching elements
- getElementsByName(self, name)
- getElementsByName - Search children of this tag for tags with a given name
@param name - name to search
@return - TagCollection of matching elements
- getElementsCustomFilter(self, filterFunc)
- getElementsCustomFilter - Searches children of this tag for those matching a provided user function
@param filterFunc <function> - A function or lambda expression that should return "True" if the passed node matches criteria.
@return - TagCollection of matching results
- getElementsWithAttrValues(self, attrName, attrValues)
- getElementsWithAttrValues - Search children of this tag for tags with an attribute name and one of several values
@param attrName <lowercase str> - Attribute name (lowercase)
@param attrValues set<str> - set of acceptable attribute values
@return - TagCollection of matching elements
- getEndTag(self)
- getEndTag - returns the end tag
@return - String of end tag
- getPeers(self)
- getPeers - Get elements who share a parent with this element
@return - TagCollection of elements
- getPeersByAttr(self, attrName, attrValue)
- getPeersByAttr - Gets peers (elements on same level) which match an attribute/value combination.
@param attrName - Name of attribute
@param attrValue - Value that must match
@return - None if no parent element (error condition), otherwise a TagCollection of peers that matched.
- getPeersByClassName(self, className)
- getPeersByClassName - Gets peers (elements on same level) with a given class name
@param className - classname must contain this name
@return - None if no parent element (error condition), otherwise a TagCollection of peers that matched.
- getPeersByName(self, name)
- getPeersByName - Gets peers (elements on same level) with a given name
@param name - Name to match
@return - None if no parent element (error condition), otherwise a TagCollection of peers that matched.
- getPeersWithAttrValues(self, attrName, attrValues)
- getPeersWithAttrValues - Gets peers (elements on same level) whose attribute given by #attrName
are in the list of possible vaues #attrValues
@param attrName - Name of attribute
@param attrValues - List of possible values which will match
@return - None if no parent element (error condition), otherwise a TagCollection of peers that matched.
- getStartTag(self)
- getStartTag - Returns the start tag
@return - String of start tag with attributes
- getStyle(self, styleName)
- getStyle - Gets the value of a style paramater, part of the "style" attribute
@param styleName - The name of the style
@return - String of the value of the style. '' is no value.
- getStyleDict(self)
- getStyleDict - Gets a dictionary of style attribute/value pairs.
@return - OrderedDict of "style" attribute.
- getTagName(self)
- getTagName - Gets the tag name of this Tag.
@return - str
- getUid(self)
- hasAttribute(self, attrName)
- hasAttribute - Checks for the existance of an attribute. Attribute names are all lowercase.
@param attrName <str> - The attribute name
@return <bool> - True or False if attribute exists by that name
- hasClass(self, className)
- hasClass - Test if this tag has a paticular class name
@param className - A class to search
- insertAfter(self, child, afterChild)
- insertAfter - Inserts a child after @afterChild
child - Child to insert
afterChild - Child to insert after. if None, will be appended
- insertBefore(self, child, beforeChild)
- insertBefore - Inserts a child before @beforeChild
child - Child to insert
beforeChild - Child to insert before. if None, will be appended
- removeAttribute(self, attrName)
- removeAttribute - Removes an attribute, by name.
@param attrName <str> - The attribute name
- removeChild(self, child)
- removeChild - Remove a child, if present.
@param child - The child to remove
@return - The child [with parentNode cleared] if removed, otherwise None.
- removeClass(self, className)
- removeClass - remove a class name if present. Returns the class name if removed, otherwise None.
- removeNode = removeChild(self, child)
- removeText(self, text)
- removeText - Removes some inner text
- setAttribute(self, attrName, attrValue)
- setAttribute - Sets an attribute. Be wary using this for classname, maybe use addClass/removeClass. Attribute names are all lowercase.
@param attrName <str> - The name of the attribute
@param attrValue <str> - The value of the attribute
- setAttributes(self, attributesDict)
- setAttributes - Sets several attributes at once, using a dictionary of attrName : attrValue
@param attributesDict - <str:str> - New attribute names -> values
- setStyle(self, styleName, styleValue)
- setStyle - Sets a style param. Example: "display", "block"
If you need to set many styles on an element, use setStyles instead.
It takes a dictionary of attribute, value pairs and applies it all in one go (faster)
To remove a style, set its value to empty string.
When all styles are removed, the "style" attribute will be nullified.
@param styleName - The name of the style element
@param styleValue - The value of which to assign the style element
@return - String of current value of "style" after change is made.
- setStyles(self, styleUpdatesDict)
- setStyles - Sets one or more style params.
This all happens in one shot, so it is much much faster than calling setStyle for every value.
To remove a style, set its value to empty string.
When all styles are removed, the "style" attribute will be nullified.
@param styleUpdatesDict - Dictionary of attribute : value styles.
@return - String of current value of "style" after change is made.
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
- childNodes
- childNodes - returns immediate child nodes as a TagCollection
@return - TagCollection of child nodes
- classList
- classList - get the list of class names
- id
- innerHTML
- innerHTML - Returns a string of the inner contents of this tag, including children.
@return - String of inner contents
- name
- nextSibling
- nextSibling - Returns the next sibling. This could be text or an element. use nextSiblingElement to ensure element
- nextSiblingElement
- nextSiblingElement - Returns the next sibling that is an element.
- outerHTML
- outerHTML - Returns start tag, innerHTML, and end tag
@return - String of start tag, innerHTML, and end tag
- parentElement
- parentElement - get the parent element
- peers
- peers - Get elements with same parent as this item
@return - TagCollection of elements
- previousSibling
- previousSibling - Returns the previous sibling. This could be text or an element. use previousSiblingElement to ensure element
- previousSiblingElement
- previousSiblingElement - Returns the previous sibling that is an element.
- value
- value - The "value" attribute of this element
|
class HTMLValidationException(builtins.Exception) |
|
HTMLValidationException - common baseclass for invalid-HTML validation errors |
|
- Method resolution order:
- HTMLValidationException
- builtins.Exception
- builtins.BaseException
- builtins.object
Data descriptors defined here:
- __weakref__
- list of weak references to the object (if defined)
Methods inherited from builtins.Exception:
- __init__(self, /, *args, **kwargs)
- Initialize self. See help(type(self)) for accurate signature.
- __new__(*args, **kwargs) from builtins.type
- Create and return a new object. See help(type) for accurate signature.
Methods inherited from builtins.BaseException:
- __delattr__(self, name, /)
- Implement delattr(self, name).
- __getattribute__(self, name, /)
- Return getattr(self, name).
- __reduce__(...)
- __repr__(self, /)
- Return repr(self).
- __setattr__(self, name, value, /)
- Implement setattr(self, name, value).
- __setstate__(...)
- __str__(self, /)
- Return str(self).
- with_traceback(...)
- Exception.with_traceback(tb) --
set self.__traceback__ to tb and return self.
Data descriptors inherited from builtins.BaseException:
- __cause__
- exception cause
- __context__
- exception context
- __dict__
- __suppress_context__
- __traceback__
- args
|
class IndexedAdvancedHTMLParser(AdvancedHTMLParser) |
|
An AdvancedHTMLParser that indexes for much much faster searching. If you are doing searching/validation, this is your bet.
If you are writing/modifying, you may use this, but be sure to call reindex() after changes. |
|
- Method resolution order:
- IndexedAdvancedHTMLParser
- AdvancedHTMLParser
- html.parser.HTMLParser
- _markupbase.ParserBase
- builtins.object
Methods defined here:
- __init__(self, filename=None, encoding='utf-8', indexIDs=True, indexNames=True, indexClassNames=True, indexTagNames=True)
- __init__ - Creates an Advanced HTML parser object, with specific indexing settings.
For the various index* arguments, if True the index will be collected and use (if useIndex=True [default] on get* function)
@param filename <str> - Optional filename to parse. Otherwise use parseFile or parseStr methods.
@param encoding <str> - Specifies the document encoding. Default utf-8
@param indexIDs <bool> - True to create an index for getElementByID method. <default True>
@param indexNames <bool> - True to create an index for getElementsByName method <default True>
@param indexClassNames <bool> - True to create an index for getElementsByClassName method. <default True>
@param indexTagNames <bool> - True to create an index for tag names. <default True>
For indexing other attributes, see the more generic addIndexOnAttribute
- addIndexOnAttribute(self, attributeName)
- addIndexOnAttribute - Add an index for an arbitrary attribute. This will be used by the getElementsByAttr function.
You should do this prior to parsing, or call reindex. Otherwise it will be blank. "name" and "id" will have no effect.
@param attributeName <lowercase str> - An attribute name. Will be lowercased.
- disableIndexing(self)
- disableIndexing - Disables indexing. Consider using plain AdvancedHTMLParser class.
Maybe useful in some scenarios where you want to parse, add a ton of elements, then index
and do a bunch of searching.
- getElementById(self, _id, root='root', useIndex=True)
- getElementById - Searches and returns the first (should only be one) element with the given ID.
@param id <str> - A string of the id attribute.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and ids are indexed [see constructor] only the index will be used. Otherwise a full search is performed.
- getElementsByAttr(self, attrName, attrValue, root='root', useIndex=True)
- getElementsByAttr - Searches the full tree for elements with a given attribute name and value combination. If you want multiple potential values, see getElementsWithAttrValues
If you want an index on a random attribute, use the addIndexOnAttribute function.
@param attrName <lowercase str> - A lowercase attribute name
@param attrValue <str> - Expected value of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and this specific attribute is indexed [see addIndexOnAttribute] only the index will be used. Otherwise a full search is performed.
- getElementsByClassName(self, className, root='root', useIndex=True)
- getElementsByClassName - Searches and returns all elements containing a given class name.
@param className <str> - A one-word class name
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and class names are indexed [see constructor] only the index will be used. Otherwise a full search is performed.
- getElementsByName(self, name, root='root', useIndex=True)
- getElementsByName - Searches and returns all elements with a specific name.
@param name <str> - A string of the name attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and names are indexed [see constructor] only the index will be used. Otherwise a full search is performed.
- getElementsByTagName(self, tagName, root='root', useIndex=True)
- getElementsByTagName - Searches and returns all elements with a specific tag name.
@param tagName <lowercase str> - A lowercase string of the tag name.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex - If True [default] and tag names are set to be indexed [default, see constructor], only the index will be used. If False, all tags
will be searched.
- getElementsWithAttrValues(self, attrName, values, root='root', useIndex=True)
- getElementsWithAttrValues - Returns elements with an attribute matching one of several values. For a single name/value combination, see getElementsByAttr
@param attrName <lowercase str> - A lowercase attribute name
@param attrValues set<str> - List of expected values of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
@param useIndex <bool> If useIndex is True and this specific attribute is indexed [see addIndexOnAttribute] only the index will be used. Otherwise a full search is performed.
- handle_starttag(self, tagName, attributeList, isSelfClosing=False)
- internal for parsing
- reindex(self, newIndexIDs=None, newIndexNames=None, newIndexClassNames=None, newIndexTagNames=None)
- reindex - reindex the tree. Optionally, change what fields are indexed.
@param newIndexIDs <bool/None> - None to leave same, otherwise new value to index IDs
@parma newIndexNames <bool/None> - None to leave same, otherwise new value to index names
@param newIndexClassNames <bool/None> - None to leave same, otherwise new value to index class names
@param newIndexTagNames <bool/None> - None to leave same, otherwise new value to index tag names
- removeIndexOnAttribute(self, attributeName)
- removeIndexOnAttribute - Remove an attribute from indexing (for getElementsByAttr function) and remove indexed data.
@param attributeName <lowercase str> - An attribute name. Will be lowercased. "name" and "id" will have no effect.
- setRoot(self, root)
- Sets the root node, and reprocesses the indexes
@param root - AdvancedTag for root
Methods inherited from AdvancedHTMLParser:
- feed(self, contents)
- feed - Feed contents. Use parseStr or parseFile instead.
@param contents - Contents
- getElementsCustomFilter(self, filterFunc, root='root')
- getElementsCustomFilter - Scan elements using a provided function
@param filterFunc <function>(node) - A function that takes an AdvancedTag as an argument, and returns True if some arbitrary criteria is met
@return - TagCollection of all matching elements
- getFormattedHTML(self, indent=' ')
- getFormattedHTML - Get formatted and xhtml of this document
@param indent - space/tab/newline of each level of indent, or integer for how many spaces per level
@return - Formatted html as string
- getHTML(self)
- getHTML - Get the full HTML as contained within this tree
@returns - String
- getRoot(self)
- getRoot - returns the root Tag
@return Tag
- getRootNodes(self)
- getRootNodes - Gets all objects at the "root" (first level; no parent). Use this if you may have multiple roots (not children of <html>)
Use this method to get objects, for example, in an AJAX request where <html> may not be your root.
Note: If there are multiple root nodes (i.e. no <html> at the top), getRoot will return a special tag. This function automatically
handles that, and returns all root nodes.
@return list<AdvancedTag> - A list of AdvancedTags which are at the root level of the tree.
- handle_charref(self, charRef)
- Internal for parsing
- handle_comment(self, comment)
- Internal for parsing
- handle_data(self, data)
- Internal for parsing
- handle_decl(self, decl)
- Internal for parsing
- handle_endtag(self, tagName)
- Internal for parsing
- handle_entityref(self, entity)
- Internal for parsing
- handle_startendtag(self, tagName, attributeList)
- Internal for parsing
- parseFile(self, filename)
- parseFile - Parses a file and creates the DOM tree and indexes
@param filename <str/file> - A string to a filename or a file object. If file object, it will not be closed, you must close.
- parseStr(self, html)
- parseStr - Parses a string and creates the DOM tree and indexes.
@param html <str> - valid HTML
- unknown_decl(self, decl)
- Internal for parsing
Methods inherited from html.parser.HTMLParser:
- check_for_whole_start_tag(self, i)
- # Internal -- check to see if we have a complete starttag; return end
# or -1 if incomplete.
- clear_cdata_mode(self)
- close(self)
- Handle any buffered data.
- error(self, message)
- get_starttag_text(self)
- Return full source of start tag: '<...>'.
- goahead(self, end)
- # Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
- handle_pi(self, data)
- # Overridable -- handle processing instruction
- parse_bogus_comment(self, i, report=1)
- # Internal -- parse bogus comment, return length or -1 if not terminated
# see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
- parse_endtag(self, i)
- # Internal -- parse endtag, return end or -1 if incomplete
- parse_html_declaration(self, i)
- # Internal -- parse html declarations, return length or -1 if not terminated
# See w3.org/TR/html5/tokenization.html#markup-declaration-open-state
# See also parse_declaration in _markupbase
- parse_pi(self, i)
- # Internal -- parse processing instr, return end or -1 if not terminated
- parse_starttag(self, i)
- # Internal -- handle starttag, return end or -1 if not terminated
- reset(self)
- Reset this instance. Loses all unprocessed data.
- set_cdata_mode(self, elem)
- unescape(self, s)
- # Internal -- helper to remove special character quoting
Data and other attributes inherited from html.parser.HTMLParser:
- CDATA_CONTENT_ELEMENTS = ('script', 'style')
Methods inherited from _markupbase.ParserBase:
- getpos(self)
- Return current line number and offset.
- parse_comment(self, i, report=1)
- # Internal -- parse comment, return length or -1 if not terminated
- parse_declaration(self, i)
- # Internal -- parse declaration (for use by subclasses).
- parse_marked_section(self, i, report=1)
- # Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
- updatepos(self, i, j)
- # Internal -- update line number and offset. This should be
# called for each piece of data exactly once, in order -- in other
# words the concatenation of all the input strings to this
# function should be exactly the entire input.
Data descriptors inherited from _markupbase.ParserBase:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class InvalidCloseException(HTMLValidationException) |
|
InvalidCloseException - Raised when a tag is closed that shouldn't be closed. |
|
- Method resolution order:
- InvalidCloseException
- HTMLValidationException
- builtins.Exception
- builtins.BaseException
- builtins.object
Methods defined here:
- __init__(self, triedToClose, stillOpen)
Data descriptors inherited from HTMLValidationException:
- __weakref__
- list of weak references to the object (if defined)
Methods inherited from builtins.Exception:
- __new__(*args, **kwargs) from builtins.type
- Create and return a new object. See help(type) for accurate signature.
Methods inherited from builtins.BaseException:
- __delattr__(self, name, /)
- Implement delattr(self, name).
- __getattribute__(self, name, /)
- Return getattr(self, name).
- __reduce__(...)
- __repr__(self, /)
- Return repr(self).
- __setattr__(self, name, value, /)
- Implement setattr(self, name, value).
- __setstate__(...)
- __str__(self, /)
- Return str(self).
- with_traceback(...)
- Exception.with_traceback(tb) --
set self.__traceback__ to tb and return self.
Data descriptors inherited from builtins.BaseException:
- __cause__
- exception cause
- __context__
- exception context
- __dict__
- __suppress_context__
- __traceback__
- args
|
class MissedCloseException(HTMLValidationException) |
|
MissedCloseException - Raised when a close was missed |
|
- Method resolution order:
- MissedCloseException
- HTMLValidationException
- builtins.Exception
- builtins.BaseException
- builtins.object
Methods defined here:
- __init__(self, triedToClose, stillOpen)
Data descriptors inherited from HTMLValidationException:
- __weakref__
- list of weak references to the object (if defined)
Methods inherited from builtins.Exception:
- __new__(*args, **kwargs) from builtins.type
- Create and return a new object. See help(type) for accurate signature.
Methods inherited from builtins.BaseException:
- __delattr__(self, name, /)
- Implement delattr(self, name).
- __getattribute__(self, name, /)
- Return getattr(self, name).
- __reduce__(...)
- __repr__(self, /)
- Return repr(self).
- __setattr__(self, name, value, /)
- Implement setattr(self, name, value).
- __setstate__(...)
- __str__(self, /)
- Return str(self).
- with_traceback(...)
- Exception.with_traceback(tb) --
set self.__traceback__ to tb and return self.
Data descriptors inherited from builtins.BaseException:
- __cause__
- exception cause
- __context__
- exception context
- __dict__
- __suppress_context__
- __traceback__
- args
| |