Filters¶
Filters can be applied at either of two stages of processing:
Applied to the downloaded data before storing it and diffing for changes (
filter
)Applied to the diff result before reporting the changes (
diff_filter
)
While creating your job pipeline, you might want to preview what the filtered output looks like. For filters applied
to the data, you can run webchanges with the --test-filter
command-line option, passing in the index (from
--list
) or the URL/command of the job to be tested:
webchanges --test 1 # Test the first job in the list and show the data colleted after it's filtered
webchanges --test https://example.net/ # Test the job that matches the given URL
This command will show the output that will be captured and stored, and used to compare to the old version stored from a previous run against the same url or shell command.
Once webchanges has collected at least 2 historic snapshots of a job (two different states of a webpage) you can start
testing the effects of your diff_filter
with the command-line option --test-diff
, passing in the index (from
--list
) or the URL/command of the job to be tested, which using the historic data saved locally in the cache:
webchanges --test-diff 1 # Test the first job in the list and show the report
At the moment, the following filters are available:
To select HTML (or XML) elements:
css: Filter XML/HTML using CSS selectors
xpath: Filter XML/HTML using XPath expressions
element-by-class: Get all HTML elements by class
element-by-id: Get an HTML element by its ID
element-by-style: Get all HTML elements by style
element-by-tag: Get an HTML element by its tag
To make HTML more readable:
To make PDFs readable:
pdf2text: Convert PDF to plaintext
To extract text from images:
ocr: Extract text from images
To make JSON more readable:
format-json: Reformat (pretty-print) JSON
To make XML more readable:
format-xml: Reformat (pretty-print) XML
To make iCal more readable:
ical2text: Convert iCalendar to plaintext
To make binary readable:
hexdump: Display data in hex dump format
To just detect changes:
sha1sum: Calculate the SHA-1 checksum of the data
To edit/filter text:
keep_lines_containing: Keep only lines matching a regular expression
delete_lines_containing: Delete lines matching a regular expression
re.sub: Replace or remove text matching a regular expression
strip: Strip leading and trailing whitespace
sort: Sort lines
reverse: Reverse the order of items (lines)
Any custom script or program:
shellpipe: Run a program or custom script
Python programmers can write their own plug-in that could include filters; see Hooks.
css and xpath¶
The css
filter extracts content based on a CSS selector. It uses the
cssselect Python package, which has limitations and extensions as explained
in its documentation.
The xpath
filter extracts content based on a XPath expression.
Examples: to filter only the <body>
element of the HTML document, stripping out everything else:
url: https://example.net/css.html
filter:
- css: ul#groceries > li.unchecked
url: https://example.net/xpath.html
filter:
- xpath: /html/body/marquee
See Microsoft’s XPath Examples page for some other examples
Using CSS and XPath filters with XML and exclusions¶
By default, CSS and XPath filters are set up for HTML documents, but it is possible to use them for XML documents as well.
Example to parse an RSS feed and filter only the titles and publication dates:
url: https://example.com/blog/css-index.rss
filter:
- css:
method: xml
selector: 'item > title, item > pubDate'
- html2text: re
url: https://example.com/blog/xpath-index.rss
filter:
- xpath:
method: xml
path: '//item/title/text()|//item/pubDate/text()'
To match an element in an XML namespace, use a namespace prefix before the tag
name. Use a |
to seperate the namespace prefix and the tag name in a CSS selector, and use a :
in an XPath
expression.
url: https://example.org/feed/css-namespace.xml
filter:
- css:
method: xml
selector: 'item > media|keywords'
namespaces:
media: http://search.yahoo.com/mrss/
- html2text
url: https://example.net/feed/xpath-namespace.xml
filter:
- xpath:
method: xml
path: '//item/media:keywords/text()'
namespaces:
media: http://search.yahoo.com/mrss/
Alternatively, use the XPath expression //*[name()='<tag_name>']
to bypass the namespace entirely.
Another useful option with XPath and CSS filters is exclude
. Elements selected by this exclude
expression are
removed from the final result. For example, the following job will not have any <a>
tag in its results:
url: https://example.org/css-exclude.html
filter:
- css:
selector: 'body'
exclude: 'a'
Limiting the returned items from a CSS Selector or XPath¶
If you only want to return a subset of the items returned by a CSS selector or XPath filter, you can use two additional subfilters:
skip
: How many elements to skip from the beginning (default: 0)maxitems
: How many elements to return at most (default: no limit)
For example, if the page has multiple elements, but you only want to select the second and third matching element (skip the first, and return at most two elements), you can use this filter:
url: https://example.net/css-skip-maxitems.html
filter:
- css:
selector: div.cpu
skip: 1
maxitems: 2
Duplicated results¶
If you get multiple results from one page, but you only expected one (e.g. because the page contains both a mobile and
desktop version in the same HTML document, and shows/hides one via CSS depending on the viewport size), you can use
‘maxitems: 1
’ to only return the first item.
Optional directives¶
selector
(for css) orpath
(for xpath) [can be entered as the value of the xpath or css directive]method
: Either ofhtml
(default) orxml
namespaces
Mapping of XML namespaces for matchingexclude
: Elements to remove from the final resultskip
: ‘Number of elements to skip from the beginning (default: 0)maxitems
: Maximum numbe of items to be returned
element-by-¶
The filters element-by-class, element-by-id, element-by-style, and element-by-tag allow you to select all matching instances of a given HTML element.
Examples:
To extract only the <body>
of a page:
url: https://example.org/bodytag.html
filter:
- element-by-tag: body
To extract <div id="something">.../<div>
from a page:
url: https://example.org/idtest.html
filter:
- element-by-id: something
Since you can chain filters, use this to extract an element within another element:
url: https://example.org/idtest_2.html
filter:
- element-by-id: outer_container
- element-by-id: something_inside
To make the output human-friendly you can chain html2text on the result:
url: https://example.net/id2text.html
filter:
- element-by-id: something
- html2text: pyhtml2text
html2text¶
This filter converts HTML (or XML) to plaintext
Optional directives¶
method
: One of:html2text
: Uses the html2text Python package (default)bs4
: Uses the BeautifulSoup Python packagere
: a simple regex-based tag stripper
html2text
¶
This filter converts HTML into Markdown. using the html2text Python package.
It is the recommended option to convert all types of HTML into readable text.
Example configuration:
Note: If the content has tables, adding the sub-directive pad_tables: true may improve readability.
url: https://example.com/html2text.html
filter:
- xpath: '//section[@role="main"]'
- html2text:
pad_tables: true
Optional sub-directives¶
See documentation
Note that the following options are set by default (but can be overridden): ensure that accented characters are kept as they are (unicode_snob: true), lines aren’t chopped up (body_width: 0), additional empty lines aren’t added between sections (single_line_break: true), and images are ignored (ignore_images: true).
bs4
¶
This filter extract unfromatted text from HTML using the BeautifulSoup, specifically its get_text(strip=True) method.
Note that as of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.
Optional sub-directives¶
parser
(defaults tolxml
): as per documentation
Required packages¶
To run jobs with this filter, you need to have additional Python package(s) installed.
Install them using:
pip install --upgrade webchanges[bs4]
re
¶
A simple HTML/XML tag stripper based on applying a regex. Very fast but may not yield the prettiest results.
beautify¶
This filter uses the BeautifulSoup, jsbeautifier and cssbeautifier Python packages to reformat an HTML document to make it more readable.
Required packages¶
To run jobs with this filter, you need to install Optional packages. Install them using:
pip install --upgrade webchanges[beautify]
pdf2text¶
This filter converts a PDF file to plaintext using the pdftotext Python library, itself based on the Poppler library.
This filter must be the first filter in a chain of filters.
url: https://example.net/pdf-test.pdf
filter:
- pdf2text
- strip
If the PDF file is password protected, you can specify its password:
url: https://example.net/pdf-test-password.pdf
filter:
- pdf2text:
password: webchangessecret
- strip
Optional sub-directives¶
password
: password for a password-protected PDF file
Required packages¶
To run jobs with this filter, you need to install Optional packages. Install them using:
pip install --upgrade webchanges[pdf2text]
In addition, you need to install any of the OS-specific dependencies of Poppler (see website).
Example:
name: Convert PDF to text
url: https://example.net/sample.pdf
filter:
- pdf2text:
password: pdfpassword
format-json¶
This filter deserializes a JSON object and reformats it using Python’s json.dumps with indentations.
Optional sub-directives¶
indentation
(defaults to 4): indent to pretty-print JSON array elements.None
selects the most compactrepresentation.
format-xml¶
This filter deserializes an XML object and reformats it using the lxml Python package’s etree.tostring pretty_print option.
ical2text¶
This filter reads an iCalendar document and converts them to easy-to read text
name: "Make iCal file readable test"
url: https://example.com/cal.ics
filter:
- ical2text:
Required packages¶
To run jobs with this filter, you need to install Optional packages. Install them using:
pip install --upgrade webchanges[ical2text]
hexdump¶
This filter display the contents both in binary and ASCII (hex dump format).
name: Display binary and ASCII test
command: cat testfile
filter:
- hexdump:
sha1sum¶
This filter calculates a SHA-1 hash for the document,
name: "Calculate SHA-1 hash test"
url: https://example.com/sha.html
filter:
- sha1sum:
keep_lines_containing¶
This filter emulates Linux’s grep using Pyton’s regular expression matching (regex) and keeps only lines that match the pattern, discarding the others. Note that mothwistanding its name, this filter does not use the executable grep.
Example: convert HTML to text, strip whitespace, and only keep lines that have the sequence a,b:
in them:
name: Keep line matching test
url: https://example.com/keep_lines_containing.html
filter:
- html2text:
- strip:
- keep_lines_containing:
re: 'a,b:'
Example: keep only lines that contain “error” irrespective of its case (e.g. Error, ERROR, etc.):
name: "Lines with error in them, case insensitive"
url: https://example.com/keep_lines_containing_i.txt
filter:
- keep_lines_containing:
re: '(?i)error'
delete_lines_containing¶
This filter is the inverse of keep_lines_containing
above and keeps only lines that do
not match the text or the regular expression,
discarding the others.
Example: eliminate lines that contain “xyz”:
name: "Lines with error in them, case insensitive"
url: https://example.com/delete_lines_containing.txt
filter:
- delete_lines_containing: 'xyz'
re.sub¶
This filter removes or replaces text using regular expressions.
Just specifying a string as the value will remove the matches.
Simple patterns can be replaced with another string using
pattern
as the expression andrepl
as the replacement.You can use regex groups (
()
) and back-reference them with\1
(etc..) to put groups into the replacement string.
All features are described in Python’s re.sub documentation.
The pattern
and repl
values are passed to this function as-is.
Just like Python’s re.sub function, there’s the possibility to apply a regular expression and either remove of replace the matched text. The following example applies the filter 3 times:
name: "re.sub test"
url: https://example.com/re_sub.txt
filter:
- re.sub: '\s*href="[^"]*"'
- re.sub:
pattern: '<h1>'
repl: 'HEADING 1: '
- re.sub:
pattern: '</([^>]*)>'
repl: '<END OF TAG \1>'
Optional sub-directives¶
pattern
: pattern to be replaced. This sub-directive must be specified if also using therepl
sub-directive. Otherwise the pattern can be specified as the value ofre.sub
.repl
: the string for replacement. If this sub-directive is missing, defaults to empty string (i.e. deletes the string matched inpattern
)
strip¶
This filter removes leading and trailing whitespace. It applies to the entire document: it is not applied line-by line.
name: "Stripping leading and trailing whitespace test"
url: https://example.com/strip.html
filter:
- strip:
sort¶
This filter performs a line-based sorting, ignoring cases (case folding as per Python’s implementation
If the source provides data in random order, you should sort it before the comparison in order to avoid diffing based only on changes in the sequence.
name: "Sorting lines test"
url: https://example.net/sorting.txt
filter:
- sort
The sort filter takes an optional separator
parameter that defines
the item separator (by default sorting is line-based), for example to
sort text paragraphs (text separated by an empty line):
url: https://example.org/paragraphs.txt
filter:
- sort:
separator: "\n\n"
This can be combined with a boolean reverse
option, which is useful
for sorting and reversing with the same separator (using %
as
separator, this would turn 3%2%4%1
into 4%3%2%1
):
url: https://example.org/sort-reverse-percent.txt
filter:
- sort:
separator: '%'
reverse: true
reverse¶
This filter reverses the order of items (lines) without sorting:
url: https://example.com/reverse-lines.txt
filter:
- reverse
This behavior can be changed by using an optional separator string argument (e.g. items separated by a pipe (|
)
symbol, as in 1|4|2|3
, which would be reversed to 3|2|4|1
):
url: https://example.net/reverse-separator.txt
filter:
- reverse: '|'
Alternatively, the filter can be specified more verbose with a dict. In this example "\n\n"
is used to separate
paragraphs (items that are separated by an empty line):
url: https://example.org/reverse-paragraphs.txt
filter:
- reverse:
separator: "\n\n"
ocr¶
This filter extracts text from images using the Tesseract OCR engine It requires two Python modules to be installed: pytesseract and Pillow. Any file formats supported by Pillow (PIL) are supported.
This filter must be the first filter in a chain of filters, since it consumes binary data and outputs text data.
url: https://example.net/ocr-test.png
filter:
- ocr:
timeout: 5
language: eng
- strip
Optional sub-directives¶
timeout
: Timeout for the recognition, in seconds (default: 10 seconds)language
: Text language (e.g.fra
oreng+fra
, default:eng
)
Required packages¶
To run jobs with this filter, you need to install Optional packages. Install them using:
pip install --upgrade webchanges[ocr]
In addition, you need to install Tesseract.
shellpipe¶
The data to be filtered is passed to a command or script and the output from the script is used. The environment
variable URLWATCH_JOB_NAME
will have the name of the job, while URLWATCH_JOB_LOCATION
its location
(either URL or command).
url: https://example.net/shellpipe.html
filter:
- shellpipe: customscript.py