Introduction¶
webchanges monitors the output of webpages (or commands on your computer shell).
Every time you run webchanges, it:
retrieves the output;
transforms and filters it (optional);
compares this with what it saved from the previous run, producing a “diff” report if it finds changes;
(optional) filters the diff report;
displays such report (default) and/or sends it via one or more methods such as email.
Jobs¶
Each such source of data is a “job”. The instructions for each such job are contained in a config file in the YAML
format called jobs.yaml
and located in ~/.config/webchanges
(Linux), ~/Library/Preferences/webchanges
(MacOS), or in the webchanges
folder within your Documents folder, i.e. %USERPROFILE%/Documents/webchanges
(Windows). You can edit it with any text editor or:
webchanges --edit
For additional information on YAML, see the YAML specifications. You can learn more about when to use quotes in YAML here (note: the library we use supports YAML 1.1, and our examples use “flow scalars”). URLs are always safe and don’t need to be enclosed in quotes.
The minimum configuration necessary for webchanges to work is a single url
directive (for web resources) or
command
directive (for the output of a shell command):
url: https://example.com/
If you have multiple sources to monitor, i.e. multiple “jobs”, separate each with a line of three dashes
(---
):
url: https://example.com/
---
url: https://example.com/page2
---
command: dir
By default, the content is downloaded as-is; if you need for a webpage to be
rendered and its JavaScript run and only then the HTML captured, add the directive use_browser: true
. This
requires additional installations and uses many resources; see here for more information.
url: https://example.com/
use_browser: true
You can add a name
to help you identify what you’re monitoring, but webchanges will automatically try to use a
webpage’s title if you don’t do so:
name: Example
url: https://example.com/
You can enter remarks in your YAML configuration file by using #
:
# I am monitoring this site because I expect it to change for the better
name: Sample # One more remark
url: https://example.com/
Finally, you have a choice of many and many options to finely control the data acquisition; see here.
Filters¶
Once you have collected the output, you may transform it to increase its utility. You use the filter
directive to
activate one or more Filters to:
select HTML or (XML):
css
,xpath
,element-by-class
,element-by-id
,element-by-style
,element-by-tag
extract text from HTML:
html2text
make HTML more readable:
beautify
extract text from PDF:
pdf2text
,ocr
make JSON more readable:
format-json
make XML more readable:
format-xml
make iCal more readable:
ical2text
make binary readable:
hexdump
detect if anything changed:
sha1sum
edit text:
keep_lines_containing
,delete_lines_containing
,re.sub
,strip
,sort
If all you’re doing is monitoring the text of a website, this filter will do it:
url: https://example.com/
filters:
- html2text: # notice the 2 empty spaces before the hyphen and the colon at the end
Filters can be chained. As an example, after retrieving an HTML document by using the url
directive, you
can extract a selection with the xpath
filter, convert it to text with html2text
, extract only lines matching
a specific regular expression with keep_lines_containing
, and sort the result with sort
:
name: Sample webchanges job definition
url: https://example.com/
https_proxy: http://user:passwor@example.net:8080
max_tries: 2
filter:
- xpath: //section[@role="main"]
- html2text:
method: html2text
unicode_snob: true
body_width: 0
inline_links: false
ignore_links: true
ignore_images: true
pad_tables: false
single_line_break: true
- keep_lines_containing: lines I care about
- sort:
---
Filters are explained here.
Comparison¶
webchanges then automatically performs a comparison between the filtered data collected in this run with the one saved from a prior run, computing a diff in the unified format by default.
Diff filters¶
After the comparison is generated, you can apply one of the filters above to the diff itself (see Diff filters) or one of the two diff-specific ones:
additions_only
deletions_only
If all you’re doing is monitoring the text of a website to see if anything was added, this job definition will do it:
url: https://example.com/
filters:
- html2text: # notice the 2 empty spaces before the hyphen and the colon at the end
additions_only: