Metadata-Version: 2.4
Name: cs-urlutils
Version: 20260531
Summary: convenience functions for working with URLs
Keywords: python3
Author-email: Cameron Simpson <cs@cskk.id.au>
Description-Content-Type: text/markdown
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Requires-Dist: beautifulsoup4
Requires-Dist: cs.lex>=20260526
Requires-Dist: cs.logutils>=20250323
Requires-Dist: cs.obj>=20260526
Requires-Dist: cs.rfc2616>=20260531
Requires-Dist: cs.threads>=20260531
Requires-Dist: html5lib
Requires-Dist: lxml
Project-URL: MonoRepo Commits, https://bitbucket.org/cameron_simpson/css/commits/branch/main
Project-URL: Monorepo Git Mirror, https://github.com/cameron-simpson/css
Project-URL: Monorepo Hg/Mercurial Mirror, https://hg.sr.ht/~cameron-simpson/css
Project-URL: Source, https://github.com/cameron-simpson/css/blob/main/lib/python/cs/urlutils.py

URL related utility functions and classes.
- Cameron Simpson <cs@cskk.id.au> 26dec2011
#

*Latest release 20260531*:
* Require html5lib and lxml and python 3 urllib modules.
* URL.flush: clean out defined cached attributes.
* URL: new session() context manager to make a requests.Session.
* URL: hrefs, srcs: return a URLs collection.
* URL: Promotable and Formatable.
* URL: replace caching methods .GET() and .HEAD() with @cached_property .GET_response and .HEAD_response.
* URL: new .url_parsed property being the namedtuple from urlparse, drop .parts.
* URL: new .query_dict() method, returning the query parameters as a dict.
* UR: new .cleanpath and .cleanrpath properties.
* URL: new .urlto(other_url) to resolve other_url against self, use it in hrefs() and srcs().
* URL: rename content_type to content_type_full, make content_type the plain text/html value.
* URL: make .text a cached_property, get the soup using just lxml (the list-=of-parsers approach seems unsupported).
* URL: new .short attrubute being a shortend URL for messages.
* URL: new .ext property for the URL file extension.
* URL: new isabs() method to test is a URL has a hostname and a path commencing with /
* URL: support extending a URL with /

Short summary:
* `NetrcHTTPPasswordMgr`: A subclass of `HTTPPasswordMgrWithDefaultRealm` that consults the `.netrc` file if no overriding credentials have been stored.
* `skip_url_errs`: A version of `cs.seq.skip_map` which skips `URLError` and `HTTPError`.
* `strip_whitespace`: Strip whitespace characters from a string, per HTML 4.01 section 1.6 and appendix E.
* `URL`: Utility class to do simple stuff to URLs, subclasses `str`.
* `urljoin`: This is `urllib.parse.urljoin` after coercing both arguments to `str`.

Module contents:
- <a name="NetrcHTTPPasswordMgr"></a>`class NetrcHTTPPasswordMgr(urllib.request.HTTPPasswordMgrWithDefaultRealm)`: A subclass of `HTTPPasswordMgrWithDefaultRealm` that consults
  the `.netrc` file if no overriding credentials have been stored.
- <a name="skip_url_errs"></a>`skip_url_errs(func, *iterables, **skip_map_kw)`: A version of `cs.seq.skip_map` which skips `URLError` and `HTTPError`.
- <a name="strip_whitespace"></a>`strip_whitespace(s)`: Strip whitespace characters from a string, per HTML 4.01 section 1.6 and appendix E.
- <a name="URL"></a>`class URL(cs.threads.HasThreadState, cs.lex.FormatableMixin, cs.deco.Promotable)`: Utility class to do simple stuff to URLs, subclasses `str`.

*`URL.__init__(self, url_s: str, referer=None, soup=None, text=None)`*:
Initialise the `URL` from the URL string `url_s`.

*`URL.__getattr__(self, attr)`*:
Ad hoc attributes.
Upper case attributes named "FOO" parse the text and find
the (sole) node named "foo".
Upper case attributes named "FOOs" parse the text and find
all the nodes named "foo".

*`URL.__truediv__(self, subpath)`*:
Return a new `URL` with `subpath` appended.

*`URL.basename`*:
The URL basename.

*`URL.cleanrpath`*:
The `cleanpath` with its leading slash stripped.

*`URL.content`*:
The decoded URL content as a `bytes`.

*`URL.content_length`*:
The value of the Content-Length: header or `None`.

*`URL.content_transfer_encoding`*:
The URL content tranfer encoding.

*`URL.context`*

*`URL.default_limit(self)`*:
Default URLLimit for this URL: same host:port, any subpath.

*`URL.domain`*:
The URL domain - the hostname with the first dotted component removed.

*`URL.exists(self) -> bool`*:
Test if this URL exists via a `HEAD` request.

*`URL.ext`*:
The URL basename file extension, as from `os.path.splitext`.

*`URL.feedparsed(self)`*:
A parse of the content via the feedparser module.

*`URL.find_all(self, *a, **kw)`*:
Convenience routine to call BeautifulSoup's .find_all() method.

*`URL.flush(self)`*:
Forget all cached content.

*`URL.format_kwargs(self)`*:
Return a dict for use with `FormatableMixin.format_as()`.

*`URL.fragment`*:
The URL fragment as returned by `urlparse.urlparse`.

*`URL.headers`*:
A `requests.Response` headers mapping.

*`URL.hostname`*:
The URL hostname as returned by `urlparse.urlparse`.

*`URL.hrefs(self, absolute=False) -> Iterable[ForwardRef('URL')]`*:
All 'href=' values from the content HTML 'A' tags.
If `absolute`, resolve the sources with respect to our URL.

*`URL.isabs(self)`*:
Test whether this `URL` is absolute, having a hostname and
a path commencing with `'/'`.

*`URL.last_modified`*:
The value of the Last-Modified: header as a UNIX timestamp, or None.

*`URL.netloc`*:
The URL netloc as returned by `urlparse.urlparse`.

*`URL.normalised(self)`*:
Return a normalised URL where "." and ".." components have been processed.

*`URL.params`*:
The URL params as returned by `urlparse.urlparse`.

*`URL.password`*:
The URL password as returned by `urlparse.urlparse`.

*`URL.path`*:
The URL path as returned by `urlparse.urlparse`.

*`URL.path_elements`*:
Return the non-empty path components; NB: a new list every time.

*`URL.port`*:
The URL port as returned by `urlparse.urlparse`.

*`URL.promote(obj)`*:
Promote `obj` to an instance of `cls`.
Instances of `cls` are passed through unchanged.
`str` is promoted directly to `cls(obj)`.
`(url,referer)` is promoted to `cls(url,referer=referer)`.

*`URL.query`*:
The URL query as returned by `urlparse.urlparse`.

*`URL.query_dict(self)`*:
Return a new `dict` containing the parsed param=value pairs from `self.query`.

*`URL.resolve(self, base)`*:
Resolve this URL with respect to a base URL.

*`URL.rpath`*:
The URL path as returned by `urlparse.urlparse`, after any leading slashes.

*`URL.savepath(self, rootdir)`*:
Compute a local filesystem save pathname for this URL.
This scheme is designed to accomodate the fact that 'a',
'a/' and 'a/b' can all coexist.
Extend any component ending in '.' with another '.'.
Extend directory components with '.d.'.

*`URL.scheme`*:
The URL scheme as returned by `urlparse.urlparse`.

*`URL.session(self, session=None)`*:
Context manager yielding a `requests.Session`.

*`URL.short`*:
A shortened form of the URL for use in messages.

*`URL.srcs(self, *a, **kw)`*:
All 'src=' values from the content HTML.
If `absolute`, resolve the sources with respect to our URL.

*`URL.unsavepath(savepath)`*:
Compute URL path component from a savepath as returned by URL.savepath.
This should always round trip with URL.savepath.

*`URL.urlto(self, other: Union[ForwardRef('URL'), str]) -> 'URL'`*:
Return `other` resolved against `self.baseurl`.
If `other` is an abolute URL it will not be changed.

*`URL.username`*:
The URL username as returned by `urlparse.urlparse`.

*`URL.walk(self, limit=None, seen=None, follow_redirects=False)`*:
Walk a website from this URL yielding this and all descendent URLs.
`limit`: an object with a contraint test method "ok".
         If not supplied, limit URLs to the same host and port.
`seen`: a setlike object with a "__contains__" method and an "add" method.
         URLs already in the set will not be yielded or visited.
`follow_redirects`: whether to follow URL redirects

*`URL.xml_find_all(self, match)`*:
Convenience routine to call ElementTree.XML's .findall() method.
- <a name="urljoin"></a>`urljoin(url, other_url)`: This is `urllib.parse.urljoin` after coercing both arguments to `str`.

# Release Log



*Release 20260531*:
* Require html5lib and lxml and python 3 urllib modules.
* URL.flush: clean out defined cached attributes.
* URL: new session() context manager to make a requests.Session.
* URL: hrefs, srcs: return a URLs collection.
* URL: Promotable and Formatable.
* URL: replace caching methods .GET() and .HEAD() with @cached_property .GET_response and .HEAD_response.
* URL: new .url_parsed property being the namedtuple from urlparse, drop .parts.
* URL: new .query_dict() method, returning the query parameters as a dict.
* UR: new .cleanpath and .cleanrpath properties.
* URL: new .urlto(other_url) to resolve other_url against self, use it in hrefs() and srcs().
* URL: rename content_type to content_type_full, make content_type the plain text/html value.
* URL: make .text a cached_property, get the soup using just lxml (the list-=of-parsers approach seems unsupported).
* URL: new .short attrubute being a shortend URL for messages.
* URL: new .ext property for the URL file extension.
* URL: new isabs() method to test is a URL has a hostname and a path commencing with /
* URL: support extending a URL with /

*Release 20231129*:
* Drop Python 2 support.
* No longer use cs.xml, which is going away.
* Make _URL type public as URL with a new promote() method, drop URL factory function, update URL constructors throughout.
* URL.__init__: make parameters keyword only.

*Release 20191004*:
Small updates for changes to other modules.

*Release 20160828*:
Use "install_requires" instead of "requires" in DISTINFO.

*Release 20160827*:
* Handle TimeoutError, reporting elapsed time.
* URL: present ._fetch as .GET.
* URL: add .resolve to resolve this URL against a base URL.
* URL: add .savepath and .unsavepath methods to generate nonconflicting save pathnames for URLs and the reverse.
* URL._fetch: record the post-redirection URL as final_url.
* New URLLimit class for specifying simple tests for URL acceptance.
* New walk(): method to walk website from starting URL, yielding URLs.
* URL.content_length property, returns int or None if header missing.
* New URL.normalised method to return URL with . and .. processed in the path.
* new URL.exists test function.
* Assorted bugfixes and improvements.

*Release 20150116*:
Initial PyPI release.
