Metadata-Version: 2.4
Name: epublib
Version: 0.1.0
Summary: A modern EPUB3 python library
Author-email: João Seckler <jseckler@riseup.net>
Requires-Python: >=3.13
Requires-Dist: beautifulsoup4>=4.13.5
Requires-Dist: lxml>=6.0.1
Description-Content-Type: text/markdown

# EPUBLib

A spec compliant, memory efficient EPUB3 library. Designed for editing
EPUBs, but can also create them.

* Spec compliant: code aims at being compliant with the
[EPUB 3.3 specification](https://www.w3.org/TR/epub-33/) (although it
does not attempt to validate the EPUB. Use
[Ace by Daisy](https://daisy.org/activities/software/ace/) and
[EPUBCheck](https://www.w3.org/publishing/epubcheck/) for that);
* Memory efficient: leverages python standard library's zipfile module to
  load data into memory as needed only;
* Designed for editing: handles EPUBs non intrusively (e.g. won't
  recreate the manifest and the metadata).

## Installation

```bash
pip install epublib
```

### Dependencies

Installing EPUBLib will also install its dependencies:

* BeautifulSoup (`pip install beautifulsoup`)
* lxml (`pip install lxml`)

## Contributing

1. Use [`uv`](https://docs.astral.sh/uv/) to manage development
   dependencies. Sync with `uv sync --all-packages`
2. `pre-commit install`

## Related

* [Ebooklib](https://github.com/aerkalov/ebooklib)
* [Sigil](https://sigil-ebook.com/)

## Usage

### Basic usage

```python
from epublib import EPUB

with EPUB("book.epub") as book:
    book.metadata.title = "New title"

    for doc in book.documents():
        new_script = doc.soup.new_tag("script", attrs={"src": "../Misc/myscript.js"})
        doc.soup.head.append(new_script)

        new_heading = doc.soup.new_tag("h1", string="New heading")
        doc.soup.body.insert(0, new_heading)

    book.update_manifest_properties()
    book.write("book-modified.epub")
```

### Reading, writing and creating

```python
from epublib import EPUB

# From path
with EPUB("book.epub") as book:
    book.write("book-modified.epub")

# From file
with open("book.epub", "rb") as f:
    book = EPUB(f)

    with open("book-modified.epub", "wb") as f:
        book.write(f)

# Read from folder path (unzipped EPUB)
with EPUB("book-folder/") as book:
    book.write_to_folder("book-folder-modified/")

# Create new EPUB
book = EPUB()
book.metadata.title = "A new book"
book.metadata.identifier = "urn:uuid:123e4567-e89b-12d3-a456-426614174000"
book.metadata.language = "en"
book.nav.soup.title.string = "Navigation title"

# the default TOC comes with one single self referential item
book.nav.toc.text = "Toc title" # Title of the toc
item_referencing_toc = next(book.nav.toc.items_referencing(book.nav.filename))
item_referencing_toc.text = "Toc title"
```

EPUBLib does not guarantee the validity of the EPUB resulting from
calling `EPUB()`. It is the user's responsability to add, at least:

* a title (`book.metadata.title = <title>`)
* an identifier (`book.metadata.identifier = <id>`)
* a language (`book.metadata.language = <language>`)
* A title for the navigation document (`book.nav.soup.title.string = <title>`)
* A title for the elements of the table of contents (see example above
  for one way of doing it)

### Accessing resources

Each resource corresponds to a file in the EPUB archive.

```python
import zipfile

from epublib import EPUB
from epublib.mediatype import MediaType, Category

book = EPUB("book.epub")
book.resources #  all resources
print([resource.filename for resource in book.resources])
# [
#     "mimetype",
#     "META-INF/container.xml",
#     "content.opf",
#     "Text/chapter1.xhtml",
#     "Images/image.png",
#     ...,
# ]

resource = book.resources.get("Text/chapter1.xhtml")

assert resource.filename == "Text/chapter1.xhtml"
assert isinstance(resource.content, bytes)
assert isinstance(resource.zipinfo, zipfile.ZipInfo)

documents = book.documents() # All XHTML and SVG resources
images = book.images() # All image resources
scripts = book.scripts() # All JavaScript resources
styles = book.styles() # All style resources

assert book.resources.get("Text/chapter1.xhtml") # ContentDocument(Text/chapter1.xhtml)
assert book.resources.get("Images/image.png") # PublicationResource(Images/image.png)

pngs = book.resources.filter(MediaType.IMAGE_PNG) # All PNG images
assert all(img.media_type == MediaType.IMAGE_PNG for img in pngs)

images = book.resources.filter(Category.IMAGE) # All images. Same as book.images()
assert all(img.media_type.category == Category.IMAGE for img in images)
```

#### Creating

```python
from epublib import EPUB
from epublib.identifier import EPUBId
from epublib.resources import PublicationResource, ContentDocument
from epublib.resources.create import create_resource_from_path, create_resource

book = EPUB("book.epub")

# Create a new resource from filesystem path
new_resource = create_resource_from_path("new-image.jpg", "Images/name-in-epub.jpg")
assert isinstance(new_resource, PublicationResource)
book.resources.add(resource=new_resource)

# Create a new resource from content

xhtml = """
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
<head>
  <title>A Small Document</title>
</head>
<body>
  <p>A simple page!</p>
</body>
</html>
"""

new_resource = create_resource(xhtml.encode(), "Text/Chapter4.xhtml")
assert isinstance(new_resource, ContentDocument)
book.resources.add(resource=new_resource)

# More options when adding are available (see full signature in the API
# documentation)
new_resource = create_resource(xhtml.encode(), "Text/Chapter5.xhtml")
book.resources.add(
    resource=new_resource,
    is_cover = False,
    position = 0, # position in book.resources list
                  # (and thus in archive). Default: None
    after = "Text/Chapter1.xhtml", # insert after this resource, default: None
    before = None,                 # insert before this resource

    # if None, it will be added unless it is the mimetype or the container.xml file
    # caution: setting this to False or True may yield invalid EPUBs
    add_to_manifest = None,

    add_to_spine = None,
    spine_position = None,
    linear = None,
    add_to_toc = None,
    toc_position = None,
)
```

#### Removing

```python
from epublib import EPUB
from epublib.identifier import EPUBId

book = EPUB("book.epub")

resource = book.resources.get("Text/chapter1.xhtml")
book.resources.remove(resource)

# It is possible to use the filename directly
book.resources.remove("Images/image.png")

# or the manifest item id
book.resources.remove(EPUBId("nav"))

# If it is a CSS or JS file, you can set the remove_css_js_links flag
# To remove any <link rel="stylesheet"> or <script> tags pointing to it
book.resources.remove("Styles/style.css", remove_css_js_links=True)

# If it has any other type, you'll have to individually remove any
# references to it
```

#### Renaming

```python
from epublib import EPUB
from epublib.identifier import EPUBId

book = EPUB("book.epub")

resource = book.resources.get("Text/chapter1.xhtml")
book.resources.rename(resource, "Text/chapter-one.xhtml")

# The same can be achieved by
book.resources.rename("Text/chapter-one.xhtml", "Text/chapter1.xhtml")

# or
book.resources.rename(EPUBId("chapter1"), "Text/chapter-one.xhtml")
```

By default, renaming a resource will update all references to it in the
rest of the book -- namely, in every `XMLResource` (see
[below](#internal-representation)). If you want to rename
a resource without updating references to it, you can set the
`update_references` flag to `False`:

```python
book = EPUB("book.epub")

book.resources.rename(
    "Text/chapter1.xhtml",
    "Text/chapter-one.xhtml",
    update_references=False,
)
```

By default, these references are looked up by using the following XML
attributes: `["href", "src", "full-path", "xlink:href"]`. If you want to
use a different set of attributes, you can pass them as a list to the
`reference_attrs` parameter:

```python
book = EPUB("book.epub")

book.resources.rename(
    "Text/chapter1.xhtml",
    "Text/chapter-one.xhtml",
    reference_attrs=["data-src", "href"],
)
```

#### Internal representation

Resources are represented by instances of `epublib.resources.Resource`
or one of its subclasses, depending on the type of resource:

* `Resource`: generic resource. Usually, the only file in the EPUB that is
  represented by a generic Resource is
  [the `mimetype` file](https://www.w3.org/TR/epub-33/#sec-zip-container-mime);

* `XMLResource`: XML resources (XHTML, SVG, XML). Provides a `soup`
  attribute representing the content as a BeautifulSoup object.
  Subclasses `Resource`;

* `PublicationResource`: A resource that contributes to the logic and
  rendering of the publication. This includes CSS files, fonts, images,
  JavaScript files, XHTML and SVG (although the last two have their
  own specific subclass: see below). All publication resources should
  have a manifest entry associated to them. Provides a `media_type: str
  | MediaType` (more on media types [below](#media-types)). Subclasses
  `Resource`;

* `ContentDocument`: A XHTML or SVG document. Subclasses `XMLResource` and
  `PublicationResource`;

* `PackageDocument`: The package document (content.opf). Subclasses
  `XMLResource`. More about the package document [below](#the-package-document);

* `NavigationDocument`: A XHTML or SVG document that represents the
  navigation document of the EPUB (the one with `properties="nav"` in
  the manifest). Subclasses `ContentDocument`. More about the navigation
  document [below](#navigation-document).

* `NCXFile`: A XML document that represents the NCX file of the EPUB
  (if it exists). Subclasses `PublicationResource` and `XMLResource`. More
  about the NCX file [below](#ncx-file).

The class hierarchy is as follows:

```text
                     ┌────────┐
                ┌────│Resource│───────┐
                │    └────────┘       │
                │                     │
                │                     │
                │                     │
           ┌────▼──────┐    ┌───────────────────┐
      ┌────│XMLResource│──┬─│PublicationResource│
      │    └───────────┘  │ └───────────────────┘
      │                   │
      │                   ├─────────────┐
      │                   │             │
┌─────▼─────────┐ ┌───────▼───────┐ ┌───▼───┐
│PackageDocument│ │ContentDocument│ │NCXFile│
└───────────────┘ └───────────────┘ └───────┘
                         │
                         │
                 ┌───────▼──────────┐
                 │NavigationDocument│
                 └──────────────────┘
```

### The package document

The package document (sometimes referred to as OPF or `content.opf`) is
"an XML document that consists of a set of elements that each
encapsulate information about a particular aspect of an EPUB
publication" (from [the spec](https://www.w3.org/TR/epub-33/#sec-package-intro)).
It contains:

* Metadata: title, author, language, date, etc;
* Manifest: list of all resources in the EPUB;
* Spine: reading order of resources;
* Collections (optional): groupings of resources;
* Manifest fallback chains (optional): define equivalence of resources
  to be used as fallbacks.

EPUBLib has specific features for handling the first three elements.
Further reading  at the [spec section about the package
document](https://www.w3.org/TR/epub-33/#sec-package-doc). The package
document itself is a resource from the epub and is available at
`book.package_document`.

#### Metadata

```python
from datetime import datetime
from epublib import EPUB

book = EPUB("book.epub")

print(book.metadata) # BookMetadata(10 items)

# book.metadata is an alias of book.package_document.metadata
assert book.metadata is book.package_document.metadata

# Mandatory metadata fields are available as attributes of convenient types
assert isinstance(book.metadata.title, str)
assert isinstance(book.metadata.language, str)
assert isinstance(book.metadata.modified, datetime)
book.metadata.title = "New title"
book.metadata.modified = datetime.now()

# Access as item (read-only) yields internal representation
print(book.metadata["title"])
# DublinCoreMetadataItem(
#     name='title',
#     tag=<dc:title>New title</dc:title>,
#     value='New title',
#     id=None,
#     dir=None,
#     lang=None
# )

```

##### Adding metadata

```python
from epublib.package.metadata import (
    GenericMetadataItem,
    DublinCoreMetadataItem,
)

new_item = book.metadata.add("pageBreakSource", "Our print version, 1976")
new_item_dc = book.metadata.add_dc("rights", "© 1976 Our Publisher")

assert isinstance(new_item, GenericMetadataItem)
assert isinstance(new_item_dc, DublinCoreMetadataItem)

print(new_item)
# GenericMetadataItem(name='pageBreakSource',
#     tag=<meta property="pageBreakSource">Our print version,
#     1976</meta>,
#     value='Our print version,
#     1976',
#     id=None,
#     dir=None,
#     lang=None,
#     refines=None,
#     scheme=None
# )

print(new_item_dc)
# DublinCoreMetadataItem(
#     name='rights',
#     tag=<dc:rights>© 1976 Our Publisher</dc:rights>,
#     value='© 1976 Our Publisher',
#     id=None,
#     dir=None,
#     lang=None
# )
```

##### Adding other types of metadata

```python
from epublib.package.metadata import MetadataItem, LinkMetadataItem

link_item = LinkMetadataItem(
    name="front.xhtml#meta-json", # corresponds to href in the tag
    rel="record",
    media_type="application/xhtml+xml",
    hreflang="en",
)
book.metadata.add_item(link_item)

# You can also create your own custom metadata items by subclassing MetadataItem
from custom_item import create_some_custom_item

custom_item = create_some_custom_item()
assert isinstance(custom_item, MetadataItem)
book.metadata.add_item(custom_item)
```

##### Getting all metadata

```python
book.metadata.items # Each item in internal representation
book.metadata.tag # The full metadata tag as an bs4.Tag element
```

#### Manifest

From the [spec](https://www.w3.org/TR/epub-33/#sec-manifest-elem), the
manifest "provides an exhaustive list of publication resources used
in the rendering of the content." Each of its items needs to have:

* an href, a relative path to the resource in the archive;
* a media-type (see [media types](#media-types) below);
* a unique identifier;

and can optionally have:

* properties (see [manifest properties](#manifest-properties) below);
* a fallback;
* a media-overlay.

The manifest is internally represented by `BookManifest`, and each item
by `ManifestItem`. Instead of the relative path, we primarily use the
absolute path of each resource to identify it in the EPUB (corresponding
to the `href` and `filename` attributes of `ManifestItem`,
respectivelly). If you whish to use the identifier instead, you can
signal that by using `EPUBId`, a `str` subclass, to wrap the identifier
string.

```python
from epublib import EPUB
from epublib.package.manifest import BookManifest, ManifestItem
from epublib.identifier import EPUBId

book = EPUB("book.epub")

# book.manifest is an alias of book.package_document.manifest
assert book.manifest is book.package_document.manifest

print(book.manifest) # BookManifest(4 items)
assert all(isinstance(item, ManifestItem) for item in book.manifest.items)

# Get manifest item by filename (absolute path). Raise KeyError if not found
item = book.manifest["Text/chapter1.xhtml"]
assert item

# Get manifest item, return None if not found
item = book.manifest.get("Text/chapter99.xhtml")
assert item is None

# Get manifest item by identifier (EPUBId)
nav_item = book.manifest[EPUBId("nav")]
assert nav_item
```

Adding and removing manifest items are normally done when adding or
removing resources (see [above](#accessing-resources)), which is done
under the hood by the `EPUB` class. If you need custom control of
manifest items regardless of their resource counterparts, you can use
the `add_item`, `insert_item` and `remove_item` methods of
`BookManifest`. Caution is advised, as this may result in invalid EPUBs.

##### Manifest properties

Each manifest item can have a set of properties, which convey additional
information about the resource (read more [in the
spec](https://www.w3.org/TR/epub-33/#sec-item-resource-properties)). A
non-exhaustive list of properties follows:

* nav (mandatory and unique, sets the navigation document)
* cover-image
* [mathml](https://www.w3.org/TR/epub-33/#sec-mathml)
* [remote-resources](https://www.w3.org/TR/epub-33/#sec-remote-resources)
* [scripted](https://www.w3.org/TR/epub-33/#sec-scripted)
* [svg](https://www.w3.org/TR/epub-33/#sec-svg)
* [switch](https://www.w3.org/TR/epub-33/#sec-switch)

```python
from epublib import EPUB


book = EPUB("book.epub")

item = book.manifest.get("Text/chapter1.xhtml")

# Only do this if there are external links in chapter 1
item.add_property("remote-resources")
# Only do this if there are math expressions in chapter 1
item.add_property("mathml")

item.remove_property("remote-resources")

assert item.has_property("mathml")
assert not item.has_property("remote-resources")

# There are shortcuts to the nav item and the cover image item.
assert book.manifest.nav is book.manifest[EPUBId("nav")]

# Get the manifest item corresponding to the cover image. Currently,
# there is no cover.
assert book.manifest.cover_image is None

# Promote some image to cover image
book.manifest.set_cover_image("Images/image.png")

assert book.manifest.cover_image is book.manifest["Images/image.png"]
```

#### Spine

The spine defines the default reading order of the publication. Each
spine item conveys the following information:

* idref (required): the identifier of the corresponding manifest item;
* linear: whether the item is part of the default reading order or not;
* properties (optional): additional information about the item;
* id: an identifier for the spine item itself.

Only the first one is mandatory. The spine is internally represented by
`BookSpine` (found at `book.spine`, an alias of
`book.package_document.spine`), and each item by `SpineItemRef`.
Different than manifest items, spine items are primarily identified by
their `idref` (their only required attribute).

```python
from epublib import EPUB
import random

book = EPUB("book.epub")

print(book.spine) # BookSpine(2 items)

assert book.spine["nav"]
assert book.spine["chapter1"]

# Getting spine item by position
assert book.spine[0] is book.spine["chapter1"]

# If you need to get a spine item by its filename, go through the
# manifest first (since the filename information is not stored in the spine):
item = book.spine[book.manifest["Text/chapter1.xhtml"].id]

# To reorder the spine, you can use the move_item method:
book.spine.move_item("nav", 0) # Move nav to the beginning of the spine
assert book.spine[0].idref == "nav"

# Or completely reorder the spine
new_order = list(book.spine.items)
random.shuffle(new_order)

book.spine.reorder(new_order)
assert list(book.spine.items) == new_order
```

As with the manifest, adding and removing spine items are normally done
when adding or removing resources (see [above](#accessing-resources)).
Refer to the following parameters of the `EPUB.resources.add` method:

* `after` and `before`;
* `add_to_spine`;
* `spine_position`;
* `linear`.

If you need custom control of spine items the `add_item`, `insert_item`
and `remove_item` methods of `BookSpine`. Caution is advised, as this
may result in invalid EPUBs.

### Navigation document

The navigation document is a special XHTML document that contains
"human- and machine-readable global navigation information." (from [the
spec](https://www.w3.org/TR/epub-33/#dfn-epub-navigation-document)). In
other words, it is a regular XHTML file with some extra requirements:

* Must include exactly one `nav` html element with `epub:type="toc"`
  (the table of contents);
* All `nav` html elements with a `epub:type` attribute, including the
  table of contents, must follow a [specific
  structure](https://www.w3.org/TR/epub-33/#sec-nav-def-model), using
  only ordered lists (`ol`, possibly nested), list items (`li`), spans
  (`span`) and anchors (`a`);

There may also exist other `nav` elements with different `epub:type`
attributes. The spec talks about two other types:

* `page-list`: a list of links to the locations in the publication
  that correspond to page numbers in a print edition of the work;
* `landmarks`: a list of links to important locations in the
  publication, such as the title page, table of contents, main
  content, bibliography, etc.

This requirements allow EPUBLib to provide specific features for
handling the navigation document, which is represented by
a `NavigationDocument` resource, available at `book.nav`.
There are features for handling the table of contents, page list and
landmarks.

```python
from epublib import EPUB

book = EPUB("book.epub")

for tag in book.nav.soup.find_all("nav"):
    tag.extract()

# Table of contents
book.create_toc(
    targets_selector = "h1, h2, h3",  # defaults to None, in which case
                                      # will only list filename without fragments
    include_filenames = False,        # Whether to include filenames in TOC entries
                                      # (i.e. hrefs with no fragments)
    spine_only = False,               # Only read from resources in the spine
                                      # (yields correctly orderered TOC)
    resource_class = ContentDocument, # Only consider resources of this class
)
# This will error if a landmark already exists. Use reset_toc to force recreation
book.reset_toc()

# Landmarks
book.create_landmarks(
    include_toc = True,                          # Include TOC in landmarks
    targets_selector = "#landmark1, #landmark2", # Defaults to None,
                                                 # selecting no landmark
)

# This will error if a landmarks list already exists. Use the following
# to force recreation
book.reset_landmarks()


# Page list
book.create_page_list(
    id_format = "page_{page}", # If a page breaks is identified but has
                               # no id, use this format to attribute one
    label_format = "{page}",   # Format for the page label, shown in the page list
    pagebreak_selector = '[role="doc-pagebreak"], [epub|type="pagebreak"]',
)

# This will error if a toc already exists. Use the following to force recreation
book.reset_page_list()
```

### NCX file

The NCX file is an XML file used in EPUB 2 publications to define the
table of contents. It has been superseded by the [navigation
document](#navigation-document), but may optionally be included in EPUB
3 publications for backwards compatibility with EPUB 2 readers. There
are several features of the NCX format, only part of which are
represented in EPUBLib:

* `head` element contains metadata, some of which are required (`uid`,
  `depth`, `totalPageCount`, `maxPageNumber`);
* `docTitle` element contains the title of the publication;
* `docAuthor` elements contain the authors of the publication;
* `navMap` element contains the actual table of contents;
* `pageList` element contains the list of pages.
* `navList` elements (any number of them) can contains other lists of
  points of interest.

Refer to the
[specification](https://daisy.org/activities/standards/daisy/daisy-3/z39-86-2005-r2012-specifications-for-the-digital-talking-book/#NCX)
for more details.

```python
from epublib import EPUB
from epublib.ncx import NCXHead, NCXNavMap, NCXPageList

book = EPUB("book.epub")

book.generate_ncx() # use reset_ncx if one already exists
assert book.ncx
assert book.ncx.nav_map
assert book.ncx.head


assert isinstance(book.ncx.head, NCXHead)
assert isinstance(book.ncx.nav_map, NCXNavMap)
assert book.ncx.page_list is None # No page list yet!

item = book.ncx.nav_map.items[0]

assert item.href == "Text/chapter1.xhtml"
assert item.text == "Start"

# Will recreate the nav_map unless reset_ncx is False or there is no NCX file
book.reset_toc(reset_ncx=True)

# Will recreate the page_list unless reset_ncx is False or there is no NCX file
book.reset_page_list(reset_ncx=True)
assert isinstance(book.ncx.page_list, NCXPageList)


# To synchronize specific parts of the NCX file with the rest of the book:
book.ncx.sync_head(book.metadata)
book.ncx.sync_toc(book.nav)
book.ncx.sync_page_list(book.nav)

# Update metadata numbers in the head of the NCX which are calculated
# (depth, total page count, max page number and play order)
book.ncx.update_numbers()

# Use reset_ncx to do all of the above at once
book.reset_ncx()
```

### Soup and internal representations

> tl;dr: If possible, do not alter the `soup` attribute of
> `PackageDocument` and the `NavigationDocument` directly. If you do need
> to alter them make sure to call `book.package_document.on_soup_change()`
> or `book.nav.on_soup_change()` afterwards.

The features described above for handling the package document and the
navigation document involve parsing the corresponding XML/XHTML files and
building a internal representation of their content. These
representations are built lazily (i.e., the parsing only occurs when
some of the representation if accessed). Due to the mutable
nature of BeautifulSoup objects, the user may inadvertently introduce
discrepancies between them and the internal representation, which may
lead to errors. For example, if a user adds an item tag directly to the
soup of the package document, there is no way for EPUBLib to know about
the new item and add it to the `BookManifest` object.

If you do need to alter the `soup` attribute of these resources (or the
`tag` attributes of the internal representations), there may be two
scenarios:

1. You don't need the internal representation, so we're all good.

    ```python
    from epublib import EPUB

    book = EPUB("book.epub")

    new_tag = book.package_document.soup.new_tag(
        "item",
        attrs={"href": "file.txt", "media-type": "text/plain", "id": "file"},
    )
    book.manifest.tag.append(new_tag)
    book.write("book-modified.epub") # All good
    ```

2. You do need the internal representation. In this case, you need to call
   the `on_soup_change` method of the corresponding resource after
   altering its soup.

    ```python
    from epublib import EPUB

    book = EPUB("book.epub")

    new_tag = book.package_document.soup.new_tag(
        "item",
        attrs={"href": "file.txt", "media-type": "text/plain", "id": "file"},
    )
    book.package_document.soup.manifest.append(new_tag)

    # Mark the internal representation for reparsing
    book.package_document.on_soup_change()

    # Internal representation is up to date
    assert book.manifest.get("file.txt")
    ```

Note that the internal representation reflect its changes to the soup,
so you don't need to do anything to see the changes there.

```python
from epublib import EPUB
from epublib.resources.create import create_resource

book = EPUB("book.epub")

book.resources.add_to_manifest(
    create_resource(b"Some text content", "Text/file.txt"),
    identifier="new-item"
)

assert book.package_document.soup.find(id="new-item")
```

If you completely overwrite the `soup` attribute of these resources,
there is also no need to call `on_soup_change`, as the property setter
will already do that for you. This is why there is no similar issue
with the `contents` attribute: since bytes are immutable, every change
to it will trigger a reparse from the property setter.

### Media types

Media types (also known as MIME types or content types) are strings that
represent the format of a file. They are used in EPUBs to describe the
format of each resource, and are required in every manifest item.

EPUBLib provides a `MediaType` class that represents the core media
types as described in the
[specification](https://www.w3.org/TR/epub-33/#sec-core-media-types).
Other media types are possible, but they will be represented by regular
strings.

We also introduce a helper class called `Category`, which represents the
main category of a media type. For example, the media type
`image/png` (`MediaType.IMAGE`) has the category `Category.IMAGE`.

```python
from epublib.mediatype import MediaType, Category

# From filename
assert MediaType.from_filename("image.png") is MediaType.IMAGE_PNG
assert MediaType.from_filename("image.jpg") is MediaType.IMAGE_JPEG
assert MediaType.from_filename("audio.ogg") is MediaType.AUDIO_OGG


# From mimetype string
assert MediaType("font/ttf") is MediaType.FONT_TTF
assert MediaType("text/css") is MediaType.CSS

# Utilities
assert MediaType.from_filename("script.js").is_js()
assert MediaType.from_filename("style.css").is_css()

# If you need lenient parsing of mimetypes (i.e. not raising errors for
# non-core media types), use coalesce
assert MediaType.coalesce("image/png") is MediaType.IMAGE_PNG
assert MediaType.coalesce("application/x-zerosize") == "application/x-zerosize"

# The category and mimetype are available as a properties in MediaType instances
media_type = MediaType.from_filename("image.png")
assert media_type.category is Category.IMAGE
assert media_type.value == "image/png"
```

### Utilities

#### Relative path resolution

When dealing with EPUBs it is often necessary to, given a relative path
(e.g. in an `href` or `src` attribute), find the full path of the
referred file. The other way around may also be necessary: given the
absolute filename, find the relative path from some resource to that
filename. Two helper functions are provided for this:

```python
from epublib.util import get_absolute_href, get_relative_href
from epublib import EPUB

book = EPUB("book.epub")

href = book.nav.soup.select_one("a")["href"] # "chapter1.xhtml"
absolute_path = get_absolute_href(
    origin_href=book.nav.filename, # "Text/nav.xhtml"
    href=href,                     # "chapter1.xhtml"
)

assert absolute_path == "Text/chapter1.xhtml"

# Vice versa:
relative_path = get_relative_href(
    relative_to=book.nav.filename, # "Text/nav.xhtml"
    absolute_href="Text/chapter1.xhtml",
)

assert relative_path == "chapter1.xhtml"
```

At a higher level, the `EPUB.resources` provides a method for resolving a
string representing an href (possibly with a fragment) to the actual
resource it refers to (and optionally to the tag is refers to):
`resolve_href`.

```python
import bs4
from epublib import EPUB

book = EPUB("book.epub")

resource = book.resources.resolve_href("Text/chapter1.xhtml#section1", with_tag=False)
assert resource is book.resources.get("Text/chapter1.xhtml")

# If the href is found inside some resource, you can use the
# `relative_to` parameter
resource = book.resources.resolve_href(
    "../Text/chapter1.xhtml#section1",
    with_tag=False,
    relative_to="Styles/style.css",
)
assert resource is book.resources.get("Text/chapter1.xhtml")

# To capture the tag the href refers to, use the `with_tag` parameter:
resource, tag = book.resources.resolve_href(
    "../Text/nav.xhtml#toc",
    with_tag=True,
    relative_to="Styles/style.css",
)
assert resource.filename == "Text/nav.xhtml"
assert isinstance(tag, bs4.Tag)
assert tag["id"] == "toc"
```
