Metadata-Version: 2.4
Name: swh.indexer
Version: 4.8.3
Summary: Software Heritage indexer
Author-email: Software Heritage developers <swh-devel@inria.fr>
Project-URL: Homepage, https://gitlab.softwareheritage.org/swh/devel/swh-indexer
Project-URL: Bug Reports, https://gitlab.softwareheritage.org/swh/devel/swh-indexer/-/issues
Project-URL: Funding, https://www.softwareheritage.org/donate
Project-URL: Documentation, https://docs.softwareheritage.org/devel/swh-indexer/
Project-URL: Source, https://gitlab.softwareheritage.org/swh/devel/swh-indexer.git
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.9
Description-Content-Type: text/x-rst
License-File: LICENSE
License-File: AUTHORS
Requires-Dist: python-magic>=0.4.13
Requires-Dist: click
Requires-Dist: frozendict!=2.1.2
Requires-Dist: iso8601
Requires-Dist: pybtex>=0.25.0
Requires-Dist: pyld>=3.0.0
Requires-Dist: rdflib>=7.1.4
Requires-Dist: sentry-sdk
Requires-Dist: typing-extensions
Requires-Dist: xmltodict
Requires-Dist: swh.core[db,http]>=4.0.0
Requires-Dist: swh.model>=8.4.0
Requires-Dist: swh.objstorage>=2.3.1
Requires-Dist: swh.storage>=3.0.0
Requires-Dist: swh.journal>=0.1.0
Provides-Extra: testing
Requires-Dist: confluent-kafka; extra == "testing"
Requires-Dist: hypothesis>=3.11.0; extra == "testing"
Requires-Dist: importlib_metadata; extra == "testing"
Requires-Dist: pytest>=8.1; extra == "testing"
Requires-Dist: pytest-mock; extra == "testing"
Requires-Dist: swh.coarnotify>=0.9.0; extra == "testing"
Requires-Dist: swh.core[testing]>=3.0.0; extra == "testing"
Requires-Dist: swh.journal[pytest]>=2.0.0; extra == "testing"
Requires-Dist: swh.storage[pytest]>=3.1.0; extra == "testing"
Requires-Dist: types-confluent-kafka; extra == "testing"
Requires-Dist: types-pyyaml; extra == "testing"
Requires-Dist: types-xmltodict; extra == "testing"
Dynamic: license-file

Software Heritage - Indexer
===========================

Tools to compute multiple indexes on SWH's raw contents:

- content:

  - mimetype
  - fossology-license
  - metadata

- origin:

  - metadata (intrinsic, using the content indexer; and extrinsic)

An indexer is in charge of:

- looking up objects
- extracting information from those objects
- store those information in the swh-indexer db

There are multiple indexers working on different object types:

  - content indexer: works with content sha1 hashes
  - revision indexer: works with revision sha1 hashes
  - origin indexer: works with origin identifiers

Indexation procedure:

- receive batch of ids
- retrieve the associated data depending on object type
- compute for that object some index
- store the result to swh's storage

Current content indexers:

- mimetype (queue swh_indexer_content_mimetype): detect the encoding
  and mimetype

- fossology-license (queue swh_indexer_fossology_license): compute the
  license

- metadata: translate file from an ecosystem-specific formats to JSON-LD
  (using schema.org/CodeMeta vocabulary)

Current origin indexers:

- metadata: translate file from an ecosystem-specific formats to JSON-LD
  (using schema.org/CodeMeta and ForgeFed vocabularies)

`Custom indexers and metadata mappings
<https://docs.softwareheritage.org/devel/swh-indexer/custom-indexers.html>`_
can be added as plugins.
