Metadata-Version: 2.4
Name: swh-osv
Version: 0.0.0
Summary: Research project to analyze the OSV database
Author-email: Software Heritage developers <swh-devel@inria.fr>
Classifier: Programming Language :: Python :: 3
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: AUTHORS
Requires-Dist: aiohttp
Requires-Dist: bs4
Requires-Dist: google-cloud-storage
Requires-Dist: grpcio
Requires-Dist: pyzstd
Requires-Dist: matplotlib
Requires-Dist: matplotlib-terminal
Requires-Dist: tqdm
Requires-Dist: matplotlib-backend-kitty
Requires-Dist: python-dateutil
Requires-Dist: swh.export
Requires-Dist: swh.graph
Requires-Dist: swh.model
Provides-Extra: luigi
Requires-Dist: luigi; extra == "luigi"
Provides-Extra: testing
Requires-Dist: pytest>=8.1; extra == "testing"
Requires-Dist: pytest-mock; extra == "testing"
Requires-Dist: pyarrow-stubs; extra == "testing"
Requires-Dist: types-grpcio; extra == "testing"
Requires-Dist: types-protobuf; extra == "testing"
Requires-Dist: types-python-dateutil; extra == "testing"
Requires-Dist: types-tqdm; extra == "testing"
Dynamic: license-file

# osv analysis

This repository mines data from vulnerability databases in the [OSV format](https://osv.dev/),
looking to map vulnerabilities to software versions they affect.

The main features are:

* computing stats
* exhaustively listing which revisions are affected by a vulnerability, and vice versa
* an experimental implementation of the SZZ algorithm, see ["Computing introducing commits"](#computing-introducing-commits) below.

## Preliminaries

* Install dependencies, get data files, and index them:

  ```
  pip3 install -e .
  make all.sqlite
  ```
* Run [swh-graph-grpc-serve](https://docs.softwareheritage.org/devel/swh-graph/grpc-api.html) on localhost (or port-forward it, eg. `ssh maxxi.internal.softwareheritage.org -L 50091:localhost:50091`).

Generate all reports:

```
make
```

## Data sources

* https://osv-vulnerabilities.storage.googleapis.com/all.zip
* https://storage.googleapis.com/cve-osv-conversion/index.html?prefix=osv-output/

## Stats

Publications per year start at 120 in 2003 and increase exponentially to 10k in 2024.

Reports' last modified each year are few and chaotic from 2003 to 2012-2013, then increase exponentially from 100 to 10k in 2024.

10k reports are published and last modified on the same day.
There is an exponential decrease from that value to 100 that are last modified 8000 after their publication.
There is also a small number of reports last modified before their publication; an exponential decrease to 100 in 100 days.

![](output/stats/publication_and_modification_times.svg)

### 'affected' items

Each OSV document lists what software packages
(which can be real packages, VCS repositories, or projects in the abstract)
are affected.

![Exponential decrease from 10k reports with 1 affected item to single reports with 1000 affected items. There are a couple thousands of reports that don't fit the regression and have between 100 and 115 affected items per report](output/stats/affected_items_per_file_log_scale.svg)

Each 'affected' entry lists
[events](https://ossf.github.io/osv-schema/#affectedrangesevents-fields)
for that affected package, which roughly map to when the vulnerability was introduced and fixed.
As VCS commits and versions are not linear, there can be many of these.

![Event types per affected entry](output/stats/event_types_per_affected_type_log_scale.svg)

Number of event types per database

![Event types per database](output/stats/event_types_per_database_lin_scale.svg)

Events are grouped par range (usually only two per range), and ranges can have three types:

* GIT when events are associated to Git commits

* SEMVER when they are associated to regular X.Y.Z versions with the expected semantics

* ECOSYSTEM for everything else, with no defined semantics. Due to the lack of semantics, the OSV spec recommends report explicitly list all affected versions in this case. The graph below breaks ECOSYSTEM ranges into two, based on whether they follow this recommendation

![Not very readable chart, we see 40k GIT ranges from each of NVD and CVE DB, 40k ECOSYSTEM with versions from Ubuntu, 110k ECOSYSTEM without version across NVD CVE / CGA / CVE / and a couple others. 30k SEMVER in total, mostly from the MAL database](output/stats/event_types_per_range_type_log_scale.svg)

### Identifiers

The OSV spec [describes identifiers as](https://ossf.github.io/osv-schema/#id-modified-fields)
"a string of the format `<DB>-<ENTRYID>`, where DB names the database and ENTRYID is in the format used by the database".

::include{file=output/identifiers.md}

## Mapping software packages to SWH objects

### Mapping package names to SWH origins

Each OSV document lists what software packages
(which can be real packages, VCS repositories, or projects in the abstract)
are [affected](https://ossf.github.io/osv-schema/#affectedpackage-field).

Each of these software packages is identified by an
[ecosystem](https://ossf.github.io/osv-schema/#defined-ecosystems)
(eg. `OSS-Fuzz`, `npm`, `PyPI`, or `Ubuntu`).

### Mapping GIT `affected` entries to origins and SWHIDs in SWH

Each software packages is associated with a list of version ranges
that are affected by the vulnerability, and each version range is
made of events (that mark when a vulnerability is introduced and then fixed).

Version ranges can be associated to Git commits (for software packages of type `GIT`),
version numbers (for for software packages of type `SEMVER`),
or opaque/ecosystem-specific strings (type `ECOSYSTEM`).

In this section, we look only at software packages of type GIT
and count how many of them we can find in SWH
(using the origin URL matching descripted below),
and how many of there commits were in the 2025-05-18 graph.

::include{file=output/git_origins.md}

### Mapping GIT `affected` entries to origins in SWH

For non-`GIT` affected packages, we currently only try to map to an origin URL.

This relies on OSV's specified [ecosystems](https://ossf.github.io/osv-schema/#defined-ecosystems) and knowledge of SWH's idiosyncrasies. See [`osv/map_packages.py`](https://github.com/softwareheritage/swh-osv/blob/main/osv/map_packages.py) for details.

::include{file=output/packages.md}

Worth noting are:

* thousands of packages on NPM are not in SWH and don't seem to exist. All but a handful come from the [MAL database](https://github.com/ossf/malicious-packages/tree/main/osv/). They are marked as "hallucinated" above.
* the OSV spec says about Maven: "The ecosystem string might optionally have a `:<REMOTE-REPO-URL>` suffix to denote the remote repository URL that best represents the source of truth for this package, without a trailing slash (e.g. `Maven:https://maven.google.com`). If this is omitted, this is assumed to be the Maven Central repository (`https://repo.maven.apache.org/maven2`).". Literally not a single report uses this suffix, even though between a quarter and a half Maven packages are not Maven Central but in other repositories.

## Cherry-picks stats

### 2024-05-16-history-hosting graph

247970180 commits are in the 7079 connected components that contain any of the `introduced`, `fixed`, `last_affected`, or `limit` commits mentioned by a vulnerability reports.

Of the commits mentioned by a vulnerability reports:
* 243008413 are not deemed to be cherry-picks, though 300632 mention the keywords "cherry picked from commit" (this typically happens because a cherry-picked commit's message is quoted in an other commit message)
* 4961689 have at least one valid "cherry picked from commit" stanza, 236022 have at least two. Of those stanzas:
    * 293315 reference unknown commits and don't have a repo URL
    * 0 reference unknown commits and a non-UTF8 repo URL
    * 0 reference unknown repo URLs (ie. these were origins unknown to SWH at the time of graph export)
    * 1640 reference unknown commits and a known repo URL (ie. these commits were unknown at the time of the export, but probably will be in the future)
    * 55449 claim to be cherry-picks of commits in different connected components
    * 5005620 reference known commits
* 4989284 are cherry-picks of known commits

Of the 4700546 cherry-pick commits, 166743 are cherry-picks of multiple commits.
Of the latter, 141972 can be reduced to cherry-picks of a single commit, which is itself a cherry-pick of one (or more). 805 can only be partially reduced through that process.


### 2025-05-18 graph

293340976 commits are in the 7767 connected components that contain any of the `introduced`, `fixed`, `last_affected`, or `limit` commits mentioned by a vulnerability reports.

Of the commits mentioned by a vulnerability reports:
* 287043476 are not deemed to be cherry-picks, though 346120 mention the keywords "cherry picked from commit" (this typically happens because a cherry-picked commit's message is quoted in an other commit message)
* 6296974 have at least one valid "cherry picked from commit" stanza, 303125 have at least two. Of those stanzas:
    * 348192 reference unknown commits and don't have a repo URL
    * 0 reference unknown commits and a non-UTF8 repo URL
    * 0 reference unknown repo URLs (ie. these were origins unknown to SWH at the time of graph export)
    * 1998 reference unknown commits and a known repo URL (ie. these commits were unknown at the time of the export, but probably will be in the future)
    * 70117 claim to be cherry-picks of commits in different connected components
    * 6372854 reference known commits
* 6354701 are cherry-picks of known commits


## Algorithms

### Origin URL matching

Origin URLs in SWH are case-sensitive, but in OSV reports they are usually lowercased.

Given a Git repository URL from an OSV report, we try to match it to a SWH origin URL this way:

1. if there is an exact match in swh-graph, return it
2. remove `.git` at the end. if there is an exact match in swh-graph, return it (this catches 20% of the origins that failed the exact match)
3. Send a request to `https://archive.softwareheritage.org/api/1/origin/search/{url}?limit=10`. If any matches the URL (case-insensitive and with `.git` suffix stripped from both), return it
4. fail

Other origins are normalized similarly, with an addition for PyPI projects,
because OSV reports also frequently mangle some PyPI project names by using dots/underscores/dashes interchangeably
so we consider those equivalent (like PyPI does) in step 3.


(computing-introducing-commits)=
## Computing introducing commits

We implement the [SZZ](https://www.st.cs.uni-saarland.de/papers/msr2005/msr2005.pdf) algorithm to compute, from a known commit fixing a vulnerability, which commits may have introduced it. Foundamentally, it works this way:

1. Look up the parent of the fixing commit, and compute the diff between it and the fixing commit
2. Compute a [git-blame](https://git-scm.com/docs/git-blame) of the parent of the fixing commit
3. For each line removed (or modified) by the fixing commit, look up in the git-blame which commit introduced it
4. Take the union of the commits introducing these lines

To compute all recommended variants, use:

```
make data/szz.tar  # standard SZZ, fast
data/vszz90.tar    # V-SZZ with similarity ratio 90%
data/vszz75.tar    # V-SZZ, slow
```

Minimally it can be run with:

```
cargo run --release --bin szz -- \
    --graph $(GRAPH_PATH) \
    --digestmap $(DIGESTMAP_PATH) \
    --url gzip:https://softwareheritage.s3.amazonaws.com/content/{sha1} \
    --url plain:https://archive.softwareheritage.org/api/1/content/sha1:{sha1}/raw/ \
    --db all.sqlite \
    --content-cache data/contents.rocksdb/ \
    --max-tree-diff-items 100 \
    --vuln-out-dir data/szz/
```

which produces in `data/szz/` ndjson (newline-delimited json) files in this format:

```rust
pub struct SzzOutputRecord {
    /// Filename of the vulnerability report in the OSV database
    pub vuln_filename: String,
    /// Vulnerabilities that are a fix according to the vulnerability report
    pub fix_revs: Vec<SWHID>,
    /// Union of tag names of each revision in `fix_revs_id`/`fix_revs_swhid`.
    ///
    /// For each revision in `fix_revs_swhid`, takes the name of any release that points
    /// **directly** to it, or the name of a `refs/tags/` branch that points **directly** to it.
    pub fix_revs_tag_names: Vec<String>,
    /// Whether the vulnerability report claims the vulnerability is present since the beginning of
    /// the project
    ///
    /// ie. the event `{ "introduced": "0" }` is present.
    ///
    /// This is usually a lie, and actually means the author of the vulnerability report does not
    /// known when the vulnerability was introduced.
    pub introduced_at_zero: bool,
    /// Vulnerabilities that introduced the vulnerability according to the vulnerability report
    pub known_introduction_revs: Vec<SWHID>,
    /// Union of tag names of each revision in
    /// `known_introduction_revs_id`/`known_introduction_revs_swhid`.
    pub known_introduction_revs_tag_names: Vec<String>,
    /// Vulnerabilities that introduced the vulnerability according to SZZ,
    /// and for each path, which lines in it are considered to be the introduction
    pub computed_introduction_revs: HashMap<SWHID, IntroductionFilesRecord>,
    /// Union of tag names of each revision in
    /// `computed_introduction_revs_id`/`computed_introduction_revs_swhid`.
    pub computed_introduction_revs_tag_names: Vec<String>,
}
```

For example:

```json
{"vuln_filename":"osv-output/CVE-2021-25313.json","fix_revs":["swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274"],"fix_revs_tag_names":["v2.5.6","v2.5.6-rc9"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806":{"pkg/catalogv2/helmop/operation.go":{"line_ranges":[{"start":765,"end":766},{"start":759,"end":760}]}},"swh:1:rev:1b6a525e1052da363f3f71c4451d7c2d50b7e967":{"pkg/catalogv2/helmop/operation.go":{"line_ranges":[{"start":589,"end":590}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2021-45099.json","fix_revs":["swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce"],"fix_revs_tag_names":["v10.0.0"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:8b1a4f016a3e109dfaa8726f2f3a1c1940ff4c2c":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":94,"end":95}]}},"swh:1:rev:c3cfef680a51828c57c9d4c7b24ed756cab95f13":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":97,"end":98}]}},"swh:1:rev:147fd4e87c41380f683805f28005dfcd7082356f":{"ssh/rootfs/root/.zshrc":{"line_ranges":[{"start":96,"end":97}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2020-36179.json","fix_revs":["swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3"],"fix_revs_tag_names":["jackson-databind-2.6.7.5"],"introduced_at_zero":false,"known_introduction_revs":["swh:1:rev:e8df0987e3034d102ee6d704d30a05a2e3ac7089"],"known_introduction_revs_tag_names":["jackson-databind-2.0.0"],"computed_introduction_revs":{"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c":{"pom.xml":{"line_ranges":[{"start":23,"end":24},{"start":12,"end":13}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2022-27818.json","fix_revs":["swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c":{"src/daemon.rs":{"line_ranges":[{"start":31,"end":34}]}},"swh:1:rev:978fa8195b46eed1b6e479e5679b1fd95a3f55a8":{"src/daemon.rs":{"line_ranges":[{"start":30,"end":32}]}},"swh:1:rev:6097674e18e2e34f68b340a40d36dcd23c258d00":{"src/daemon.rs":{"line_ranges":[{"start":307,"end":308}]},"src/server.rs":{"line_ranges":[{"start":14,"end":15}]}},"swh:1:rev:b4e6dc76f4845ab03104187a42ac6d1bbc1e0021":{"src/daemon.rs":{"line_ranges":[{"start":408,"end":409},{"start":404,"end":405}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2018-16846.json","fix_revs":["swh:1:rev:b10be4d44915a4d78a8e06aa31919e74927b142e"],"fix_revs_tag_names":["v13.2.4"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:9bf3c8b1a04b0aa4a3cc78456a508f1c48e70279":{"CMakeLists.txt":{"line_ranges":[{"start":3,"end":4}]}}},"computed_introduction_revs_tag_names":["v13.2.3"]}
{"vuln_filename":"osv-output/CVE-2020-4071.json","fix_revs":["swh:1:rev:8a7dfe2161d241f4a79775a99c7c94405ad3975d"],"fix_revs_tag_names":["v0.3.4"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:42ccebf98daa7c86ead0df65345361f9bdc17b5a":{"setup.cfg":{"line_ranges":[{"start":5,"end":6}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2021-44108.json","fix_revs":["swh:1:rev:d919b2744cd05abae043490f0a3dd1946c1ccb8c"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:235a041b8d7638db931114ace49e4f771508830f":{"src/amf/namf-handler.c":{"line_ranges":[{"start":202,"end":204}]}},"swh:1:rev:d0673e3066ff14ce2d965b436ccb9b3646a38705":{"lib/sbi/message.c":{"line_ranges":[{"start":467,"end":468}]}},"swh:1:rev:c9363b132093581b6fd2ce794aa63cd597bf83a6":{"src/amf/namf-handler.c":{"line_ranges":[{"start":172,"end":173}]}},"swh:1:rev:dbee687a75797e0be5f8484030d11ea22e18b63c":{"lib/sbi/message.c":{"line_ranges":[{"start":1325,"end":1326},{"start":1391,"end":1392},{"start":1499,"end":1501},{"start":1328,"end":1330},{"start":1357,"end":1358},{"start":1334,"end":1336}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2016-6306.json","fix_revs":["swh:1:rev:848d650dade802c835b4b3a1e29c7581e79494ed"],"fix_revs_tag_names":["v0.10.47"],"introduced_at_zero":false,"known_introduction_revs":["swh:1:rev:163ca274230fce536afe76c64676c332693ad7c1"],"known_introduction_revs_tag_names":["v0.10.0"],"computed_introduction_revs":{"swh:1:rev:3e711f14ae7db34350fcc5b1d7ffd4a8cfc2daef":{"src/node_version.h":{"line_ranges":[{"start":28,"end":29}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2020-25706.json","fix_revs":["swh:1:rev:39458efcd5286d50e6b7f905fedcdc1059354e6e"],"fix_revs_tag_names":[],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:20a1b073420eeecab9eca2f1a8d86f30f81c2f23":{"lib/import.php":{"line_ranges":[{"start":983,"end":984}]}},"swh:1:rev:0ba5711f09338a7019ed5622701a7effd83ba701":{"lib/import.php":{"line_ranges":[{"start":1756,"end":1757}]}}},"computed_introduction_revs_tag_names":[]}
{"vuln_filename":"osv-output/CVE-2024-43440.json","fix_revs":["swh:1:rev:aea9770cfc7d003a737f4899489d1e3982efe9ac"],"fix_revs_tag_names":["v4.1.12"],"introduced_at_zero":true,"known_introduction_revs":[],"known_introduction_revs_tag_names":[],"computed_introduction_revs":{"swh:1:rev:44305df587ad156e6ccc8495bfbcd45e45370c23":{"version.php":{"line_ranges":[{"start":31,"end":32},{"start":34,"end":35}]}}},"computed_introduction_revs_tag_names":[]}
```

We also implement some variants of it:

* Ignoring all whitespace at the beginning or end of a line, with `--tokenizer trimmed-line`
* Ignoring all whitespace, with `--tokenizer whitespace-stripped-line`
* Considering all lines changed with less than N% edit distance to be the same (similar to [V-SZZ](https://baolingfeng.github.io/papers/ICSE2022VSZZ.pdf)), with `--min-line-similarity-ratio 0.75` (for 75%, like V-SZZ)

We also provide optional verbose output. `--line-details-out-dir data/szz-line-details/` enables ndjson output of the provenance of each line in this format:

```rust
pub struct SzzLineDetailsRecord {
    pub fix_rev: RevisionRecord,
    pub path: String,
    pub vulnerable_hunk_before_fix: HunkRecord,
    /// Introduction rev computed by SZZ
    pub introduction_rev: RevisionRecord,
    pub vulnerable_hunk_after_introduction: HunkRecord,
    pub hunk_before_introduction: Option<HunkRecord>,
    pub file_creation: RapidHashSet<RevisionRecord>,
    /// number of blame steps from the fix rev to find the introduction rev
    ///
    /// It is found through a DFS or BFS, so it is greater or equal to `intro_to_fix_rev_distance`.
    pub intro_to_fix_rev_num_blame_steps: u64,
    /// number of revisions between introduction_rev and fix_rev (shortest path)
    pub intro_to_fix_rev_distance: u64,
    /// number of revisions between file_creation_rev and introduction_rev (shortest path)
    pub creation_to_intro_rev_distance: u64,
    /// number of revisions between file_creation_rev and fix_rev (shortest path, usually equal
    /// to `intro_to_fix_rev_distance + creation_to_intro_rev_distance` but may be smaller)
    pub creation_to_fix_rev_distance: u64,
}

pub struct HunkRecord {
    pub hunk: String,
    pub line_range: LineRange,
}

pub struct RevisionRecord {
    pub swhid: SWHID,
    pub author_timestamp: Option<i64>,
    pub committer_timestamp: Option<i64>,
}
```

For example:

```json
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":752,"end":753}},"introduction_rev":{"swhid":"swh:1:rev:1b6a525e1052da363f3f71c4451d7c2d50b7e967","author_timestamp":1599023421,"committer_timestamp":1599023421},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":589,"end":590}},"hunk_before_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equals\",\n","line_range":{"start":589,"end":590}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":146,"creation_to_intro_rev_distance":31,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":758,"end":759}},"introduction_rev":{"swhid":"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806","author_timestamp":1612981111,"committer_timestamp":1612981722},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":759,"end":760}},"hunk_before_introduction":{"hunk":"","line_range":{"start":759,"end":759}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":35,"creation_to_intro_rev_distance":133,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:65f7c844267bf7336a38ee6ea3e0e63af9e21274","author_timestamp":1614838845,"committer_timestamp":1614839856},"path":"pkg/catalogv2/helmop/operation.go","vulnerable_hunk_before_fix":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":764,"end":765}},"introduction_rev":{"swhid":"swh:1:rev:e7bbe784067ff66c6992171d51c1c2f5f5330806","author_timestamp":1612981111,"committer_timestamp":1612981722},"vulnerable_hunk_after_introduction":{"hunk":"\t\t\t\t\tOperator: \"Equal\",\n","line_range":{"start":765,"end":766}},"hunk_before_introduction":{"hunk":"","line_range":{"start":765,"end":765}},"file_creation":[{"swhid":"swh:1:rev:01105fa239444886d000dbf14f41f0909b2ac699","author_timestamp":1596350639,"committer_timestamp":1596352311}],"intro_to_fix_rev_num_blame_steps":18,"intro_to_fix_rev_distance":35,"creation_to_intro_rev_distance":133,"creation_to_fix_rev_distance":149}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"# Home Assistant Core CLI\n","line_range":{"start":95,"end":96}},"introduction_rev":{"swhid":"swh:1:rev:8b1a4f016a3e109dfaa8726f2f3a1c1940ff4c2c","author_timestamp":1581777236,"committer_timestamp":1581777236},"vulnerable_hunk_after_introduction":{"hunk":"# Home Assistant Core CLI\n","line_range":{"start":94,"end":95}},"hunk_before_introduction":{"hunk":"# Home Assistant CLI\n","line_range":{"start":94,"end":95}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":237,"creation_to_intro_rev_distance":399,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"eval \"$(_HASS_CLI_COMPLETE=source_zsh hass-cli)\"\n","line_range":{"start":96,"end":97}},"introduction_rev":{"swhid":"swh:1:rev:c3cfef680a51828c57c9d4c7b24ed756cab95f13","author_timestamp":1543786701,"committer_timestamp":1543786701},"vulnerable_hunk_after_introduction":{"hunk":"eval \"$(_HASS_CLI_COMPLETE=source_zsh hass-cli)\"\n","line_range":{"start":97,"end":98}},"hunk_before_introduction":{"hunk":"","line_range":{"start":97,"end":97}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":475,"creation_to_intro_rev_distance":161,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:d9a9cbb4ac90e065543bc96ec2516666ff73f1ce","author_timestamp":1639584568,"committer_timestamp":1639584568},"path":"ssh/rootfs/root/.zshrc","vulnerable_hunk_before_fix":{"hunk":"\n","line_range":{"start":97,"end":98}},"introduction_rev":{"swhid":"swh:1:rev:147fd4e87c41380f683805f28005dfcd7082356f","author_timestamp":1577619683,"committer_timestamp":1577619683},"vulnerable_hunk_after_introduction":{"hunk":"\n","line_range":{"start":96,"end":97}},"hunk_before_introduction":{"hunk":"","line_range":{"start":96,"end":96}},"file_creation":[{"swhid":"swh:1:rev:f57215516081de79978ad0da71d046d70504dce0","author_timestamp":1506461095,"committer_timestamp":1506461095}],"intro_to_fix_rev_num_blame_steps":7,"intro_to_fix_rev_distance":248,"creation_to_intro_rev_distance":388,"creation_to_fix_rev_distance":636}
{"fix_rev":{"swhid":"swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3","author_timestamp":1624334005,"committer_timestamp":1624334005},"path":"pom.xml","vulnerable_hunk_before_fix":{"hunk":"  <version>2.6.7.5-SNAPSHOT</version>\n","line_range":{"start":12,"end":13}},"introduction_rev":{"swhid":"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c","author_timestamp":1603594079,"committer_timestamp":1603594079},"vulnerable_hunk_after_introduction":{"hunk":"  <version>2.6.7.5-SNAPSHOT</version>\n","line_range":{"start":12,"end":13}},"hunk_before_introduction":{"hunk":"  <version>2.6.7.4</version>\n","line_range":{"start":12,"end":13}},"file_creation":[{"swhid":"swh:1:rev:90c4352c4d2412fbe4be10e93e2c520b9658a752","author_timestamp":1324625127,"committer_timestamp":1324625127}],"intro_to_fix_rev_num_blame_steps":1,"intro_to_fix_rev_distance":4,"creation_to_intro_rev_distance":1267,"creation_to_fix_rev_distance":1271}
{"fix_rev":{"swhid":"swh:1:rev:e19c557b789113f900018208d87446c34ae4fab3","author_timestamp":1624334005,"committer_timestamp":1624334005},"path":"pom.xml","vulnerable_hunk_before_fix":{"hunk":"    <tag>HEAD</tag>\n","line_range":{"start":23,"end":24}},"introduction_rev":{"swhid":"swh:1:rev:8069e46dd9c288d4a52911ebdc52192cd3d0e96c","author_timestamp":1603594079,"committer_timestamp":1603594079},"vulnerable_hunk_after_introduction":{"hunk":"    <tag>HEAD</tag>\n","line_range":{"start":23,"end":24}},"hunk_before_introduction":{"hunk":"    <tag>jackson-databind-2.6.7.4</tag>\n","line_range":{"start":23,"end":24}},"file_creation":[{"swhid":"swh:1:rev:90c4352c4d2412fbe4be10e93e2c520b9658a752","author_timestamp":1324625127,"committer_timestamp":1324625127}],"intro_to_fix_rev_num_blame_steps":1,"intro_to_fix_rev_distance":4,"creation_to_intro_rev_distance":1267,"creation_to_fix_rev_distance":1271}
{"fix_rev":{"swhid":"swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818","author_timestamp":1648221476,"committer_timestamp":1648221476},"path":"src/daemon.rs","vulnerable_hunk_before_fix":{"hunk":"        if !config_file_path.exists() {\n            log::error!(\"{:#?} doesn't exist\", config_file_path);\n            exit(1);\n","line_range":{"start":96,"end":99}},"introduction_rev":{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411},"vulnerable_hunk_after_introduction":{"hunk":"    if !config_file_path.exists() {\n        log::error!(\"{:#?} doesn't exist\", config_file_path);\n        exit(1);\n","line_range":{"start":31,"end":34}},"hunk_before_introduction":null,"file_creation":[{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411}],"intro_to_fix_rev_num_blame_steps":72,"intro_to_fix_rev_distance":156,"creation_to_intro_rev_distance":0,"creation_to_fix_rev_distance":156}
{"fix_rev":{"swhid":"swh:1:rev:f70b99dd575fab79d8a942111a6980431f006818","author_timestamp":1648221476,"committer_timestamp":1648221476},"path":"src/daemon.rs","vulnerable_hunk_before_fix":{"hunk":"        }\n\n","line_range":{"start":99,"end":101}},"introduction_rev":{"swhid":"swh:1:rev:978fa8195b46eed1b6e479e5679b1fd95a3f55a8","author_timestamp":1644118333,"committer_timestamp":1644118333},"vulnerable_hunk_after_introduction":{"hunk":"    }\n\n","line_range":{"start":30,"end":32}},"hunk_before_introduction":{"hunk":"","line_range":{"start":30,"end":30}},"file_creation":[{"swhid":"swh:1:rev:c15b5b153e94f12d3e92ed9568f7ea0928141c1c","author_timestamp":1642534411,"committer_timestamp":1642534411}],"intro_to_fix_rev_num_blame_steps":72,"intro_to_fix_rev_distance":107,"creation_to_intro_rev_distance":51,"creation_to_fix_rev_distance":156}
```

and when `--min-line-similarity-ratio` is given, `--middle-revisions-out-dir` enables CSV output listing all revisions that made a non-significant change to a line, in this format:

```rust
pub struct MiddleCommitRecord {
    /// parent revision of `predecessor_rev`
    pub rev: SWHID,
    pub rev_author_timestamp: Option<i64>,
    pub rev_committer_timestamp: Option<i64>,
    /// middle rev touching the line
    pub predecessor_rev: SWHID,
    pub predecessor_rev_author_timestamp: Option<i64>,
    pub predecessor_rev_committer_timestamp: Option<i64>,
    pub hunk_id: String,
    /// fix rev
    pub ancestor_rev: SWHID,
    pub ancestor_rev_author_timestamp: Option<i64>,
    pub ancestor_rev_committer_timestamp: Option<i64>,
    /// always "V-SZZ_middle_revision"
    pub tag: String,
    pub code_before_id: SWHID,
    pub code_after_id: SWHID,
    pub file_path: String,
    /// entire file at 'rev'
    pub code_before: String,
    /// entire file at 'predecessor_rev'
    pub code_after: String,
}
```

For example:

```text
rev,rev_author_timestamp,rev_committer_timestamp,predecessor_rev,predecessor_rev_author_timestamp,predecessor_rev_committer_timestamp,hunk_id,ancestor_rev,ancestor_rev_author_timestamp,ancestor_rev_committer_timestamp,tag,code_before,code_after
swh:1:rev:ff68f28b1e21feb9fd584847b2272aef2fc370dd,1533889455,1533889686,swh:1:rev:93dcbcf3b9e0726c03b45b7e74ec9ca4c89eab03,1533893246,1533893246,,swh:1:rev:9b5bbd48a72096930af08402c5e07fce7dd770f3,1544087928,1544087928,V-SZZ_similar_line,"
","
"
swh:1:rev:ff68f28b1e21feb9fd584847b2272aef2fc370dd,1533889455,1533889686,swh:1:rev:93dcbcf3b9e0726c03b45b7e74ec9ca4c89eab03,1533893246,1533893246,,swh:1:rev:9b5bbd48a72096930af08402c5e07fce7dd770f3,1544087928,1544087928,V-SZZ_similar_line,"	fmt.Fprintf(w, `
","	fmt.Fprintf(w, `
"
swh:1:rev:71adb3c4170dc47f71c21bf8d95ed7ddd640819e,1635286588,1635286588,swh:1:rev:d9492ec19b76aca2b13e18131fe46078810984af,1635287082,1635287082,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"SET (CARES_LIB_VERSIONINFO ""7:0:5"")
","SET (CARES_LIB_VERSIONINFO ""7:1:5"")
"
swh:1:rev:7586c5f19f94923b9c722351cfd41696cd9764d9,1634813012,1634813012,swh:1:rev:800e4727d1e38cec97767437b8202f60a94f3f1d,1635175567,1635175567,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"PROJECT (c-ares LANGUAGES C VERSION ""1.17.2"" )
","PROJECT (c-ares LANGUAGES C VERSION ""1.18.0"" )
"
swh:1:rev:7586c5f19f94923b9c722351cfd41696cd9764d9,1634813012,1634813012,swh:1:rev:800e4727d1e38cec97767437b8202f60a94f3f1d,1635175567,1635175567,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"SET (CARES_LIB_VERSIONINFO ""6:3:4"")
","SET (CARES_LIB_VERSIONINFO ""7:0:5"")
"
swh:1:rev:11a2bf8efd88d961f3b2c5dea04b09b4af247bce,1625070329,1625070329,swh:1:rev:fe282cf172c63f2bca21e8fda50a318cad4a7c69,1626972694,1626972694,,swh:1:rev:fddf01938d3789e06cc1c3774e4cd0c7d2a89976,1674068199,1674068199,V-SZZ_similar_line,"PROJECT (c-ares LANGUAGES C VERSION ""1.17.0"" )
","PROJECT (c-ares LANGUAGES C VERSION ""1.17.2"" )
"
```

## SZZ-related diffs

We also have support for producing diffs of all revisions mentioned by any of SZZ's outputs. The corresponding diffs for the recommended SZZ variants can be computed with:

```
make data/szz-diffs.tar.zst
make data/vszz90-diffs.tar.zst
make data/vszz75-diffs.tar.zst
```

See the `Makefile` for details.

## Customizing SZZ

The SZZ implementation ([`SzzProcessor`](https://docs.rs/swh-osv/latest/swh_osv/szz/struct.SzzProcessor.html)) is parametrized by multiple types, which can be provided by users:

* [`StrategyFactory`](https://docs.rs/swh-osv/latest/swh_osv/szz/strategies/trait.StrategyFactory.html) which returns instances of [`StrategyFactory`](https://docs.rs/swh-osv/latest/swh_osv/szz/strategies/trait.Strategy.html) which themselves compute:
  * given a version range from an OSV document the fix revision to start from ([`NaiveStrategy`](https://docs.rs/swh-osv/latest/swh_osv/szz/strategies/struct.NaiveStrategy.html) returns the "fix" events)
  * from a list of diffs, the set of vulnerable hunks ([`NaiveStrategy`](https://docs.rs/swh-osv/latest/swh_osv/szz/strategies/struct.NaiveStrategy.html) returns all deleted/modified hunks)
* [`RevisionSkipper`](https://docs.rs/swh-contents/latest/swh_contents/blame/blame/trait.RevisionSkipper.html) which takes as input a revision and its parent (and the two different versions of a file in each), and returns, if the revision should be skipped, a mapping from lines in the revision to lines in the parent. [`RevisionSkipper`](https://docs.rs/swh-contents/latest/swh_contents/blame/blame/struct.DefaultRevisionSkipper.html) never returns anything (ie. it skips no revision)
* [`Tokenizer`](https://docs.rs/swh-contents/latest/swh_contents/diff/tokenizers/trait.Tokenizer.html) which takes as input a version of a file, and returns its lines with customizable comparison implementations.
  * the default [`line_tokenizer`](https://docs.rs/swh-contents/latest/swh_contents/diff/tokenizers/fn.line_tokenizer.html) returns lines as-is
  * [`TrimmedAsciiTokenizer`](https://docs.rs/swh-contents/latest/swh_contents/diff/tokenizers/struct.TrimmedAsciiTokenizer.html) returns lines whose comparisons are insensitive to leading and trailing ASCII spaces
  * [`StrippedAsciiWhitespaceTokenizer`](https://docs.rs/swh-osv/latest/swh_osv/szz/tokenizers/struct.StrippedAsciiWhitespaceTokenizer.html) returns lines whose comparisons are insensitive to any ASCII spaces

## Data formats

This package produces files in various formats:

* `all.sqlite`: a database with verbatim OSV documents plus some indexes, and an integer id for each document. See `swh/osv/to_sqlite.py` for the exact schema
* `connected_components.wccs`: a renumbering of revisions connected to any vulnerable commit. This allows revisions to be identified by a small integer (in the range [0; 300M]) instead of being a sparse subset of all node ids in the graph ([0; 60G]). This is an [epserde](https://docs.rs/epserde/) serialization of [swh_graph_stdlib::connectivity::SubgraphWccs](https://docs.rs/swh-graph-stdlib/latest/swh_graph_stdlib/connectivity/struct.SubgraphWccs.html), which is based on an [Elias-Fano](https://docs.rs/sux/0.12.3/sux/dict/elias_fano/index.html) sequence. It also identifies which connected component a revision belongs to, which is useful to identify cherry-picks.
* `commit2vuln_without_cherrypicks.*`: a map from small revision id to id of a document in sqlite, using the [BVGraph](https://docs.rs/webgraph/latest/webgraph/graphs/bvgraph/) format (note: this is not actually a graph, it just reuses BVGraph as a generic map from integers to set of integers). It is built directly from the `introduced` and `fixed` information in OSV documents using graph traversals
* `commit2vuln_with_cherrypicks.*`: same as `commit2vuln_without_cherrypicks.*`, but enriches the sets of `introduced` and `fixed` events by mining commit messages for [cherry-pick](https://git-scm.com/docs/git-cherry-pick) information. It does so by considering that any cherry-pick of an introducing (resp. fixing) commit is also an introducing (resp. fixing) commit, transitively.
* `commit2vuln_without_cherrypicks/*.parquet` and `commit2vuln_using_cherrypicks/*.parquet`: same as above, but designed for portability at the expense of query time and file size
