Metadata-Version: 2.3
Name: scorep-db
Version: 0.2.0
Summary: Minimal tooling to keep track of Score-P Profiles + Traces.
Author: Maximilian Sander
Author-email: maximilian.sander1@tu-dresden.de
Requires-Python: >=3.10,<4
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: minio (>=7.2.14,<7.3.0)
Requires-Dist: psycopg2-binary (>=2.9.10,<2.10.0)
Requires-Dist: python-dotenv (>=1.0.1,<1.1.0)
Requires-Dist: rdflib (>=7.1.2,<7.2.0)
Requires-Dist: rdflib-sqlalchemy (>=0.5.4,<0.6.0)
Requires-Dist: setuptools (>=75.8.0,<75.9.0)
Requires-Dist: sparqlwrapper (>=2.0.0,<2.1.0)
Project-URL: Homepage, https://gitlab.hrz.tu-chemnitz.de/s0872522--tu-dresden.de/scorep-db
Project-URL: Repository, https://gitlab.hrz.tu-chemnitz.de/s0872522--tu-dresden.de/scorep-db
Description-Content-Type: text/markdown

# scorep-db
Minimal tooling to keep track of Score-P Profiles + Traces.
It relies on Score-P's metadata collection abilities.
It is not an official Score-P software tool.

<img src="scorep_db/figures/scorep-db-overview.png" alt="alt text" width="400"/>

__Until__ version 1.0 is reached, this is a proof of concept.
Metadata schema structure may change without notice before version 1.0.

It currently relies on [feature branches of Score-P](https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/branches/MR300/latest.tar.gz)
Find the repository [here](https://gitlab.hrz.tu-chemnitz.de/s0872522--tu-dresden.de/scorep-db).

---
## Install
Either install via `pip`
```
pip install scorep-db
```
or from source (git).

## Configuration

This project allows configuration via CLI arguments, environment variables, and a configuration file. Most configuration values can be specified using any of these three methods, providing high flexibility. However, certain values, such as database credentials, should be specified either via environment variables or the configuration file for security reasons.

The order of precedence for configuration values is as follows:

1. CLI arguments
2. Configuration file
3. Environment variables

### CLI Arguments

The program is invoked as a package, followed by a specific subcommand, and additional arguments can be provided. Here is an example:

```bash
python scorep-db add --experiment-path /path/to/experiment
```

Below is a list of available subcommands with brief explanations:

- __add__: Adds an experiment to the database
- __query__: Query the database with a sparql query file
- __download__: Download test cases according to a certain query the target_path. The query must follow some structure (see below)
- __health-check__: Test, if the databases are available
- __merge__: Merges an offline database into an online database
- __get-id__: Get the Score-P Experiment ID (same as `scorep-info show-metatadata --experiment-id`)
- __clear__: Delete everything within the selected database.

You can use `--help` or `-h` to display the available subcommands. If `--help` is provided after a subcommand, it will also display the relevant arguments for that command.

### Environment Variables

Environment variables can be used for configuration, especially for sensitive information like credentials. You can set these variables directly in your shell. The environment variables use the same keys as the configuration file.

Setting environment variables directly:

```bash
export SCOREP_DB_EXPERIMENT_PATH=/path/to/experiment
export SCOREP_DB_METADATA_SQLITE_PATH=/experiment.db
```

This method is useful when you do not want to store sensitive data in a file.

### Configuration File

The application can also be configured using a configuration file in the dotenv format.
By default, the `.env` file in the current working directory is used.
Alternatively, the configuration file can be specified via the CLI argument `--config_file`
or the environment variable `SCOREP_DB_CONFIG_FILE`.
Only one configuration file can be used at a time.

With this file, you can set all configuration values:

```
SCOREP_DB_RECORD_MODE=s3
SCOREP_DB_METADATA_MODE=rdf4j
SCOREP_DB_EXPERIMENT_PATH=/path/to/experiment
SCOREP_DB_APPEND_FILES=/path/to/file1,/path/to/file2
SCOREP_DB_DOWNLOAD_DIRECTORY=/path/to/download
SCOREP_DB_QUERY_FILE=/path/to/query
SCOREP_DB_DRYRUN=True

# Record Store
SCOREP_DB_RECORD_LOCAL_DIRECTORY=/path/to/record_local

SCOREP_DB_RECORD_S3_HOSTNAME=s3.hostname.com
SCOREP_DB_RECORD_S3_PORT=443
SCOREP_DB_RECORD_S3_USER=username
SCOREP_DB_RECORD_S3_PASSWORD=password
SCOREP_DB_RECORD_S3_BUCKET_NAME=bucket_name

# Metadata Store
SCOREP_DB_METADATA_SQLITE_PATH=/experiment.db
SCOREP_DB_METADATA_SQLITE_NAME=metadata_name

SCOREP_DB_METADATA_RDF4J_HOSTNAME=rdf4j.hostname.com
SCOREP_DB_METADATA_RDF4J_PORT=8080
SCOREP_DB_METADATA_RDF4J_USER=username
SCOREP_DB_METADATA_RDF4J_PASSWORD=password
SCOREP_DB_METADATA_RDF4J_DB_NAME=db_name

SCOREP_DB_METADATA_FUSEKI_HOSTNAME=fuseki.hostname.com
SCOREP_DB_METADATA_FUSEKI_PORT=8080
SCOREP_DB_METADATA_FUSEKI_USER=username
SCOREP_DB_METADATA_FUSEKI_PASSWORD=password
SCOREP_DB_METADATA_FUSEKI_DB_NAME=db_name
```
The keys in the configuration file correspond to those of the environment variables.

## Query
See directory `example/query/` for some queries.


## Download Query
The download capability currently relies on a query with a special format.
The query __must__ be based on the following minimal example.
It is critical, that the results `?Experiment` and `?storeLocation` are in the result.
It does not matter, if they are upper or lower or mixed case.
```
PREFIX scorep: <http://scorep-fair.github.io/ontology#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?Experiment ?storeLocation
WHERE {
  ?Experiment rdf:type scorep:ExperimentRun ;
              scorep:storeLocation ?storeLocation .
}
```
The query above will download all experiments into to specified target path.
The name will be (some random) `uuid` name, so no renaming takes place.

The download name can be modified by providing additional search terms.
```
PREFIX scorep: <http://scorep-fair.github.io/ontology#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?Experiment ?storeLocation ?program
WHERE {
  ?Experiment rdf:type scorep:ExperimentRun  ;
              scorep:storeLocation ?storeLocation ;
              scorep:program   ?program   .
}
```
The query above leads to the name
`program_<program_name>.<experiment-id>` (e.g. `program_sp-mz.A.x.709330_1725173478_308410`).
This name is created by concatinating any search key words and its values.
Each `?Experiment` is unique - if it appears multiple times in a search results, the folder name
will be created by merging the key,value pairs together.

E.g. the following query
```
PREFIX scorep: <http://scorep-fair.github.io/ontology#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?experiment ?storeLocation ?program ?n ?c ?toolchain
WHERE {
  ?experiment rdf:type scorep:ExperimentRun    ;
              scorep:storeLocation   ?storeLocation ;
              scorep:program     ?program   ;
              scorep:environment ?envVar    .

  ?envVar scorep:envName  ?envName  ;
          scorep:envState ?envState ;
          scorep:envValue ?envValue .

  FILTER(?envName IN ("SLURM_NTASKS", "SLURM_CPUS_PER_TASK", "TC_NAME"))
  FILTER(?envState = "set")
  FILTER(?program = "sp-mz.A.x")

  BIND(IF(?envName = "SLURM_NTASKS", ?envValue, "") AS ?n)
  BIND(IF(?envName = "SLURM_CPUS_PER_TASK", ?envValue, "") AS ?c)
  BIND(IF(?envName = "TC_NAME", ?envValue, "") AS ?toolchain)
}
```
will create a download pattern of e.g
```
n_2.c_2.toolchain_foss2022a.program_sp.mz.A.x.709330_1725173478_308410/
```

which allows the user to associate some case setup with the folder name.
Not that this naming scheme is _close_ to the
[one needed for Extra-P](https://github.com/extra-p/extrap/blob/master/docs/file-formats.md#cube-file-format).
This does not work yet, but might be addressed in the future.


## Inclusion of 'external' data.
Other JSON-LD files may be merged and linked into the metadata.

The User has to link its JSON-LD the Score-P Run to its metadata.
The runtime id of the Score-P Run can be extracted with

```
SCOREP_RUN_ID=`scorep-db get-id --experiment-path <path/to/experiment_directory>`
echo $SCOREP_RUN_ID
```
or
```
SCOREP_RUN_ID=`scorep-info show-metadata --experiment-id <path/to/experiment_directory>`
echo $SCOREP_RUN_ID
```

which can then be used to link the external JSON-LD to this Score-P Experiment.

See scripts in `cube_x_to_jsonld/*` on how this might exemplary be done.


## Performance
The query via RDFlib is quite slow, and, depending on the query, can be _very, very_ slow.
This issue can be solved by using a different, more performance "Triple Store" backend.

## Config File Layout
Almost all `scorep-db` command need a config file (except `get-id`).
The config file configures some paths and credentials of the following type.
```
# Offline - Data Store
SCOREP_DB_OFFLINE_DIRECTORY=${HOME}/repos/scorep-db/example/showcase_NAS-NPB/showcase_database/

# Offline - Metadata Store
SCOREP_DB_OFFLINE_PATH=${HOME}/repos/scorep-db/example/showcase_NAS-NPB/showcase_database/
SCOREP_DB_OFFLINE_NAME=scorep-experiments.db

# ----------------------------------------- #

# Online - Data Store
SCOREP_DB_ONLINE_OBJ_HOSTNAME=localhost
SCOREP_DB_ONLINE_OBJ_PORT=9000
SCOREP_DB_ONLINE_OBJ_USER=minioadmin
SCOREP_DB_ONLINE_OBJ_PASSWORD=minioadmin
SCOREP_DB_ONLINE_OBJ_BUCKET_NAME=scorep-experiments

# Online - Metadata Store
SCOREP_DB_ONLINE_RDF_HOSTNAME=localhost
SCOREP_DB_ONLINE_RDF_PORT=5432
SCOREP_DB_ONLINE_RDF_USER=postgres
SCOREP_DB_ONLINE_RDF_PASSWORD=mysecretpassword
SCOREP_DB_ONLINE_RDF_DB_NAME=postgres

# Online - Metadata Store
SCOREP_DB_METADATA_FUSEKI_HOSTNAME=fuseki.hostname.com
SCOREP_DB_METADATA_FUSEKI_PORT=8080
SCOREP_DB_METADATA_FUSEKI_USER=username
SCOREP_DB_METADATA_FUSEKI_PASSWORD=password
SCOREP_DB_METADATA_FUSEKI_DB_NAME=db_name
```

##


It dependes on the metadata emitted by Score-P

The env may contain further data, which means
that this must be attributed as well.

I this case it will be attributed to the run.
The envVariable_name is the property, its is the value

## Example usage

View the showcase in `example/showcase_NAS-NPB/03_run_testcases.sh`

# Setup 'Online' Infrastructure
You can use docker to host the _online_ infrastructure.
Make sure to match these with the `<config_files>`.
__Postgres__ (rdf database)
```bash
$ docker pull postgres
$ docker run --name my_postgres \
             -e POSTGRES_PASSWORD=mysecretpassword \
             -p 5432:5432 \
             -d \
             postgres
```
__rdf4j__ (rdf database)
```bash
docker pull eclipse/rdf4j-workbench:latest
docker run -d \
           -p 8080:8080 \
           -e JAVA_OPTS="-Xms1g -Xmx4g" \
    	   -v data:/var/rdf4j \
    	   -v logs:/usr/local/tomcat/logs \
           eclipse/rdf4j-workbench:latest testing
```
__Apache Jena Fuseki__ (rdf database)
```bash
docker run --name fuseki-data -v $HOME/fuseki busybox
docker run -d \
           -p 8100:3030 \
           -e JVM_ARGS=-Xmx2g \
           -e ADMIN_PASSWORD=<admin_password> \
           --volumes-from fuseki-data \
           stain/jena-fuseki
```
__Minio__ (profile / trace database)
```bash
$ docker pull minio/minio
$ docker run --name minio \
    -v $HOME/.minio-data:/data \
    -v $HOME/.minio:/root/.minio \
    -e "MINIO_ROOT_USER=minioadmin" \
    -e "MINIO_ROOT_PASSWORD=minioadmin" \
    -p 9000:9000 \
    -p 9001:9001 \
    -d \
    minio/minio server /data --console-address ":9001"

```

