Quickstart
Save and Load
Format is auto-detected from the file extension:
from scitex_io import save, load
# DataFrames
import pandas as pd
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
save(df, "data.csv")
loaded = load("data.csv")
# NumPy arrays
import numpy as np
save(np.array([1, 2, 3]), "data.npy")
# Dictionaries
save({"key": "value"}, "config.yaml")
save({"nested": [1, 2]}, "data.json")
# Any Python object
save({"complex": object()}, "data.pkl")
One function for save, one for load — 30+ formats work the same way.
List Available Formats
from scitex_io import list_formats
formats = list_formats()
print(f"Save: {len(formats['save']['builtin'])} built-in formats")
print(f"Load: {len(formats['load']['builtin'])} built-in formats")
Save: 24 built-in formats
Load: 29 built-in formats
Custom Format Registration
Register handlers for any file extension:
from scitex_io import register_saver, register_loader, save, load
@register_saver(".tsv3")
def save_tsv3(obj, path, **kwargs):
"""Save with 3-space-separated values."""
with open(path, "w") as f:
for row in obj:
f.write(" ".join(str(v) for v in row) + "\n")
@register_loader(".tsv3")
def load_tsv3(path, **kwargs):
"""Load 3-space-separated values."""
with open(path) as f:
return [line.strip().split(" ") for line in f]
# Now .tsv3 works like any built-in format
save([[1, 2], [3, 4]], "data.tsv3")
assert load("data.tsv3") == [["1", "2"], ["3", "4"]]
Note
User-registered handlers take priority over built-in ones for the same extension. This lets you override default behavior without modifying the library.
Project Configuration
Most research projects have hard-coded parameters scattered across scripts — sample rates,
thresholds, model hyperparameters, plot dimensions. load_configs centralizes them in
YAML files under a config/ directory:
project/
config/
PATHS.yaml # DATA_DIR: /data/experiment_01
PREPROCESS.yaml # SAMPLE_RATE: 1000, BANDPASS: [0.5, 40]
MODEL.yaml # HIDDEN_DIM: 256, DROPOUT: 0.3
PLOT.yaml # FIGSIZE: [180, 60], DPI: 300
IS_DEBUG.yaml # IS_DEBUG: true
from scitex_io import load_configs
# One call loads all YAML files, namespaced by filename
CONFIG = load_configs() # ./config/*.yaml
CONFIG = load_configs(config_dir="./my_configs") # custom path
CONFIG.PATHS.DATA_DIR # "/data/experiment_01"
CONFIG.PREPROCESS.SAMPLE_RATE # 1000
CONFIG.MODEL.HIDDEN_DIM # 256
# Debug mode: keys prefixed with DEBUG_ override their counterparts
# In MODEL.yaml: HIDDEN_DIM: 256, DEBUG_HIDDEN_DIM: 32
CONFIG = load_configs(IS_DEBUG=True)
CONFIG.MODEL.HIDDEN_DIM # 32 (debug value promoted)
Returns a DotDict — a nested dictionary with dot-notation access. Config keys should
use UPPER_CASE to signal that they are user-defined constants, not runtime variables.
DotDict
DotDict gives dot-notation access to nested dictionaries:
from scitex_io import DotDict
d = DotDict({"MODEL": {"HIDDEN_DIM": 256, "LAYERS": 4}})
d.MODEL.HIDDEN_DIM # 256
d.MODEL.LAYERS # 4
# Standard dict operations work too
d["MODEL"]["HIDDEN_DIM"] # 256
d.MODEL.keys() # dict_keys(["HIDDEN_DIM", "LAYERS"])
d.MODEL.to_dict() # {"HIDDEN_DIM": 256, "LAYERS": 4}
Metadata Embedding
A saved PNG has no record of the code, parameters, or session that produced it.
embed_metadata solves this by writing provenance directly into the file:
from scitex_io import embed_metadata, read_metadata, has_metadata
# Embed metadata into a figure
embed_metadata("figure.png", {
"experiment": "exp_042",
"model": "resnet50",
"accuracy": 0.94,
"timestamp": "2026-03-11",
})
# Read it back — months later, from the file alone
meta = read_metadata("figure.png")
meta["experiment"] # "exp_042"
# Check if a file carries metadata
has_metadata("figure.png") # True
Supports PNG (tEXt chunks), JPEG (EXIF), SVG (XML metadata), and PDF (XMP metadata). The figure carries its own history — no external database needed.
Auto Path Routing
When save() receives a relative path, the output directory is determined
automatically from the execution context:
Context |
Output directory |
Example |
|---|---|---|
Script |
|
|
Notebook |
|
|
Interactive / IPython |
|
|
Absolute path |
Used as-is |
|
# In ~/proj/scripts/analysis.py:
save(df, "results.csv")
# → ~/proj/scripts/analysis_out/results.csv
# Absolute path bypasses routing:
save(df, "/data/shared/results.csv")
# → /data/shared/results.csv
Advanced Save Options
from scitex_io import save
# Symlink from cwd to saved file
save(df, "results.csv", symlink_from_cwd=True)
# Symlink at a specific path
save(fig, "fig1.png", symlink_to="/data/latest/fig1.png")
# Skip auto CSV export for image saves (default: images get .csv companion)
save(fig, "plot.png", no_csv=True)
# Resolve path from the calling script, not the immediate caller.
# Essential when save() is called through library wrappers.
save(df, "results.csv", use_caller_path=True)
# Dry run — print resolved path without writing
save(df, "results.csv", dry_run=True)
Glob and Pattern Matching
Natural-sorted file matching with named placeholder parsing:
from scitex_io import glob, parse_glob, load
# Natural sort (1, 2, 10 — not 1, 10, 2)
paths = glob("data/**/*.csv")
# Brace expansion
paths = glob("results/{exp1,exp2}/*.npy")
# Parse named placeholders from paths
paths, parsed = parse_glob("sub_{id}/ses_{session}/*.vhdr")
# parsed = [{'id': '001', 'session': 'pre'}, ...]
# Glob works directly in load()
dfs = load("results/*.csv") # → list of DataFrames
Caching
Repeated loads are cached automatically by path + mtime:
from scitex_io import load, get_cache_info, configure_cache, clear_load_cache
data1 = load("large_file.hdf5") # reads from disk
data2 = load("large_file.hdf5") # returns cached copy (instant)
info = get_cache_info()
print(f"Cache hits: {info['stats']['hits']}, misses: {info['stats']['misses']}")
configure_cache(max_size=64) # increase cache size (default: 32)
clear_load_cache() # free memory