docs for
zanjv0.5.2
The ZANJ format is meant to be a way of saving arbitrary
objects to disk, in a way that is flexible, allows keeping configuration
and data together, and is human readable. It is very loosely inspired by
HDF5 and the derived exdir format, and the implementation
is inspired by npz files.
SerializableDataclass from the muutils library and save
it to disk – any large arrays or lists will be stored efficiently as
external files in the zip archive, while the basic structure and
metadata will be stored in readable JSON files.ConfiguredModel, which
inherits from a torch.nn.Module which will let you save not
just your model weights, but all required configuration information,
plus any other metadata (like training logs) in a single file.This library was originally a module in muutils
Available on PyPI as zanj
pip install zanj
You can find a runnable example of this in demo.ipynb
Any SerializableDataclass of basic types can be saved as
zanj:
import numpy as np
import pandas as pd
from muutils.json_serialize import SerializableDataclass, serializable_dataclass, serializable_field
from zanj import ZANJ
@serializable_dataclass
class BasicZanj(SerializableDataclass):
a: str
q: int = 42
c: list[int] = serializable_field(default_factory=list)
# initialize a zanj reader/writer
zj = ZANJ()
# create an instance
instance: BasicZanj = BasicZanj("hello", 42, [1, 2, 3])
path: str = "tests/junk_data/path_to_save_instance.zanj"
zj.save(instance, path)
recovered: BasicZanj = zj.read(path)ZANJ will intelligently handle nested serializable dataclasses, numpy arrays, pytorch tensors, and pandas dataframes:
import torch
import pandas as pd
@serializable_dataclass
class Complicated(SerializableDataclass):
name: str
arr1: np.ndarray
arr2: np.ndarray
iris_data: pd.DataFrame
brain_data: pd.DataFrame
container: list[BasicZanj]
torch_tensor: torch.TensorFor custom classes, you can specify a serialization_fn
and loading_fn to handle the logic of converting to and
from a json-serializable format:
@serializable_dataclass
class Complicated(SerializableDataclass):
name: str
device: torch.device = serializable_field(
serialization_fn=lambda self: str(self.device),
loading_fn=lambda data: torch.device(data["device"]),
)Note that loading_fn takes the dictionary of the whole
class – this is in case you’ve stored data in multiple fields of the
dict which are needed to reconstruct the object.
First, define a configuration class for your model. This class will
hold the parameters for your model and any associated objects (like
losses and optimizers). The configuration class should be a subclass of
SerializableDataclass and use the
serializable_field function to define fields that need
special serialization.
Here’s an example that defines a GPT-like model configuration:
from zanj.torchutil import ConfiguredModel, set_config_class
@serializable_dataclass
class MyNNConfig(SerializableDataclass):
input_dim: int
hidden_dim: int
output_dim: int
# store the activation function by name, reconstruct it by looking it up in torch.nn
act_fn: torch.nn.Module = serializable_field(
serialization_fn=lambda x: x.__name__,
loading_fn=lambda x: getattr(torch.nn, x["act_fn"]),
)
# same for the loss function
loss_kwargs: dict = serializable_field(default_factory=dict)
loss_factory: torch.nn.modules.loss._Loss = serializable_field(
default_factory=lambda: torch.nn.CrossEntropyLoss,
serialization_fn=lambda x: x.__name__,
loading_fn=lambda x: getattr(torch.nn, x["loss_factory"]),
)
loss = property(lambda self: self.loss_factory(**self.loss_kwargs))Then, define your model class. It should be a subclass of
ConfiguredModel, and use the set_config_class
decorator to associate it with your configuration class. The
__init__ method should take a single argument, which is an
instance of your configuration class. You must also call the superclass
__init__ method with the configuration instance.
@set_config_class(MyNNConfig)
class MyNN(ConfiguredModel[MyNNConfig]):
def __init__(self, config: MyNNConfig):
# call the superclass init!
# this will store the model in the zanj_model_config field
super().__init__(config)
# whatever you want here
self.net = torch.nn.Sequential(
torch.nn.Linear(config.input_dim, config.hidden_dim),
config.act_fn(),
torch.nn.Linear(config.hidden_dim, config.output_dim),
)
def forward(self, x):
return self.net(x)You can now create instances of your model, save them to disk, and load them back into memory:
config = MyNNConfig(
input_dim=10,
hidden_dim=20,
output_dim=2,
act_fn=torch.nn.ReLU,
loss_kwargs=dict(reduction="mean"),
)
# create your model from the config, and save
model = MyNN(config)
fname = "tests/junk_data/path_to_save_model.zanj"
ZANJ().save(model, fname)
# load by calling the class method `read()`
loaded_model = MyNN.read(fname)
# zanj will actually infer the type of the object in the file
# -- and will warn you if you don't have the correct package installed
loaded_another_way = ZANJ().read(fname)When initializing a ZANJ object, you can specify some
configuration info about saving, such as:
# how big an array or list (including pandas DataFrame) can be before moving it from the core JSON file
external_array_threshold: int = ZANJ_GLOBAL_DEFAULTS.external_array_threshold
external_list_threshold: int = ZANJ_GLOBAL_DEFAULTS.external_list_threshold
# compression settings passed to `zipfile` package
compress: bool | int = ZANJ_GLOBAL_DEFAULTS.compress
# for doing very cursed things in your own custom loading or serialization functions
custom_settings: dict[str, Any] | None = ZANJ_GLOBAL_DEFAULTS.custom_settings
# specify additional serialization handlers
handlers_pre: MonoTuple[SerializerHandler] = tuple()
handlers_default: MonoTuple[SerializerHandler] = DEFAULT_SERIALIZER_HANDLERS_ZANJ,The on-disk format is a file <filename>.zanj is a
zip file containing:
__zanj_meta__.json: a file containing zanj-specific
metadata including:
__zanj__.json: a file containing user-specified data
.npy for numpy arrays or torch tensors.jsonl for pandas dataframes or large sequences__zanj_meta__.json_REF_KEY in muutils, will have
value pointing to external file_FORMAT_KEY key will detail an external format
type| Format | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16 |
|---|---|---|---|---|---|---|---|
| pickle (PyTorch) | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ |
| H5 (Tensorflow) | ✅ | ❌ | ✅ | ✅ | ~ | ~ | ❌ |
| HDF5 | ✅ | ? | ✅ | ✅ | ~ | ✅ | ❌ |
| SavedModel (Tensorflow) | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ |
| MsgPack (flax) | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ |
| Protobuf (ONNX) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Cap’n’Proto | ✅ | ✅ | ~ | ✅ | ✅ | ~ | ❌ |
| Numpy (npy,npz) | ✅ | ? | ? | ❌ | ✅ | ❌ | ❌ |
| SafeTensors | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| exdir | ✅ | ? | ? | ? | ? | ✅ | ❌ |
| ZANJ | ✅ | ❌ | ❌* | ✅ | ✅ | ✅ | ❌* |
* denotes this feature may be coming at a future date
:)
(This table was stolen from safetensors)
zanjThe ZANJ format is meant to be a way of saving arbitrary
objects to disk, in a way that is flexible, allows keeping configuration
and data together, and is human readable. It is very loosely inspired by
HDF5 and the derived exdir format, and the implementation
is inspired by npz files.
SerializableDataclass from the muutils library and save
it to disk – any large arrays or lists will be stored efficiently as
external files in the zip archive, while the basic structure and
metadata will be stored in readable JSON files.ConfiguredModel, which
inherits from a torch.nn.Module which will let you save not
just your model weights, but all required configuration information,
plus any other metadata (like training logs) in a single file.This library was originally a module in muutils
Available on PyPI as zanj
pip install zanj
You can find a runnable example of this in demo.ipynb
Any SerializableDataclass of basic types can be saved as
zanj:
import numpy as np
import pandas as pd
from muutils.json_serialize import SerializableDataclass, serializable_dataclass, serializable_field
from zanj import ZANJ
@serializable_dataclass
class BasicZanj(SerializableDataclass):
a: str
q: int = 42
c: list[int] = serializable_field(default_factory=list)
### initialize a zanj reader/writer
zj = ZANJ()
### create an instance
instance: BasicZanj = BasicZanj("hello", 42, [1, 2, 3])
path: str = "tests/junk_data/path_to_save_instance<a href="zanj/zanj.html">zanj.zanj</a>"
zj.save(instance, path)
recovered: BasicZanj = zj.read(path)ZANJ will intelligently handle nested serializable dataclasses, numpy arrays, pytorch tensors, and pandas dataframes:
import torch
import pandas as pd
@serializable_dataclass
class Complicated(SerializableDataclass):
name: str
arr1: np.ndarray
arr2: np.ndarray
iris_data: pd.DataFrame
brain_data: pd.DataFrame
container: list[BasicZanj]
torch_tensor: torch.TensorFor custom classes, you can specify a serialization_fn
and loading_fn to handle the logic of converting to and
from a json-serializable format:
@serializable_dataclass
class Complicated(SerializableDataclass):
name: str
device: torch.device = serializable_field(
serialization_fn=lambda self: str(self.device),
loading_fn=lambda data: torch.device(data["device"]),
)Note that loading_fn takes the dictionary of the whole
class – this is in case you’ve stored data in multiple fields of the
dict which are needed to reconstruct the object.
First, define a configuration class for your model. This class will
hold the parameters for your model and any associated objects (like
losses and optimizers). The configuration class should be a subclass of
SerializableDataclass and use the
serializable_field function to define fields that need
special serialization.
Here’s an example that defines a GPT-like model configuration:
from <a href="zanj/torchutil.html">zanj.torchutil</a> import ConfiguredModel, set_config_class
@serializable_dataclass
class MyNNConfig(SerializableDataclass):
input_dim: int
hidden_dim: int
output_dim: int
# store the activation function by name, reconstruct it by looking it up in torch.nn
act_fn: torch.nn.Module = serializable_field(
serialization_fn=lambda x: x.__name__,
loading_fn=lambda x: getattr(torch.nn, x["act_fn"]),
)
# same for the loss function
loss_kwargs: dict = serializable_field(default_factory=dict)
loss_factory: torch.nn.modules.loss._Loss = serializable_field(
default_factory=lambda: torch.nn.CrossEntropyLoss,
serialization_fn=lambda x: x.__name__,
loading_fn=lambda x: getattr(torch.nn, x["loss_factory"]),
)
loss = property(lambda self: self.loss_factory(**self.loss_kwargs))Then, define your model class. It should be a subclass of
ConfiguredModel, and use the set_config_class
decorator to associate it with your configuration class. The
__init__ method should take a single argument, which is an
instance of your configuration class. You must also call the superclass
__init__ method with the configuration instance.
@set_config_class(MyNNConfig)
class MyNN(ConfiguredModel[MyNNConfig]):
def __init__(self, config: MyNNConfig):
# call the superclass init!
# this will store the model in the zanj_model_config field
super().__init__(config)
# whatever you want here
self.net = torch.nn.Sequential(
torch.nn.Linear(config.input_dim, config.hidden_dim),
config.act_fn(),
torch.nn.Linear(config.hidden_dim, config.output_dim),
)
def forward(self, x):
return self.net(x)You can now create instances of your model, save them to disk, and load them back into memory:
config = MyNNConfig(
input_dim=10,
hidden_dim=20,
output_dim=2,
act_fn=torch.nn.ReLU,
loss_kwargs=dict(reduction="mean"),
)
### create your model from the config, and save
model = MyNN(config)
fname = "tests/junk_data/path_to_save_model<a href="zanj/zanj.html">zanj.zanj</a>"
ZANJ().save(model, fname)
### load by calling the class method `read()`
loaded_model = MyNN.read(fname)
### zanj will actually infer the type of the object in the file
### -- and will warn you if you don't have the correct package installed
loaded_another_way = ZANJ().read(fname)When initializing a ZANJ object, you can specify some
configuration info about saving, such as:
### how big an array or list (including pandas DataFrame) can be before moving it from the core JSON file
external_array_threshold: int = ZANJ_GLOBAL_DEFAULTS.external_array_threshold
external_list_threshold: int = ZANJ_GLOBAL_DEFAULTS.external_list_threshold
### compression settings passed to `zipfile` package
compress: bool | int = ZANJ_GLOBAL_DEFAULTS.compress
### for doing very cursed things in your own custom loading or serialization functions
custom_settings: dict[str, Any] | None = ZANJ_GLOBAL_DEFAULTS.custom_settings
### specify additional serialization handlers
handlers_pre: MonoTuple[SerializerHandler] = tuple()
handlers_default: MonoTuple[SerializerHandler] = DEFAULT_SERIALIZER_HANDLERS_ZANJ,The on-disk format is a file
<filename><a href="zanj/zanj.html">zanj.zanj</a>
is a zip file containing:
__zanj_meta__.json: a file containing zanj-specific
metadata including:
__zanj__.json: a file containing user-specified data
.npy for numpy arrays or torch tensors.jsonl for pandas dataframes or large sequences__zanj_meta__.json_REF_KEY in muutils, will have
value pointing to external file_FORMAT_KEY key will detail an external format
type| Format | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16 |
|---|---|---|---|---|---|---|---|
| pickle (PyTorch) | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ |
| H5 (Tensorflow) | ✅ | ❌ | ✅ | ✅ | ~ | ~ | ❌ |
| HDF5 | ✅ | ? | ✅ | ✅ | ~ | ✅ | ❌ |
| SavedModel (Tensorflow) | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ |
| MsgPack (flax) | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ |
| Protobuf (ONNX) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Cap’n’Proto | ✅ | ✅ | ~ | ✅ | ✅ | ~ | ❌ |
| Numpy (npy,npz) | ✅ | ? | ? | ❌ | ✅ | ❌ | ❌ |
| SafeTensors | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
| exdir | ✅ | ? | ? | ? | ? | ✅ | ❌ |
| ZANJ | ✅ | ❌ | ❌* | ✅ | ✅ | ✅ | ❌* |
* denotes this feature may be coming at a future date
:)
(This table was stolen from safetensors)
def register_loader_handler(handler: zanj.loading.LoaderHandler)register a custom loader handler
class ZANJ(muutils.json_serialize.json_serialize.JsonSerializer):Zip up: Arrays in Numpy, JSON for everything else
given an arbitrary object, throw into a zip file, with arrays stored in .npy files, and everything else stored in a json file
(basically npz file with json)
zanj.json in the root of the archive, via
muutils.json_serialize.JsonSerializer__zanj_meta__.json file in the
root of the archivecreate a ZANJ-class via z_cls = ZANJ().create(obj), and
save/read instances of the object via
z_cls.save(obj, path), z_cls.load(path). be
sure to pass an instance of the object, to make sure
that the attributes of the class can be correctly recognized
ZANJ(
error_mode: muutils.errormode.ErrorMode = ErrorMode.Except,
internal_array_mode: Literal['list', 'array_list_meta', 'array_hex_meta', 'array_b64_meta', 'external', 'zero_dim'] = 'array_list_meta',
external_array_threshold: int = 256,
external_list_threshold: int = 256,
compress: bool | int = True,
custom_settings: dict[str, typing.Any] | None = None,
handlers_pre: None = (),
handlers_default: None = (ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray:external', desc='external numpy array', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor:external', desc='external torch tensor', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='list:external', desc='external list', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='tuple:external', desc='external tuple', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame:external', desc='external pandas DataFrame', source_pckg='zanj'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='base types', desc='base types (bool, int, float, str, None)'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dictionaries', desc='dictionaries'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(list, tuple) -> list', desc='lists and tuples as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function _serialize_override_serialize_func>, uid='.serialize override', desc='objects with .serialize method'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='namedtuple -> dict', desc='namedtuples as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dataclass -> dict', desc='dataclasses as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='path -> str', desc='Path objects as posix strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='obj -> str(obj)', desc='directly serialize objects in `SERIALIZE_DIRECT_AS_STR` to strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray', desc='numpy arrays'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor', desc='pytorch tensors'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame', desc='pandas DataFrames'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(set, list, tuple, Iterable) -> list', desc='sets, lists, tuples, and Iterables as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='fallback', desc='fallback handler -- serialize object attributes and special functions as strings'))
)external_array_threshold: int
external_list_threshold: int
custom_settings: dict
compress
def externals_info(self) -> dict[str, dict[str, str | int | list[int]]]return information about the current externals
def meta(
self
) -> Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]return the metadata of the ZANJ archive
def save(self, obj: Any, file_path: str | pathlib._local.Path) -> strsave the object to a ZANJ archive. returns the path to the archive
def read(self, file_path: Union[str, pathlib._local.Path]) -> Anyload the object from a ZANJ archive ### TODO: load only some part of the zanj file by passing an ObjectPath
docs for
zanjv0.5.2
for storing/retrieving an item externally in a ZANJ archive
ZANJ_MAINZANJ_METAExternalItemTypeExternalItemType_valsExternalItemload_jsonlload_npyEXTERNAL_LOAD_FUNCSGET_EXTERNAL_LOAD_FUNCzanj.externalsfor storing/retrieving an item externally in a ZANJ archive
ZANJ_MAIN: str = '__zanj__.json'
ZANJ_META: str = '__zanj_meta__.json'
ExternalItemType = typing.Literal['jsonl', 'npy']
ExternalItemType_vals = ('jsonl', 'npy')
class ExternalItem(typing.NamedTuple):ExternalItem(item_type, data, path)
ExternalItem(
item_type: Literal['jsonl', 'npy'],
data: Any,
path: tuple[typing.Union[str, int], ...]
)Create new instance of ExternalItem(item_type, data, path)
item_type: Literal['jsonl', 'npy']Alias for field number 0
data: AnyAlias for field number 1
path: tuple[typing.Union[str, int], ...]Alias for field number 2
def load_jsonl(
zanj: "'LoadedZANJ'",
fp: IO[bytes]
) -> list[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]]]]def load_npy(zanj: "'LoadedZANJ'", fp: IO[bytes]) -> numpy.ndarrayEXTERNAL_LOAD_FUNCS: dict[typing.Literal['jsonl', 'npy'], typing.Callable[[zanj.zanj.ZANJ, typing.IO[bytes]], typing.Any]] = {'jsonl': <function load_jsonl>, 'npy': <function load_npy>}def GET_EXTERNAL_LOAD_FUNC(item_type: str) -> Callable[[zanj.zanj.ZANJ, IO[bytes]], Any]docs for
zanjv0.5.2
LoaderHandlerLOADER_MAP_LOCKLOADER_MAPregister_loader_handlerget_item_loaderload_item_recursiveLoadedZANJzanj.loadingclass LoaderHandler:handler for loading an object from a json file or a ZANJ archive
LoaderHandler(
check: Callable[[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], bool],
load: Callable[[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], Any],
uid: str,
source_pckg: str,
priority: int = 0,
desc: str = '(no description)'
)check: Callable[[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], bool]
load: Callable[[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], Any]
uid: str
source_pckg: str
priority: int = 0
desc: str = '(no description)'
def serialize(
self
) -> Dict[str, Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]]serialize the handler info
def from_formattedclass(cls, fc: type, priority: int = 0)create a loader from a class with serialize,
load methods and __muutils_format__
attribute
LOADER_MAP_LOCK = <unlocked _thread.lock object>
LOADER_MAP: dict[str, zanj.loading.LoaderHandler] = {'numpy.ndarray': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='numpy.ndarray', source_pckg='zanj', priority=0, desc='numpy.ndarray loader'), 'torch.Tensor': LoaderHandler(check=<function <lambda>>, load=<function _torch_loaderhandler_load>, uid='torch.Tensor', source_pckg='zanj', priority=0, desc='torch.Tensor loader'), 'pandas.DataFrame': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='pandas.DataFrame', source_pckg='zanj', priority=0, desc='pandas.DataFrame loader'), 'list': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='list', source_pckg='zanj', priority=0, desc='list loader, for externals'), 'tuple': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='tuple', source_pckg='zanj', priority=0, desc='tuple loader, for externals')}
def register_loader_handler(handler: zanj.loading.LoaderHandler)register a custom loader handler
def get_item_loader(
json_item: Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]],
path: tuple[typing.Union[str, int], ...],
zanj: typing.Any | None = None,
error_mode: muutils.errormode.ErrorMode = ErrorMode.Warn
) -> zanj.loading.LoaderHandler | Noneget the loader for a json item
def load_item_recursive(
json_item: Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]],
path: tuple[typing.Union[str, int], ...],
zanj: typing.Any | None = None,
error_mode: muutils.errormode.ErrorMode = ErrorMode.Warn,
allow_not_loading: bool = True
) -> Anyclass LoadedZANJ:for loading a zanj file
LoadedZANJ(path: str | pathlib._local.Path, zanj: Any)def populate_externals(self) -> Noneput all external items into the main json data
docs for
zanjv0.5.2
KW_ONLY_KWARGSjsonl_metadatastore_npystore_jsonlEXTERNAL_STORE_FUNCSZANJSerializerHandlerzanj_external_serializeDEFAULT_SERIALIZER_HANDLERS_ZANJzanj.serializingKW_ONLY_KWARGS: dict = {'kw_only': True}def jsonl_metadata(
data: list[typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]]]]]
) -> dictmetadata about a jsonl object
def store_npy(self: Any, fp: IO[bytes], data: numpy.ndarray) -> Nonestore numpy array to given file as .npy
def store_jsonl(
self: Any,
fp: IO[bytes],
data: Sequence[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]]
) -> Nonestore sequence to given file as .jsonl
EXTERNAL_STORE_FUNCS: dict[typing.Literal['jsonl', 'npy'], typing.Callable[[typing.Any, typing.IO[bytes], typing.Any], NoneType]] = {'npy': <function store_npy>, 'jsonl': <function store_jsonl>}class ZANJSerializerHandler(muutils.json_serialize.json_serialize.SerializerHandler):a handler for ZANJ serialization
ZANJSerializerHandler(
uid: str,
desc: str,
*,
check: Callable[[Any, Any, tuple[Union[str, int], ...]], bool],
serialize_func: Callable[[Any, Any, tuple[Union[str, int], ...]], Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]],
source_pckg: str
)source_pckg: str
check: Callable[[Any, Any, tuple[Union[str, int], ...]], bool]
serialize_func: Callable[[Any, Any, tuple[Union[str, int], ...]], Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]]
def zanj_external_serialize(
jser: Any,
data: Any,
path: tuple[typing.Union[str, int], ...],
item_type: Literal['jsonl', 'npy'],
_format: str
) -> Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]stores a numpy array or jsonl externally in a ZANJ object
jser: ZANJdata: Anypath: ObjectPathitem_type: ExternalItemTypeJSONitem json data with referencemodifies jser._externals
DEFAULT_SERIALIZER_HANDLERS_ZANJ: None = (ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray:external', desc='external numpy array', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor:external', desc='external torch tensor', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='list:external', desc='external list', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='tuple:external', desc='external tuple', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame:external', desc='external pandas DataFrame', source_pckg='zanj'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='base types', desc='base types (bool, int, float, str, None)'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dictionaries', desc='dictionaries'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(list, tuple) -> list', desc='lists and tuples as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function _serialize_override_serialize_func>, uid='.serialize override', desc='objects with .serialize method'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='namedtuple -> dict', desc='namedtuples as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dataclass -> dict', desc='dataclasses as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='path -> str', desc='Path objects as posix strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='obj -> str(obj)', desc='directly serialize objects inSERIALIZE_DIRECT_AS_STRto strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray', desc='numpy arrays'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor', desc='pytorch tensors'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame', desc='pandas DataFrames'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(set, list, tuple, Iterable) -> list', desc='sets, lists, tuples, and Iterables as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='fallback', desc='fallback handler -- serialize object attributes and special functions as strings'))
docs for
zanjv0.5.2
torch utilities for zanj – in particular the
ConfiguredModel base class
note that this requires torch
KWArgsnum_paramsget_module_deviceConfiguredModelset_config_classConfigMismatchExceptionassert_model_cfg_equalityassert_model_exact_equalityzanj.torchutiltorch utilities for zanj – in particular the
ConfiguredModel base class
note that this requires torch
KWArgs = typing.Anydef num_params(m: torch.nn.modules.module.Module, only_trainable: bool = True)return total number of parameters in a model
only_trainable is False, will include parameters
with requires_grad = Falsehttps://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
def get_module_device(
m: torch.nn.modules.module.Module
) -> tuple[bool, torch.device | dict[str, torch.device]]get the current devices
class ConfiguredModel(torch.nn.modules.module.Module, typing.Generic[~T_config]):a model that has a configuration, for saving with ZANJ
@set_config_class(YourConfig)
class YourModule(ConfiguredModel[YourConfig]):
def __init__(self, cfg: YourConfig):
super().__init__(cfg)__init__() must initialize the model from a config
object only, and call
super().__init__(zanj_model_config)
If you are inheriting from another class + ConfiguredModel, ConfiguredModel must be the first class in the inheritance list
zanj_config_classzanj_model_config: ~T_config
training_records: dict | None
def serialize(
self,
path: tuple[typing.Union[str, int], ...] = (),
zanj: zanj.zanj.ZANJ | None = None
) -> dict[str, typing.Any]def save(self, file_path: str, zanj: zanj.zanj.ZANJ | None = None)def load(
cls,
obj: dict[str, typing.Any],
path: tuple[typing.Union[str, int], ...],
zanj: zanj.zanj.ZANJ | None = None
) -> zanj.torchutil.ConfiguredModelload a model from a serialized object
def read(
cls,
file_path: str,
zanj: zanj.zanj.ZANJ | None = None
) -> zanj.torchutil.ConfiguredModelread a model from a file
def load_file(
cls,
file_path: str,
zanj: zanj.zanj.ZANJ | None = None
) -> zanj.torchutil.ConfiguredModelread a model from a file
def get_handler(cls) -> zanj.loading.LoaderHandlerdef num_params(self) -> intModuledump_patchestrainingcall_super_initforwardregister_bufferregister_parameteradd_moduleregister_moduleget_submoduleset_submoduleget_parameterget_bufferget_extra_stateset_extra_stateapplycudaipuxpumtiacputypefloatdoublehalfbfloat16to_emptytoregister_full_backward_pre_hookregister_backward_hookregister_full_backward_hookregister_forward_pre_hookregister_forward_hookregister_state_dict_post_hookregister_state_dict_pre_hookstate_dictregister_load_state_dict_pre_hookregister_load_state_dict_post_hookload_state_dictparametersnamed_parametersbuffersnamed_bufferschildrennamed_childrenmodulesnamed_modulestrainevalrequires_grad_zero_gradshare_memoryextra_reprcompiledef set_config_class(
config_class: Type[muutils.json_serialize.serializable_dataclass.SerializableDataclass]
) -> Callable[[Type[zanj.torchutil.ConfiguredModel]], Type[zanj.torchutil.ConfiguredModel]]class ConfigMismatchException(builtins.ValueError):Inappropriate argument value (of correct type).
ConfigMismatchException(msg: str, diff)diffdef assert_model_cfg_equality(
model_a: zanj.torchutil.ConfiguredModel,
model_b: zanj.torchutil.ConfiguredModel
)check both models are correct instances and have the same config
Raises: ConfigMismatchException: if the configs don’t match, e.diff will contain the diff
def assert_model_exact_equality(
model_a: zanj.torchutil.ConfiguredModel,
model_b: zanj.torchutil.ConfiguredModel
)check the models are exactly equal, including state dict contents
docs for
zanjv0.5.2
an HDF5/exdir file alternative, which uses json for attributes, allows serialization of arbitrary data
for large arrays, the output is a .tar.gz file with most data in a json file, but with sufficiently large arrays stored in binary .npy files
“ZANJ” is an acronym that the AI tool Elicit came up with for me. not to be confused with:
zanj.zanjan HDF5/exdir file alternative, which uses json for attributes, allows serialization of arbitrary data
for large arrays, the output is a .tar.gz file with most data in a json file, but with sufficiently large arrays stored in binary .npy files
“ZANJ” is an acronym that the AI tool Elicit came up with for me. not to be confused with:
ZANJitem = typing.Union[bool, int, float, str, NoneType, typing.List[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], numpy.ndarray, ForwardRef('pd.DataFrame')]
ZANJ_GLOBAL_DEFAULTS: zanj.zanj._ZANJ_GLOBAL_DEFAULTS_CLASS = _ZANJ_GLOBAL_DEFAULTS_CLASS(error_mode=ErrorMode.Except, internal_array_mode='array_list_meta', external_array_threshold=256, external_list_threshold=256, compress=True, custom_settings=None)
class ZANJ(muutils.json_serialize.json_serialize.JsonSerializer):Zip up: Arrays in Numpy, JSON for everything else
given an arbitrary object, throw into a zip file, with arrays stored in .npy files, and everything else stored in a json file
(basically npz file with json)
zanj.json in the root of the archive, via
muutils.json_serialize.JsonSerializer__zanj_meta__.json file in the
root of the archivecreate a ZANJ-class via z_cls = ZANJ().create(obj), and
save/read instances of the object via
z_cls.save(obj, path), z_cls.load(path). be
sure to pass an instance of the object, to make sure
that the attributes of the class can be correctly recognized
ZANJ(
error_mode: muutils.errormode.ErrorMode = ErrorMode.Except,
internal_array_mode: Literal['list', 'array_list_meta', 'array_hex_meta', 'array_b64_meta', 'external', 'zero_dim'] = 'array_list_meta',
external_array_threshold: int = 256,
external_list_threshold: int = 256,
compress: bool | int = True,
custom_settings: dict[str, typing.Any] | None = None,
handlers_pre: None = (),
handlers_default: None = (ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray:external', desc='external numpy array', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor:external', desc='external torch tensor', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='list:external', desc='external list', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='tuple:external', desc='external tuple', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame:external', desc='external pandas DataFrame', source_pckg='zanj'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='base types', desc='base types (bool, int, float, str, None)'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dictionaries', desc='dictionaries'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(list, tuple) -> list', desc='lists and tuples as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function _serialize_override_serialize_func>, uid='.serialize override', desc='objects with .serialize method'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='namedtuple -> dict', desc='namedtuples as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dataclass -> dict', desc='dataclasses as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='path -> str', desc='Path objects as posix strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='obj -> str(obj)', desc='directly serialize objects in `SERIALIZE_DIRECT_AS_STR` to strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray', desc='numpy arrays'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor', desc='pytorch tensors'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame', desc='pandas DataFrames'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(set, list, tuple, Iterable) -> list', desc='sets, lists, tuples, and Iterables as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='fallback', desc='fallback handler -- serialize object attributes and special functions as strings'))
)external_array_threshold: int
external_list_threshold: int
custom_settings: dict
compress
def externals_info(self) -> dict[str, dict[str, str | int | list[int]]]return information about the current externals
def meta(
self
) -> Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]return the metadata of the ZANJ archive
def save(self, obj: Any, file_path: str | pathlib._local.Path) -> strsave the object to a ZANJ archive. returns the path to the archive
def read(self, file_path: Union[str, pathlib._local.Path]) -> Anyload the object from a ZANJ archive ### TODO: load only some part of the zanj file by passing an ObjectPath