spacr.io
========

.. py:module:: spacr.io






Module Contents
---------------

.. py:function:: process_non_tif_non_2D_images(folder)

   Processes all images in the folder and splits them into grayscale channels, preserving bit depth.


.. py:class:: CombineLoaders(train_loaders)

   A class that combines multiple data loaders into a single iterator.

   :param train_loaders: A list of data loaders.
   :type train_loaders: list

   :raises StopIteration: If all data loaders have been exhausted.


   .. py:attribute:: train_loaders


   .. py:attribute:: loader_iters


.. py:class:: CombinedDataset(datasets, shuffle=True)

   Bases: :py:obj:`torch.utils.data.Dataset`


   A dataset that combines multiple datasets into one.

   :param datasets: A list of datasets to be combined.
   :type datasets: list
   :param shuffle: Whether to shuffle the combined dataset. Defaults to True.
   :type shuffle: bool, optional


   .. py:attribute:: datasets


   .. py:attribute:: lengths


   .. py:attribute:: total_length


   .. py:attribute:: shuffle
      :value: True



.. py:class:: NoClassDataset(data_dir, transform=None, shuffle=True, load_to_memory=False)

   Bases: :py:obj:`torch.utils.data.Dataset`


   A custom dataset class for handling image data without class labels.

   :param data_dir: The directory path where the image files are located.
   :type data_dir: str
   :param transform: A function/transform to apply to the image data. Default is None.
   :type transform: callable, optional
   :param shuffle: Whether to shuffle the dataset. Default is True.
   :type shuffle: bool, optional
   :param load_to_memory: Whether to load all images into memory. Default is False.
   :type load_to_memory: bool, optional


   .. py:attribute:: data_dir


   .. py:attribute:: transform
      :value: None



   .. py:attribute:: shuffle
      :value: True



   .. py:attribute:: load_to_memory
      :value: False



   .. py:attribute:: filenames


   .. py:method:: load_image(img_path)

      Load an image from the given file path.

      :param img_path: The file path of the image.
      :type img_path: str

      :returns: The loaded image.
      :rtype: PIL.Image



   .. py:method:: shuffle_dataset()

      Shuffle the dataset.



.. py:class:: spacrDataset(data_dir, loader_classes, transform=None, shuffle=True, pin_memory=False, specific_files=None, specific_labels=None)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.


   .. py:attribute:: data_dir


   .. py:attribute:: classes


   .. py:attribute:: transform
      :value: None



   .. py:attribute:: shuffle
      :value: True



   .. py:attribute:: pin_memory
      :value: False



   .. py:attribute:: filenames
      :value: []



   .. py:attribute:: labels
      :value: []



   .. py:method:: load_image(img_path)


   .. py:method:: shuffle_dataset()


   .. py:method:: get_plate(filepath)


.. py:class:: spacrDataLoader(*args, preload_batches=1, **kwargs)

   Bases: :py:obj:`torch.utils.data.DataLoader`


   Data loader combines a dataset and a sampler, and provides an iterable over the given dataset.

   The :class:`~torch.utils.data.DataLoader` supports both map-style and
   iterable-style datasets with single- or multi-process loading, customizing
   loading order and optional automatic batching (collation) and memory pinning.

   See :py:mod:`torch.utils.data` documentation page for more details.

   :param dataset: dataset from which to load the data.
   :type dataset: Dataset
   :param batch_size: how many samples per batch to load
                      (default: ``1``).
   :type batch_size: int, optional
   :param shuffle: set to ``True`` to have the data reshuffled
                   at every epoch (default: ``False``).
   :type shuffle: bool, optional
   :param sampler: defines the strategy to draw
                   samples from the dataset. Can be any ``Iterable`` with ``__len__``
                   implemented. If specified, :attr:`shuffle` must not be specified.
   :type sampler: Sampler or Iterable, optional
   :param batch_sampler: like :attr:`sampler`, but
                         returns a batch of indices at a time. Mutually exclusive with
                         :attr:`batch_size`, :attr:`shuffle`, :attr:`sampler`,
                         and :attr:`drop_last`.
   :type batch_sampler: Sampler or Iterable, optional
   :param num_workers: how many subprocesses to use for data
                       loading. ``0`` means that the data will be loaded in the main process.
                       (default: ``0``)
   :type num_workers: int, optional
   :param collate_fn: merges a list of samples to form a
                      mini-batch of Tensor(s).  Used when using batched loading from a
                      map-style dataset.
   :type collate_fn: Callable, optional
   :param pin_memory: If ``True``, the data loader will copy Tensors
                      into device/CUDA pinned memory before returning them.  If your data elements
                      are a custom type, or your :attr:`collate_fn` returns a batch that is a custom type,
                      see the example below.
   :type pin_memory: bool, optional
   :param drop_last: set to ``True`` to drop the last incomplete batch,
                     if the dataset size is not divisible by the batch size. If ``False`` and
                     the size of dataset is not divisible by the batch size, then the last batch
                     will be smaller. (default: ``False``)
   :type drop_last: bool, optional
   :param timeout: if positive, the timeout value for collecting a batch
                   from workers. Should always be non-negative. (default: ``0``)
   :type timeout: numeric, optional
   :param worker_init_fn: If not ``None``, this will be called on each
                          worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
                          input, after seeding and before data loading. (default: ``None``)
   :type worker_init_fn: Callable, optional
   :param multiprocessing_context: If
                                   ``None``, the default `multiprocessing context`_ of your operating system will
                                   be used. (default: ``None``)
   :type multiprocessing_context: str or multiprocessing.context.BaseContext, optional
   :param generator: If not ``None``, this RNG will be used
                     by RandomSampler to generate random indexes and multiprocessing to generate
                     ``base_seed`` for workers. (default: ``None``)
   :type generator: torch.Generator, optional
   :param prefetch_factor: Number of batches loaded
                           in advance by each worker. ``2`` means there will be a total of
                           2 * num_workers batches prefetched across all workers. (default value depends
                           on the set value for num_workers. If value of num_workers=0 default is ``None``.
                           Otherwise, if value of ``num_workers > 0`` default is ``2``).
   :type prefetch_factor: int, optional, keyword-only arg
   :param persistent_workers: If ``True``, the data loader will not shut down
                              the worker processes after a dataset has been consumed once. This allows to
                              maintain the workers `Dataset` instances alive. (default: ``False``)
   :type persistent_workers: bool, optional
   :param pin_memory_device: the device to :attr:`pin_memory` to if ``pin_memory`` is
                             ``True``.
   :type pin_memory_device: str, optional

   .. warning:: If the ``spawn`` start method is used, :attr:`worker_init_fn`
                cannot be an unpicklable object, e.g., a lambda function. See
                :ref:`multiprocessing-best-practices` on more details related
                to multiprocessing in PyTorch.

   .. warning:: ``len(dataloader)`` heuristic is based on the length of the sampler used.
                When :attr:`dataset` is an :class:`~torch.utils.data.IterableDataset`,
                it instead returns an estimate based on ``len(dataset) / batch_size``, with proper
                rounding depending on :attr:`drop_last`, regardless of multi-process loading
                configurations. This represents the best guess PyTorch can make because PyTorch
                trusts user :attr:`dataset` code in correctly handling multi-process
                loading to avoid duplicate data.

                However, if sharding results in multiple workers having incomplete last batches,
                this estimate can still be inaccurate, because (1) an otherwise complete batch can
                be broken into multiple ones and (2) more than one batch worth of samples can be
                dropped when :attr:`drop_last` is set. Unfortunately, PyTorch can not detect such
                cases in general.

                See `Dataset Types`_ for more details on these two types of datasets and how
                :class:`~torch.utils.data.IterableDataset` interacts with
                `Multi-process data loading`_.

   .. warning:: See :ref:`reproducibility`, and :ref:`dataloader-workers-random-seed`, and
                :ref:`data-loading-randomness` notes for random seed related questions.

   .. _multiprocessing context:
       https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods


   .. py:attribute:: preload_batches
      :value: 1



   .. py:attribute:: batch_queue


   .. py:attribute:: process
      :value: None



   .. py:attribute:: current_batch_index
      :value: 0



   .. py:attribute:: pin_memory


   .. py:method:: cleanup()


.. py:class:: TarImageDataset(tar_path, transform=None)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.


   .. py:attribute:: tar_path


   .. py:attribute:: transform
      :value: None



.. py:function:: load_images_from_paths(images_by_key)

.. py:function:: concatenate_and_normalize(src, channels, save_dtype=np.float32, settings={})

.. py:function:: delete_empty_subdirectories(folder_path)

   Deletes all empty subdirectories in the specified folder.

   Args:
   - folder_path (str): The path to the folder in which to look for empty subdirectories.


.. py:function:: preprocess_img_data(settings)

.. py:function:: read_plot_model_stats(train_file_path, val_file_path, save=False)

.. py:function:: convert_numpy_to_tiff(folder_path, limit=None)

   Converts all numpy files in a folder to TIFF format and saves them in a subdirectory 'tiff'.

   Args:
   folder_path (str): The path to the folder containing numpy files.


.. py:function:: generate_cellpose_train_test(src, test_split=0.1)

.. py:function:: parse_gz_files(folder_path)

   Parses the .fastq.gz files in the specified folder path and returns a dictionary
   containing the sample names and their corresponding file paths.

   :param folder_path: The path to the folder containing the .fastq.gz files.
   :type folder_path: str

   :returns: A dictionary where the keys are the sample names and the values are
             dictionaries containing the file paths for the 'R1' and 'R2' read directions.
   :rtype: dict


.. py:function:: generate_dataset(settings={})

.. py:function:: generate_loaders(src, mode='train', image_size=224, batch_size=32, classes=['nc', 'pc'], n_jobs=None, validation_split=0.0, pin_memory=False, normalize=False, channels=['r', 'g', 'b'], augment=False, verbose=False)

.. py:function:: generate_loaders_v1(src, mode='train', image_size=224, batch_size=32, classes=['nc', 'pc'], n_jobs=None, validation_split=0.0, pin_memory=False, normalize=False, channels=['r', 'g', 'b'], augment=False, verbose=False)

.. py:function:: generate_training_dataset(settings)

   Build a balanced training/testing dataset from one of:
     - metadata rules (exact matches or compound 'where' rules)
     - annotation columns (each <col>_<value> is a standalone class)
     - measurement rules (numeric ranges/bins; supports multiple conditions per class)

   New behavior (annotation mode):
     - If a column has only one annotated value (e.g., only '1's), we add a
       '<column>_random' class using unannotated rows for that column (same size as positives).
     - Optional: persist that random selection into DB as a new INT column named '<column>_random' with 1's.

   Required helpers:
     - _read_and_merge_data, _read_db (from .io)
     - generate_dataset_from_lists(dst, class_data, classes, test_split)
     - save_settings (from .utils)


.. py:function:: training_dataset_from_annotation(db_path, dst, annotation_column='test', annotated_classes=(1, 2))

.. py:function:: training_dataset_from_annotation_metadata(db_path, dst, annotation_column='test', annotated_classes=(1, 2), metadata_type_by='columnID', class_metadata=['c1', 'c2'])

.. py:function:: generate_dataset_from_lists(dst, class_data, classes, test_split=0.1)

.. py:function:: convert_separate_files_to_yokogawa(folder, regex)

.. py:function:: convert_to_yokogawa(folder)

   Detects file type in the folder and converts them
   to Yokogawa-style naming with Maximum Intensity Projection (MIP).


.. py:function:: apply_augmentation(image, method)

.. py:function:: process_instruction(entry)

.. py:function:: prepare_cellpose_dataset(input_root, augment_data=False, train_fraction=0.8, n_jobs=None)

