RemotePathIterator¶
- class pyremotedata.implicit_mount.RemotePathIterator(io_handler: IOHandler, batch_size: int = 64, batch_parallel: int = 10, max_queued_batches: int = 3, n_local_files: int = 384, clear_local: bool = False, **kwargs)[source]
Bases:
object
This function provides a high-level buffered iterator for downloading files from a remote directory. All heavy computation is done in a separate thread, to avoid blocking the main thread unnecessarily.
OBS: The attributes of this method should not be used unless for development or advanced use cases, all responsibility in this case is on the user.
- Parameters:
io_handler (IOHandler) – A backend object of class “IOHandler” to use for downloading files.
batch_size (int) – The number of files to download in each batch. Larger batches are more efficient, but may cause memory issues.
batch_parallel (int) – The number of files to download in parallel in each batch. Larger values may be more efficient, but can cause excessive loads on the remote server.
max_queued_batches (int) – The batches are processed sequentially from a queue, which is filled on request. This parameter specifies the maximum number of batches in the queue. Larger values can ensure a stable streaming rate, but may require more files to be stored locally.
n_local_files (int) – The number of files to store locally. OBS: This MUST be larger than batch_size * max_queued_batches (I suggest twice that), otherwise files may be deleted before they are consumed.
clear_local (bool) – If True, the local directory will be cleared after the iterator is stopped.
**kwargs – Keyword arguments to pass to the IOHandler.get_file_index() function. Set ‘store’ to False to avoid altering the remote directory (this is much slower if you intent to use the iterator multiple times, however it may be necessary if the remote directory is read-only). PSA: If ‘store’ is False, ‘override’ must also be False.
- Yields:
Tuple[str, str] – A tuple containing the local path and the remote path of the downloaded file.
- download_files()[source]
Download the entire list of remote paths in batches, and stores the local paths in a queue (self.download_queue).
The function is not intended to be called directly, but there is no good reason why it should not and useful for debugging and testing.
- shuffle() None [source]
Shuffle the remote paths.
Shuffles the remote paths in-place. This function should not be called while iterating.
- split(proportion: List[float | int] | None = None, indices: List[List[int]] | None = None) List[RemotePathIterator] [source]
Split the remote paths into multiple iterators, that share the same backend. These CANNOT be used in parallel.
Either, but not both, of proportion and indices must be specified.
- Parameters:
- Returns:
A list of RemotePathIterator objects.
- Return type:
List[RemotePathIterator]