Package bytewax

Bytewax is an open source Python framework for building highly scalable dataflows in a streaming or batch context.

See our readme for more documentation.

Expand source code
"""Bytewax is an open source Python framework for building highly
scalable dataflows in a streaming or batch context.

[See our readme for more
documentation.](https://github.com/bytewax/bytewax)

"""
from .bytewax import cluster_main, Dataflow, run_main, AdvanceTo, Emit
from .execution import run, run_cluster, spawn_cluster

__all__ = [
    "Dataflow",
    "run_main",
    "run",
    "run_cluster",
    "spawn_cluster",
    "cluster_main",
    "AdvanceTo",
    "Emit",
]

__pdoc__ = {
    # This is the PyO3 module that has to be named "bytewax". Hide it
    # since we import all its members here.
    "bytewax": False,
    # Hide execution because we import all its members here.
    "execution": False,
}

Sub-modules

bytewax.exhash

Exhash is a consistent hash that Bytewax calls internally to route data to workers …

bytewax.inputs

Helpers to let you quickly define epoch / batching semantics …

bytewax.parse

Helpers to read execution arguments from the environment or command line.

bytewax.testing

Helper tools for testing dataflows.

Functions

def cluster_main(flow, input_builder, output_builder, addresses, proc_id, worker_count_per_proc)

Execute a dataflow in the current process as part of a cluster.

You have to coordinate starting up all the processes in the cluster and ensuring they each are assigned a unique ID and know the addresses of other processes. You'd commonly use this for starting processes as part of a Kubernetes cluster.

Blocks until execution is complete.

>>> flow = Dataflow()
>>> def input_builder(worker_index, worker_count):
...     return enumerate(range(3))
>>> def output_builder(worker_index, worker_count):
...     return print
>>> cluster_main(flow, input_builder, output_builder)  # doctest: +ELLIPSIS
(...)

See run_main() for a way to test input and output builders without the complexity of starting a cluster.

See run_cluster() for a convenience method to pass data through a dataflow for notebook development.

See spawn_cluster() for starting a simple cluster locally on one machine.

Args

flow
Dataflow to run.
input_builder
Returns input that each worker thread should process. Should yield either AdvanceTo or Send to advance the epoch, or input new data into the dataflow.
output_builder
Returns a callback function for each worker thread, called with (epoch, item) whenever and item passes by a capture operator on this process.
addresses
List of host/port addresses for all processes in this cluster (including this one).
proc_id
Index of this process in cluster; starts from 0.
worker_count_per_proc
Number of worker threads to start on each process.
def run(flow: Dataflow, inp: Iterable[Tuple[int, Any]]) ‑> List[Tuple[int, Any]]

Pass data through a dataflow running in the current thread.

Blocks until execution is complete.

Handles distributing input and collecting output. You'd commonly use this for tests or prototyping in notebooks.

Input must be finite, otherwise collected output will grow unbounded.

>>> flow = Dataflow()
>>> flow.map(str.upper)
>>> flow.capture()
>>> out = run(flow, [(0, "a"), (1, "b"), (2, "c")])
>>> sorted(out)
[(0, 'A'), (1, 'B'), (2, 'C')]

Args

flow
Dataflow to run.
inp
Input data.

Returns

List of (epoch, item) tuples seen by capture operators.

Expand source code
def run(flow: Dataflow, inp: Iterable[Tuple[int, Any]]) -> List[Tuple[int, Any]]:
    """Pass data through a dataflow running in the current thread.

    Blocks until execution is complete.

    Handles distributing input and collecting output. You'd commonly
    use this for tests or prototyping in notebooks.

    Input must be finite, otherwise collected output will grow
    unbounded.

    >>> flow = Dataflow()
    >>> flow.map(str.upper)
    >>> flow.capture()
    >>> out = run(flow, [(0, "a"), (1, "b"), (2, "c")])
    >>> sorted(out)
    [(0, 'A'), (1, 'B'), (2, 'C')]

    Args:

        flow: Dataflow to run.

        inp: Input data.

    Returns:

        List of `(epoch, item)` tuples seen by capture operators.

    """

    def input_builder(worker_index, worker_count):
        assert worker_index == 0
        for epoch, input in inp:
            yield AdvanceTo(epoch)
            yield Emit(input)

    out = []

    def output_builder(worker_index, worker_count):
        assert worker_index == 0
        return out.append

    run_main(flow, input_builder, output_builder)

    return out
def run_cluster(flow: Dataflow, inp: Iterable[Tuple[int, Any]], proc_count: int = 1, worker_count_per_proc: int = 1, mp_ctx=<multiprocess.context.SpawnContext object>) ‑> List[Tuple[int, Any]]

Pass data through a dataflow running as a cluster of processes on this machine. Blocks until execution is complete. Starts up cluster processes for you, handles connecting them together, distributing input, and collecting output. You'd commonly use this for notebook analysis that needs parallelism and higher throughput, or simple stand-alone demo programs. Input must be finite because it is reified into a list before distribution to cluster and otherwise collected output will grow unbounded.

>>> from bytewax.testing import doctest_ctx
>>> flow = Dataflow()
>>> flow.map(str.upper)
>>> flow.capture()
>>> out = run_cluster(
...     flow,
...     [(0, "a"), (1, "b"), (2, "c")],
...     proc_count=2,
...     mp_ctx=doctest_ctx,  # Outside a doctest, you'd skip this.
... )
>>> sorted(out)
[(0, 'A'), (1, 'B'), (2, 'C')]
See <code><a title="bytewax.spawn_cluster" href="#bytewax.spawn_cluster">spawn\_cluster()</a></code> for starting a cluster on this
machine with full control over inputs and outputs.
See <code><a title="bytewax.cluster_main" href="#bytewax.cluster_main">cluster\_main()</a></code> for starting one process in a cluster
in a distributed situation.

Args

flow
Dataflow to run.
inp
Input data. Will be reifyied to a list before sending to processes. Will be partitioned between workers for you.
proc_count
Number of processes to start.
worker_count_per_proc
Number of worker threads to start on each process.
mp_ctx
multiprocessing context to use. Use this to configure starting up subprocesses via spawn or fork. Defaults to spawn.

Returns

List of (epoch, item) tuples seen by capture operators.

Expand source code
def run_cluster(
    flow: Dataflow,
    inp: Iterable[Tuple[int, Any]],
    proc_count: int = 1,
    worker_count_per_proc: int = 1,
    mp_ctx=get_context("spawn"),
) -> List[Tuple[int, Any]]:
    """Pass data through a dataflow running as a cluster of processes on
    this machine.
    Blocks until execution is complete.
    Starts up cluster processes for you, handles connecting them
    together, distributing input, and collecting output. You'd
    commonly use this for notebook analysis that needs parallelism and
    higher throughput, or simple stand-alone demo programs.
    Input must be finite because it is reified into a list before
    distribution to cluster and otherwise collected output will grow
    unbounded.
    >>> from bytewax.testing import doctest_ctx
    >>> flow = Dataflow()
    >>> flow.map(str.upper)
    >>> flow.capture()
    >>> out = run_cluster(
    ...     flow,
    ...     [(0, "a"), (1, "b"), (2, "c")],
    ...     proc_count=2,
    ...     mp_ctx=doctest_ctx,  # Outside a doctest, you'd skip this.
    ... )
    >>> sorted(out)
    [(0, 'A'), (1, 'B'), (2, 'C')]
    See `bytewax.spawn_cluster()` for starting a cluster on this
    machine with full control over inputs and outputs.
    See `bytewax.cluster_main()` for starting one process in a cluster
    in a distributed situation.
    Args:
        flow: Dataflow to run.
        inp: Input data. Will be reifyied to a list before sending to
            processes. Will be partitioned between workers for you.
        proc_count: Number of processes to start.
        worker_count_per_proc: Number of worker threads to start on
            each process.
        mp_ctx: `multiprocessing` context to use. Use this to
            configure starting up subprocesses via spawn or
            fork. Defaults to spawn.
    Returns:
        List of `(epoch, item)` tuples seen by capture operators.
    """
    # A Manager starts up a background process to manage shared state.
    with mp_ctx.Manager() as man:
        inp = man.list(list(inp))

        def input_builder(worker_index, worker_count):
            for i, epoch_item in enumerate(inp):
                if i % worker_count == worker_index:
                    (epoch, item) = epoch_item
                    yield AdvanceTo(epoch)
                    yield Emit(item)

        out = man.list()

        def output_builder(worker_index, worker_count):
            return out.append

        spawn_cluster(
            flow,
            input_builder,
            output_builder,
            proc_count,
            worker_count_per_proc,
            mp_ctx,
        )

        # We have to copy out the shared state before process
        # shutdown.
        return list(out)
def run_main(flow, input_builder, output_builder)

Execute a dataflow in the current thread.

Blocks until execution is complete.

You'd commonly use this for prototyping custom input and output builders with a single worker before using them in a cluster setting.

>>> flow = Dataflow()
>>> flow.capture()
>>> def input_builder(worker_index, worker_count):
...     return enumerate(range(3))
>>> def output_builder(worker_index, worker_count):
...     return print
>>> run_main(flow, input_builder, output_builder)  # doctest: +ELLIPSIS
(...)

See run() for a convenience method to not need to worry about input or output builders.

See spawn_cluster() for starting a cluster on this machine with full control over inputs and outputs.

Args

flow
Dataflow to run.
input_builder
Returns input that each worker thread should process.
output_builder
Returns a callback function for each worker thread, called with (epoch, item) whenever and item passes by a capture operator on this process.
def spawn_cluster(flow: Dataflow, input_builder: Callable[[int, int], Iterable[Union[AdvanceToEmit]]], output_builder: Callable[[int, int], Callable[[Tuple[int, Any]], None]], proc_count: int = 1, worker_count_per_proc: int = 1, mp_ctx=<multiprocess.context.SpawnContext object>) ‑> List[Tuple[int, Any]]

Execute a dataflow as a cluster of processes on this machine.

Blocks until execution is complete.

Starts up cluster processes for you and handles connecting them together. You'd commonly use this for notebook analysis that needs parallelism and higher throughput, or simple stand-alone demo programs.

>>> from bytewax.testing import doctest_ctx
>>> flow = Dataflow()
>>> flow.capture()
>>> def input_builder(i, n):
...   for epoch, item in enumerate(range(3)):
...     yield AdvanceTo(epoch)
...     yield Emit(item)
>>> def output_builder(worker_index, worker_count):
...     return print
>>> spawn_cluster(
...     flow,
...     input_builder,
...     output_builder,
...     proc_count=2,
...     mp_ctx=doctest_ctx,  # Outside a doctest, you'd skip this.
... )  # doctest: +ELLIPSIS
(...)

See run_main() for a way to test input and output builders without the complexity of starting a cluster.

See run_cluster() for a convenience method to pass data through a dataflow for notebook development.

See cluster_main() for starting one process in a cluster in a distributed situation.

Args

flow
Dataflow to run.
input_builder
Returns input that each worker thread should process.
output_builder
Returns a callback function for each worker thread, called with (epoch, item) whenever and item passes by a capture operator on this process.
proc_count
Number of processes to start.
worker_count_per_proc
Number of worker threads to start on each process.
mp_ctx
multiprocessing context to use. Use this to configure starting up subprocesses via spawn or fork. Defaults to spawn.
Expand source code
def spawn_cluster(
    flow: Dataflow,
    input_builder: Callable[[int, int], Iterable[Union[AdvanceTo, Emit]]],
    output_builder: Callable[[int, int], Callable[[Tuple[int, Any]], None]],
    proc_count: int = 1,
    worker_count_per_proc: int = 1,
    mp_ctx=get_context("spawn"),
) -> List[Tuple[int, Any]]:
    """Execute a dataflow as a cluster of processes on this machine.

    Blocks until execution is complete.

    Starts up cluster processes for you and handles connecting them
    together. You'd commonly use this for notebook analysis that needs
    parallelism and higher throughput, or simple stand-alone demo
    programs.

    >>> from bytewax.testing import doctest_ctx
    >>> flow = Dataflow()
    >>> flow.capture()
    >>> def input_builder(i, n):
    ...   for epoch, item in enumerate(range(3)):
    ...     yield AdvanceTo(epoch)
    ...     yield Emit(item)
    >>> def output_builder(worker_index, worker_count):
    ...     return print
    >>> spawn_cluster(
    ...     flow,
    ...     input_builder,
    ...     output_builder,
    ...     proc_count=2,
    ...     mp_ctx=doctest_ctx,  # Outside a doctest, you'd skip this.
    ... )  # doctest: +ELLIPSIS
    (...)

    See `bytewax.run_main()` for a way to test input and output
    builders without the complexity of starting a cluster.

    See `bytewax.run_cluster()` for a convenience method to pass data
    through a dataflow for notebook development.

    See `bytewax.cluster_main()` for starting one process in a cluster
    in a distributed situation.

    Args:

        flow: Dataflow to run.

        input_builder: Returns input that each worker thread should
            process.

        output_builder: Returns a callback function for each worker
            thread, called with `(epoch, item)` whenever and item
            passes by a capture operator on this process.

        proc_count: Number of processes to start.

        worker_count_per_proc: Number of worker threads to start on
            each process.

        mp_ctx: `multiprocessing` context to use. Use this to
            configure starting up subprocesses via spawn or
            fork. Defaults to spawn.

    """
    addresses = _gen_addresses(proc_count)
    with mp_ctx.Pool(processes=proc_count) as pool:
        futures = [
            pool.apply_async(
                cluster_main,
                (
                    flow,
                    input_builder,
                    output_builder,
                    addresses,
                    proc_id,
                    worker_count_per_proc,
                ),
            )
            for proc_id in range(proc_count)
        ]
        pool.close()

        for future in futures:
            # Will re-raise exceptions from subprocesses.
            future.get()

        pool.join()

Classes

class AdvanceTo (epoch)

Instance variables

var epoch

Return an attribute of instance, which is of type owner.

class Dataflow

A definition of a Bytewax dataflow graph.

Use the methods defined on this class to add steps with operators of the same name.

See the execution functions in the bytewax to run.

TODO: Right now this is just a linear dataflow only.

Methods

def capture(self)

Capture causes all (epoch, item) tuples that pass by this point in the Dataflow to be passed to the Dataflow's output handler.

Every dataflow must contain at least one capture.

If you use this operator multiple times, the results will be combined.

There are no guarantees on the order that output is passed to the handler. Read the attached epoch to discern order.

def filter(self, predicate)

Filter selectively keeps only some items.

It calls a function predicate(item: Any) => should_emit: bool on each item.

It emits the item downstream unmodified if the predicate returns True.

def flat_map(self, mapper)

Flat Map is a one-to-many transformation of items.

It calls a function mapper(item: Any) => emit: Iterable[Any] on each item.

It emits each element in the downstream iterator individually.

def inspect(self, inspector)

Inspect allows you to observe, but not modify, items.

It calls a function inspector(item: Any) => None on each item.

The return value is ignored; it emits items downstream unmodified.

def inspect_epoch(self, inspector)

Inspect Epoch allows you to observe, but not modify, items and their epochs.

It calls a function inspector(epoch: int, item: Any) => None on each item with its epoch.

The return value is ignored; it emits items downstream unmodified.

def map(self, mapper)

Map is a one-to-one transformation of items.

It calls a function mapper(item: Any) => updated_item: Any on each item.

It emits each updated item downstream.

def reduce(self, reducer, is_complete)

Reduce lets you combine items for a key into an aggregator in epoch order.

Since this is a stateful operator, it requires the the input stream has items that are (key, value) tuples so we can ensure that all relevant values are routed to the relevant aggregator.

It calls two functions:

  • A reducer(aggregator: Any, value: Any) => updated_aggregator: Any which combines two values. The aggregator is initially the first value seen for a key. Values will be passed in epoch order, but no order is defined within an epoch.

  • An is_complete(updated_aggregator: Any) => should_emit: bool which returns true if the most recent (key, aggregator) should be emitted downstream and the aggregator for that key forgotten. If there was only a single value for a key, it is passed in as the aggregator here.

It emits (key, aggregator) tuples downstream when you tell it to.

def reduce_epoch(self, reducer)

Reduce Epoch lets you combine all items for a key within an epoch into an aggregator.

This is like reduce but marks the aggregator as complete automatically at the end of each epoch.

Since this is a stateful operator, it requires the the input stream has items that are (key, value) tuples so we can ensure that all relevant values are routed to the relevant aggregator.

It calls a function reducer(aggregator: Any, value: Any) => updated_aggregator: Any which combines two values. The aggregator is initially the first value seen for a key. Values will be passed in arbitrary order.

It emits (key, aggregator) tuples downstream at the end of each epoch.

def reduce_epoch_local(self, reducer)

Reduce Epoch Local lets you combine all items for a key within an epoch on a single worker.

It is exactly like reduce_epoch but does no internal exchange between workers. You'll probably should use that instead unless you are using this as a network-overhead optimization.

def stateful_map(self, builder, mapper)

Stateful Map is a one-to-one transformation of values in (key, value) pairs, but allows you to reference a persistent state for each key when doing the transformation.

Since this is a stateful operator, it requires the the input stream has items that are (key, value) tuples so we can ensure that all relevant values are routed to the relevant state.

It calls two functions:

  • A builder(key: Any) => new_state: Any which returns a new state and will be called whenever a new key is encountered with the key as a parameter.

  • A mapper(state: Any, value: Any) => (updated_state: Any, updated_value: Any) which transforms values. Values will be passed in epoch order, but no order is defined within an epoch. If the updated state is None, the state will be forgotten.

It emits a (key, updated_value) tuple downstream for each input item.

class Emit (item)

Instance variables

var item

Return an attribute of instance, which is of type owner.