API Reference¶
Messages¶
Most data processing workflows have the same basic architecture and only differ in the type of data and how those inputs are formatted. Minor differences in this formatting can make almost identical code non-reusable. To address this issue, this framework insists on using a single data structure to pass information between components - the Message object. A Message consists of two components: a Pandas DataFrame and a TensorMessage. The former is a very general purpose structure that benefits from all of the features of the popular Pandas library - a DataFrame is essentially a dictionary of arrays. However, in the context of pytorch deep learning, we cannot use DataFrames for everything because we cannot store tensor objects inside a dataframe (Pandas breaks down tensors into unit sized tensors and stores those units as objects as opposed to storing them as one entity). The TensorMessage emulates the structure and methods of DataFrames, except it only stores pytorch tensors (in the future, tensor’s in other frameworks could be supported). Because of this, it also attempts to autoconvert inputs to tensors. With this combined structure, one could store metadata in the dataframe and example/label pairs in the TensorMessage.
-
class
Fireworks.core.message.
Message
(*args, metadata=None, length=None, **kwargs)[source]¶ Bases:
object
A Message is a class for representing data in a way that can be consumed by different analysis pipelines in python. It does this by representing the data as a dictionary of arrays. This is the same approach that Pandas takes, but Messages are specifically designed to be used with PyTorch, in which the core data structure is a tensor, which cannot be mixed into a Pandas dataframe. Hence, a Message has two elements: a TensorMessage which specifically stores tensor objects and a dataframe which can be used for handling any other type of array data. With this structure, you could put your training data in as a TensorMessage and any associated metadata as the df and lump all of that into a message. All of the existing df methods can be run on the metadata, and any PyTorch operations can be performed on the tensor part.
Messages support operations such as appending one message to another, joining two messages along keys, applying maps to the message’s values, and equality checks.
Additionally, messages behave as much as possible like dicts. In many scenarios, a dict will be able to substitute for a message and vice-versa. For example, for preparing batches to feed into an RNN to classify DNA sequences, one could create a Message like this:
my_message = Message({ 'embedded_sequences': torch.Tensor([...]), 'labels': torch.LongTensor([...]), 'raw_sequences': ['TCGA...',...,], ...})
The Message constructor will parse this dictionary and store the labels and embedded sequences inside a TensorMessage and the raw_sequences and other metadata in the dataframe.
Now we can access elements of this Message:
my_message[0] my_message[2:10] my_message['labels'] len(my_message)
We can also move tensors to the GPU and back:
my_message.cpu() my_message.cuda(device_num=1) # Defaults to 0 for device number my_message.cuda(keys=['labels']) # Can specify only certain columns to move if desired
-
classmethod
from_objects
(*args, **kwargs)[source]¶ Returns a message by treating all of the provided values as atomic elments and ignoring length differences. This results in a Message of length 1. This can be useful if you want to treat an entire blob as data as a single unit. For example, if you want a row of the Message to contain the state of a neural network, one column could be ‘bias’, and the element could be an entire bias-vector, and another column could be ‘weights’, containing an entire weight matrix in a single row.
-
classmethod
read
(method, path, *args, **kwargs)[source]¶ Reads the file located at a provided path using a given method and loads the data into a Message. This method uses the Pandas .read_ functions and converts the resulting DataFrame to a Message. Thus, there is no way currently to go straight from data in a file to a TensorMessage. Additionally, only the methods in the read_methods dict at the top of the file are supported.
You can provide additional positional and key-word arguments which will be passed to the relevant Pandas method to parameterize the read.
Parameters: - method (-) – The method to read the file as. Must be one of: json, csv, excel, hdf, parquet, pickle, sql_table, stata, table.
- path (-) – The location of the file. This file must exist, be readable, and must be the type specified by method.
- *args, **kwargs (-) –
Additional arguments for the underlying Pandas function call.
Returns: A Message representation of the loaded data.
Return type: - message
-
to
(method, *args, **kwargs)[source]¶ Converts all elements of Message to dataframe and then calls df.to_x based on the given method This can be used to serialize and save Messages or convert them into a different format
Parameters: - method (-) – The method to save the file as. Must be one of: ‘json’, ‘dict’, ‘html’, ‘feather’, ‘latex’, ‘stata’, ‘msgpack’, ‘gbq’, ‘records’, ‘sparse’, ‘dense’, ‘string’, ‘clipboard’
- *args, **kwargs (-) –
Additional arguments for the underlying Pandas pd.to_ function call as desired. If the first argument (or kwarg path_or_buf) is provided, it will be used as a filepath to save to.
Returns: - Depending on the method chosen, this will save the Message to a file location and then return the converted Message (eg. to_json will return a json)
-
check_length
()[source]¶ Checks that lengths of the internal tensor_message and dataframe are the same and equalto self.len If one of the two is empty (length 0), then that is fine.
-
columns
¶ Returns the names of columns in this Message
-
index
¶ Returns index for internal tensors
-
append
(other)[source]¶ Compines messages together. Should initialize other if not a message already.
Parameters: other – The message to append to self. Must have the same keys as self so that in the resulting Message, every column continues to have the same length as needed.
-
merge
(other)[source]¶ Combines messages horizontally by producing a message with the keys/values of both.
Parameters: other – The message to merge with self. Must have different keys and the same length as self to ensure length consistencies. Alternatively, if either self or other have an empty TensorMessage or df, then they can be merged together safely as long as the resulting Message has a consistent length. For example:
message_a = Message({'a': [1,2,3]}) # This is essentially a DataFrame message_b = Message({'b': torch.Tensor([1,2,3])}) # This is essentially a TensorMessage message_c = Message_a.merge(message_b) # This works
Returns: The concatenated Message containing columns from self and other. Return type: message
-
map
(mapping)[source]¶ Applies function mapping to message. If mapping is a dict, then maps will be applied to the corresponding keys as columns, leaving columns not present in mapping untouched. In otherwords, mapping would be a dict of column_name:functions specifying the mappings.
Parameters: mapping – Can either be a dict mapping column names to functions that should be applied to those columns, or a single function. In the latter case, the mapping function will be applied to every column. Returns: A Message with the column:value pairs produced by the mapping. Return type: message
-
tensors
(keys=None)[source]¶ Return tensors associated with message as a tensormessage. If keys are specified, returns tensors associated with those keys, performing conversions as needed.
Parameters: keys – Keys to get. Default = None, in which case all tensors are returned as a TensorMessage. If columns corresponding to requested keys are not tensors, they will be converted. Returns: A TensorMessage containing the tensors requested. Return type: tensors (TensorMessage)
-
to_tensors
(keys=None)[source]¶ Returns message with columns indicated by keys converted to Tensors. If keys is None, all columns are converted.
Parameters: keys – Keys to get. Default = None, in which case all columns are mapped to Tensor. Returns: A Message in which the desired columns are Tensors. Return type: message
-
dataframe
(keys=None)[source]¶ Returns message as a dataframe. If keys are specified, only returns those keys as a dataframe.
Parameters: keys – Keys to get. Default = None, in which case all non-tensors are returned as a DataFrame. If columns corresponding to requested keys are tensors, they will be converted (to np.arrays). Returns: A DataFrame containing the columns requested. Return type: df (pd.DataFrame)
-
to_dataframe
(keys=None)[source]¶ Returns message with columns indicated by keys converted to DataFrame. If keys is None, all tensors are converted.
Parameters: keys – Keys to get. Default = None, in which case all tensors are mapped to DataFrame. Returns: A Message in which the desired columns are DataFrames. Return type: message
-
permute
(index)[source]¶ Reorders elements of message based on index.
Parameters: index – A valid index for the message. Returns: - A new Message with the elements arranged according to the input index.
- For example,
::
message_a = Message({‘a’:[1,2,3]})
message_b = message_a.permute([2,1,0])
message_c = Message({‘a’: [3,2,1]})
message_b == message_c
The last statement will evaluate to True
Return type: message
-
cpu
(keys=None)[source]¶ Moves tensors to system memory. Can specify which ones to move by specifying keys.
Parameters: keys – Keys to move to system memory. Default = None, meaning all columns are moved. Returns: Moved message Return type: message (Message)
-
cuda
(device=0, keys=None)[source]¶ Moves tensors to gpu with given device number. Can specify which ones to move by specifying keys.
Parameters: - device (int) – CUDA device number to use. Default = 0.
- keys – Keys to move to GPU. Default = None, meaning all columns are moved.
Returns: Moved message
Return type: message (Message)
-
classmethod
-
class
Fireworks.core.message.
TensorMessage
(message_dict=None, map_dict=None)[source]¶ Bases:
object
A TensorMessage is a class for representing data meant for consumption by pytorch as a dictionary of tensors. It is analogous to a Pandas DataFrame, except all elements are PyTorch Tensors (or torch.nn.Parameter). This can be useful when working with PyTorch, as you can have your Models use column names to keep track of what each Tensor is for, and by parameterizing column names, you can make your Models somewhat generic (instead of taking x as an argument that must have a specific format, you take take a TensorMessage and act on columns that can be configured at runtime or in your script.)
TensorMessages, along with DataFrames, constitute the components of a Message. We recommend using Messages rather than TensorMessages directly, as all of the functionality of TensorMessages is encapsulated by a Message, and this gives you the ability to seamlessly move data to and from Tensor format, so you can mix your deep learning tasks with other data science tasks that you would use Pandas for.
-
keys
(*args, **kwargs)[source]¶ Returns the column names of this TensorMessage, which as the keys of the internal tensor_dict.
-
tensor_message
¶
-
df
¶
-
columns
¶ Returns names of tensors in TensorMessage
-
index
¶ Returns the index for internal tensors
-
append
(other)[source]¶ Note if the target message has additional keys, those will be dropped. The target message must also have every key present in this message in order to avoid an value error due to length differences.
-
merge
(other)[source]¶ Combines self with other into a message with the keys of both. self and other must have distinct keys.
-
-
Fireworks.core.message.
compute_length
(of_this)[source]¶ Of_this is a dict of listlikes. This function computes the length of that object, which is the length of all of the listlikes, which are assumed to be equal. This also implicitly checks for the lengths to be equal, which is necessary for Message/TensorMessage.
-
Fireworks.core.message.
extract_tensors
(from_this)[source]¶ Given a dict from_this, returns two dicts, one containing all of the key/value pairs corresponding to tensors in from_this, and the other containing the remaining pairs.
-
Fireworks.core.message.
complement
(indices, n)[source]¶ Given an index, returns all indices between 0 and n that are not in the index.
Pipes¶
With a uniform data structure for information transfer established, we can create functions and classes that are reusable because of the standardized I/O expectations. A Pipe object represents some transformation that is applied to data as it flows through a pipeline. For example, a pipeline could begin with a source that reads from the database, followed by one that cache those reads in memory, then one that applies embedding transformations to create tensors, and so on.
These transformations are represented as classes rather than functions because we sometimes want to be able to apply transformations in a just-in-time or ahead-of-time manner, or have the transformations be dependent on some upstream or downstream aspect of the pipeline. For example, the Pipe that creates minibatches for training can convert its inputs to tensors and move them to GPU as a minibatch is created, using the tensor-conversion method implemented by an upstream Pipe. Or a Pipe that caches its inputs can prefetch objects to improve overall performance, and so on.

-
Fireworks.core.pipe.
recursive
(accumulate=False)[source]¶ Decorator that labels a Pipe method as recursive. This means, that method func will first be called on the Pipe’s inputs and then on the Pipe itself. If accumulate is set to True, then the result from calling the method on a given Pipe will be used as input to the next one. If False, then the original arguments will be used when calling the method each time.
-
class
Fireworks.core.pipe.
Pipe
(input=None, *args, **kwargs)[source]¶ Bases:
abc.ABC
The core object of computation in fireworks. A Pipe can take Pipes as inputs, and its outputs can be streamed to other Pipes. All communication is done via Message objects. Method calls are deferred to input Pipes recursively until a Pipe that implements the method is reached.
This is made possible with a recursive function call method. Any Pipe can use this method to call a method on its inputs; this will recursively loop until reaching a Pipe that implements the method and return those outputs (as a Message) or raise an error if there are none. For example, we can do something like this:
reader = pipe_for_reading_from_some_dataset(...) cache = CachingPipe(reader, type='LRU') embedder = CreateEmbeddingsPipe(cache}) loader = CreateMinibatchesPipe(embedder}) loader.reset() for batch in loader: # Code for training
Under the hood, the code for loader.__next__() can choose to recursively call a to_tensor() method which is implemented by embedder. Index queries and other magic methods can also be implemented recursively, and this enables a degree of commutativity when stacking Pipes together (changing the order of Pipes is often allowed because of the pass-through nature of recursive calls).
Note that in order for this to work well, there must be some consistency among method names. If a Pipe expects ‘to_tensor’ to convert batches to tensor format, then an upstream Pipe must have a method with that name, and this should remain consistent across projects to maintain reusability. Lastly, the format for specifying inputs to a Pipe is a dictionary of Pipes. The keys in this dictionary can provide information for the Pipe to use or be ignored completely.
-
name
= 'base_pipe'¶
-
stateful_attributes
= []¶
-
get_state
()[source]¶ This returns the current state of the Pipe, which consists of the values of all attributes designated in the list ‘stateful_attributes’. This can be used to save and load a Pipe’s state.
Parameters: None (-) – Returns: {…}, ‘external’: {…}}, where the ‘external’ subdict is empty. This is so that the representation is consistent with the get_state methods of Junctions and Models. We consider all attributes of a Pipe to be internal, and that is why the ‘external’ subdict is empty. See documentation on Component Map for more details on what we mean by that (note that Pipes don’t use Component_Maps to store state, but simply expose similar methods for compatilibity.) Return type: - A dict of the form {‘internal’
-
set_state
(state, *args, **kwargs)[source]¶ Sets the state of the pipe based on the provided state argument.
Parameters: state (-) – A dict of the form {‘internal’: {…}, ‘external’: {…}}. The ‘external’ dict will be ignored, because consider all attributes of a Pipe to be in internal (for simplicity). See Component_Map documentation for details.
-
recursive_call
(attribute, *args, ignore_first=True, call=True, **kwargs)[source]¶ Recursively calls method/attribute on input until reaching an upstream Pipe that implements the method and returns the response as a message (empty if response is None). Recursive calls enable a stack of Pipes to behave as one entity; any method implemented by any component can be accessed recursively.
Parameters: - attribute – The name of the attribute/method to call.
- args – The arguments if this is a recursive method call.
- ignore_first – If True, then ignore whether or not the target attribute is implemented by self. This can be useful if a Pipe implements a method and wants to use an upstream call of the same method as well.
- call – If True, and the attribute is a method, the method will be called. Otherwise, it will be returned as is.
- kwargs – The kwargs is this is a recursive method call.
Returns: A dictionary mapping the name of each input Pipe to the response that was returned.
Return type: Responses (dict)
-
-
class
Fireworks.core.pipe.
HookedPassThroughPipe
(input=None, *args, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
This Pipe has hooks which can be implemented by subclasses to modify the behavior of passed through calls. You can define hooks for the following (magic) methods: __getitem__, __call__, and __next__. Whenever you call one of these method this will happen:
- The method will be recursively called on this Pipes input (if it exists)
- The appropriate hook function will be called on the result of that recursive call
- This will be the returned value.
These hooks can make it easy to create pipes that ‘do something’ every time data is accessed in a certain way. For example, you could have the pipe apply some transform to the data.
-
name
= 'Hooked-passthrough Pipe'¶
-
class
Fireworks.core.cache.
MessageCache
(max_size)[source]¶ Bases:
abc.ABC
A message cache stores parts of a larger method and supports retrievals and insertions based on index. The use case for a MessageCache is for storing parts of a large dataset in memory. The MessageCache can keep track of which elements and which indices are present in memory at a given time and allow for updates and retrievals.
-
insert
(index, message)[source]¶ Inserts message into cache along with the desired indices. This method should be called by __setitem__ as needed to perform the insertion.
Parameters: - index – The index to insert into. Can be an int, slice, or list of integer indices.
- message – The Message to insert. Should have the same length as the provided idnex.
-
delete
(index)[source]¶ Deletes elements in the message corresponding to index. This method should be called by __setitem__ or __delitem__ as needed.
Parameters: index – The index to insert into. Can be an int, slice, or list of integer indices.
-
size
¶
-
-
class
Fireworks.core.cache.
UnlimitedCache
[source]¶ Bases:
Fireworks.core.cache.MessageCache
This is a basic implementation of a MessageCache that simply appends new elements and never clears memory internally
-
class
Fireworks.core.cache.
BufferedCache
(max_size)[source]¶ Bases:
Fireworks.core.cache.MessageCache
This implements a setitem method that assumes that when the cache is full, elements must be deleted until it is max_size - buffer_size in length. The deletion method, _free, must be implemented by a subclass.
-
class
Fireworks.core.cache.
RankingCache
(max_size)[source]¶ Bases:
Fireworks.core.cache.MessageCache
Implements a free method that deletes elements based on a ranking function.
-
class
Fireworks.core.cache.
LRUCache
(*args, buffer_size=0, **kwargs)[source]¶ Bases:
Fireworks.core.cache.RankingCache
,Fireworks.core.cache.BufferedCache
Implements a Least Recently Used cache. Items are deleted in descending order of how recently they were accessed. A call to __getitem__ or __setitem__ counts as accessing an element.
-
class
Fireworks.core.cache.
LFUCache
(*args, buffer_size=0, **kwargs)[source]¶ Bases:
Fireworks.core.cache.RankingCache
,Fireworks.core.cache.BufferedCache
Implements a Least Frequently Used cache. Items are deleted in increasing order of how frequently they are accessed. A call to __getitem__ or __setitem__ counts as accessing an element.
-
Fireworks.core.cache.
pointer_adjustment_function
(index)[source]¶ Given an index, returns a function that that takes an integer as input and returns how many elements of the index the number is greater than. This is used for readjusting pointers after a deletion. For example, if you delete index 2, then every index greater than 2 must slide down 1 but index 0 and 1 do not more.
-
Fireworks.core.cache.
get_indices
(values, listlike)[source]¶ Returns the indices in litlike that match elements in values
-
class
Fireworks.toolbox.pipes.
BioSeqPipe
(path, input=None, filetype='fasta', **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
Class for representing biosequence data. Specifically, this class can read biological data files (such as fasta) and iterate through them as a Pipe. This can serve as the first Pipe in a pipeline for analyzing genomic data.
-
name
= 'BioSeqPipe'¶
-
reset
()¶
-
-
class
Fireworks.toolbox.pipes.
LoopingPipe
(input, *args, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
This Pipe can take any iterator and make it appear to be indexable by iterating through the input as needed to reach any given index.
The input Pipe must implement __next__ and reset (to be repeatable), and this will simulate __getitem__ by repeatedly looping through the iterator as needed.
For example, say we have a Pipe that iterates through the lines of a FASTA file:
fasta = BioSeqPipe('genes.fasta')
This Pipe can only iterate through the file in one direciton. If we want to access arbitrary elements, we can do this:
clock = LoopingPipe(inputs=fasta) clock[10] clock[2:6] len(clock)
All of these actions are now possible. Note that this is in general an expensive process, because the Pipe has to iterate one at a time to get to the index it needs. In practice, this Pipe should pipe its output to a CachingPipe that can store values in memory. This approach enables you to process datasets that don’t entirely fit in memory; you can stream in what you need and cache portions. From the perspective of the downstream Pipes, every element of the dataset is accessible as if it were in memory.
-
name
= 'LoopingPipe'¶
-
reset
(*args, **kwargs)¶
-
-
class
Fireworks.toolbox.pipes.
CachingPipe
(input, *args, cache_size=100, buffer_size=0, cache_type='LRU', infinite=False, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
This Pipe can be used to dynamically cache elements from upstream Pipes. Whenever data is requested by index, this Pipe will intercept the request and add that message alongside the index to its internal cache. This can be useful for dealing with datasets that don’t fit in memory or are streamed in. You can cache portions of the dataset as you use them. By combining this with a LoopingPipe, you can create the illusion of making the entire dataset available to downstream Pipes regardless of the type and size of the original data.
More specifically, fiven input Pipes that implement __getitem__, will store all calls to __getitem__ into an internal cache and therafter __getitem__ calls will either access from the cache or trigger __getitem__ calls on the input and an update to the cache.
For example,
fasta = BioSeqPipe(path='genes.fasta') clock = LoopingPipe(inputs=fasta) cache = CachingPipe(inputs=clock, cache_size=100) # cache_size is optional; default=100
Will set up a pipeline that reads lines from a FASTA file and acessess and caches elements as requests are made
cache[20:40] # This will be cached cache[25:30] # This will read from the cache cache[44] # This will update the cache cache[40:140] # This will fill the cache, flushing out old elements cache[25:30] # This will read from the dataset and update the cache again
-
init_cache
(*args, **kwargs)[source]¶ This initializes a cache object at self.cache. There are currently two types of Cache available; LRUCache and LFUCache, and you can choose which one by specifying the cache_type argument in the initializer. See Fireworks/core/cache.py for more information on Message caches.
-
-
class
Fireworks.toolbox.pipes.
Title2LabelPipe
(title, input, *args, labels_column='labels', **kwargs)[source]¶ Bases:
Fireworks.core.pipe.HookedPassThroughPipe
This Pipe takes one Pipe as input and inserts a column called ‘label’ containing the provided title of the input Pipe to to all outputs.
-
class
Fireworks.toolbox.pipes.
LabelerPipe
(input, labels, *args, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
This Pipe implements a to_tensor function that converts labels contained in messages to tensors based on an internal labels dict.
-
class
Fireworks.toolbox.pipes.
RepeaterPipe
(input, *args, repetitions=10, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
Given an input Pipe that is iterable, enables repeat iteration. In other words, one loop through a RepeaterPipe is equivalent to n loops through the original dataset, where n is the number of repetitions that have been configured. Thsi can be useful for oversampling a data set without having to duplicate it.
-
reset
(*args, **kwargs)¶
-
-
class
Fireworks.toolbox.pipes.
ShufflerPipe
(input, *args, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
Given input Pipes that implement __getitem__ and __len__, will shuffle the indices so that iterating through the Pipe or calling __getitem__ will return different values.
-
reset
(*args, **kwargs)¶
-
-
class
Fireworks.toolbox.pipes.
IndexMapperPipe
(input_indices, output_indices, *args, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
Given input Pipes that implement __getitem__, returns a Pipe that maps indices in input_indices to output_indices via __getitem__
-
class
Fireworks.toolbox.pipes.
BatchingPipe
(*args, batch_size=5, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
Generates minibatches.
-
reset
(*args, **kwargs)¶
-
-
class
Fireworks.toolbox.pipes.
FunctionPipe
(*args, function, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.HookedPassThroughPipe
This Pipe is initialized with a function that is applied to all instances of __getitem__, __next__, and __call__. This can be useful for quickly inserting a function into the middle of a pipeline.
-
class
Fireworks.toolbox.pipes.
TensorPipe
(*args, columns=None, cuda=True, device=0, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.HookedPassThroughPipe
This Pipe converts Messages to tensors. You can specify which columns should be converted.
-
class
Fireworks.toolbox.pipes.
GradientPipe
(*args, columns=None, cuda=True, device=0, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.HookedPassThroughPipe
This Pipe sets requires_grad=True for Tensor Columns. This is necessary for computing gradients when training a model. You can specify which columns should be converted.
Junctions¶
Whereas Pipes are designed to have one input, Junctions can have multiple inputs, called components. Since there is no unambiguous way to define how recursive method calls would work in this situation, it is the responsibility of each Junction to have built-in logic for how to aggregate its components in order to respond to method calls from downstream sources. This provides a way to construct more complex computation graphs.

-
class
Fireworks.core.junction.
Junction
(*args, components=None, **kwargs)[source]¶ Bases:
object
A junction can take pipes as inputs, and its outputs can be piped to other pipes. All communication is done via Message objects.
Unlike Pipes, junctions do not automatically have recursive method calling. This is because they have multiple input sources, which would result in ambiguity. Instead, junctions are meant to act as bridges between multiple pipes in order to enable complex workflows which require more than a linear pipeline.
Like Models, Junctions can have internal and external components in their state.
-
check_components
(components=None)[source]¶ Checks to see if the provided components dict provides all necessary params for this model to run.
-
required_components
¶ This should be overridden by a subclass in order to specify components that should be provided during initialization. Otherwise, this will default to just return the components already present within the Model.
-
-
class
Fireworks.core.junction.
PyTorch_Junction
(*args, components=None, **kwargs)[source]¶ Bases:
Fireworks.core.junction.Junction
A PyTorch Junction can have automatically convert components to their PyTorch equivalents (eg. convert a numpy array or list to a torch.Tensor), and this can be useful when using the Junction for PyTorch related tasks.
-
class
Fireworks.toolbox.junctions.
HubJunction
(*args, **kwargs)[source]¶ Bases:
Fireworks.core.junction.Junction
This junction takes multiple sources implementing __next__ as input and implements a new __next__ method that samples its input sources.
-
class
Fireworks.toolbox.junctions.
RandomHubJunction
(*args, **kwargs)[source]¶ Bases:
Fireworks.toolbox.junctions.HubJunction
HubJunction that randomly chooses inputs to step through.
-
class
Fireworks.toolbox.junctions.
ClockworkHubJunction
(*args, **kwargs)[source]¶ Bases:
Fireworks.toolbox.junctions.HubJunction
HubJunction that iterates through input sources one at a time.
-
class
Fireworks.toolbox.junctions.
SwitchJunction
(*args, **kwargs)[source]¶ Bases:
Fireworks.core.junction.Junction
This junction has an internal switch that determines which of it’s components all method calls will be routed to.
-
route
¶ Returns the component to route method calls to based on the internal switch.
-
Models¶
Models are a data structure for representing mathematical models that can be stacked together, incorporated into pipelines, and have their parameters trained using PyTorch. These Models don’t have to be neural networks or even machine learning models; they can represent any function that you want. The goal of the Models class is to decouple the parameterization of a model from its computation. By doing this, those parameters can be swapped in and out as needed, while the computation logic is contained in the code itself. This structure makes it easy to save and load models. For example, if a Model computes y = m*x+b, the parameters m and b can be provided during initialization, they can be learned using gradient descent, or loaded in from a database.

Models function like Junctions with respect to their parameters, which are called components. These components can be PyTorch Parameters, PyTorch Modules, or some other object that has whatever methods/attributes the Model requires. Models function like Pipes with respect to their arguments. Hence, you can insert a Model inside a Pipeline. Models also function like PyTorch Modules with respect to computation and training. Hence, once you have created a Model, you can train it using a method like gradient descent. PyTorch will keep track of gradients and Parameters inside your Models automatically. You can also freeze and unfreeze components of a Model using the freeze/unfreeze methods.

m = LinearModel(components={'m': [1.]}) # Initialize model for y = m*x+b with m = 1.
print(m.required_components) # This will return ['m', 'b']. A model can optionally have initialization logic for components not provided
# For example, the y-intercept b can have a default initialization if not provided here.
print(m.components) # This should return a dict containing both m and b. The model should have initialized a y-intercept and automatically added that to it's components dict.
f = NonlinearModel(input=m) # Initialize a model that represents some nonlinearity and give it m as an input.
result = f(x) # Evaluates f(m(x)) on argument message x. Because m is an input of f, m will be called first and pipe its output to f.
-
class
Fireworks.core.model.
Model
(components={}, *args, input=None, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.HookedPassThroughPipe
,Fireworks.core.junction.Junction
,abc.ABC
Represents a statistical model which has a set of components, and a means for converting inputs into outputs. The model functions like a Pipe with respect to the input/output stream, and it functions like a Junction with respect to the parameterization. components can be provided via multiple different sources in this way, providing flexibility in model configuration. Models can also provide components for other Models, enabling one to create complex graphs of Models that can be trained simultaneously or individually.
-
init_default_components
()[source]¶ This method can optionally be implemented in order for the model to provide a default initialization for some or all of its required components.
-
forward
(message)[source]¶ Represents a forward pass application of the model to an input. Must be implemented by a subclass. This should return a Message.
-
save
(*args, method='json', **kwargs)[source]¶ Aggregates this model’s components into a single Message and saves them using the chosen method. You can use any method that Messages support writing to via the to_ method, and you can provide additional key-word arguments as needed to support this. If the save method involves a path, then the path will be modified for each component and the state_dict. For the state_dict, the path will be torch_{name}-{path}, and for each component it will be {key}_{path} where name is either the name of the class or self.__name__ if it is defined. key is the value of the key in the components dictionary.
-
load_state
(*args, method='json', **kwargs)[source]¶ Loads the data in the given save file into the state dict.
-
get_state
()[source]¶ This returns the current state of the Pipe, which consists of the values of all attributes designated in the list ‘stateful_attributes’. This can be used to save and load a Pipe’s state.
Parameters: None (-) – Returns: {…}, ‘external’: {…}}, where the ‘external’ subdict is empty. This is so that the representation is consistent with the get_state methods of Junctions and Models. We consider all attributes of a Pipe to be internal, and that is why the ‘external’ subdict is empty. See documentation on Component Map for more details on what we mean by that (note that Pipes don’t use Component_Maps to store state, but simply expose similar methods for compatilibity.) Return type: - A dict of the form {‘internal’
-
set_state
(state, reset=True)[source]¶ Sets the state of the pipe based on the provided state argument.
Parameters: state (-) – A dict of the form {‘internal’: {…}, ‘external’: {…}}. The ‘external’ dict will be ignored, because consider all attributes of a Pipe to be in internal (for simplicity). See Component_Map documentation for details.
-
enable_inference_all
(*args, **kwargs)¶
-
disable_inference_all
(*args, **kwargs)¶
-
enable_updates_all
(*args, **kwargs)¶
-
disable_updates_all
(*args, **kwargs)¶
-
-
Fireworks.core.model.
freeze_module
(module, parameters=None, submodules=None)[source]¶ Recursively freezes the parameters in a PyTorch module.
-
Fireworks.core.model.
unfreeze_module
(module, parameters=None, submodules=None)[source]¶ Recursively unfreezes the parameters in a PyTorch module.
-
Fireworks.core.model.
model_from_module
(module_class)[source]¶ Given the class definition for a pytorch module, returns a model that encapsulates that module.
-
Fireworks.core.model.
to_parameter
(component)[source]¶ Attempts to convert a component to Pytorch Parameter if it is a tensor-like. This is required for using that component during model training.
-
class
Fireworks.core.model.
PyTorch_Model
(components={}, *args, input=None, skip_module_init=False, **kwargs)[source]¶ Bases:
torch.nn.modules.module.Module
,Fireworks.core.model.Model
,Fireworks.core.junction.PyTorch_Junction
-
state_dict
()[source]¶ Returns a dictionary containing a whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names.
Returns: a dictionary containing a whole state of the module Return type: dict Example:
>>> module.state_dict().keys() ['bias', 'weight']
-
all_parameters
()[source]¶ Returns a list of every PyTorch parameter that this Model depends on that is unfrozen. This is useful for providing a parameters list to an optimizer.
-
set_state
(state, reset=True)[source]¶ Sets the state of the pipe based on the provided state argument.
Parameters: state (-) – A dict of the form {‘internal’: {…}, ‘external’: {…}}. The ‘external’ dict will be ignored, because consider all attributes of a Pipe to be in internal (for simplicity). See Component_Map documentation for details.
-
State¶
-
class
Fireworks.core.component_map.
Component_Map
(components)[source]¶ Bases:
dict
Each of the main objects that can be used to construct pipelines (Pipes, Junctions, and Models) have a means for tracking their internal state. For Pipes, this is handled by a simple dictionary, but for Junctions and Models we use this class which satisfies the more complex needs for these objects. In particular, a Component_Map can track whether a variable is internal or external to a given object, and if it’s external, whom the variable belongs to. This lets us dynamically assign a variable of one object to to a component of a Junction or Model while maintaining the distinction that the assigned variable is not internal to Junction or Model. This distinction can be useful for variables such as hyperparameters or runtime configurations (eg. whether to use Cuda) that one does not want to store alongside variables like model weights. You can also have a Model ‘borrow’ variables from another Model while maintaining this distinciton (eg. use the first two layers from this other model, then use the remainder of the layers using internal weights), and this can be useful when training Models (you could have your optimizer operate only on a Model’s internal parameters, treating everything else as constant.) These are a just few examples of how this abstraction can be useful, and in simpler terms, it is essentially a means to deliberately pass variables by reference, which is not how Python’s memory model operates by default, but it can be extremely helpful when doing machine learning. The details of the interaction with a Component_Map are abstracted away by Junctions and Models. Hence, you shouldn’t have to directly interact with a Component_Map. Instead, you can generally just call set_state and get_state on Junctions and Models to get serialized representations of the Component_Maps.
The format of these serialization is a dict of the form {‘external’: {…}, ‘internal’: {…}}. The ‘internal’ dict contains a mapping between variable names and those variables. The ‘external’ dict contains a mapping between variable names and the object that those variables belong to. In this way, a Component_Map can keep track of the owner of the linked variable and also get its value as needed. Hence, Junctions and Models can simply use that variable as if it were internal, and this makes it easy to swap variables around without changing syntax (eg. replace some internal component of a Model with an attribute of some other object on the fly.)
A Component_Map behaves like a dict with the special property that if you assign an tuple of the form (obj, x) to the dict, where x is a string, then the Component_Map will treat that as a ‘pass by reference’ assignment. In other words, it will assume that you want to externally link the variable obj.x to the Component_Map. For example, if you do this:
A = some_object() cm = Component_Map() cm['a'] = (A, 'x')
Now whenever you call cm[‘a’], you will get whatever is returned by A.x.
cm['a'] == A.x # This evaluates to True. cm['a'] is A.x # This also evaluates to True, because the assignment is by reference.
If you cm.get_state(), the ‘external’ dict will contain a reference to A.
state = cm.get_state() external = state['external'] external['a'] == (A, 'x') # This evaluates to True.
On the other hand, if you do this:
cm['a'] = A.x # Don't pass by reference. cm['a'] == A.x # This evaluates to True. cm['a'] is A.x # This may or may not be True because Python sometimes assigns by reference and sometimes copies data depending on the situation.
This will be treated as an internal assignment. Note that PyTorch implements logic for enforcing pass-by-reference for torch.nn.Parameter objects. Hence, if A.x was a Parameter, then the assignment will be by reference. However, we will have no way of knowing who the ‘owner’ of the Parameter is, and by using Component_Maps, we also are able to extend this functionality to any Python object. If you now get the state, it will be in the ‘internal’ dict.
state = cm.get_state() internal = state['internal'] internal['a'] == A.x # This evaluates to True. If A.x is vector/tensor-valued, you may get a vector/tensor of 1's.
-
setitem_hook
(key, value)[source]¶ This can be overridden by a subclass in order to implement specific actions that should take place before an attribute is set.
-
set_state
(state)[source]¶ This method can be used to apply a serialized representation of state to a Component_Map at once. This is used for loading in saved data.
Parameters: state – A dict of the form {‘external’: {…}, ‘internal’: {…}}. The elements of this dict will be assigned to the Component_Map. Note that this will not reset the Component_Map, so if there were previous elements already present, those will remain in the Component_Map.
-
-
class
Fireworks.core.component_map.
PyTorch_Component_Map
(components, model=None)[source]¶ Bases:
Fireworks.core.component_map.Component_Map
This is a subclass of Component_Map with additional functionality for dealing with PyTorch data structures. PyTorch has a lot of logic in the background to keep track of Parameters and gradients and where objects are located in memory. The PyTorch_Component_Map has a modified __setitem__ method which ensures that there are no conflicts with any of these background operations by PyTorch. In particular, a PyTorch_Component_Map can have a (PyTorch) Model assigned to it, and whenever __setitem__ is called, the item is 1) Converted to a torch.nn.Parameter object if possible. This is essential for computing gradients and training the parameter. 2) Recursively assigned if necessary. This concept is best explained with an example. Say you have a neural network with a convolutional
layer,
model = some_pytorch_model() model.conv1 = torch.nn.Conv2d(4,4,4) # This represents a 4x4 convolutional layer with 4 channels. 'model.conv1' is itself a PyTorch Module with its own internal state, and in general, models can have models that have models, and so on. In other words, 'model.conv1' could itself have variables that are Modules/Models and so on. When you get the state dict for the original model, you will get nested dictionaries. These can still be serialized and saved to a file like normal, but when we call set_state, we want to make sure that we assign these nested dictionary elements to the correct submodules. :: state = model.get_state() internal = state['internal'] internal['conv1'] == {'weights': ['This is some Tensor'], 'bias': ['This is some vector']} If we naively called model.set_state(state) to load some other state from a file, then we would end up assigning a nested dictionary to the value of model.conv1. What we actually want is: :: model.set_state(state) print(model.conv1) # This is a PyTorch Module print(model.conv1.weights) # This is a Tensor print(model.conv1.bias) # This is a Tensor PyTorch_Component_Map checks if the attribute being assigned to is a PyTorch_Model or (PyTorch) Module and performs this type of assignment.
- ‘Registered’ to the Model. This is something that PyTorch does whenever you assign a value to a PyTorch Module and is essential
- for proper functioning of PyTorch methods/functions, such as getting a state_dict, submodules, etc.
This additional logic is important, because in general, all of the layers of a Neural Network are implemented as Modules and PyTorch Modules inherently has a nested structure.
-
class
Fireworks.core.scaffold.
Scaffold
(to_attach=None)[source]¶ Bases:
object
A Scaffold can keep track of the internal state of objects in a pipeline. This can be used to save and load the entire state of a pipeline, allowing one to pause and resume a project, take snapshots, and log the internal state of components as an experiment proceeds. The current implementation of Scaffold is very simple; you attach objects to the Scaffold while providing a name that serves as an identifier for that object. You can then call the serialize method to get a dictionary of the current states of all attached objects, the save method to save those serialized states to a given folder, or load to update the states of attached objects using data in a provided directory.
scaffold = Scaffold() # Attach components as desired scaffold.attach('a', A) scaffold.attach('b', B) . . . # This will save the current state of attached objects in folder 'save_directory'. The filenames will be based on the identifiers # eg. 'a.json', 'b.json', etc. scaffold.save(path='save_directory', method='json') # This will read files in folder 'save_directory2' and set the state of attached objects based on identifiers. # eg. the components attached with identifier 'a' will load from 'a.json' and so on. Note that the provided method must be # consistent with the filetypes in the save directory. scaffold.load(path='save_directory2', method='json') # This will produce a dictionary of identifiers to state object serializations (see Component_Map documentation for details). # You could use this dictionary to log state information however you want. For example, you could log the current weights of # neural network layers in your model for later plotting. state = scaffold.serialize()
-
attach
(name, obj)[source]¶ Attaches an object to the Scaffold with a provided identifier. The Scaffold can then track the object’s internal state, enabling one to access, save, and load the serialized states of all tracked objects at once.
Parameters: - name – The identifier for the object.
- obj – The object to attach. Note that each object must implement a get_state method which returns a dictionary of the form {‘external’: {…}, ‘internal’: {…}}. Pipes, Junctions, and Models satisfy this criteria.
-
serialize
()[source]¶ Returns a dictionary containing serialized representations of all objects tracked by the scaffold. See Component_Map documentation for more information on these serializations.
Parameters: None – Returns: A dict of the form {key: state}, where state is a dict of the form {‘external’: {…}, ‘internal’: {…}} corresponding to the internal and external state of objects tracked by the Scaffold. See Component_Map documentation for more information on state. Return type: state
-
save
(path, method='json', **kwargs)[source]¶ Saves serialized representation of all objects linked to Scaffold using a desired method (json, csv, pickle, etc.)
Parameters: - path – The folder to save serializations to. This folder must exist and be writable by the program.
- method – The method for saving. This must be one of the methods support by the Message.to(…) method (see Message documentation) , as state dicts are converted to Messages and saved using Message.to(…).
-
load
(path, method='json', reset=False)[source]¶ Loads serialized representations of all objects linked to Scaffold using the given names in the given path.
Parameters: - path – The folder to load serializations from. This folder must exist and be readable by the program.
- method – The method for loading. This must be one of the methods support by the Message.load(…) method (see Message documentation), as that method is used to read the files. Note that load will only look for files corresponding to the provided method that also have the correspponding suffix (eg. json filenames must end with ‘.json’, pickles files with ‘.pickle’, etc.). So if you have files in the foler that were not saved as the given method, or have different filename suffixes, they will be ignored.
-
Database¶
This module contains methods and classes for ingesting and reading data to/from a database. A user can specify a schema and stream messages from a source into a relational database. You can also create a source that streams data from a database based on a query. Because this module is built using SQLalchemy, it inherits all of the capabilities of that library, such as the ability to interface with many different relational databases and very precise control over schema and access. There are two sources: A TableSource implements methods for writing a Message to a table, and a DBSource is an iterable that produces Messages as it loops through a database query.
TableSource
A TableSource is initialized with an SQLalchemy table, and SQLalchemy engine, and an optional list of columns that the TableSource will write to in the table. By specifying columns, you can choose to use only a subset of the columns in a table (for example, if there are auto-incrementing ID columns that don’t need to explicitly written). In addition to methods for standard relational database actions such as rollback, commit, etc., the TableSource has an insert method that takes a Message object, converts it into a format that can be written to the database and then performs the insert. It also has a query method that takes the same arguments that the query function in SQLalchemy takes (or does a SELECT * query by default) and returns a DBSource object corresponding to that query.
DBSource
This Source is initialized with an SQLalchemy query and iterates through the results of that query. It converts the outputs to Messages as it does so, enabling one to easily incorporate database queries into a Source pipeline.
-
class
Fireworks.extensions.database.
TablePipe
(table, engine, columns=None, input=None, **kwargs)[source]¶ Bases:
Fireworks.core.pipe.Pipe
Represents an SQLalchemy Table while having the functionality of a Pipe.
-
init_db
()[source]¶ Initializes metadata for internal table. This ensures that the table exists in the database.
-
insert
(batch)[source]¶ Inserts the contents of batch message into the database using self.table object NOTE: Only the dataframe components of a message will be inserted.
Parameters: batch (Message) – A message to be inserted. The columns and types must be consistent with the database schema.
-
query
(entities=None, *args, **kwargs)[source]¶ Queries the database and generates a DBPipe corresponding to the result.
Parameters: - entities – A list of column names
- args – Optional positional arguments for the SQLalchemy query function
- kwargs – Optional keyword arguments for the SQLalchemy query function
Returns: A DBPipe object that can iterate through the results of the query.
Return type: dbpipe (DBPipe)
-
upsert
(batch)[source]¶ Performs an upsert into the database. This is equivalent to performing an update + insert (ie. if value is not present, insert it, otherwise update the existing value.)
Parameters: batch (Message) – The message to upsert.
-
make_row
(row)[source]¶ Converts a Message or dict mapping columns to values into a table object that can be inserted into an SQLalchemy database.
Parameters: row – row in Message or dict form to convert. Returns: row converted to table form. Return type: table
-
make_row_dict
(row)[source]¶ Converts a 1-row Message into a dict of atomic (non-listlike) elements. This can be used for the bulk_insert_mappings method of an SQLalchemy session, which skips table instantiation and takes dictionaries as arguments instead.
Parameters: row – row in Message or dict form to convert. Returns: row converted to table form. Return type: table
-
-
Fireworks.extensions.database.
create_table
(name, columns, primary_key=None, Base=None)[source]¶ Creates a table given a dict of column names to data types. This is an easy way to quickly create a schema for a data pipeline.
Parameters: - columns (dict) – Dict mapping column names to SQLalchemy types.
- primary_key – The column that should be the primary key of the table. If unspecified, a new auto-incrementing column called ‘id’ will be added as the primary key. SQLalchemy requires that all tables have a primary key, and this ensures that every row is always uniquely identifiable.
- Base – An optional argument that can be provided to specify the Base class that the new table class will inherit from. By default, this will be set to an instance of declarative_base from SQLalchemy.
Returns: A table class specifying the schema for the database table.
Return type: simpletable (sqlalchemy.ext.declarative.api.DeclarativeMeta)
-
class
Fireworks.extensions.database.
DBPipe
(table, engine, query=None, columns_and_types=None)[source]¶ Bases:
Fireworks.core.pipe.Pipe
Pipe that can iterate through the output of a database query.
-
Fireworks.extensions.database.
parse_columns
(object, ignore_id=True)[source]¶ Returns the names of columns in a table or query object
Parameters: - table (sqlalchemy.ext.declarative.api.DeclarativeMeta) –
- ignore_id (bool) – If True, ignore the ‘id’ column, which is a default primary key added by the create_table function.
Returns: A list of columns names in the sqlalchemy object.
Return type: columns (list)
-
Fireworks.extensions.database.
parse_columns_and_types
(object, ignore_id=True)[source]¶ Returns column names and types in a object or query object as a dict
Parameters: - object – An SQLalchemy table or Query object
- ignore_id (bool) – If True, ignore the ‘id’ column, which is a default primary key added by the create_table function.
Returns: A dict mapping column names to their SQLalchemy type.
Return type: columns_and_types (dict)
-
Fireworks.extensions.database.
to_message
(row, columns_and_types=None)[source]¶ Converts a database query result produced by SQLalchemy into a Message
Parameters: - row – A row from the query.
- columns_and_types (dict) – If unspecified, this will be inferred. Otherwise, you can specify the columns to parse, for example, if you only want to extract some columns.
Returns: Message representation of input.
Return type: message
-
Fireworks.extensions.database.
cast
(value)[source]¶ Converts values to basic types (ie. np.int64 to int)
Parameters: value – The object to be cast. Returns: The cast object.
-
Fireworks.extensions.database.
reflect_table
(table_name, engine)[source]¶ Gets the table with the given name from the sqlalchemy engine.
Parameters: - table_name (str) – Name of the table to extract.
- engine (sqlalchemy.engine.base.Engine) – Engine to extract from.
Returns: The extracted table, which can be now be used to read from the database.
Return type: table (sqlalchemy.ext.declarative.api.DeclarativeMeta)
Experiment¶
The Experiment module offers a way to save data from individual runs of a model. This makes it convenient to compare results from different experiments and to replicate those experiments.
exp = Experiment('name', 'db_path', 'description')
will create a folder named db_path/name containing a sqlite file called name.sqlite. You can now save any objects to that folder using
with exp.open('filename') as f:
f.save(...)
This will create a file handle f to the desired filename in the folder. You can also use exp.get_engine(‘name’) or exp.get_session(‘name’) to get an SQLalchemy session/engine object with the given name that you can then use to save/load data. Combined with Fireworks.db, you can save any data in Message format relatively easily.
-
Fireworks.extensions.experiment.
load_experiment
(experiment_path)[source]¶ Returns an experiment object corresponding to the database in the given path.
Parameters: experiment_path (str) – Path to the experiment folder. Returns: An Experiment object loaded using the files in the given folder path. Return type: experiment (Experiment)
-
class
Fireworks.extensions.experiment.
Experiment
(experiment_name, db_path='.', description=None, load=False)[source]¶ Bases:
object
-
load_experiment
(path=None, experiment_name=None)[source]¶ Loads in parameters associated with this experiment from a directory.
Parameters:
-
init_metadata
()[source]¶ Initializes metadata table. This is a necessary action whenever using an SQLalchemy table for the first time and is idempotent, so calling this method multiple times does not produce side-effects.
-
get_engine
(name)[source]¶ Creates an engine corresponding to a database with the given name. In particular, this creates a file called {name}.sqlite in this experiment’s save directory, and makes an engine to connect to it.
Parameters: name – Name of engine to create. This will also be the name of the file that is created. Returns: The new engine. You can also reach this engine now by calling self.engines[name] Return type: engine
-
get_session
(name)[source]¶ Creates an SQLalchemy session corresponding to the engine with the given name that can be used to interact with the database.
Parameters: name – Name of engine corresponding to session. The engine will be created if one with that name does not already exist. Returns: A session created from the chosen engine. Return type: session
-
open
(filename, *args, string_only=False)[source]¶ Returns a handle to a file with the given filename inside this experiment’s directory. If string_only is true, then this instead returns a string with the path to create the file. If the a file with ‘filename’ is already present in the directory, this will raise an error.
Parameters: Returns: - If string_only is True, the path to the file. Otherwise, the opened file handle. Note: You can use this method in a
with statement to auto-close the file.
Return type: file
-
Factory¶
The Factory module contains a class with the same name that performs hyperparameter optimization by repeatedly spawning independent instances of a model, training and evaluating them, and recording their parameters. The design of this module is based off of a ‘blackboard architecture’ in software engineering, in which multiple independent processes can read and write from a shared pool of information, the blackboard. In this case, the shared pool of information is the hyperparameters and their corresponding evaluation metrics. The factory class is able to use that information to choose new hyperparameters (based on a user supplied search algorithm) and repeat this process until a trigger to stop is raised.
- A factory class takes four arguments:
- Trainer - A function that takes a dictionary of hyperparameters, trains a model and returns the trained model
- Metrics_dict - A dictionary of objects that compute metrics during model training or evaluation.
- Generator - A function that takes the computed metrics and parameters up to this point as arguments and generates a new set of metrics to
use for training. The generator represents the search strategy that you are using. - Eval_dataloader - A dataloader (an iterable that produces minibatches as Message objects) that represents the evaluation dataset.
After instantiated with these arguments and calling the run method, the factory will use its generator to generate hyperparameters, train models using those hyperparameters, and compute metrics by evaluating those models against the eval_dataloader. This will loop until something raises a StopHyperparameterOptimization flag.
Different subclasses of Factory have different means for storing metrics and parameters. The LocalMemoryFactory stores them in memory as the name implies. The SQLFactory stores them in a relational database table. Because of this, SQLFactory takes three additional initialization arguments:
- Params_table - An SQLalchemy table specifying the schema for storing parameters.
- Metrics_table - An SQLalchemy table specifying the schema for storing metrics.
- Engine - An SQLalchemy engine, representing the database connection.
Additionally, to reduce memory and network bandwidth usage, the SQLFactory table caches information in local memory while regularly syncing with the database.
Currently, all of these steps take place on a single thread, but in the future we will be able to automatically parallelize and distribute them.
-
Fireworks.extensions.factory.
update
(bundle: dict, parameters: dict)[source]¶ Parameters: - - A dictionary of key (parameters) – (obj, atr). Obj is the object referred to, and attr is a string with the name of the attribute to be assigned.
- - A dictionary of key – value. Wherever keys match, obj.attr will be set to value.
-
class
Fireworks.extensions.factory.
Factory
(*args, components=None, **kwargs)[source]¶ Bases:
Fireworks.core.junction.Junction
Base class for hyperparameter optimization in pytorch using queues.
-
required_components
= {'eval_set': <class 'object'>, 'metrics': <class 'dict'>, 'parameterizer': <class 'function'>, 'trainer': <class 'function'>}¶
-
-
class
Fireworks.extensions.factory.
LocalMemoryFactory
(*args, components=None, **kwargs)[source]¶ Bases:
Fireworks.extensions.factory.Factory
Factory that stores parameters in memory.
-
class
Fireworks.extensions.factory.
SQLFactory
(*args, components=None, **kwargs)[source]¶ Bases:
Fireworks.extensions.factory.Factory
Factory that stores parameters in SQLalchemy database while caching them locally.
-
required_components
= {'engine': <class 'object'>, 'eval_set': <class 'object'>, 'metrics': <class 'dict'>, 'metrics_tables': <class 'object'>, 'parameterizer': <class 'function'>, 'params_table': <class 'object'>, 'trainer': <class 'function'>}¶
-
Miscellaneous¶
-
Fireworks.toolbox.text.
character_tokenizer
(sequence)[source]¶ Splits sequence into a list of characters.
-
Fireworks.toolbox.text.
pad_sequence
(sequence, max_length, embeddings_dict, pad_token='EOS')[source]¶ Adds EOS tokens until sequence length is max_length.
-
Fireworks.toolbox.text.
pad
(batch, embeddings_dict, pad_token='EOS')[source]¶ Pads all embeddings in a batch to be the same length.
-
Fireworks.toolbox.text.
apply_embeddings
(sequence, embeddings_dict, tokenizer)[source]¶ Decomposes sequence into tokens using tokenizer and then converts tokens to embeddings using embeddings_dict.
-
Fireworks.toolbox.text.
create_pretrained_embeddings
(embeddings_file)[source]¶ Loads embeddings vectors from file into a dict.
-
Fireworks.toolbox.text.
load_embeddings
(name='glove840b')[source]¶ Loads serialized embeddings from pickle.
-
Fireworks.toolbox.text.
make_vocabulary
(text, tokenizer=None, cutoff_rule=None)[source]¶ Converts an iterable of phrases into the set of unique tokens that are in the vocabulary.
-
Fireworks.toolbox.text.
make_indices
(vocabulary)[source]¶ Constructs a dictionary of token names to indices from a vocabulary. Each index value corresponds to a one-hot vector.
-
Fireworks.toolbox.text.
too_big
(dataset, start, end, dim=300, cutoff=620000)[source]¶ Calculates if a batch consisting of dataset[start:end] is too big based on cutoff. This can be used for constructing dynamic batches.
-
Fireworks.utils.utils.
one_hot
[source]¶ This converts an integer to a one-hot vector, which is a format that is often used by statistical classifiers to represent predictions. :param - index: The index in the one-hot array to set as 1 :param - max: The size of the one-hot array.
Returns: A one-hot array, which consists of all 0s except a 1 at a certain index which corresponds to the label classification. Return type: - hot
-
Fireworks.utils.utils.
index_to_list
(index)[source]¶ Converts an index to a list. This is used by some of the methods in message.py.
-
Fireworks.utils.utils.
slice_to_list
(s)[source]¶ Converts a slice object to a list of indices. This is used by some of the methods in message.py.
-
Fireworks.utils.utils.
get_indices
(values, listlike)[source]¶ Returns the indices in litlike that match elements in values. This is used by some of the methods in message.py.
-
Fireworks.utils.utils.
slice_length
(orange)[source]¶ Returns the length of the index corresponding to a slice. For example, slice(0,4,2) has a length of two. This is used by some of the methods in message.py.
-
Fireworks.utils.utils.
subset_dict
(dictionary, keys)[source]¶ Returns a dict that contains all key,value pairs in dictionary where the key is one of the provided keys. This is used by some of the methods in message.py.
-
Fireworks.toolbox.preprocessing.
train_test_split
(pipe, test=0.2)[source]¶ Splits input pipe into a training pipe and a test pipe. The indices representing the input pipe are shuffled, and assigned to the training and test sets randomly based on the proportions specified.
Parameters: - pipe (-) – A pipe which represents the data to be split up.
- test (-) – The proportion of the set that should be returns as test set. This should be between 0 and 1.
Returns: - A pipe that represents the training data. You can call __getitem__, __next__, etc. on this pipe and it will transparently
provide elements from the shuffled training set.
- test_pipe: Analogous to the train_pipe, this represents the test data, which is shuffled and disjoint from the training data.
Return type: - train_pipe
-
class
Fireworks.toolbox.preprocessing.
Normalizer
(*args, **kwargs)[source]¶ Bases:
Fireworks.core.model.PyTorch_Model
Normalizes Data by Mean and Variance. Analogous to sklearn.preprocessing.Normalizer This Model uses a one-pass method to estimate the sample variance which is not guaranteed to be numerically stable.
The functionality is implemented using hooks. Every time data is accessed from upstream pipes, this Model updates its estimate of the population mean and variance using the update() method. If self._inference_enabled is set to True, then the data will also be normalized based on those estimates. Means and variances are calculated on a per-column basis. You can also disable/enable the updating of these estimate by calling self.enable_updates / self.disable_updates.
-
required_components
= ['mean', 'variance', 'count', 'rolling_sum', 'rolling_squares']¶
-
-
Fireworks.utils.events.
visdom_loss_handler
(modules_dict, model_name)[source]¶ Attaches plots and metrics to trainer. This handler creates or connects to an environment on a running Visdom dashboard and creates a line plot that tracks the loss function of a training loop as a function of the number of iterations. This can be attached to an Ignite Engine, and the training closure must have ‘loss’ as one of the keys in its return dict for this plot to be made. See documentation for Ignite (https://github.com/pytorch/ignite) and Visdom (https://github.com/facebookresearch/visdom) for more information.
-
exception
Fireworks.utils.exceptions.
EndHyperparameterOptimization
[source]¶ Bases:
RuntimeError
This exception can be raised to signal a factory to stop looping.