Module for running monte carlo and other batch jobs on a Hadoop
instance. The module allows for the submission of scripts (and supporting
files) to a Hadoop MapReduce cluster for batch execution. Default
operation runs the submitted script for the specified number of
iterations on the configured Hadoop instance. By supplying an additional
reducer script, data generated in the batch process can be
reduced/filtered/processed before it is written to HDFS and made
available to the user.
WARNING: Piped UNIX commands tend to fail when used as mappers and
reducers. Instead write a BASH or python script.
|
make_checkpointing_filter()
Generates a python script which filters checkpointing results from
HDMC. |
source code
|
|
|
make_checkpointing_frame(script,
checkpoint_names,
checkpoint_dir,
arguments='
' ,
debug=False)
Generates a python script which given a list of files to be
processed, executes the specified script in over the files in
parallel via MapReduce. |
source code
|
|
|
make_frame(script,
arguments='
' ,
iterations=1,
debug=False)
Generates a basic python frame for running a batch job on a MapReduce
cluster. |
source code
|
|
|
get_output_hdfs_name(output_data_name)
Given the full path to a file or directory, returns its HDFS
equivalent |
source code
|
|
|
build_hadoop_call(script,
output_data_name,
iterations=1,
supporting_file_list=None,
reduction_script=None,
arguments=None,
debug=False)
Builds a call array suitable for subprocess.Popen which submits a
streaming job to the configured MapReduce instance. |
source code
|
|
|
build_checkpoint_call(script,
output_data_name,
supporting_file_list,
reduction_script=None,
arguments=None)
Builds a call array suitable for subprocess.Popen which submits a
streaming job to the configured MapReduce instance. |
source code
|
|
|
build_generic_hadoop_call(mapper,
reducer,
input,
output,
supporting_file_list=None,
num_mappers=None,
num_reducers=None,
key_comparator=None)
Builds a call array suitable for subprocess.Popen which submits a
streaming job to the configured MapReduce instance. |
source code
|
|
|
execute(hadoop_call)
Nonblocking execution of the given call array |
source code
|
|
|
execute_and_wait(hadoop_call)
Blocking execution of the given call array |
source code
|
|
|
|
|
|
|
download_hdfs_data(output_data_name)
Given a full path, downloads an output directory from HDFS to the
specified location. |
source code
|
|
|
print_hdfs_data(output_data_name)
Given a full path, prints the output of all parts of an HDFS
directory. |
source code
|
|
|
|
|
get_checkpoint_names(file_list)
Given a list of file or command names, produces checkpoint names by
taking the last member of the array generated by splitting in / |
source code
|
|
|
submit(script,
output_data_name,
iterations=1,
supporting_file_list=None,
reduction_script=None,
arguments='
' ,
debug=False)
Submits script non-blocking job to a MapReduce cluster and collects
output in output_data_name. |
source code
|
|
|
submit_inline(script,
output_data_name,
iterations=1,
supporting_file_list=None,
reduction_script=None,
arguments='
' ,
debug=False)
Submits script blocking job to a MapReduce cluster and collects
output in output_data_name. |
source code
|
|
|
submit_checkpoint_inline(script,
output_data_name,
file_list,
reduction_script=None,
arguments='
' ,
debug=False)
Submits a script to a MapReduce cluster for parallel operation on a
number of files. |
source code
|
|
|
submit_checkpoint(script,
output_data_name,
file_list,
reduction_script=None,
arguments='
' ,
debug=False)
Submits a script to a MapReduce cluster for parallel operation on a
number of files. |
source code
|
|
|
|