Package ziggy :: Package hdmc :: Module hdmc
[hide private]
[frames] | no frames]

Module hdmc

source code

Module for running monte carlo and other batch jobs on a Hadoop instance. The module allows for the submission of scripts (and supporting files) to a Hadoop MapReduce cluster for batch execution. Default operation runs the submitted script for the specified number of iterations on the configured Hadoop instance. By supplying an additional reducer script, data generated in the batch process can be reduced/filtered/processed before it is written to HDFS and made available to the user.

WARNING: Piped UNIX commands tend to fail when used as mappers and reducers. Instead write a BASH or python script.

Created on Jul 28, 2010


Author: dwmclary

Functions [hide private]
 
make_checkpointing_filter() source code
 
make_checkpointing_frame(script, checkpoint_names, checkpoint_dir, arguments='', debug=False)
Generates a python script which given a list of files to be processed, executes the specified script in over the files in parallel via MapReduce.
source code
 
make_frame(script, arguments='', iterations=1, debug=False)
Generates a basic python frame for running a batch job on a MapReduce cluster.
source code
 
get_output_hdfs_name(output_data_name)
Given the full path to a file or directory, returns its HDFS equivalent
source code
 
build_hadoop_call(script, output_data_name, iterations=1, supporting_file_list=None, reduction_script=None, arguments=None, debug=False)
Builds a call array suitable for subprocess.Popen which submits a streaming job to the configured MapReduce instance.
source code
 
build_checkpoint_call(script, output_data_name, supporting_file_list, reduction_script=None, arguments=None)
Builds a call array suitable for subprocess.Popen which submits a streaming job to the configured MapReduce instance.
source code
 
build_generic_hadoop_call(mapper, reducer, input, output, supporting_file_list=None, num_mappers=None, num_reducers=None, key_comparator=None)
Builds a call array suitable for subprocess.Popen which submits a streaming job to the configured MapReduce instance.
source code
 
execute(hadoop_call)
Nonblocking execution of the given call array
source code
 
execute_and_wait(hadoop_call)
Blocking execution of the given call array
source code
 
create_dummy_data()
Creates a piece of dummy map input data in HDFS.
source code
 
load_data_to_hfds(input_data_file)
Loads a data file to HDFS.
source code
 
download_hdfs_data(output_data_name)
Given a full path, downloads an output directory from HDFS to the specified location.
source code
 
print_hdfs_data(output_data_name)
Given a full path, prints the output of all parts of an HDFS directory.
source code
 
set_checkpoint_directory(output_data_name)
Creates a checkpoint directory for parallel file processing.
source code
 
get_checkpoint_names(file_list) source code
 
submit(script, output_data_name, iterations=1, supporting_file_list=None, reduction_script=None, arguments='', debug=False)
Submits script non-blocking job to a MapReduce cluster and collects output in output_data_name.
source code
 
submit_inline(script, output_data_name, iterations=1, supporting_file_list=None, reduction_script=None, arguments='', debug=False)
Submits script blocking job to a MapReduce cluster and collects output in output_data_name.
source code
 
submit_checkpoint_inline(script, output_data_name, file_list, reduction_script=None, arguments='', debug=False) source code
 
submit_checkpoint(script, output_data_name, file_list, reduction_script=None, arguments='', debug=False) source code
Variables [hide private]
  __package__ = 'ziggy.hdmc'
Function Details [hide private]

build_hadoop_call(script, output_data_name, iterations=1, supporting_file_list=None, reduction_script=None, arguments=None, debug=False)

source code 

Builds a call array suitable for subprocess.Popen which submits a streaming job to the configured MapReduce instance. The function also generates the necessary execution frame.

build_checkpoint_call(script, output_data_name, supporting_file_list, reduction_script=None, arguments=None)

source code 

Builds a call array suitable for subprocess.Popen which submits a streaming job to the configured MapReduce instance. The function also generates the necessary execution frame.

create_dummy_data()

source code 

Creates a piece of dummy map input data in HDFS. This is necessary because Hadoop streamingrequires input for mapping tasks.

load_data_to_hfds(input_data_file)

source code 

Loads a data file to HDFS. For future use.

set_checkpoint_directory(output_data_name)

source code 

Creates a checkpoint directory for parallel file processing. This directory is always named hdmc_checkpoints and exists at the same level as the trailing entry in output_data_name.

submit(script, output_data_name, iterations=1, supporting_file_list=None, reduction_script=None, arguments='', debug=False)

source code 

Submits script non-blocking job to a MapReduce cluster and collects output in output_data_name. Supporting filenames can be passed as a list, as can a reducing/filtering script. Arguments to the submitted script should be passed as a string.

submit_inline(script, output_data_name, iterations=1, supporting_file_list=None, reduction_script=None, arguments='', debug=False)

source code 

Submits script blocking job to a MapReduce cluster and collects output in output_data_name. Supporting filenames can be passed as a list, as can a reducing/filtering script. Arguments to the submitted script should be passed as a string.