Pydoop applications are run as Hadoop Pipes applications. To start, you will need a working Hadoop cluster. If you don’t have one available, you can bring up a single-node Hadoop cluster on your machine – see the Hadoop web site for instructions.
Assuming the hadoop executable is in your path, a typical pipes command line looks like this:
hadoop pipes -conf conf.xml -input input -output output
where input (file or directory) and output (directory) are HDFS paths. The configuration file, read from the local file system, is an xml document consisting of a simple (name, value) property list as explained below.
Here’s an example of a configuration file:
<?xml version="1.0"?>
<configuration>
<property>
<name>hadoop.pipes.executable</name>
<value>app_launcher</value>
</property>
<property>
<name>hadoop.pipes.java.recordreader</name>
<value>true</value>
</property>
<property>
<name>hadoop.pipes.java.recordwriter</name>
<value>true</value>
</property>
<property>
<name>mapred.job.name</name>
<value>app_name</value>
</property>
[...]
</configuration>
The meaning of these properties is as follows:
In the job configuration file you can also set application-specific properties; their values will be accessible at run time through the JobConf object.
Finally, you can include general Hadoop properties (e.g., mapred.reduce.tasks). See the Hadoop documentation for a list of the available properties and their meanings.
Note
You can also configure property values on the command line with the -D property.name=value syntax . You may find this more convenient when scripting or temporarily overriding a specific property value. If you specify all required properties with the -D switches, the xml configuration file is not necessary.
When working on a shared cluster where you don’t have root access, you might have a lot of software installed in non-standard locations, such as your home directory. Since non-interactive ssh connections do not usually preserve your environment, you might lose some essential setting like LD_LIBRARY_PATH.
A quick way to fix this is to insert a snippet like this one at the start of your launcher program:
#!/bin/sh
""":"
export LD_LIBRARY_PATH="my/lib/path:${LD_LIBRARY_PATH}"
exec /path/to/pyexe/python -u $0 $@
":"""
# Python code for the launcher follows
In this way, the launcher is run as a shell script that does some exports and then executes Python on itself. Note that sh code is protected by a Python comment, so that it’s not considered when the script is interpreted by Python.
Before running your application, you need to perform the following steps:
The examples subdirectory of Pydoop’s distribution root contains several examples of Python scripts that generate hadoop pipes command lines. Documentation for the examples is in the Examples section.