Before we begin - if you’re here and haven’t completed the first tutorial on queue submission, you should go back and complete that first. This tutorial assumes that you already have queue submission working and just need to overcome some of the limitations of simple queue submission.
In this tutorial, we’ll introduce the notion of reserving FireWorks on queue submission. Some differences between the simple method of the previous tutorial and the reservation method are outlined below:
Situation | Simple Queue Launching | Reservation Queue Launching |
---|---|---|
write/submit queue script | write generic script using QueueParams file alone | 1. reserve a FW from the database
2. use FW’s spec to modify queue script
|
queue manager runs queue script | determine a FW to run and run it | run the reserved FW |
job is deleted from queue | no action needed by the user | any affected reserved jobs must be unreserved by user manually |
Reserving jobs allows for more flexibility, but also adds maintenance overhead when queues go down or jobs in the queue are cancelled. Hence, there are some advantages to sticking with Simple Queue Launching. With that out of the way, let’s explore the reservation method of queue submission!
Begin in your working directory from the previous tutorial. You should have four files: fw_test.yaml, my_qp.yaml, my_fworker.yaml, and my_launchpad.yaml.
Note
Because we are using standard filenames for the LaunchPad and FireWorker, we will omit the -l and -w parameters when running scripts for the remainder of this tutorial.
Let’s reset our database and add a FireWork for testing:
lp_run.py reset <TODAY'S DATE>
lp_run.py add fw_test.yaml
Reserving a FireWork is as simple as adding the -r option to the Queue Launcher. Let’s queue up a reserved FireWork and immediately check its state:
qlauncher_run.py -r singleshot my_qp.yaml
lp_run.py get_fw 1
When you get the FireWork, you should notice that its state is RESERVED. No other Rocket Launchers will run that FireWork; it is now bound to your queue. Some details of the reservation are given in the launches key of the FireWork.
When your queue runs and completes your job, you should see that the state is updated to COMPLETED:
lp_run.py get_fw 1
One nice feature of reserving FireWorks is that you are automatically prevented from submitting more jobs to the queue than exist FireWorks in the database. Let’s try to submit too many jobs and see what happens.
Clean your working directory of everything but four files: fw_test.yaml, my_qp.yaml, my_fworker.yaml, and my_launchpad.yaml
Reset the database and add a FireWork for testing:
lp_run.py reset <TODAY'S DATE>
lp_run.py add fw_test.yaml
We have only one FireWork in the database, so we should only be able to submit one job to the queue. Let’s try submitting two:
qlauncher_run.py -r singleshot my_qp.yaml
qlauncher_run.py -r singleshot my_qp.yaml
You should see that the first submission went OK, but the second one told us No jobs exist in the LaunchPad for submission to queue!. If we repeated this sequence without the -r option, we would submit too many jobs to the queue.
Note
Once the job starts running or completes, both the Simple version of the queue launcher and the Reservation version will stop you from submitting jobs. However, only the Reservation version will identify that a job is already queued.
Another key feature of reserving FireWorks before queue submission is that the FireWork can override queue parameters. This is done by specifying the _queueparams reserved key in the spec. For example, let’s override the walltime parameter.
Clean your working directory of everything but four files: fw_test.yaml, my_qp.yaml, my_fworker.yaml, and my_launchpad.yaml
Look in the file my_qp.yaml. You should have walltime parameter listed, perhaps set to 2 minutes. By default, all jobs submitted by this Queue Launcher would have a 2-minute walltime.
Let’s copy over the fw_walltime.yaml file from the tutorials dir:
cp <INSTALL_DIR>/fw_tutorials/queue_pt2/fw_walltime.yaml .
Look inside fw_walltime.yaml. You will see a _queueparams key in the spec that specifies a walltime of 10 minutes. Anything in the _queueparams key will override the corresponding parameter in my_qp.yaml when the Queue Launcher is run in reservation mode. So now, the FireWork itself is determining key properties of the queue submission.
Let’s add and run this FireWork:
lp_run.py reset <TODAY'S DATE>
lp_run.py add fw_test.yaml
qlauncher_run.py -r singleshot my_qp.yaml
You might check the walltime that your job was submitted with using your queue manager’s built-in commands (e.g., qstat or mstat). You can also see the queue submission script by looking inside the file FW_submit.script. Inside, you’ll see the job was submitted with the walltime specified by your FireWork, not the default walltime from my_qp.yaml.
Your job should complete successfully as before. You could also try to override other queue parameters such as the number of cores for running the job or the account which is charged for running the job. In this way, your queue submission can be tailored on a per-job basis!
One limitation of reserving FireWorks is that the FireWork’s fate is tied to that of the queue submission. If the place in the queue is deleted, that FireWork is stuck in limbo unless you reset its state from RESERVED back to READY. Let’s try to simulate this:
Clean your working directory of everything but four files: fw_test.yaml, my_qp.yaml, my_fworker.yaml, and my_launchpad.yaml
Let’s add and run this FireWork. Before the job starts running, delete it from the queue (if you’re too slow, repeat this entire step):
lp_run.py reset <TODAY'S DATE>
lp_run.py add fw_test.yaml
qlauncher_run.py -r singleshot my_qp.yaml
qdel <JOB_ID>
Note
The job id should have been printed by the Queue Launcher, or you can check your queue manager. The qdel command might need to be modified, depending on the type of queue manager you use.
Now we have no jobs in the queue. But our FireWork still shows up as RESERVED:
lp_run.py get_fw 1
Because our FireWork is RESERVED, we cannot run it:
qlauncher_run.py -r singleshot my_qp.yaml
tells us that No jobs exist in the LaunchPad for submission to queue!. FireWorks thinks that our old queue submission (the one that we deleted) is going to run this FireWork and is not letting us submit another queue script for the same job.
# The solution is to un-reserve our RESERVED FireWork:
lp_run.py unreserve
Now the FireWork should be in the READY state:
lp_run.py get_fw 1
And we can run it again:
qlauncher_run.py -r singleshot my_qp.yaml
Note
The unreserve command is currently a blunt instrument that un-reserves all reserved FireWorks. If you un-reserve a FireWork that is still in a queue, the consequences are not so bad. FireWorks might submit a second job to the queue that reserves this same FireWork. The first queue script to run will run the FireWork properly. The second job to run will not find a FireWork to run and simply exit.
As we demonstrated, reserving jobs in the queue has several advantages, but also adds the complication that queue failure can hold up a FireWork until you run the unreserve command to free up broken reservations. Is is up to you which mode you prefer for your application. However, we suggest that you use only one of the two methods throughout your application. In particular, do not use the Simple Queue Launcher if you are defining the _queueparams parameter in your spec. Jobs launched from the Simple Queue Launcher will not carry out this override!