FireWorks typically requires a network connection between the LaunchPad and the FireWorker to operate. The network connection allows the FireWorker to check out a job from the LaunchPad, and subsequently update the server with the status of the job:
Unfortunately, many computing centers employ an internal system of “compute nodes” that cannot access an outside network. The FireWorker is thus blocked from checking out jobs from the LaunchPad (or updating back the LaunchPad with the job status). There exists no way for the compute node to communicate with the LaunchPad. However, if the login node to the computing center can access the network, we can design a system whereby the login node handles all network connections and communicates with the compute nodes by serializing information as files:
This is the offline mode of FireWorks operation. Before using this option, however, it is important to understand that:
In offline mode, the login node will checkout a job, serialize it to a FW.json file, and put that file in the launch directory. When the compute node starts running the job, it will read the FW.json file to instantiate the FireWork and run it using the --offline option of the rlaunch command.
To submit jobs in offline mode:
With those two small modifications, your job should get submitted and run successfully. You’ll notice that a FW.json as well as a FW_offline.json file got written to your submission’s launch directory.
Next, we allow the compute node to communicate back job information to the LaunchPad via the login node.
Since the compute nodes have no way to communicate job status via a network, they write files (FW_ping.json and FW_action.json) in order to report this information. The login node can periodically read these files and subsequently pass the information back to the LaunchPad.
To recover all offline jobs, type the command from the login node:
lpad recover_offline
Note
Type lpad recover_offline -h to see further options.
This will look inside all the offline job locations in search of FW_ping.json and FW_action.json files. If it finds them, it will connect to the LaunchPad and update the status of the jobs based on the files’ contents. At this point, we should note a few things:
Generally, you will not need to manually tell FireWorks to forget about certain directories. However, if you manually want to stop trying to recover certain FireWorks, you can type:
lpad forget_offline -h
This prints a help file stating how can “forget” certain FireWorks so we no longer try to recover them. The state of these FireWorks in the database will be frozen unless you run a command like defuse_fws or rerun_fws to handle them.
While offline mode is typically undesirable compared to normal FireWorks operation, one advantage is that it minimizes the need for database access. Whereas normal operation requires the database to be fully operational while jobs are running, offline operation only requires database access when checking out and submitting jobs (qlaunch) and when recovering jobs (recover_offline). The database can be down for maintenance in between, while jobs are running.