Class SGELoadBalancer
source code
This class is able to query each SGE host and return load & queue
statistics
*** All times are in SECONDS unless otherwise specified ***
The polling interval in seconds. recommended: 60-300. any more
frequent is very wasteful. the polling loop with visualizer takes about
15 seconds. polling_interval = 60
VERY IMPORTANT: Set this to the max nodes you're willing to have in
your cluster. Try setting this to the default cluster size you'd
ordinarily use. max_nodes = 5
IMPORTANT: Set this to the longest time a job can wait before another
host is added to the cluster to help. Recommended: 300-900 seconds (5-15
mins). Do not use a value less than 300 seconds because that is how long
an instance will take to start up. longest_allowed_queue_time = 900
Keep this at 1 - your master, for now. min_nodes = 1
This would allow the master to be killed when the queue empties.
UNTESTED. allow_master_kill = False
How many nodes to add per iteration. Setting it > 1 opens up
possibility of spending too much $$ add_nodes_per_iteration = 1
Kill an instance after it has been up for X minutes. Do not kill
earlier, since you've already paid for that hour. (in mins) kill_after =
45
After adding a node, how long to wait for the instance to start new
jobs stabilization_time = 180
Visualizer off by default. Start it with "starcluster loadbalance
-p tag" plot_stats = False
How many hours qacct should look back to gather past job data. lower
values minimize data transfer lookback_window = 3
|
__init__(self,
interval=60,
max_nodes=5,
wait_time=900,
add_pi=1,
kill_after=45,
stab=180,
lookback_win=3,
min_nodes=1,
allow_master_kill=False,
plot_stats=False,
plot_output_dir=None,
dump_stats=False,
stats_file=None)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature |
source code
|
|
|
_validate_dir(self,
dirname,
msg_prefix='
' ) |
source code
|
|
|
|
|
get_remote_time(self)
this function remotely executes 'date' on the master node and returns
a datetime object with the master's time instead of fetching it from
local machine, maybe inaccurate. |
source code
|
|
|
get_qatime(self,
now)
this function takes the lookback window and creates a string
representation of the past few hours, to feed to qacct to limit the
data set qacct returns. |
source code
|
|
|
|
|
run(self,
cluster)
This function will loop indefinitely, using
SGELoadBalancer.get_stats() to get the clusters status. |
source code
|
|
|
_eval_add_node(self)
This function uses the metrics available to it to decide whether to
add a new node to the cluster or not. |
source code
|
|
|
_eval_remove_node(self)
This function uses the sge stats to decide whether or not to remove a
node from the cluster. |
source code
|
|
|
|
|
_minutes_uptime(self,
node)
this function uses data available to boto to determine how many total
minutes this instance has been running. |
source code
|
|
Inherited from object :
__delattr__ ,
__format__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__sizeof__ ,
__str__ ,
__subclasshook__
|
__init__(self,
interval=60,
max_nodes=5,
wait_time=900,
add_pi=1,
kill_after=45,
stab=180,
lookback_win=3,
min_nodes=1,
allow_master_kill=False,
plot_stats=False,
plot_output_dir=None,
dump_stats=False,
stats_file=None)
(Constructor)
| source code
|
x.__init__(...) initializes x; see x.__class__.__doc__ for
signature
- Overrides:
object.__init__
- (inherited documentation)
|
this function will ssh to the SGE master and get load & queue
stats. it will feed these stats to SGEStats, which parses the XML. it
will return two arrays: one of hosts, each host has a hash with its host
information inside. The job array contains a hash for every job,
containing statistics about the job name, priority, etc
|
This function will loop indefinitely, using
SGELoadBalancer.get_stats() to get the clusters status. It looks at the
job queue and tries to decide whether to add or remove a node. It should
later look at job durations (currently doesn't)
- Overrides:
LoadBalancer.run
|
This function uses the metrics available to it to decide whether to
add a new node to the cluster or not. It isn't able to add a node yet.
TODO: See if the recent jobs have taken more than 5 minutes (how long it
takes to start an instance)
|
This function will find a suitable node to remove from the cluster.
The criteria for removal are:
1. The node must not be running any SGE job
2. The node must have been up for 50-60 minutes past its start time
3. The node must not be the master, or allow_master_kill=True
|
this function uses data available to boto to determine how many total
minutes this instance has been running. you can mod (%) the return value
with 60 to determine how many minutes into a billable hour this node has
been running.
|
visualizer
- Get Method:
- unreachable.visualizer(self)
|