Package starcluster :: Package balancers :: Package sge :: Class SGELoadBalancer
[hide private]
[frames] | no frames]

Class SGELoadBalancer

source code


This class is able to query each SGE host and return load & queue statistics

*** All times are in SECONDS unless otherwise specified ***

The polling interval in seconds. recommended: 60-300. any more frequent is very wasteful. the polling loop with visualizer takes about 15 seconds. polling_interval = 60

VERY IMPORTANT: Set this to the max nodes you're willing to have in your cluster. Try setting this to the default cluster size you'd ordinarily use. max_nodes = 5

IMPORTANT: Set this to the longest time a job can wait before another host is added to the cluster to help. Recommended: 300-900 seconds (5-15 mins). Do not use a value less than 300 seconds because that is how long an instance will take to start up. longest_allowed_queue_time = 900

Keep this at 1 - your master, for now. min_nodes = 1

This would allow the master to be killed when the queue empties. UNTESTED. allow_master_kill = False

How many nodes to add per iteration. Setting it > 1 opens up possibility of spending too much $$ add_nodes_per_iteration = 1

Kill an instance after it has been up for X minutes. Do not kill earlier, since you've already paid for that hour. (in mins) kill_after = 45

After adding a node, how long to wait for the instance to start new jobs stabilization_time = 180

Visualizer off by default. Start it with "starcluster loadbalance -p tag" plot_stats = False

How many hours qacct should look back to gather past job data. lower values minimize data transfer lookback_window = 3

Instance Methods [hide private]
 
__init__(self, interval=60, max_nodes=None, wait_time=900, add_pi=1, kill_after=45, stab=180, lookback_win=3, min_nodes=1, allow_master_kill=False, plot_stats=False, plot_output_dir=None, dump_stats=False, stats_file=None)
x.__init__(...) initializes x; see x.__class__.__doc__ for signature
source code
 
_validate_dir(self, dirname, msg_prefix='') source code
 
_mkdir(self, directory, makedirs=False) source code
 
get_remote_time(self)
this function remotely executes 'date' on the master node and returns a datetime object with the master's time instead of fetching it from local machine, maybe inaccurate.
source code
 
get_qatime(self, now)
this function takes the lookback window and creates a string representation of the past few hours, to feed to qacct to limit the data set qacct returns.
source code
 
get_stats(self)
this function will ssh to the SGE master and get load & queue stats.
source code
 
run(self, cluster)
This function will loop indefinitely, using SGELoadBalancer.get_stats() to get the clusters status.
source code
 
has_cluster_stabilized(self) source code
 
_eval_add_node(self)
This function uses the metrics available to it to decide whether to add a new node to the cluster or not.
source code
 
_eval_remove_node(self)
This function uses the sge stats to decide whether or not to remove a node from the cluster.
source code
 
_find_node_for_removal(self)
This function will find a suitable node to remove from the cluster.
source code
 
_minutes_uptime(self, node)
this function uses data available to boto to determine how many total minutes this instance has been running.
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Properties [hide private]
  visualizer

Inherited from object: __class__

Method Details [hide private]

__init__(self, interval=60, max_nodes=None, wait_time=900, add_pi=1, kill_after=45, stab=180, lookback_win=3, min_nodes=1, allow_master_kill=False, plot_stats=False, plot_output_dir=None, dump_stats=False, stats_file=None)
(Constructor)

source code 

x.__init__(...) initializes x; see x.__class__.__doc__ for signature

Overrides: object.__init__
(inherited documentation)

get_stats(self)

source code 

this function will ssh to the SGE master and get load & queue stats. it will feed these stats to SGEStats, which parses the XML. it will return two arrays: one of hosts, each host has a hash with its host information inside. The job array contains a hash for every job, containing statistics about the job name, priority, etc

run(self, cluster)

source code 

This function will loop indefinitely, using SGELoadBalancer.get_stats() to get the clusters status. It looks at the job queue and tries to decide whether to add or remove a node. It should later look at job durations (currently doesn't)

Overrides: LoadBalancer.run

_eval_add_node(self)

source code 

This function uses the metrics available to it to decide whether to add a new node to the cluster or not. It isn't able to add a node yet. TODO: See if the recent jobs have taken more than 5 minutes (how long it takes to start an instance)

_find_node_for_removal(self)

source code 

This function will find a suitable node to remove from the cluster.
The criteria for removal are:
1. The node must not be running any SGE job
2. The node must have been up for 50-60 minutes past its start time
3. The node must not be the master, or allow_master_kill=True

_minutes_uptime(self, node)

source code 

this function uses data available to boto to determine how many total minutes this instance has been running. you can mod (%) the return value with 60 to determine how many minutes into a billable hour this node has been running.


Property Details [hide private]

visualizer

Get Method:
unreachable.visualizer(self)