🤝 Binary Consensus

This page is organized as follow:

The Binary Consensus problem is a simple yet non-trivial decentralized multi-agent environment inspired by the classic voter model introduced by Holley and Liggett in 1975 [Holley and Liggett]. In the original model, agents (or “particles”) interact over a graph and influence each other’s decisions. While this model has been extensively studied in statistical physics, its adaptation to the Markov Decision Process (MDP) framework—particularly in decentralized, fully cooperative settings—is relatively rare.

This environment adapts the voter model into a discrete-time decentralized partially observable Markov decision process (Dec-POMDP). Each agent observes its own binary vote as well as those of its neighbors and can choose to retain or flip its vote at each time step. Agents’ states are also stochastically influenced by their neighbors’ actions. Despite its simplicity, the problem presents interesting challenges for coordination and decision-making.

Illustration for one state update in the binary consensus problem.

Illustration for one state update in the Binary Consensus problem.

Problem Formulation

Agents and Votes: There are \(N\) agents. Each agent maintains a binary state \(s_i(t) \in \{0, 1\}\), representing its vote at time \(t\). The joint state at time \(t\) is:

\[S(t) = \{s_i(t)\}_{i=1}^N\]

Observations: Each agent observes its own vote and the votes of its direct neighbors as defined by a fixed graph structure.

Action Space: At every time step, each agent selects an action \(a_i(t) \in \{0, 1\}\):

  • \(a_i(t) = 0\): keep the current vote

  • \(a_i(t) = 1\): switch to the opposite vote

Objective: The goal is to reach a consensus corresponding to the initial majority vote within a fixed time horizon \(T\). The initial majority \(m_0\) is defined as:

\[m_0 = \arg\max_{v \in \{0,1\}} \sum_{i=1}^{N} \mathbf{1}[s_i(0) = v]\]

A consensus is said to be reached at time \(t\) if all agents share the same vote:

\[s_i(t) = m_0 \quad \forall i \in \{1, \dots, N\}\]

An episode terminates either when consensus is reached (even if incorrect) or when the time horizon \(T\) is exceeded.

Use Cases and Complexity

While this theoretical model is not tied to any specific real-world application, its simplicity and flexibility make it a strong benchmark for evaluating decentralized, centralized, and hybrid learning strategies across varying graph sizes and topologies.

The environment has both a state space and an action space of size \(2^N\), which quickly becomes intractable as the number of agents increases. This property makes it a useful stress test for centralized methods and a valuable tool for studying scalability in multi-agent reinforcement learning.

A simple, intuitive heuristic policy is available as a baseline: at each time step, each agent adopts the majority vote of its local neighborhood. This provides a competitive reference for evaluating learned strategies.

Environment

class cognac.env.BinaryConsensus.env.BinaryConsensusNetworkEnvironment(adjacency_matrix: ~numpy.ndarray, max_steps: int = 100, show_neighborhood_state: bool = True, reward_class: type[~cognac.core.BaseReward.BaseReward] = <class 'cognac.env.BinaryConsensus.rewards.FactoredRewardModel'>, is_global_reward: bool = False)

Bases: ParallelEnv

A multi-agent reinforcement learning environment modeling binary consensus.

Agents interact on a probabilistic influence graph and attempt to reach a common binary state (0 or 1). Each agent’s action influences its own state and that of its neighbors according to the adjacency matrix.

_check_adjacency_matrix() None

Ensure adjacency matrix has valid structure.

Raises

AssertionError

If the diagonal is non-zero or probabilities are not in [0, 1].

action_space(agent: int) Discrete

Return the action space for a given agent.

Parameters

agentint

Index of the agent.

Returns

Discrete

Action space with two discrete actions: 0 or 1.

get_majority_value() int

Compute the current majority binary value.

Returns

int

1 if majority is 1, 0 if majority is 0, -1 if tied.

get_obs() Dict[int, ndarray]

Get current observations for all agents.

Returns

dict

Observations keyed by agent index, each containing a binary vector of the agent’s neighborhood.

metadata: dict[str, Any] = {'name': 'binary_consensus_environment_v0'}
observation_space(agent: int) MultiDiscrete

Return the observation space for a given agent.

Parameters

agentint

Index of the agent.

Returns

MultiDiscrete

Observation space describing possible binary observations from the agent’s neighborhood.

render(save_frame: bool = False, fig=None, ax=None) None

Render the current state of the environment.

Args:
save_frame (bool, optional): Whether to save the current

frame as an image. Defaults to False.

fig (matplotlib.figure.Figure, optional): Figure for rendering.

Defaults to None.

ax (matplotlib.axes.Axes, optional): Axes for rendering.

Defaults to None.

reset(seed: int | None = None, options: Dict | None = None) Tuple[Dict[int, ndarray], Dict[int, dict]]

Reset the environment to its initial state.

Parameters

seedint, optional

Random seed for reproducibility.

optionsdict, optional

Options for reset (e.g. “init_vect” for setting an initial state).

Returns

observationsdict

Observations for each agent after reset.

infosdict

Info dictionaries for each agent.

state() ndarray

Get the internal environment state.

Returns

np.ndarray

Array of current binary states for all agents.

step(actions: Dict[int, int]) Tuple[Dict[int, ndarray], Dict[int, float], Dict[int, bool], Dict[int, bool], Dict[int, dict]]

Perform one environment step using the given agent actions.

Parameters

actionsdict

Mapping from agent index to their binary action (0 or 1).

Returns

observationsdict

New observations for each agent.

rewardsdict

Rewards assigned to each agent or shared globally.

terminationsdict

Flags indicating whether each agent’s episode is terminated.

truncationsdict

Flags indicating whether each agent’s episode is truncated.

infosdict

Additional metadata for each agent.

Rewards

The default rewards here is the FactoredRewardModel. This reward gives a penalty to each agent at each step for disagreeing with the current majority (which does not necessarily match the objective consensus). At the terminal state, it gives a large reward for reaching the consensus, weighted by the time it took to reach it (the faster, the better). If the consensus is not reached and the game reaches the maximum horizon, it gives a large negative reward weighted by the distance to the consensus.

More formally, the reward model works like this:

During an episode: Each agent gets a local reward at each step:

\[\begin{split}r_i(t) = \begin{cases} 0 & \text{if agent } i \text{ agrees with majority} \\ -1 & \text{otherwise} \end{cases}\end{split}\]

At episode end:

  • Let \(\tau\) be the temporal weight factor in the reward, it

\[\tau = \frac{t_{\max}-t_{\rm final}}{t_{\max}}, \quad\]

\(t_{\max}\) is the maximum length of an episode and \(t_{\rm final}\) is the actual terminal timestep. Thus, this temporal factor goes linearly from 1 to 0 in an episode.

  • Let \(\xi\) be a penalty term that is added whenever the consensus is not reach at the end of an episode.

\[\begin{split}\xi = \begin{cases} -100 & \tau = 0 \\ 0 & \text{otherwise} \end{cases}\end{split}\]

Then the final reward is computed using the ratio to the consensus \(x_{\rm final}/N\), \(x\) being the number of agents agreeing with the objective value.

\[r_i(\text{end}) = \frac{\tau \,\left(100\,x_{\rm final}/N + \xi \right)}{N}\]
class cognac.env.BinaryConsensus.rewards.FactoredRewardModel(max_reward: float = 100.0, min_reward: float = -100.0)

Bases: BaseReward

Reward model that encourages agents to reach consensus with the majority.

This reward model assigns rewards based on local agreement with the current majority state. Upon episode termination, a factored reward is computed based on the proportion of agents in consensus and the remaining time steps.

This model supports both local (per-agent) and global reward configurations.

Parameters

max_rewardfloat, optional

Maximum reward achievable at full consensus. Default is 100.0.

min_rewardfloat, optional

Penalty applied when the episode is truncated before consensus. Default is -100.0.

_abc_impl = <_abc._abc_data object>
get_consensus_value(env: ParallelEnv) int

Compute how many agents currently agree on the majority value.

Parameters

envParallelEnv

The current environment instance.

Returns

int

Number of agents voting for the majority value.

class cognac.env.BinaryConsensus.rewards.RewardWInitTarget(max_reward: float = 100.0, min_reward: float = -10.0)

Bases: BaseReward

Reward model encouraging agents to converge on the initial majority state.

This reward model stores the initial majority value after reset, then provides: - A large positive reward if the final state reaches full consensus on the initial value. - A large penalty if consensus is not reached or if the consensus is on the wrong value. - Stepwise feedback (+1 or -1) during the episode based on agreement with the target.

Parameters

max_rewardfloat, optional

Reward for reaching full consensus on the initial majority value. Default is 10.0.

min_rewardfloat, optional

Penalty for failing to reach the correct consensus. Default is -10.0.

_abc_impl = <_abc._abc_data object>
reset(init_state: ndarray)

Set the target consensus value from the initial state.

Parameters

init_statenp.ndarray

Initial binary vector of agent states.