🤝 Binary Consensus
This page is organized as follow:
The Binary Consensus problem is a simple yet non-trivial decentralized multi-agent environment inspired by the classic voter model introduced by Holley and Liggett in 1975 [Holley and Liggett]. In the original model, agents (or “particles”) interact over a graph and influence each other’s decisions. While this model has been extensively studied in statistical physics, its adaptation to the Markov Decision Process (MDP) framework—particularly in decentralized, fully cooperative settings—is relatively rare.
This environment adapts the voter model into a discrete-time decentralized partially observable Markov decision process (Dec-POMDP). Each agent observes its own binary vote as well as those of its neighbors and can choose to retain or flip its vote at each time step. Agents’ states are also stochastically influenced by their neighbors’ actions. Despite its simplicity, the problem presents interesting challenges for coordination and decision-making.
Illustration for one state update in the Binary Consensus problem.
Problem Formulation
Agents and Votes: There are \(N\) agents. Each agent maintains a binary state \(s_i(t) \in \{0, 1\}\), representing its vote at time \(t\). The joint state at time \(t\) is:
Observations: Each agent observes its own vote and the votes of its direct neighbors as defined by a fixed graph structure.
Action Space: At every time step, each agent selects an action \(a_i(t) \in \{0, 1\}\):
\(a_i(t) = 0\): keep the current vote
\(a_i(t) = 1\): switch to the opposite vote
Objective: The goal is to reach a consensus corresponding to the initial majority vote within a fixed time horizon \(T\). The initial majority \(m_0\) is defined as:
A consensus is said to be reached at time \(t\) if all agents share the same vote:
An episode terminates either when consensus is reached (even if incorrect) or when the time horizon \(T\) is exceeded.
Use Cases and Complexity
While this theoretical model is not tied to any specific real-world application, its simplicity and flexibility make it a strong benchmark for evaluating decentralized, centralized, and hybrid learning strategies across varying graph sizes and topologies.
The environment has both a state space and an action space of size \(2^N\), which quickly becomes intractable as the number of agents increases. This property makes it a useful stress test for centralized methods and a valuable tool for studying scalability in multi-agent reinforcement learning.
A simple, intuitive heuristic policy is available as a baseline: at each time step, each agent adopts the majority vote of its local neighborhood. This provides a competitive reference for evaluating learned strategies.
Environment
- class cognac.env.BinaryConsensus.env.BinaryConsensusNetworkEnvironment(adjacency_matrix: ~numpy.ndarray, max_steps: int = 100, show_neighborhood_state: bool = True, reward_class: type[~cognac.core.BaseReward.BaseReward] = <class 'cognac.env.BinaryConsensus.rewards.FactoredRewardModel'>, is_global_reward: bool = False)
Bases:
ParallelEnvA multi-agent reinforcement learning environment modeling binary consensus.
Agents interact on a probabilistic influence graph and attempt to reach a common binary state (0 or 1). Each agent’s action influences its own state and that of its neighbors according to the adjacency matrix.
- _check_adjacency_matrix() None
Ensure adjacency matrix has valid structure.
Raises
- AssertionError
If the diagonal is non-zero or probabilities are not in [0, 1].
- action_space(agent: int) Discrete
Return the action space for a given agent.
Parameters
- agentint
Index of the agent.
Returns
- Discrete
Action space with two discrete actions: 0 or 1.
- get_majority_value() int
Compute the current majority binary value.
Returns
- int
1 if majority is 1, 0 if majority is 0, -1 if tied.
- get_obs() Dict[int, ndarray]
Get current observations for all agents.
Returns
- dict
Observations keyed by agent index, each containing a binary vector of the agent’s neighborhood.
- metadata: dict[str, Any] = {'name': 'binary_consensus_environment_v0'}
- observation_space(agent: int) MultiDiscrete
Return the observation space for a given agent.
Parameters
- agentint
Index of the agent.
Returns
- MultiDiscrete
Observation space describing possible binary observations from the agent’s neighborhood.
- render(save_frame: bool = False, fig=None, ax=None) None
Render the current state of the environment.
- Args:
- save_frame (bool, optional): Whether to save the current
frame as an image. Defaults to False.
- fig (matplotlib.figure.Figure, optional): Figure for rendering.
Defaults to None.
- ax (matplotlib.axes.Axes, optional): Axes for rendering.
Defaults to None.
- reset(seed: int | None = None, options: Dict | None = None) Tuple[Dict[int, ndarray], Dict[int, dict]]
Reset the environment to its initial state.
Parameters
- seedint, optional
Random seed for reproducibility.
- optionsdict, optional
Options for reset (e.g. “init_vect” for setting an initial state).
Returns
- observationsdict
Observations for each agent after reset.
- infosdict
Info dictionaries for each agent.
- state() ndarray
Get the internal environment state.
Returns
- np.ndarray
Array of current binary states for all agents.
- step(actions: Dict[int, int]) Tuple[Dict[int, ndarray], Dict[int, float], Dict[int, bool], Dict[int, bool], Dict[int, dict]]
Perform one environment step using the given agent actions.
Parameters
- actionsdict
Mapping from agent index to their binary action (0 or 1).
Returns
- observationsdict
New observations for each agent.
- rewardsdict
Rewards assigned to each agent or shared globally.
- terminationsdict
Flags indicating whether each agent’s episode is terminated.
- truncationsdict
Flags indicating whether each agent’s episode is truncated.
- infosdict
Additional metadata for each agent.
Rewards
The default rewards here is the FactoredRewardModel. This reward gives a penalty to each agent at each step for disagreeing with the current majority (which does not necessarily match the objective consensus). At the terminal state, it gives a large reward for reaching the consensus, weighted by the time it took to reach it (the faster, the better). If the consensus is not reached and the game reaches the maximum horizon, it gives a large negative reward weighted by the distance to the consensus.
More formally, the reward model works like this:
During an episode: Each agent gets a local reward at each step:
At episode end:
Let \(\tau\) be the temporal weight factor in the reward, it
\(t_{\max}\) is the maximum length of an episode and \(t_{\rm final}\) is the actual terminal timestep. Thus, this temporal factor goes linearly from 1 to 0 in an episode.
Let \(\xi\) be a penalty term that is added whenever the consensus is not reach at the end of an episode.
Then the final reward is computed using the ratio to the consensus \(x_{\rm final}/N\), \(x\) being the number of agents agreeing with the objective value.
- class cognac.env.BinaryConsensus.rewards.FactoredRewardModel(max_reward: float = 100.0, min_reward: float = -100.0)
Bases:
BaseRewardReward model that encourages agents to reach consensus with the majority.
This reward model assigns rewards based on local agreement with the current majority state. Upon episode termination, a factored reward is computed based on the proportion of agents in consensus and the remaining time steps.
This model supports both local (per-agent) and global reward configurations.
Parameters
- max_rewardfloat, optional
Maximum reward achievable at full consensus. Default is 100.0.
- min_rewardfloat, optional
Penalty applied when the episode is truncated before consensus. Default is -100.0.
- _abc_impl = <_abc._abc_data object>
- class cognac.env.BinaryConsensus.rewards.RewardWInitTarget(max_reward: float = 100.0, min_reward: float = -10.0)
Bases:
BaseRewardReward model encouraging agents to converge on the initial majority state.
This reward model stores the initial majority value after reset, then provides: - A large positive reward if the final state reaches full consensus on the initial value. - A large penalty if consensus is not reached or if the consensus is on the wrong value. - Stepwise feedback (+1 or -1) during the episode based on agreement with the target.
Parameters
- max_rewardfloat, optional
Reward for reaching full consensus on the initial majority value. Default is 10.0.
- min_rewardfloat, optional
Penalty for failing to reach the correct consensus. Default is -10.0.
- _abc_impl = <_abc._abc_data object>