SoLU

Softmax Linear Unit (SoLU) - A PyTorch implementation.

This package implements the Softmax Linear Unit activation function as described in https://www.anthropic.com/research/softmax-linear-units. SoLU applies a softmax operation element-wise with the input, creating a unique activation pattern that has been shown to improve training dynamics in certain neural network architectures.

Example:
>>> import torch
>>> from SoLU import SoLU, SoLULayer
>>> x = torch.randn(2, 5, 4)
>>> solu = SoLU()
>>> output = solu(x)
>>> layer = SoLULayer(hidden_size=4)
>>> output = layer(x)
 1"""Softmax Linear Unit (SoLU) - A PyTorch implementation.
 2
 3This package implements the Softmax Linear Unit activation function
 4as described in https://www.anthropic.com/research/softmax-linear-units. SoLU applies
 5a softmax operation element-wise with the input, creating a unique
 6activation pattern that has been shown to improve training dynamics
 7in certain neural network architectures.
 8
 9Example:
10    >>> import torch
11    >>> from SoLU import SoLU, SoLULayer
12    >>> x = torch.randn(2, 5, 4)
13    >>> solu = SoLU()
14    >>> output = solu(x)
15    >>> layer = SoLULayer(hidden_size=4)
16    >>> output = layer(x)
17"""
18
19from .module import SoLU
20from .layers import SoLULayer
21
22__all__ = ["SoLU", "SoLULayer"]
class SoLU(torch.nn.modules.module.Module):
 9class SoLU(nn.Module):
10    """Softmax Linear Unit (SoLU) activation function.
11
12    The SoLU activation function applies a softmax operation along a specified
13    dimension and multiplies it element-wise with the input tensor:
14
15        f(x) = x * softmax(x, dim=dim)
16
17    This activation function creates a multiplicative interaction between the
18    input and its normalized version, which can help with gradient flow and
19    feature learning in deep neural networks.
20
21    Attributes:
22        dim: The dimension along which to apply softmax.
23
24    Example:
25        >>> import torch
26        >>> from SoLU import SoLU
27        >>> solu = SoLU(dim=-1)
28        >>> x = torch.randn(2, 5, 4)
29        >>> output = solu(x)
30        >>> assert output.shape == x.shape
31    """
32
33    def __init__(self, dim: int = -1):
34        """Initialize the SoLU activation function.
35
36        Args:
37            dim: The dimension along which to apply softmax. Defaults to -1
38                (the last dimension), which is typically the feature dimension
39                in transformer architectures.
40        """
41        super().__init__()
42        self.dim = dim
43
44    def forward(self, x: torch.Tensor) -> torch.Tensor:
45        """Apply the SoLU activation to the input tensor.
46
47        Args:
48            x: Input tensor of any shape. The softmax operation will be
49                applied along the dimension specified in ``self.dim``.
50
51        Returns:
52            A tensor with the same shape as ``x``, where each element is
53            the product of the corresponding input element and its softmax
54            normalization along the specified dimension.
55        """
56        return x * F.softmax(x, dim=self.dim)

Softmax Linear Unit (SoLU) activation function.

The SoLU activation function applies a softmax operation along a specified dimension and multiplies it element-wise with the input tensor:

f(x) = x * softmax(x, dim=dim)

This activation function creates a multiplicative interaction between the input and its normalized version, which can help with gradient flow and feature learning in deep neural networks.

Attributes:
  • dim: The dimension along which to apply softmax.
Example:
>>> import torch
>>> from SoLU import SoLU
>>> solu = SoLU(dim=-1)
>>> x = torch.randn(2, 5, 4)
>>> output = solu(x)
>>> assert output.shape == x.shape
SoLU(dim: int = -1)
33    def __init__(self, dim: int = -1):
34        """Initialize the SoLU activation function.
35
36        Args:
37            dim: The dimension along which to apply softmax. Defaults to -1
38                (the last dimension), which is typically the feature dimension
39                in transformer architectures.
40        """
41        super().__init__()
42        self.dim = dim

Initialize the SoLU activation function.

Arguments:
  • dim: The dimension along which to apply softmax. Defaults to -1 (the last dimension), which is typically the feature dimension in transformer architectures.
dim
def forward(self, x: torch.Tensor) -> torch.Tensor:
44    def forward(self, x: torch.Tensor) -> torch.Tensor:
45        """Apply the SoLU activation to the input tensor.
46
47        Args:
48            x: Input tensor of any shape. The softmax operation will be
49                applied along the dimension specified in ``self.dim``.
50
51        Returns:
52            A tensor with the same shape as ``x``, where each element is
53            the product of the corresponding input element and its softmax
54            normalization along the specified dimension.
55        """
56        return x * F.softmax(x, dim=self.dim)

Apply the SoLU activation to the input tensor.

Arguments:
  • x: Input tensor of any shape. The softmax operation will be applied along the dimension specified in self.dim.
Returns:

A tensor with the same shape as x, where each element is the product of the corresponding input element and its softmax normalization along the specified dimension.

class SoLULayer(torch.nn.modules.module.Module):
 9class SoLULayer(nn.Module):
10    """A neural network layer combining SoLU activation with LayerNorm.
11
12    This layer implements the effective block used in recent research to
13    recover and improve performance in transformer-like architectures:
14
15        f(x) = LayerNorm(SoLU(x))
16
17    The combination of SoLU activation followed by LayerNormalization
18    provides stable training dynamics and has been shown to improve
19    convergence in deep networks.
20
21    Attributes:
22        solu: The SoLU activation function module.
23        layer_norm: A LayerNorm module that normalizes across the hidden size.
24
25    Example:
26        >>> import torch
27        >>> from SoLU import SoLULayer
28        >>> layer = SoLULayer(hidden_size=4)
29        >>> x = torch.randn(2, 5, 4)  # batch_size=2, seq_len=5, hidden_dim=4
30        >>> output = layer(x)
31        >>> assert output.shape == x.shape
32    """
33
34    def __init__(self, hidden_size: int, dim: int = -1):
35        """Initialize the SoLULayer.
36
37        Args:
38            hidden_size: The size of the hidden dimension, which determines
39                the shape normalization for LayerNorm.
40            dim: The dimension along which to apply softmax in the SoLU
41                activation. Defaults to -1 (the last dimension).
42        """
43        super().__init__()
44        self.solu = SoLU(dim=dim)
45        self.layer_norm = nn.LayerNorm(hidden_size)
46
47    def forward(self, x: torch.Tensor) -> torch.Tensor:
48        """Apply the SoLU activation and LayerNorm to the input tensor.
49
50        Args:
51            x: Input tensor of shape ``(*, hidden_size)`` where ``hidden_size``
52                matches the ``hidden_size`` passed to the constructor.
53
54        Returns:
55            A tensor with the same shape as ``x``, after applying SoLU
56            activation followed by LayerNormalization.
57        """
58        x = self.solu(x)
59        return self.layer_norm(x)

A neural network layer combining SoLU activation with LayerNorm.

This layer implements the effective block used in recent research to recover and improve performance in transformer-like architectures:

f(x) = LayerNorm(SoLU(x))

The combination of SoLU activation followed by LayerNormalization provides stable training dynamics and has been shown to improve convergence in deep networks.

Attributes:
  • solu: The SoLU activation function module.
  • layer_norm: A LayerNorm module that normalizes across the hidden size.
Example:
>>> import torch
>>> from SoLU import SoLULayer
>>> layer = SoLULayer(hidden_size=4)
>>> x = torch.randn(2, 5, 4)  # batch_size=2, seq_len=5, hidden_dim=4
>>> output = layer(x)
>>> assert output.shape == x.shape
SoLULayer(hidden_size: int, dim: int = -1)
34    def __init__(self, hidden_size: int, dim: int = -1):
35        """Initialize the SoLULayer.
36
37        Args:
38            hidden_size: The size of the hidden dimension, which determines
39                the shape normalization for LayerNorm.
40            dim: The dimension along which to apply softmax in the SoLU
41                activation. Defaults to -1 (the last dimension).
42        """
43        super().__init__()
44        self.solu = SoLU(dim=dim)
45        self.layer_norm = nn.LayerNorm(hidden_size)

Initialize the SoLULayer.

Arguments:
  • hidden_size: The size of the hidden dimension, which determines the shape normalization for LayerNorm.
  • dim: The dimension along which to apply softmax in the SoLU activation. Defaults to -1 (the last dimension).
solu
layer_norm
def forward(self, x: torch.Tensor) -> torch.Tensor:
47    def forward(self, x: torch.Tensor) -> torch.Tensor:
48        """Apply the SoLU activation and LayerNorm to the input tensor.
49
50        Args:
51            x: Input tensor of shape ``(*, hidden_size)`` where ``hidden_size``
52                matches the ``hidden_size`` passed to the constructor.
53
54        Returns:
55            A tensor with the same shape as ``x``, after applying SoLU
56            activation followed by LayerNormalization.
57        """
58        x = self.solu(x)
59        return self.layer_norm(x)

Apply the SoLU activation and LayerNorm to the input tensor.

Arguments:
  • x: Input tensor of shape (*, hidden_size) where hidden_size matches the hidden_size passed to the constructor.
Returns:

A tensor with the same shape as x, after applying SoLU activation followed by LayerNormalization.