Yiqing Liang's Blog

The Evolution of Activation Functions: From ReLU to SwiGLU

November 7, 2025 Technical

By Yiqing Liang

Introduction

Activation functions are the unsung heroes of deep learning. While much attention goes to architecture design and training techniques, the choice of activation function profoundly impacts model performance, training stability, and computational efficiency. Over the past decade, we've witnessed a fascinating evolution from simple ReLU to sophisticated gated variants like SwiGLU that power today's largest language models.

In this post, we'll explore the most important activation functions in modern deep learning, understand why certain choices dominate in specific architectures, and examine empirical data on their performance, sparsity patterns, and computational costs.

The Basics: What Makes a Good Activation Function?

Before diving into specific functions, let's establish what we want from an activation function:

Non-linearity: Essential for learning complex patterns (otherwise, deep networks collapse to linear models)
Gradient flow: Gradients should propagate effectively through deep networks
Computational efficiency: Fast to compute in both forward and backward passes
No upper saturation: Output can grow without hitting a ceiling, avoiding vanishing gradients (functions like ReLU grow linearly for positive inputs, unlike sigmoid which saturates at 1)
Smooth or continuous: Aids optimization (though not strictly necessary, as ReLU shows)
Sparsity (optional): Can improve efficiency and interpretability

ReLU: The Foundation

Definition

ReLU(x) = max(0, x)

Why It Won

Introduced by Nair & Hinton (2010) and popularized by AlexNet (2012), ReLU revolutionized deep learning by solving the vanishing gradient problem that plagued sigmoid and tanh activations:

No saturation for positive values: Gradient is 1 for x > 0, enabling deep networks
Computational simplicity: Just a comparison and zeroing operation
Sparsity: Zero neurons can enable sparse computation optimizations (though standard dense implementations still compute zeros)
Scale invariance: ReLU(αx) = αReLU(x) for α > 0

Limitations

Despite its success, ReLU has known issues:

Dying ReLU problem: Neurons can get stuck at zero with negative inputs, never activating again
Not zero-centered: Outputs are always non-negative, which can slow learning
Non-smooth at zero: The kink at x = 0 can cause optimization challenges

Widespread Adoption

ReLU became and remains the default choice for many architectures:

CNNs: Standard activation in ResNets, VGG, and most convolutional architectures for computer vision
Simplicity: Easy to implement and debug, contributing to its widespread adoption
Hardware optimization: Modern accelerators (GPUs, TPUs) have highly optimized implementations

GELU: Smooth Approximation

Definition

Gaussian Error Linear Unit (GELU), introduced by Hendrycks & Gimpel (2016), provides a smooth approximation to ReLU:

GELU(x) = x · Φ(x)
where Φ(x) is the cumulative distribution function of the standard normal distribution

# Approximation often used in practice:
GELU(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))

ReLU vs GELU comparison

Key Properties

Smooth everywhere: No kink at zero, better for optimization
Better gradient flow: Enabling non-zero gradients for negative inputs (unlike ReLU which has zero gradients)

Transformers Loved GELU

GELU once became the standard activation in transformer models:

BERT: Uses GELU in all feedforward layers
GPT-2/GPT-3: Standard choice for language models
Vision Transformers (ViT): Adopted GELU from NLP

Empirical Performance

Practical considerations from the original GELU paper:

Compute overhead: More expensive than ReLU due to computing the Gaussian CDF (though approximations help)
Sparsity: Lower activation sparsity compared to ReLU since GELU doesn't zero out negative values completely

SiLU/Swish: Self-Gated Activation

Definition

Swish (Ramachandran et al., 2017), also known as SiLU (Sigmoid Linear Unit), is a simple self-gated activation:

Swish(x) = x · σ(x) = x / (1 + exp(-x))

# Parameterized version (β typically learned or set to 1):
Swish_β(x) = x · σ(βx)

Key Properties

Smooth and non-monotonic: Similar benefits to GELU
Self-gating (bounded below, unbounded above): The input x controls how much of itself passes through via the sigmoid σ(x). When x is large and positive, σ(x) ≈ 1 so the output ≈ x (linear growth, unbounded). When x is large and negative, σ(x) ≈ 0 so the output ≈ 0 (bounded below). This creates adaptive behavior where the activation "decides" based on the input value itself.
Nearly identical to GELU: Very similar shapes and empirical performance

GELU vs SiLU/Swish comparison

Empirical Performance

From the Swish paper and EfficientNet:

EfficientNet: Uses Swish throughout, achieving SOTA efficiency
Compute overhead and Sparsity: SiLU is costlier than ReLU due to sigmoid; exact overhead is kernel-dependent; produces dense activations (no hard zeros).

GLU: The Gating Revolution

Definition

Gated Linear Units (GLU), introduced by Dauphin et al. (2017), brought explicit gating to activations:

GLU(x, W, V, b, c) = (xW + b) ⊗ σ(xV + c)

where ⊗ means element-wise multiplication (multiply corresponding elements)

# In practice, for input x ∈ ℝ^d, GLU works as follows:
# 1. Linear projection: split input into two parts using different weight matrices
#    - Value path: xW + b (what information to pass)
#    - Gate path: xV + c (how much to pass)
# 2. Apply sigmoid to gate: σ(xV + c) gives values between 0 and 1
# 3. Element-wise multiply: (xW + b) ⊗ σ(xV + c)
#    - Each element in the value is multiplied by corresponding gate value
#    - Gate ≈ 0 blocks information, gate ≈ 1 allows it through

# Simplified notation when input is pre-split into [x₁, x₂]:
GLU([x₁, x₂]) = x₁ ⊗ σ(x₂)
# x₁ is the value (what), σ(x₂) is the gate (how much)

Why Gating Matters

GLU introduced a key insight: learned gating mechanisms can control information flow:

Selective activation: Gates determine which information passes through
Non-linear interactions: Multiplicative gating creates richer representations
Better gradient flow: Linear path through one branch aids backpropagation
Adaptive behavior: Gates learn context-dependent activation patterns

Empirical Performance

From the original GLU paper:

Parameter efficiency: Better performance with same or fewer parameters
Compute overhead: notably more parameters and higher FLOPs vs standard FFN (due to a doubled expand projection)

GLU Variants: ReGLU, SwiGLU, and GeGLU

The Variant Zoo

The success of GLU inspired a natural question: what if we replace the sigmoid function in GLU with other activations? Shazeer (2020) systematically explored this idea:

# General form of GLU variants:
GLU_variant([x₁, x₂]) = x₁ ⊗ activation(x₂)

# Specific variants:
GLU([x₁, x₂])     = x₁ ⊗ σ(x₂)         # Original (sigmoid)
ReGLU([x₁, x₂])   = x₁ ⊗ ReLU(x₂)      # ReLU gating
GeGLU([x₁, x₂])   = x₁ ⊗ GELU(x₂)      # GELU gating
SwiGLU([x₁, x₂])  = x₁ ⊗ Swish(x₂)     # Swish gating

Empirical Findings

Shazeer's experiments tested these variants in the feedforward layers of Transformer models and found that some variants (particularly GeGLU and SwiGLU) yield quality improvements over the typically-used ReLU or GELU activations.

Adoption in Modern LLMs

Both GeGLU and SwiGLU have been adopted in state-of-the-art language models:

SwiGLU: Used in PaLM, LLaMA 2, and Qwen2
GeGLU: Used in Gemma

The Shazeer (2020) paper notes that both GeGLU and SwiGLU produce the best perplexities among tested variants, and attributes their success somewhat humorously "to divine benevolence," acknowledging that the precise reasons for their effectiveness remain somewhat unclear.

Why Gated Activations Work

The success of GLU variants (both SwiGLU and GeGLU) can be understood through several mechanisms:

Gating mechanism: Allows the network to dynamically control information flow, selectively amplifying important features
Smooth gradients: Both Swish and GELU provide smooth, differentiable functions that help gradient flow in deep networks
Increased capacity: The gating structure effectively doubles the parameters in the FFN, improving expressiveness
Non-monotonicity: Both activation functions are non-monotonic, allowing richer feature representations

Note: The choice between SwiGLU and GeGLU appears largely empirical, with both delivering strong performance. Different research groups and model families have converged on different choices, suggesting the differences may be marginal in practice.

ReLU²: An Alternative to Gated Activations

While GLU variants have gained popularity, they come with 2x parameter overhead. For scenarios where this cost is prohibitive, ReLU² (Squared ReLU) offers an alternative that improves upon standard ReLU without the parameter increase.

Definition

Squared ReLU, introduced in the Primer paper (So et al., 2021):

ReLU²(x) = (max(0, x))²

ReLU vs ReLU² comparison

Key Properties

Increased non-linearity: Squaring amplifies large activations, providing richer representations
Maintains sparsity: Still zero for negative inputs, enabling sparse computation
No parameter overhead: Unlike GLU variants, doesn't double parameters
Simple implementation: Just ReLU followed by squaring, minimal compute cost

Use Cases and Recent Adoption

Primer (So et al., 2021): Uses ReLU² in feedforward blocks for efficient transformers
ReLU² Wins (Zhang et al., 2024): Systematic study showing ReLU² excels in sparse LLMs across sparsity-performance trade-offs, sparsity predictivity, and hardware affinity
PLM-1.8B (2025): 1.8B parameter model using squared ReLU with multi-head latent attention, optimized for edge deployment (mobile phones, Raspberry Pi)
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity (Haziza et al., 2025): Leverages squared ReLU's inherent sparsity for GPU-accelerated 2:4 sparsity patterns, achieving 1.3x faster FFN in both forward and backward passes
Sparse computation research: ReLU² enables efficient inference through activation sparsity without parameter overhead of GLU variants

Trade-off: ReLU² offers improvements over ReLU with minimal cost, but GLU variants generally provide better performance when parameters and compute are available.

Comparative Analysis: Architecture-Specific Choices

Convolutional Neural Networks (CNNs)

Standard choice: ReLU or Swish

ResNets: ReLU remains dominant for its simplicity and speed
EfficientNets: Swish provides marginal improvements worth the cost
Compute-constrained: ReLU is preferred for its efficiency
High-capacity models: Swish or GELU can help squeeze out extra performance

Transformer Language Models

Evolution: GELU → GLU Variants

Early models (2018-2020):

BERT - GELU
GPT-2 - GELU
GPT-3 - GELU

Modern LLMs (2022+):

PaLM (540B, 2022) - SwiGLU
LLaMA 2 (7B-70B, 2023) - SwiGLU
Gemma (2B-7B, 2024) - GeGLU
Qwen2 (0.5B-72B, 2024) - SwiGLU

Why the shift to GLU variants:

Better performance: Consistent improvements over GELU/ReLU justify the 2x parameter cost at scale
Gating mechanism: Dynamic information flow control becomes more valuable in larger models
Scaling benefits: Performance gains compound with model size

Vision Transformers

Evolution: GELU → SwiGLU (following language models)

Classic Vision Transformers (GELU):

ViT (2020) - GELU
CLIP (2021) - GELU
DINOv2 (2023) - GELU
SAM (Segment Anything, 2023) - GELU

Diffusion Transformers:

DiT (Diffusion Transformer, 2023) - GELU/SiLU
Hunyuan-DiT (2024) - GELU-approximate

Recent Vision Models with SwiGLU (2024-2025):

BAGEL (ByteDance, 2025) - SwiGLU in vision encoders and LLM backbone
Llama 3.2-Vision (Meta, 2024) - SwiGLU (32-layer vision encoder + adapter)
LoLA-SpecViT (2025) - SwiGLU for hyperspectral imaging
Enhanced Swin Transformer (Pacal et al., 2024) - SwiGLU-based MLP for skin cancer diagnosis

The shift to SwiGLU in vision:

Following LLM success: Modern multimodal models inherit SwiGLU from their language model backbones
Unified architectures: Vision-language models benefit from consistent activation functions across modalities
Empirical improvements: Studies show SwiGLU improves accuracy, training speed, and parameter efficiency
Recent trend (2024+): New vision transformers increasingly adopt SwiGLU, especially in multimodal contexts

Note: While GELU dominated early vision transformers (2020-2023), the field is shifting toward SwiGLU in modern architectures (2024+), particularly in multimodal and specialized vision models.

Practical Considerations

Sparsity Patterns

Exact activation sparsity: ReLU/ReLU² induce hard zeros
Dense activations: GELU/SiLU/SwiGLU generally produce non-zero outputs (no hard zeros).

Implementation Tips

# PyTorch implementations

# GELU (use native implementation)
import torch.nn.functional as F
output = F.gelu(x)

# Swish/SiLU (use native implementation)
output = F.silu(x)

# SwiGLU (typical implementation in transformers)
def swiglu(x):
    x, gate = x.chunk(2, dim=-1)
    return x * F.silu(gate)

# Usage in FFN:
# Project to 2*hidden_dim, then apply SwiGLU to get hidden_dim output
ffn_output = W2(swiglu(W1(x)))  # W1: d -> 2*d_ff, W2: d_ff -> d

# ReLU² (simple but effective)
output = F.relu(x) ** 2

Looking Forward: What's Next?

The evolution of activation functions continues:

Learned activations: Meta-learning optimal activation shapes for specific tasks
Hardware co-design: Activations optimized for specific accelerators (TPUs, specialized AI chips)
Sparse activations: Explicit sparsity mechanisms for efficiency (e.g., top-k activation)
Adaptive activations: Context-dependent activation functions that change based on input
Normalization-free alternatives: Activations that provide built-in normalization properties

Practical Decision Guide

When to Use ReLU

Training CNNs for computer vision
Compute or memory constrained environments
When simplicity and speed are priorities
Inference-optimized deployment (sparsity can be exploited)

When to Use GELU/Swish

Training transformers (when not using GLU variants)
Vision transformers following standard practices
When smooth activation is beneficial for optimization
Medium to large scale models where compute cost is acceptable

When to Use SwiGLU

Training large language models (10B+ parameters)
When performance justifies 2x FFN parameter cost
Following modern LLM architecture best practices
Research or high-capacity scenarios

When to Use ReLU²

Experimenting with alternatives to GELU in transformers
Want sparsity benefits of ReLU with slightly better performance
Minimal compute overhead is critical

Conclusion

The journey from ReLU to SwiGLU reflects deep learning's evolution: from simple, sparse activations optimized for CNNs to sophisticated gated variants that excel in massive transformer models. Key takeaways:

ReLU remains king for CNNs: Simple, fast, and effective for convolutional architectures
GELU/Swish for smooth optimization: Better gradient flow in transformers, small but consistent improvements
SwiGLU for large-scale LLMs: Best performance when parameters and compute are available
Context matters: The "best" activation depends on architecture, scale, and constraints

Understanding these activation functions and their trade-offs empowers you to make informed architecture decisions. As models continue to scale and new architectures emerge, activation functions will continue evolving—but the fundamental principles of smooth gradients, computational efficiency, and expressiveness will remain central to their design.

References

Foundational Papers:

Rectified Linear Units Improve Restricted Boltzmann Machines - Nair & Hinton, ICML 2010
ImageNet Classification with Deep Convolutional Neural Networks - Krizhevsky et al., NeurIPS 2012 (AlexNet)
Gaussian Error Linear Units (GELUs) - Hendrycks & Gimpel, arXiv 2016
Searching for Activation Functions - Ramachandran et al., arXiv 2017 (Swish)

Gated Activations:

Language Modeling with Gated Convolutional Networks - Dauphin et al., ICML 2017 (GLU)
GLU Variants Improve Transformer - Shazeer, arXiv 2020

Modern Architectures:

Language Models:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - Devlin et al., NAACL 2019 (uses GELU)
Language Models are Unsupervised Multitask Learners - Radford et al., OpenAI 2019 (GPT-2, uses GELU)
Language Models are Few-Shot Learners - Brown et al., NeurIPS 2020 (GPT-3, uses GELU)
PaLM: Scaling Language Modeling with Pathways - Chowdhery et al., arXiv 2022 (uses SwiGLU)
Llama 2: Open Foundation and Fine-Tuned Chat Models - Touvron et al., arXiv 2023 (uses SwiGLU)
Gemma: Open Models Based on Gemini Research and Technology - Gemma Team, arXiv 2024 (uses GeGLU)
Qwen2 Technical Report - Qwen Team, arXiv 2024 (uses SwiGLU)

Vision Models:

Very Deep Convolutional Networks for Large-Scale Image Recognition - Simonyan & Zisserman, ICLR 2015 (VGG, uses ReLU)
Deep Residual Learning for Image Recognition - He et al., CVPR 2016 (ResNet, uses ReLU)
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks - Tan & Le, ICML 2019 (uses Swish)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - Dosovitskiy et al., ICLR 2021 (ViT, uses GELU)
Learning Transferable Visual Models From Natural Language Supervision - Radford et al., ICML 2021 (CLIP, uses GELU)
Scalable Diffusion Models with Transformers - Peebles & Xie, ICCV 2023 (DiT, uses GELU/SiLU)
DINOv2: Learning Robust Visual Features without Supervision - Oquab et al., arXiv 2023 (uses GELU)
Segment Anything - Kirillov et al., ICCV 2023 (SAM, uses GELU)
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer - Li et al., arXiv 2024 (uses GELU)
The Llama 3 Herd of Models - Llama Team, Meta AI, arXiv 2024 (Llama 3.2-Vision, uses SwiGLU)
Enhancing Skin Cancer Diagnosis Using Swin Transformer with Hybrid Shifted Window-Based Multi-head Self-attention and SwiGLU-Based MLP - Pacal et al., J Imaging Inform Med 2024 (uses SwiGLU)
Emerging Properties in Unified Multimodal Pretraining - Deng et al., arXiv 2025 (BAGEL, uses SwiGLU)
LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA - arXiv 2025 (uses SwiGLU)

Sparse and Efficient Models (ReLU²):

Primer: Searching for Efficient Transformers for Language Modeling - So et al., NeurIPS 2021 (uses ReLU²)
ReLU² Wins: Discovering Efficient Activation Functions for Sparse LLMs - Zhang et al., arXiv 2024
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing - arXiv 2025 (PLM-1.8B, uses squared ReLU)
Accelerating Transformer Inference and Training with 2:4 Activation Sparsity - Haziza et al., arXiv 2025 (uses squared ReLU)

Classic Activations (for context):

Understanding the difficulty of training deep feedforward neural networks - Glorot & Bengio, AISTATS 2010 (discusses sigmoid/tanh issues)

← Back to Blog

Home / Blog