Understanding MoE - mixture of experts | Large Language Models

MoE - mixture of experts

Mixture of Experts (MoE) is a model design where only a subset of specialized sub-networks (experts) is activated per input. This allows for larger models with sparse computation, improving efficiency. While not the focus of the article, MoE appears in advanced Transformer variants where different experts process different inputs. The gating network selects which experts to activate, depending on the input features. This strategy scales self-attention to massive models. It shows how the attention mechanism can be extended in complex architectures. Understanding MoE helps contextualize future directions of Transformers.