Single-Head and Multi-Head Self-Attention

Date:

Single-head self-attention computes attention using a single set of query, key, and value projections, capturing global dependencies in one representation space. In contrast, multi-head self-attention uses multiple parallel attention heads, each with distinct projections, allowing the model to learn diverse patterns and relationships from different subspaces. This architectural enhancement significantly improves the model’s ability to capture complex contextual information, which is critical for tasks in natural language understanding.