Native Sparse Attention: Hardware-Aligned, Natively Trainable Sparse Attention (arxiv.org)
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges.