Flash Attention - II
This chapter assumes that you know about attention mechanism. If not please see this video, which provides a lot of info about how to model, train a GPT-2 from ground up, Andrej Karpathy video. This chapter compromises of:
- Attention Pytorch Code.
- Attention From Scratch Using Cuda.
The Vanilla Self-Attention: Math Refresher
Before we look at the code, let’s quickly recap the math for a single attention head.
Given Query ($Q$), Key ($K$), and Value ($V$) matrices for a sequence of length $T$ and feature dimension $d_k$ (per head):
To read more click the link below. Since writing math using plain html is time consuming, I use jupyter-book for wrtiing my blogs. (To read comfortably please toggle the side bar)
https://yogheswaran-a.github.io/cuda-notes/04-flash-attention-II.html
More Cuda blogs:
https://yogheswaran-a.github.io/cuda-notes/00-landing-page.html