Flash Attention - II

This chapter assumes that you know about attention mechanism. If not please see this video, which provides a lot of info about how to model, train a GPT-2 from ground up, Andrej Karpathy video. This chapter compromises of:

Attention Pytorch Code.
Attention From Scratch Using Cuda.

The Vanilla Self-Attention: Math Refresher

Before we look at the code, let’s quickly recap the math for a single attention head.

Given Query ($Q$), Key ($K$), and Value ($V$) matrices for a sequence of length $T$ and feature dimension $d_k$ (per head):

To read more click the link below. Since writing math using plain html is time consuming, I use jupyter-book for wrtiing my blogs. (To read comfortably please toggle the side bar)

https://yogheswaran-a.github.io/cuda-notes/04-flash-attention-II.html

More Cuda blogs:
https://yogheswaran-a.github.io/cuda-notes/00-landing-page.html

Flash Attention - II#

The Vanilla Self-Attention: Math Refresher#

Flash Attention - II

The Vanilla Self-Attention: Math Refresher