Flow Matching: The Theory Behind Stable Diffusion - 3.

Most of the posts start with a pic or a qoute, I鈥檓 going to start with a lame joke. A gaussian noise and MINST dataset fell in love with each other. How did they meet? They used flow matching app. What is the post about? Diffusion models and flow matching have improved image generation(they both can be wriiten under the same formulation). In this blog post I will write my learnings about flow matching from the ground up, which was used to develop SD3, open ai SORA, Meta鈥檚 movie gen video, etc. The topis covered are: ...

June 25, 2025 路 1 min 路 204 words 路 Yoghes Waran

SDE, Weiner Process, ITO's Lemma and Reverse Time Equation

Understanding Weiner Process Motivation I wrote this blog as I was trying to understand the math behind the diffusion model.Though one can understand the algorithm of differernt types of formulation under diffusion models such as herirachical VAE, Score based models, Rectiflow, flow models without knowing much about SDE, to understand in detail why they all fall under the same category I beleive a deeper understanding of weineer process, OU process is needed, especaily the stochastic calculas which is the key to ITO鈥檚 lemma, forward and backward equations. In this section a brief informal intro to wiener process and why stocastic calculas is provided and in the later section a formal definition is tried to be given. ...

June 5, 2025 路 1 min 路 148 words 路 Yoghes Waran

Understanding Flash Attention: Part II

Flash Attention - II This chapter assumes that you know about attention mechanism. If not please see this video, which provides a lot of info about how to model, train a GPT-2 from ground up, Andrej Karpathy video. This chapter compromises of: Attention Pytorch Code. Attention From Scratch Using Cuda. The Vanilla Self-Attention: Math Refresher Before we look at the code, let鈥檚 quickly recap the math for a single attention head. ...

June 4, 2025 路 1 min 路 128 words 路 Yoghes Waran

Understanding Flash Attention: Part I

Flash Attention - I This chapter assumes that you know about attention mechanism. If not please see this video, which provides a lot of info about how to model, train a GPT-2 from ground up, Andrej Karpathy video. This chapter compromises of: CuBLAS. Some Functions And Classes To Know. Finding Maximum And Sum Of An Array. CuBLAS The official documentation of Cuda is very detailed and well explained. I would request everyone to go through it. I believe it is self sufficent. ...

June 3, 2025 路 1 min 路 118 words 路 Yoghes Waran

Cuda C++: Instruction Dispatch and Memory

Instruction Dispatch and Memory This chapter comprises of: Wrap And Wrap Scheduler. Types of Memories. Why coalesing matters? Matirx Multiplication Using Shared Memory. More About Shared Memory. Wrap And Wrap Scheduler Hope you remember what Wrap scheduler is, it was defined in the previous chapter. Its defined here again, Wrap Scheduler: Wrap scheduler simply put is the one which issues the instructions to the SM. It tells which instructions needs to be executed and when. Warp schedulers are dual issue capable. This means that the wrap scheduler can issue two instructions to the same SM in the same clock, if the two instructions do not depend on each other. To read more click the link below. Since writing math using plain html is time consuming, I use jupyter-book for wrtiing my blogs. (To read comfortably please toggle the side bar) ...

June 2, 2025 路 1 min 路 145 words 路 Yoghes Waran

Getting Started: Cuda C++

Getting Started: Cuda C++ This chapter comprises of: GPU Intro. Transfer Of Data From CPU to GPU. GPU Kernels. Streaming Multiprocessor and GPU features. What are Threads, Blocks and Grids? Writing our first program. Cuda Error Checking. Putting it all together: Vector Addition. GPU Intro When a C++ program complied, the program is converted into machine code which can be executed on the CPU. When a program is written for CPU the machine code is excecuted sequentially inside a core. If there are multiple cores present then each core can be used to perform the desired operation parallely(using OpenMp in C++) but the instructions inside each core will be exceuted sequentially. ...

June 1, 2025 路 1 min 路 147 words 路 Yoghes Waran