The Tiny Number Powering AI - FP8
A new 8-bit floating-point format called FP8 is transforming how large language models like GPT are trained. This blog explores how FP8 drastically reduces memory usage, boosts training speed, and enables more efficient AI—without sacrificing accuracy. Discover why smaller numbers mean bigger gains for the future of AI.



Have you ever wondered how today's LLMs like ChatGPT or Gemini manage to stay intelligent while becoming faster and cheaper to run?
Well, training these giants used to require hundreds of gigabytes of memory and weeks of GPU time. But now, thanks to a tiny number format called FP8, we are speeding things up, decreasing costs and keeping the intelligence intact.
How LLMs Are Generally Trained
Large Language Models (LLMs) like GPT, LLaMA, or Claude are trained on billions of words using neural networks that consist of layers of matrix multiplications, attention heads, and non-linearities. Every input, output, and internal parameter is a number—usually a floating-point number, which means it can represent decimals as well as very large or small values. Traditionally, training these models relied on FP32 (32-bit floating-point format)—a numerical format that uses 32 bits to store one number. It's precise and accurate, but also memory-heavy and slow. Imagine lifting weight with both hands vs doing it with just one finger - powerful but takes more effort and slow.
That is exactly where FP8 comes into picture.
The Need for FP8
As models grow larger (some with 175 billion+ parameters!), researchers started asking: Do we really need that much precision to train them? The answer? Not always.
That’s where FP8 (8-bit floating point) comes in.
It uses just one-quarter the memory of FP32 and half that of FP16/BF16, making everything faster and leaner. And unlike 8-bit integer formats (like INT8), FP8 still keeps an exponent, so it can represent a much wider range of numbers without breaking the math. Think of FP8 like switching from a bulky power hungry PCs to a M4 Mac—still powerful, but way lighter.
What is FP8?

FP8 stands for 8-bit floating point, a super compact numerical format used to store and process numbers in deep learning models. Unlike regular integers, FP8 numbers include an exponent, which means they represent profoundly small and increasingly large values just like the common FP32 but with far less memory. They are of 2 types - E4M3 which has better precision but a smaller range & E5M2 which has a wide range but a comparatively smaller precision. By reducing the size of each number to just 8 bits (1 byte), FP8 helps AI models run faster, consume less memory, and train more efficiently—especially when used with modern GPUs designed to handle these smaller formats.
In the true sense, it compresses the number format by trimming down the precision and range to only what's most necessary for the model to learn effectively. This is only possible because of the hardware level support of NVIDIA and Tensor Cores along with Dynamic Scaling.
How FP8 is Transforming LLM Training
Speed: Training becomes up to 75% faster when using FP8 on supported hardware like the powerful NVIDIA H100 GPUs.
Memory Savings: Models like GPT-175B saw a 39% reduction in GPU memory use during training.
Lower Costs: Less memory and compute means smaller cloud bills and better energy efficiency.
Easy Inference: Since FP8 is still a floating point, the same model can be used in training and interference - no complex processes are needed.

What’s Next?
The success of FP8 has researchers dreaming even bigger—or should we say smaller? Formats like FP6 and FP4 are being explored. Ideas like per-layer or block-wise scaling, smarter training techniques, and more robust software support are also on the horizon. Plus, upcoming chips from NVIDIA (Blackwell) and AMD (MI300) are doubling down on FP8.
Final Thought
As AI models grow more powerful, the secret to scaling them might not be more hardware but smarter, smaller numbers. FP8 shows that by cutting precision just enough, we can train faster, cheaper, and greener—without losing accuracy. And with future formats and tools already on the way, we're heading into a world where AI becomes both ultra-capable and ultra-efficient, especially with the rise of TinyMLs. Shrinking numbers, growing power—that’s the FP8 future.
Here's a link to a super-interesting research paper published in 2023.
Have you ever wondered how today's LLMs like ChatGPT or Gemini manage to stay intelligent while becoming faster and cheaper to run?
Well, training these giants used to require hundreds of gigabytes of memory and weeks of GPU time. But now, thanks to a tiny number format called FP8, we are speeding things up, decreasing costs and keeping the intelligence intact.
How LLMs Are Generally Trained
Large Language Models (LLMs) like GPT, LLaMA, or Claude are trained on billions of words using neural networks that consist of layers of matrix multiplications, attention heads, and non-linearities. Every input, output, and internal parameter is a number—usually a floating-point number, which means it can represent decimals as well as very large or small values. Traditionally, training these models relied on FP32 (32-bit floating-point format)—a numerical format that uses 32 bits to store one number. It's precise and accurate, but also memory-heavy and slow. Imagine lifting weight with both hands vs doing it with just one finger - powerful but takes more effort and slow.
That is exactly where FP8 comes into picture.
The Need for FP8
As models grow larger (some with 175 billion+ parameters!), researchers started asking: Do we really need that much precision to train them? The answer? Not always.
That’s where FP8 (8-bit floating point) comes in.
It uses just one-quarter the memory of FP32 and half that of FP16/BF16, making everything faster and leaner. And unlike 8-bit integer formats (like INT8), FP8 still keeps an exponent, so it can represent a much wider range of numbers without breaking the math. Think of FP8 like switching from a bulky power hungry PCs to a M4 Mac—still powerful, but way lighter.
What is FP8?

FP8 stands for 8-bit floating point, a super compact numerical format used to store and process numbers in deep learning models. Unlike regular integers, FP8 numbers include an exponent, which means they represent profoundly small and increasingly large values just like the common FP32 but with far less memory. They are of 2 types - E4M3 which has better precision but a smaller range & E5M2 which has a wide range but a comparatively smaller precision. By reducing the size of each number to just 8 bits (1 byte), FP8 helps AI models run faster, consume less memory, and train more efficiently—especially when used with modern GPUs designed to handle these smaller formats.
In the true sense, it compresses the number format by trimming down the precision and range to only what's most necessary for the model to learn effectively. This is only possible because of the hardware level support of NVIDIA and Tensor Cores along with Dynamic Scaling.
How FP8 is Transforming LLM Training
Speed: Training becomes up to 75% faster when using FP8 on supported hardware like the powerful NVIDIA H100 GPUs.
Memory Savings: Models like GPT-175B saw a 39% reduction in GPU memory use during training.
Lower Costs: Less memory and compute means smaller cloud bills and better energy efficiency.
Easy Inference: Since FP8 is still a floating point, the same model can be used in training and interference - no complex processes are needed.

What’s Next?
The success of FP8 has researchers dreaming even bigger—or should we say smaller? Formats like FP6 and FP4 are being explored. Ideas like per-layer or block-wise scaling, smarter training techniques, and more robust software support are also on the horizon. Plus, upcoming chips from NVIDIA (Blackwell) and AMD (MI300) are doubling down on FP8.
Final Thought
As AI models grow more powerful, the secret to scaling them might not be more hardware but smarter, smaller numbers. FP8 shows that by cutting precision just enough, we can train faster, cheaper, and greener—without losing accuracy. And with future formats and tools already on the way, we're heading into a world where AI becomes both ultra-capable and ultra-efficient, especially with the rise of TinyMLs. Shrinking numbers, growing power—that’s the FP8 future.
Here's a link to a super-interesting research paper published in 2023.
Have you ever wondered how today's LLMs like ChatGPT or Gemini manage to stay intelligent while becoming faster and cheaper to run?
Well, training these giants used to require hundreds of gigabytes of memory and weeks of GPU time. But now, thanks to a tiny number format called FP8, we are speeding things up, decreasing costs and keeping the intelligence intact.
How LLMs Are Generally Trained
Large Language Models (LLMs) like GPT, LLaMA, or Claude are trained on billions of words using neural networks that consist of layers of matrix multiplications, attention heads, and non-linearities. Every input, output, and internal parameter is a number—usually a floating-point number, which means it can represent decimals as well as very large or small values. Traditionally, training these models relied on FP32 (32-bit floating-point format)—a numerical format that uses 32 bits to store one number. It's precise and accurate, but also memory-heavy and slow. Imagine lifting weight with both hands vs doing it with just one finger - powerful but takes more effort and slow.
That is exactly where FP8 comes into picture.
The Need for FP8
As models grow larger (some with 175 billion+ parameters!), researchers started asking: Do we really need that much precision to train them? The answer? Not always.
That’s where FP8 (8-bit floating point) comes in.
It uses just one-quarter the memory of FP32 and half that of FP16/BF16, making everything faster and leaner. And unlike 8-bit integer formats (like INT8), FP8 still keeps an exponent, so it can represent a much wider range of numbers without breaking the math. Think of FP8 like switching from a bulky power hungry PCs to a M4 Mac—still powerful, but way lighter.
What is FP8?

FP8 stands for 8-bit floating point, a super compact numerical format used to store and process numbers in deep learning models. Unlike regular integers, FP8 numbers include an exponent, which means they represent profoundly small and increasingly large values just like the common FP32 but with far less memory. They are of 2 types - E4M3 which has better precision but a smaller range & E5M2 which has a wide range but a comparatively smaller precision. By reducing the size of each number to just 8 bits (1 byte), FP8 helps AI models run faster, consume less memory, and train more efficiently—especially when used with modern GPUs designed to handle these smaller formats.
In the true sense, it compresses the number format by trimming down the precision and range to only what's most necessary for the model to learn effectively. This is only possible because of the hardware level support of NVIDIA and Tensor Cores along with Dynamic Scaling.
How FP8 is Transforming LLM Training
Speed: Training becomes up to 75% faster when using FP8 on supported hardware like the powerful NVIDIA H100 GPUs.
Memory Savings: Models like GPT-175B saw a 39% reduction in GPU memory use during training.
Lower Costs: Less memory and compute means smaller cloud bills and better energy efficiency.
Easy Inference: Since FP8 is still a floating point, the same model can be used in training and interference - no complex processes are needed.

What’s Next?
The success of FP8 has researchers dreaming even bigger—or should we say smaller? Formats like FP6 and FP4 are being explored. Ideas like per-layer or block-wise scaling, smarter training techniques, and more robust software support are also on the horizon. Plus, upcoming chips from NVIDIA (Blackwell) and AMD (MI300) are doubling down on FP8.
Final Thought
As AI models grow more powerful, the secret to scaling them might not be more hardware but smarter, smaller numbers. FP8 shows that by cutting precision just enough, we can train faster, cheaper, and greener—without losing accuracy. And with future formats and tools already on the way, we're heading into a world where AI becomes both ultra-capable and ultra-efficient, especially with the rise of TinyMLs. Shrinking numbers, growing power—that’s the FP8 future.
Here's a link to a super-interesting research paper published in 2023.
Be the first to know about every new letter.
No spam, unsubscribe anytime.