The Tiny Number Powering AI - FP8

Have you ever wondered how today's LLMs like ChatGPT or Gemini manage to stay intelligent while becoming faster and cheaper to run?
Well, training these giants used to require hundreds of gigabytes of memory and weeks of GPU time. But now, thanks to a tiny number format called FP8, we are speeding things up, decreasing costs and keeping the intelligence intact.

How LLMs Are Generally Trained

Large Language Models (LLMs) like GPT, LLaMA, or Claude are trained on billions of words using neural networks that consist of layers of matrix multiplications, attention heads, and non-linearities. Every input, output, and internal parameter is a number—usually a floating-point number, which means it can represent decimals as well as very large or small values. Traditionally, training these models relied on FP32 (32-bit floating-point format)—a numerical format that uses 32 bits to store one number. It's precise and accurate, but also memory-heavy and slow. Imagine lifting weight with both hands vs doing it with just one finger - powerful but takes more effort and slow.
That is exactly where FP8 comes into picture.

The Need for FP8

As models grow larger (some with 175 billion+ parameters!), researchers started asking: Do we really need that much precision to train them? The answer? Not always.
That’s where FP8 (8-bit floating point) comes in.
It uses just one-quarter the memory of FP32 and half that of FP16/BF16, making everything faster and leaner. And unlike 8-bit integer formats (like INT8), FP8 still keeps an exponent, so it can represent a much wider range of numbers without breaking the math. Think of FP8 like switching from a bulky power hungry PCs to a M4 Mac—still powerful, but way lighter.

What is FP8?

FP8 stands for 8-bit floating point, a super compact numerical format used to store and process numbers in deep learning models. Unlike regular integers, FP8 numbers include an exponent, which means they represent profoundly small and increasingly large values just like the common FP32 but with far less memory. They are of 2 types - E4M3 which has better precision but a smaller range & E5M2 which has a wide range but a comparatively smaller precision. By reducing the size of each number to just 8 bits (1 byte), FP8 helps AI models run faster, consume less memory, and train more efficiently—especially when used with modern GPUs designed to handle these smaller formats.
In the true sense, it compresses the number format by trimming down the precision and range to only what's most necessary for the model to learn effectively. This is only possible because of the hardware level support of NVIDIA and Tensor Cores along with Dynamic Scaling.

How FP8 is Transforming LLM Training

Speed: Training becomes up to 75% faster when using FP8 on supported hardware like the powerful NVIDIA H100 GPUs.
Memory Savings: Models like GPT-175B saw a 39% reduction in GPU memory use during training.
Lower Costs: Less memory and compute means smaller cloud bills and better energy efficiency.
Easy Inference: Since FP8 is still a floating point, the same model can be used in training and interference - no complex processes are needed.

What’s Next?

The success of FP8 has researchers dreaming even bigger—or should we say smaller? Formats like FP6 and FP4 are being explored. Ideas like per-layer or block-wise scaling, smarter training techniques, and more robust software support are also on the horizon. Plus, upcoming chips from NVIDIA (Blackwell) and AMD (MI300) are doubling down on FP8.

Final Thought

As AI models grow more powerful, the secret to scaling them might not be more hardware but smarter, smaller numbers. FP8 shows that by cutting precision just enough, we can train faster, cheaper, and greener—without losing accuracy. And with future formats and tools already on the way, we're heading into a world where AI becomes both ultra-capable and ultra-efficient, especially with the rise of TinyMLs. Shrinking numbers, growing power—that’s the FP8 future.
Here's a link to a super-interesting research paper published in 2023.

Have you ever wondered how today's LLMs like ChatGPT or Gemini manage to stay intelligent while becoming faster and cheaper to run?
Well, training these giants used to require hundreds of gigabytes of memory and weeks of GPU time. But now, thanks to a tiny number format called FP8, we are speeding things up, decreasing costs and keeping the intelligence intact.

How LLMs Are Generally Trained

Large Language Models (LLMs) like GPT, LLaMA, or Claude are trained on billions of words using neural networks that consist of layers of matrix multiplications, attention heads, and non-linearities. Every input, output, and internal parameter is a number—usually a floating-point number, which means it can represent decimals as well as very large or small values. Traditionally, training these models relied on FP32 (32-bit floating-point format)—a numerical format that uses 32 bits to store one number. It's precise and accurate, but also memory-heavy and slow. Imagine lifting weight with both hands vs doing it with just one finger - powerful but takes more effort and slow.
That is exactly where FP8 comes into picture.

The Need for FP8

As models grow larger (some with 175 billion+ parameters!), researchers started asking: Do we really need that much precision to train them? The answer? Not always.
That’s where FP8 (8-bit floating point) comes in.
It uses just one-quarter the memory of FP32 and half that of FP16/BF16, making everything faster and leaner. And unlike 8-bit integer formats (like INT8), FP8 still keeps an exponent, so it can represent a much wider range of numbers without breaking the math. Think of FP8 like switching from a bulky power hungry PCs to a M4 Mac—still powerful, but way lighter.

What is FP8?

FP8 stands for 8-bit floating point, a super compact numerical format used to store and process numbers in deep learning models. Unlike regular integers, FP8 numbers include an exponent, which means they represent profoundly small and increasingly large values just like the common FP32 but with far less memory. They are of 2 types - E4M3 which has better precision but a smaller range & E5M2 which has a wide range but a comparatively smaller precision. By reducing the size of each number to just 8 bits (1 byte), FP8 helps AI models run faster, consume less memory, and train more efficiently—especially when used with modern GPUs designed to handle these smaller formats.
In the true sense, it compresses the number format by trimming down the precision and range to only what's most necessary for the model to learn effectively. This is only possible because of the hardware level support of NVIDIA and Tensor Cores along with Dynamic Scaling.

How FP8 is Transforming LLM Training

Speed: Training becomes up to 75% faster when using FP8 on supported hardware like the powerful NVIDIA H100 GPUs.
Memory Savings: Models like GPT-175B saw a 39% reduction in GPU memory use during training.
Lower Costs: Less memory and compute means smaller cloud bills and better energy efficiency.
Easy Inference: Since FP8 is still a floating point, the same model can be used in training and interference - no complex processes are needed.

What’s Next?

The success of FP8 has researchers dreaming even bigger—or should we say smaller? Formats like FP6 and FP4 are being explored. Ideas like per-layer or block-wise scaling, smarter training techniques, and more robust software support are also on the horizon. Plus, upcoming chips from NVIDIA (Blackwell) and AMD (MI300) are doubling down on FP8.

Final Thought

As AI models grow more powerful, the secret to scaling them might not be more hardware but smarter, smaller numbers. FP8 shows that by cutting precision just enough, we can train faster, cheaper, and greener—without losing accuracy. And with future formats and tools already on the way, we're heading into a world where AI becomes both ultra-capable and ultra-efficient, especially with the rise of TinyMLs. Shrinking numbers, growing power—that’s the FP8 future.
Here's a link to a super-interesting research paper published in 2023.

Have you ever wondered how today's LLMs like ChatGPT or Gemini manage to stay intelligent while becoming faster and cheaper to run?
Well, training these giants used to require hundreds of gigabytes of memory and weeks of GPU time. But now, thanks to a tiny number format called FP8, we are speeding things up, decreasing costs and keeping the intelligence intact.

How LLMs Are Generally Trained

Large Language Models (LLMs) like GPT, LLaMA, or Claude are trained on billions of words using neural networks that consist of layers of matrix multiplications, attention heads, and non-linearities. Every input, output, and internal parameter is a number—usually a floating-point number, which means it can represent decimals as well as very large or small values. Traditionally, training these models relied on FP32 (32-bit floating-point format)—a numerical format that uses 32 bits to store one number. It's precise and accurate, but also memory-heavy and slow. Imagine lifting weight with both hands vs doing it with just one finger - powerful but takes more effort and slow.
That is exactly where FP8 comes into picture.

The Need for FP8

As models grow larger (some with 175 billion+ parameters!), researchers started asking: Do we really need that much precision to train them? The answer? Not always.
That’s where FP8 (8-bit floating point) comes in.
It uses just one-quarter the memory of FP32 and half that of FP16/BF16, making everything faster and leaner. And unlike 8-bit integer formats (like INT8), FP8 still keeps an exponent, so it can represent a much wider range of numbers without breaking the math. Think of FP8 like switching from a bulky power hungry PCs to a M4 Mac—still powerful, but way lighter.

What is FP8?

FP8 stands for 8-bit floating point, a super compact numerical format used to store and process numbers in deep learning models. Unlike regular integers, FP8 numbers include an exponent, which means they represent profoundly small and increasingly large values just like the common FP32 but with far less memory. They are of 2 types - E4M3 which has better precision but a smaller range & E5M2 which has a wide range but a comparatively smaller precision. By reducing the size of each number to just 8 bits (1 byte), FP8 helps AI models run faster, consume less memory, and train more efficiently—especially when used with modern GPUs designed to handle these smaller formats.
In the true sense, it compresses the number format by trimming down the precision and range to only what's most necessary for the model to learn effectively. This is only possible because of the hardware level support of NVIDIA and Tensor Cores along with Dynamic Scaling.

How FP8 is Transforming LLM Training

Speed: Training becomes up to 75% faster when using FP8 on supported hardware like the powerful NVIDIA H100 GPUs.
Memory Savings: Models like GPT-175B saw a 39% reduction in GPU memory use during training.
Lower Costs: Less memory and compute means smaller cloud bills and better energy efficiency.
Easy Inference: Since FP8 is still a floating point, the same model can be used in training and interference - no complex processes are needed.

What’s Next?

The success of FP8 has researchers dreaming even bigger—or should we say smaller? Formats like FP6 and FP4 are being explored. Ideas like per-layer or block-wise scaling, smarter training techniques, and more robust software support are also on the horizon. Plus, upcoming chips from NVIDIA (Blackwell) and AMD (MI300) are doubling down on FP8.

Final Thought

As AI models grow more powerful, the secret to scaling them might not be more hardware but smarter, smaller numbers. FP8 shows that by cutting precision just enough, we can train faster, cheaper, and greener—without losing accuracy. And with future formats and tools already on the way, we're heading into a world where AI becomes both ultra-capable and ultra-efficient, especially with the rise of TinyMLs. Shrinking numbers, growing power—that’s the FP8 future.
Here's a link to a super-interesting research paper published in 2023.

The Tiny Number Powering AI - FP8

How LLMs Are Generally Trained

The Need for FP8

What is FP8?

How FP8 is Transforming LLM Training

What’s Next?

Final Thought

How LLMs Are Generally Trained

The Need for FP8

What is FP8?

How FP8 is Transforming LLM Training

What’s Next?

Final Thought

How LLMs Are Generally Trained

The Need for FP8

What is FP8?

How FP8 is Transforming LLM Training

What’s Next?

Final Thought

Be the first to know about every new letter.