Floating-point formats in the world of machine learning


What you will learn:

  • Why floating point is important for developing machine learning models.
  • What floating point formats are used with machine learning?

Over the past two decades, computationally intensive artificial intelligence (AI) tasks have encouraged the use of custom hardware to efficiently drive these new, robust systems. Machine learning (ML) models, one of the most widely used forms of AI, are trained to handle these intensive tasks using floating-point arithmetic.

However, because floating-point formats have been extremely resource-intensive, AI deployment systems often rely on one of the few now-standard integer quantization techniques using floating-point formats, such as bfloat16 from Google and FP16 from IEEE.

Since computer memory is limited, it is not efficient to store numbers with infinite precision, whether binary or decimal fractions. This is due to the imprecision of the numbers when it comes to certain applications, such as AI training.

While software engineers can design machine learning algorithms, they often cannot rely on ever-changing hardware to be able to run those algorithms efficiently. The same can be said for hardware manufacturers, who often produce next-gen processors without being task-oriented, meaning the processor is designed to be a complete platform to handle most tasks instead target-specific applications.

In computing, floating point are arithmetic formulas representative of real numbers that are an approximation to support a trade-off between range and precision, or rather huge amounts of data and accurate results. For this reason, floating point calculation is often used in systems with small and large numbers that require fast processing times.

It is well known that deep neural networks can tolerate lower numerical precision because high precision computations are less efficient when training or inferring neural networks. Additional precision provides no benefit while being slower and less memory efficient.

In fact, some models can even achieve greater accuracy with less accuracy. A paper published by Cornell University attributes lower precision regularization effects.

Floating point formats

Although there are a ton of floating point formats, only a few have gained traction for machine learning applications, as these formats require the proper hardware and firmware support to operate effectively. In this section, we’ll look at several examples of floating-point formats designed to handle machine learning development.

IEEE 754

The IEEE 754 standard (Fig.1) is one of the best-known formats for AI applications. It is a set of representations of numeric values ​​and symbols, including FP16, FP32 and FP64 (AKA Half, Single and Double-precision formats). FP32, for example, is broken down into a sequence of 32 bits, such as b31, b30, and b29, down to zero.

A floating-point format is specified by a radix (b)which is either 2 (binary) or 10 (decimal), a precision (p) and an exponent range of emin to emax, with emin = 1 – emax for all IEEE 754 formats. The format includes finite numbers that can be described by three integers.

These integers include s = a sign (zero or one), c = a significand (or coefficient) having no more than p digits when written in base b (i.e. an integer between 0 and bp − 1), and q = an exponent such that emin ≤ q + p − 1 ≤ emax. The format also includes two infinities (+∞ and -∞) and two types of NaN (Not a Number), including a silent NaN (qNaN) and a signaling NaN (sNaN).

The details here are many, but this is the general format of how IEEE 754 floating point works; more detailed information can be found at the link above. FP32 and FP64 are on the larger floating-point spectrum, and they’re supported by x86 processors and most current GPUs, as well as the C/C++, PyTorch, and TensorFlow programming languages. FP16, on the other hand, is not widely used with modern CPUs, but is widely supported by current GPUs in conjunction with machine learning frameworks.


Google’s bfloat16 (Fig.2) is another widely used floating-point format intended for machine learning workloads. The Brain Floating Point format is essentially a truncated version of the IEEE FP16, allowing fast, single-precision conversion from 754 to and from this format. When applied to machine learning, there are generally three types of values, including weights, activations, and gradients.

Google recommends storing weights and gradients in FP32 format and storing activations in bfloat16. Of course, weights can also be stored in BFloat16 without significant performance degradation depending on the circumstances.

Basically, bfloat16 consists of one sign bit, eight exponent bits, and seven mantissa bits. This differs from IEEE 16-bit floating point, which was not designed for deep learning applications when it was developed. The format is used in Intel AI processors including Nervana NNP-L1000, Xeon processors, Intel FPGAs, and Google Cloud TPUs.

Unlike the IEEE format, bfloat16 is not used with C/C++ programming languages. However, it leverages TensorFlow, AMD ROCm, NVIDIA CUDA, and ARMv8.6-A software stack for AI applications.


NVIDIA TensorFloat (Fig.3) is another great floating point format. However, it was only designed to take advantage of TensorFlow TPUs designed explicitly for AI applications. According to NVIDIA, “TensorFloat-32 is the new math mode in NVIDIA A100 GPUs for handling matrix operations also called tensor operations used at the core of AI and some HPC applications. TF32 running on Tensor cores in A100 GPUs can provide up to 10X speedup over single-precision floating-point (FP32) calculations on Volta GPUs.”

The format is just a 32 bit float which removes 13 bits of precision to run on Tensor Cores. Thus, it has the precision of the FP16 (10 bits), but has the range of the IEEE 754 format of the FP32 (8 bits).

NVIDIA states that TF32 uses the same 10-bit mantissa as the FP16 half-precision math calculations, which turn out to have more than enough headroom for the precision requirements of AI workloads. TF32 also adopts the same 8-bit exponent as FP32, so it can support the same numeric range. This means that content can be converted from FP32 to TF32, making it easier to switch platforms.

Currently, TF32 does not support C/C++ programming languages, but NVIDIA says the TensorFlow framework and a version of the PyTorch framework that supports TF32 on NGC are available to developers. Although it limits the hardware and software that can be used with the format, it performs exceptionally well on the company’s GPUs.


This is just a basic overview of floating-point formats, an introduction to a larger, more expansive world designed to reduce hardware and software demands to drive innovation within the software industry. ‘IA. It will be interesting to see how these platforms evolve over the next few years as AI becomes more advanced and entrenched in our lives. Technology is constantly evolving, as are the formats that make machine learning application development increasingly efficient in running software.


Comments are closed.