e-Consult | Notes

Describe the format of binary floating-point real numbers

Resources | Subject Notes | Computer Science

Floating-Point Numbers - Cambridge A-Level Computer Science 9618

13.3 Floating-point numbers, representation and manipulation

Objective: Describe the format of binary floating-point real numbers

Floating-point numbers are used to represent real numbers (numbers with a fractional part) in computers. They are represented in a format similar to scientific notation. This allows for a wide range of values to be represented, from very small to very large, using a fixed number of bits.

The IEEE 754 Standard

The most common standard for representing floating-point numbers is IEEE 754. This standard defines how floating-point numbers are stored in memory. We will focus on the single-precision (32-bit) and double-precision (64-bit) formats, which are commonly used.

Single-Precision (32-bit) Floating-Point Format

A single-precision floating-point number is typically represented using 32 bits, divided into three parts:

Sign bit (1 bit): Indicates the sign of the number (0 for positive, 1 for negative).
Exponent (8 bits): Represents the power of 2 by which the significand is multiplied.
Significand (23 bits): Represents the significant digits of the number. It is normalized, meaning it has the form 1.xxxxx (where xxxxx are the digits).

The format can be summarized as follows:

Bit Position	Name	Description
31	Sign Bit	0: Positive, 1: Negative
30-23	Significand	Represents the digits of the number (normalized).
22-12	Exponent	Represents the power of 2. It is biased.

Exponent Bias

The exponent is biased to allow for the representation of both positive and negative exponents. The bias is a fixed value that is subtracted from the actual exponent. For single-precision, the bias is 127. This means that the actual exponent is calculated as: actual exponent = stored exponent - 127

Significand Representation

The significand is normalized to ensure that there is only one non-zero digit to the left of the decimal point (or binary point in this case). This allows for a greater range of representable numbers with the same number of bits. The leading '1' is implicit and not stored, saving one bit.

Double-Precision (64-bit) Floating-Point Format

A double-precision floating-point number is typically represented using 64 bits, divided into three parts:

Sign bit (1 bit): Indicates the sign of the number (0 for positive, 1 for negative).
Exponent (11 bits): Represents the power of 2 by which the significand is multiplied.
Significand (52 bits): Represents the significant digits of the number. It is normalized, meaning it has the form 1.xxxxx (where xxxxx are the digits).

The format can be summarized as follows:

Bit Position	Name	Description
63	Sign Bit	0: Positive, 1: Negative
62-53	Significand	Represents the digits of the number (normalized).
52-31	Exponent	Represents the power of 2. It is biased.

Double-Precision Exponent Bias

The exponent is biased to allow for the representation of both positive and negative exponents. For double-precision, the bias is 1023. This means that the actual exponent is calculated as: actual exponent = stored exponent - 1023

Limitations of Floating-Point Representation

Floating-point representation has limitations:

Limited Precision: Not all real numbers can be represented exactly. This can lead to rounding errors.
Special Values: Special values like 0, positive infinity, negative infinity, and NaN (Not a Number) are used to represent certain conditions.

Special Values

Zero: Represented by all bits being zero.
Positive and Negative Infinity: Represented when the exponent is all ones (or all zeros for positive infinity).
NaN: Represented when the exponent is all ones and the significand is non-zero. NaN indicates an undefined or unrepresentable result (e.g., 0/0).

Example

Consider a single-precision floating-point number with the following binary representation: 0 01111100 10000000000000000000000. This represents the number 3.14159 (approximately).

The sign bit is 0 (positive). The exponent is 01111100, which is 124 in decimal. The bias is 127, so the actual exponent is 124 - 127 = -3. The significand is 10000000000000000000000, which is 1.10000000000000000000000 in binary. Therefore, the number is calculated as: 1.1 * 2^-3 = 1.1 / 8 = 0.1375. This example is simplified for illustration. The actual calculation involves the implicit leading '1' in the significand.

Suggested diagram: A diagram illustrating the single-precision and double-precision floating-point formats, showing the bit positions and their corresponding names.