Optimisation removes the “33rd-bit problem” and is already in LLVM, with GCC and MSVC updates on the way

Harriet L. Whitcombe • May 04, 2026 18:23

Optimisation removes the “33rd-bit problem” and is already in LLVM, with GCC and MSVC updates on the way

Engineers at the Japanese company Cybozu Labs, which focuses on software development and computational optimisation, have proposed a new way to perform division by a constant on 64-bit processors. The technique takes advantage of the spare bit-width available in modern registers, removing limitations inherited from older 32-bit algorithms. The patch has already been merged into LLVM (Low Level Virtual Machine), the widely used open-source project that includes the Clang compiler (version 23.0.0). Updates for GCC (GNU Compiler Collection) and MSVC (Microsoft Visual C++) are currently being tested.

Background: constant division and the Granlund–Montgomery (GM) method

Even on today’s high-performance 64-bit systems, mainstream compilers (GCC, Clang, MSVC) have continued to rely on algorithms designed roughly 30 years ago for 32-bit CPUs. Since 1994, the standard approach in compilers for division by a constant has been the Granlund and Montgomery method (the GM method). Rather than emitting an actual division instruction, it replaces division with a multiplication by a “magic constant” followed by bit shifts.

However, the GM method runs into trouble with so-called “33-bit divisors”. In those cases it requires extra work, leading to unnecessary computation and lower performance on modern 64-bit processors. In practice, about 3% of constant divisions of 32-bit values (for example dividing by 7, 19, or 107) need intermediate steps using 33-bit “magic numbers”. That creates a long critical path and reduces parallelism.

Cybozu Labs’ new 64-bit-grid approach to the “33rd-bit problem”

The contribution from Mitsunari Shigeo and Hoshino Takashi is to stop emulating 33-bit arithmetic and instead rewrite the underlying formula directly using a 64-bit grid. Rather than a complicated correction sequence, the method uses a compact mathematical model:

(x⋅(2^(64−a) ⋅c))//2^64,

where x is the dividend widened to 64 bits and c is the magic constant.

On x86-64, the approach uses MULX (Unsigned Multiply Without Affecting Flags), which leaves the processor flags untouched. On ARM/Apple Silicon it uses UMULH (Unsigned Multiply High), which returns the upper 64 bits of the multiplication result. With these instructions, the division can be implemented as a single multiply-high operation, delivering a substantial speed-up.

Instruction count and latency: GM method vs the new method

In a loop, the older GM method can require as many as 9 instructions, including additions and shifts, which lengthens the dependency chain. The new method reduces the chain to 3 operations, cutting latency and data dependencies-an important gain on modern processors.

Benchmarks on Intel Xeon w9-3495X and Apple M4

Benchmarks on an Intel Xeon w9-3495X and an Apple M4 showed speed-ups of up to 1.67x and 1.98x respectively. The uplift was more pronounced on the Apple M4 due to the high throughput of its multiply units. On the Xeon, the new method also improved the predictability of execution time, which matters for server workloads-for example, the standard deviation of runtime dropped from 0.013 to 0.009 seconds.

LLVM is merged; GCC and MSVC are in final testing

By incorporating the new technique into LLVM and GCC, software that processes large volumes of data-such as databases, cryptographic systems, and network-traffic analysis-can run faster.

This is not only an academic result but a practical optimisation that is already shipping into industry toolchains. At present, the patch is fully integrated into LLVM, while GCC and MSVC updates are in the final testing phase. As a result, in the near future most programs rebuilt with the updated compilers should see a meaningful acceleration without any changes to their source code. In compilers, a long-standing historical anachronism is removed, and the power of 64-bit processors is finally applied to basic arithmetic operations-delivering almost a twofold speed-up in certain scenarios.

Optimisation removes the “33rd-bit problem” and is already in LLVM, with GCC and MSVC updates on the way

Optimisation removes the “33rd-bit problem” and is already in LLVM, with GCC and MSVC updates on the way

Background: constant division and the Granlund–Montgomery (GM) method

Cybozu Labs’ new 64-bit-grid approach to the “33rd-bit problem”

Instruction count and latency: GM method vs the new method

Benchmarks on Intel Xeon w9-3495X and Apple M4

LLVM is merged; GCC and MSVC are in final testing

Comments

Leave a Comment