Me, myself & IT

True Lies – or What LLVM Claims, but Fails to Deliver

Purpose
Reason: Example 1; Example 2
Violation of the Microsoft calling convention
Poor to abysmal performance of 128÷128-bit integer division
Bloated code and poor performance of 64÷64-bit integer division
Counterexample for the 3 preceding bugs
Unoptimised code of 64÷64-bit integer division functions
Miserable performance of 128÷128-bit (and 64÷64-bit) integer division: Runtime measurement on AMD Ryzen 5 3600; Runtime measurement on AMD EPYC 7262; Runtime measurement on Intel Core i5-6600
Careless (or clueless?) wasting of resources on Microsoft Windows
Which cross-compiler?
Misleading message box from installer on Microsoft Windows
Inconsistent undefined behaviour
Impaired performance of one of the most trivial functions
Invalid machine code generated for __bswapsi2(), __bswapdi2() and __builtin_bswap*()
Bad, insufficient or no (peephole) optimisation at all in various library functions: __absvti2(); __ashldi3(), __ashrdi3() and __lshrdi3(); __cmp?i2() and __ucmp?i2(); __mulo?i4(); __parity?i2()
Braindead implementation of code generator for __builtin_*(): __builtin_parity(); __builtin_rotateleft*() and __builtin_rotateright*(); __builtin_mul_overflow(); __builtin_add_overflow() and __builtin_sub_overflow(); __builtin_copysign()
Utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication: Runtime measurement on AMD EPYC 7262
Which optimiser?
The optimiser fails – take 2
The optimiser fails – take 3
The optimiser fails – take 4
The optimiser fails – take 5
The optimiser fails – take 6
The optimiser fails – take 7
The optimiser fails – take 8
The optimiser fails – take 9
The optimiser fails – take 10
The optimiser fails – take 11
The optimiser fails – take 12
The optimiser fails – take 13
The optimiser fails – take 14
The optimiser fails – take 15

Purpose

Demonstrate multiple bugs, defects, deficiencies and flaws of LLVM’s Clang compiler, from bloated, unoptimised or wrong code generated over poor, miserable, disastrous and abysmal performance of the integer division and the overflow-checking integer multiplication routines of its compiler-rt runtime libraries to careless wasting of resources on Microsoft^® Windows^™.

Reason

For their compiler-rt runtime libraries, LLVM ~~claims~~ boasts

The compiler-rt project provides highly tuned implementations of the low-level code generator support routines like "__fixunsdfdi" and other calls generated when a target doesn't have a short sequence of native instructions to implement a core IR operation.

and ~~states~~ brags

The builtins library provides optimized implementations of this and other low-level routines, either in target-independent C form, or as a heavily-optimized assembly.

but dares to ship unoptimised and untuned ~~code~~ crap like the library functions __cmpdi2(), __cmpti2(), __ucmpdi2(), __ucmpti2(), __udivmoddi4(), __udivmodti4(), __mulodi4(), __muloti4() or __popcountti2(), which even show pessimised code that impairs their own Clang compiler and its optimiser from generating proper and performant machine code suitable for current super-scalar processors!

Note: for the synopsis of the library’s contents see README.txt.

As proved below, especially in the sections miserable performance of integer division, unoptimised library functions and utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication, the statements cited above are blatant lies!

Note: I recommend to read Randall Hyde’s ACM article, especially the following part:

Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.

Example 1

Literally bad smelling code from the 80’s, almost verbatim ~~stolen~~ taken from GCC, ugly and defeating proper optimisation:

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
[…]
// This file implements __ucmpdi2 for the compiler_rt library.
[…]
// Returns:  if (a <  b) returns 0
//           if (a == b) returns 1
//           if (a >  b) returns 2
[…]
int __ucmpdi2(uint64_t a, uint64_t b) {
    union {
        uint64_t all;
        struct {
            uint32_t low, high;
        } s;
    } x, y;
    x.all = a;
    y.all = b;
    if (x.s.high < y.s.high)
        return 0;
    if (x.s.high > y.s.high)
        return 2;
    if (x.s.low < y.s.low)
        return 0;
    if (x.s.low > y.s.low)
        return 2;
    return 1;
}

#ifdef __ARM_EABI__
// Returns:  if (a <  b) returns -1
//           if (a == b) returns  0
//           if (a >  b) returns  1
int __aeabi_ulcmp(int64_t a, int64_t b) {
    return __ucmpdi2(a, b) - 1;
}
#endif

Note: spotting the most obvious bug is left as an exercise to the reader.

Properly written code which lets the optimiser do its job follows:

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

typedef unsigned long long uint64_t;

int __ucmpdi2(uint64_t a, uint64_t b) {
    return (a > b) - (a < b) + 1;
}

#ifdef __ARM_EABI__
int __aeabi_ulcmp(uint64_t a, uint64_t b) {
    return (a > b) - (a < b);
}
#endif

Example 2

Awkward construction of magic 128-bit constants:

//===-- popcountti2.c - Implement __popcountti2
[…]
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
[…]
int __popcountti2(__int128_t a) {
  __uint128_t x3 = (__uint128_t)a;
  x3 = x3 - ((x3 >> 1) &
             (((__uint128_t)0x5555555555555555uLL << 64) | 0x5555555555555555uLL));
  // Every 2 bits holds the sum of every pair of bits (64)
  x3 = ((x3 >> 2) &
        (((__uint128_t)0x3333333333333333uLL << 64) | 0x3333333333333333uLL)) +
       (x3 & (((__uint128_t)0x3333333333333333uLL << 64) | 0x3333333333333333uLL));
  // Every 4 bits holds the sum of every 4-set of bits (3 significant bits) (32)
  x3 = (x3 + (x3 >> 4)) &
       (((__uint128_t)0x0F0F0F0F0F0F0F0FuLL << 64) | 0x0F0F0F0F0F0F0F0FuLL);
[…]

Properly written code looks as follows:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __popcountti2(__uint128_t value) {
    value -= (value >> 1) & (~(__uint128_t) 0 / 3u); // 0x55...55
    value = ((value >> 2) & (~(__uint128_t) 0 / 5u)) // 0x33...33
          + (value & (~(__uint128_t) 0 / 5u));
    value += value >> 4;
    value &= ~(__uint128_t) 0 / 17u;                 // 0x0F...0F
    value *= ~(__uint128_t) 0 / 255u;                // 0x01...01
    return value >> 120;
}

Violation of the Microsoft calling convention

The MSDN articles Overview of x64 Calling Conventions and Parameter Passing specify:

There's a strict one-to-one correspondence between a function call's arguments and the registers used for those arguments. Any argument that doesn't fit in 8 bytes, or isn't 1, 2, 4, or 8 bytes, must be passed by reference. A single argument is never spread across multiple registers.
[…] 16-byte arguments are passed by reference. […]
To return a user-defined type by value in RAX, it must have a length of 1, 2, 4, 8, 16, 32, or 64 bits. […] Otherwise, the caller must allocate memory for the return value and pass a pointer to it as the first argument. The remaining arguments are then shifted one argument to the right. The same pointer must be returned by the callee in RAX.

The LLVM Project Blog post Clang is now used to build Chrome for Windows states:

Clang is the first-ever open-source C++ compiler that’s ABI-compatible with Microsoft Visual C++ (MSVC) – meaning you can build some parts of your program (for example, system libraries) with the MSVC compiler (“cl.exe”), other parts with Clang, and when linked together (either by MSVC’s linker, “link.exe“, or LLD, the LLVM project’s linker – see below) the parts will form a working program.

To falsify LLVM’s claim, create the text file case0.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __clang__
typedef struct {
    unsigned __int64 low, high;
} __uint128_t;
#else
__attribute__ ((ms_abi))
#endif
__uint128_t __ubugti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder) {
    if (remainder != 0)
#ifndef ALTERNATE
        *remainder = divisor;
    return dividend;
#else
        *remainder = dividend % divisor;
    return dividend / divisor;
#endif
}

Note: the actual definition or name of the ~~128-bit integer scalar~~ user-defined aggregate data type __uint128_t for the Visual C compiler is irrelevant; for binary compatibility, defined by the respective Application Binary Interface alias ABI, only its matching size and memory layout matters!

Compile the source file case0.c with Clang, engaging its optimiser, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-windows case0.c

[…]
__ubugti4:				# @__ubugti4
# %bb.0:
	movaps	(%rcx), %xmm0
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movaps	(%rdx), %xmm1
	movaps	%xmm1, (%r8)
.LBB0_2:
	retq
[…]

OOPS: the code generated by Clang expects the addresses of its three (128-bit) arguments in the registers RCX, RDX and R8 instead of RDX, R8 and R9, while it places the (128-bit) return value in the SSE register XMM0 instead of the memory pointed to by RCX, without to return this pointer in RAX, thus violating the Microsoft ABI twice!

Compile the source file case0.c a second time with Clang, now without SSE support, and display the generated assembly code:

clang -mno-sse -o- -O3 -S -target amd64-pc-windows case0.c

[…]
__ubugti4:				# @__ubugti4
# %bb.0:
	movq	%rdx, %r9
	movq	(%rcx), %rax
	movq	8(%rcx), %rdx
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movq	(%r9), %r10
	movq	8(%r9), %rcx
	movq	%rcx, 8(%r8)
	movq	%r10, (%r8)
.LBB0_2:
	retq
[…]

OUCH: the code generated by Clang still expects the addresses of its three (128-bit) arguments in the registers RCX, RDX and R8 instead of RDX, R8 and R9, while it places the (128-bit) return value in the register pair RDX:RAX instead of the memory pointed to by RCX, without to return this pointer in RAX, again violating the Microsoft ABI twice!

Now compile the source file case0.c with the reference compiler for the Microsoft ABI, Visual C, and display the generated assembly code:

CL.EXE /c /FaCON: /FoNUL: /nologo /Ox /W4 case0.c

case0.c
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 
[…]
__ubugti4 PROC
; Line 11
	test	r9, r9
	je	SHORT $LN1@ubugti4
; Line 13
	mov	rax, QWORD PTR [r8]
	mov	QWORD PTR [r9], rax
	mov	rax, QWORD PTR [r8+8]
	mov	QWORD PTR [r9+8], rax
$LN1@ubugti4:
; Line 14
	mov	rax, QWORD PTR [rdx]
	mov	QWORD PTR [rcx], rax
	mov	rax, QWORD PTR [rdx+8]
	mov	QWORD PTR [rcx+8], rax
	mov	rax, rcx
; Line 19
	ret	0
__ubugti4 ENDP
[…]

This code complies (of course) with the Microsoft ABI:

the caller allocates space for the (128-bit) aggregate return value and passes its address as hidden first argument in register RCX;
the caller passes the addresses of the three explicit (128-bit) arguments in the registers RDX, R8 and R9;
the callee returns the address of the (128-bit) aggregate return value in register RAX.

Poor to abysmal performance of 128÷128-bit integer division

Compile the source file case0.c a third time with Clang, now with the preprocessor macro ALTERNATE defined, and display the generated assembly code:

clang -DALTERNATE -o- -O3 -S -target amd64-pc-windows case0.c

[…]
__ubugti4:				# @__ubugti4
[…]
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movq	%r8, %r15
[…]
	leaq	80(%rsp), %rcx
	leaq	64(%rsp), %rdx
	callq	__umodti3
	movaps	%xmm0, (%r15)
.LBB0_2:
[…]
	leaq	48(%rsp), %rcx
	leaq	32(%rsp), %rdx
	callq	__udivti3
	nop
[…]

Oops: instead to generate a single call of the library function __udivmodti4() which returns both quotient and remainder, even with the command line option -O3 specified Clang generates separate calls of the library functions __umodti3() and __udivti3() – which is especially funny, since both call the library function __udivmodti4() in turn!

Note: as demonstrated below, this code runs about 3 to 34 (in words: thirty-four) times slower than code generated by GCC!

Additionally count the number of instructions and the code size of the __udivmodti4() function provided in the clang_rt.builtins-x86_64.lib library: it shows 234 instructions in 765 bytes, while properly written (assembly) code shows only 80 instructions in just 219 bytes – and runs up to 17 (in words: seventeen) times faster!

Note: exploration of the equally entertaining call chain for signed 128÷128-bit integer division is left as an exercise to the reader.

Bloated code and poor performance of 64÷64-bit integer division

Create the text file case2.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef ALTERNATE
typedef struct {
   __int64 quotient, remainder;
} lldiv_t;

__attribute__ ((ms_abi))
lldiv_t lldiv(__int64 dividend, __int64 divisor) {
    lldiv_t lldiv = {dividend / divisor, dividend % divisor};
    return lldiv;
}
#else
unsigned long __udivmoddi4(unsigned long dividend, unsigned long divisor, unsigned long *remainder) {
    if (remainder != 0)
        *remainder = dividend % divisor;
    return dividend / divisor;
}
#endif // ALTERNATE

Compile the source file case2.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-windows case2.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
_lldiv:					# @lldiv
# %bb.0:
	pushl	%ebp			#	push	ebp
	pushl	%ebx			#	mov	ebp, [esp+8]
	pushl	%edi			#	push	ebx
	pushl	%esi			#
	movl	36(%esp), %esi		#
	movl	20(%esp), %ebp		#
	movl	24(%esp), %edi		#
	movl	28(%esp), %ebx		#
	pushl	%esi			#	push	[esp+28]
	pushl	36(%esp)		#	push	[esp+28]
	pushl	%ebx			#	push	[esp+28]
	pushl	%edi			#	push	[esp+28]
	calll	__alldiv		#	call	__alldvrm
	movl	%edx, %ecx		#
	movl	%eax, (%ebp)		#	mov	[ebp], eax
	imull	%eax, %esi		#
	mull	32(%esp)		#
	movl	%ecx, 4(%ebp)		#	mov	[ebp+4], edx
	imull	32(%esp), %ecx		#
	addl	%esi, %edx		#
	addl	%edx, %ecx		#
	subl	%eax, %edi		#
	movl	%ebp, %eax		#
	sbbl	%ecx, %ebx		#
	movl	%edi, 8(%ebp)		#	mov	[ebp+8], ecx
	movl	%ebx, 12(%ebp)		#	mov	[ebp+12], ebx
	popl	%esi			#
	popl	%edi			#	pop	ebx
	popl	%ebx			#	mov	eax, ebp
	popl	%ebp			#	pop	ebp
	retl				#	ret
[…]

Ouch: instead to generate a single call of the compiler helper function __alldvrm(), which returns both quotient and remainder in the register pairs EDX:EAX and EBX:ECX, even with the command line option -O3 specified Clang generates code to compute the remainder itself, wasting 15 of the total 31 instructions, wasting 34 of the total 74 bytes, and wasting about 9 processor clock cycles per function call!

Note: the code of the __divmoddi4() function provided in the clang_rt.builtins-i386.lib library is equally bad:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___divmoddi4
	.type	___divmoddi4, @function
	.text

___divmoddi4:
	push	ebp
	push	ebx
	push	edi
	push	esi
	push	eax
	mov	edi, [esp+36]
	mov	ebx, [esp+24]
	mov	ebp, [esp+28]
	mov	esi, [esp+32]
	push	edi
	push	esi
	push	ebp
	push	ebx
	call	___divdi3
	add	esp, 16
	mov	ecx, eax
	mov	[esp], edx
	imul	edi, eax
	mul	eax, esi
	add	edx, edi
	mov	edi, [esp]
	imul	esi, edi
	add	esi, edx
	sub	ebx, eax
	mov	eax, [esp+40]
	mov	edx, edi
	sbb	ebp, esi
	mov	[eax], ebx
	mov	[eax+4], ebp
	mov	eax, ecx
	add	esp, 4
	pop	esi
	pop	edi
	pop	ebx
	pop	ebp
	ret

	.ident	"clang version 10.0.0 "
	.end

Additionally count the number of instructions and the code size of the __udivmoddi4() function provided in the clang_rt.builtins-i386.lib library: it shows 256 instructions in 759 bytes, while properly written (assembly) code shows only 96 instructions in just 240 bytes – and runs up to 11 (in words: eleven) times faster!

Compile the source file case2.c a second time with Clang, now with the preprocessor macro ALTERNATE defined, targetting the AMD64 platform, and display the generated assembly code:

clang -DALTERNATE -o- -O3 -S -target amd64-pc-linux case2.c

[…]
__udivmoddi4:				# @__udivmoddi4
# %bb.0:
	testq	%rdx, %rdx
	je	.LBB0_2
# %bb.1:
	movq	%rdx, %rcx
	movq	%rdi, %rax
	xorl	%edx, %edx
	divq	%rsi
	movq	%rdx, (%rcx)
.LBB0_2:
	movq	%rdi, %rax
	xorl	%edx, %edx
	divq	%rsi
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: despite the command line option -O3 given, the generated code performs two expensive divisions instead of just one!

Counterexample for the 3 preceding bugs

Compile the source file case2.c a third time with Clang, now targetting the AMD64 platform, and display the generated assembly code:

clang -DALTERNATE -o- -O3 -S -target amd64-pc-windows case2.c

[…]
lldiv:					# @lldiv
# %bb.0:
	movq	%rdx, %rax
	cqto
	idivq	%r8
	movq	%rax, (%rcx)
	movq	%rdx, 8(%rcx)
	movq	%rcx, %rax
	retq
[…]

Unoptimised code of 64÷64-bit integer division functions

Create the text file case3.c with the following source code, refactored from __divdi3() and __moddi3(), in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

unsigned long long __udivmoddi4(unsigned long long dividend,
                                unsigned long long divisor,
                                unsigned long long *remainder);
#ifndef ALTERNATE
long long __divdi3(long long dividend, long long divisor) {
    long long r = divisor >> 63;    // r = divisor < 0 ? -1 : 0
    long long s = dividend >> 63;   // s = dividend < 0 ? -1 : 0
    divisor = (divisor ^ r) - r;    // negate if divisor < 0
    dividend = (dividend ^ s) - s;  // negate if dividend < 0
    s ^= r;                         // sign of quotient
                                    // negate if quotient < 0
    return (__udivmoddi4(dividend, divisor, 0) ^ s) - s;
}
#else
long long __moddi3(long long dividend, long long divisor) {
    long long r = divisor >> 63;    // r = divisor < 0 ? -1 : 0
    long long s = dividend >> 63;   // s = dividend < 0 ? -1 : 0
    divisor = (divisor ^ r) - r;    // negate if divisor < 0
    dividend = (dividend ^ s) - s;  // negate if dividend < 0
    __udivmoddi4(dividend, divisor, (unsigned long long *) &r);
    return (r ^ s) - s;             // negate if dividend < 0
}
#endif // ALTERNATE

Compile the source file case3.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated, not properly optimised assembly code:

clang -o- -O3 -S -target i386-pc-windows case3.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
___divdi3:				# @__divdi3
# %bb.0:
	pushl	%ebp			#
	movl	%esp, %ebp		#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movl	20(%ebp), %ecx		#	mov	eax, [esp+20]
	movl	12(%ebp), %eax		#
	movl	16(%ebp), %edi		#	mov	ecx, [esp+16]
	movl	8(%ebp), %ebx		#
	movl	%ecx, %edx		#
	movl	%eax, %esi		#
	sarl	$31, %edx		#
	sarl	$31, %esi		#	cdq
	xorl	%edx, %edi		#	xor	ecx, edx
	xorl	%edx, %ecx		#	xor	eax, edx
	subl	%edx, %edi		#	sub	ecx, edx
	sbbl	%edx, %ecx		#	sbb	eax, edx
	xorl	%esi, %ebx		#
	xorl	%esi, %eax		#
	subl	%esi, %ebx		#
	sbbl	%esi, %eax		#
	xorl	%edx, %esi		#	mov	ebx, edx
	pushl	$0			#	push	0
	pushl	%ecx			#	push	eax
	pushl	%edi			#	push	ecx
					#	mov	eax, [esp+24]
					#	mov	ecx, [esp+20]
					#	cdq
					#	xor	ecx, edx
					#	xor	eax, edx
					#	sub	ecx, edx
					#	sbb	eax, edx
					#	xor	ebx, edx
	pushl	%eax			#	push	eax
	pushl	%ebx			#	push	ecx
	calll	___udivmoddi4		#	call	___udivmoddi4
	addl	$20, %esp		#	add	esp, 20
	xorl	%esi, %eax		#	xor	eax, ebx
	xorl	%esi, %edx		#	xor	edx, ebx
	subl	%esi, %eax		#	sub	eax, ebx
	sbbl	%esi, %edx		#	sbb	edx, ebx
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]

Oops: despite the command line option -O3 given, the generated code shows 38 instructions and clobbers the registers EBP, EDI and ESI without necessity; the properly optimised code shows just 30 instructions and avoids 6 superfluous memory accesses.

Compile the source file case3.c a second time with Clang, now with the preprocessor macro ALTERNATE defined, again targetting the i386 platform, and display the generated assembly code:

clang -DALTERNATE -o- -O3 -S -target i386-pc-windows case3.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
___moddi3:				# @__moddi3
# %bb.0:
	pushl	%ebp			#
	movl	%esp, %ebp		#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	andl	$-8, %esp		#
	subl	$16, %esp		#	sub	esp, 8
	movl	20(%ebp), %ecx		#	mov	eax, [esp+28]
	movl	12(%ebp), %eax		#
	movl	16(%ebp), %edi		#	mov	ecx, [esp+24]
	movl	%esp, %ebx		#	push	esp
	movl	%ecx, %edx		#
	movl	%eax, %esi		#
	sarl	$31, %edx		#	cdq
	sarl	$31, %esi		#
	xorl	%edx, %edi		#	xor	ecx, edx
	xorl	%edx, %ecx		#	xor	eax, edx
	movl	%edx, 4(%esp)		#
	movl	%edx, (%esp)		#
	subl	%edx, %edi		#	sub	ecx, edx
	sbbl	%edx, %ecx		#	sbb	eax, edx
					#	push	eax
					#	push	ecx
					#	mov	eax, [esp+32]
	movl	8(%ebp), %edx		#	mov	ecx, [esp+28]
					#	cdq
					#	mov	ebx, edx
	xorl	%esi, %eax		#	xor	ecx, edx
	xorl	%esi, %edx		#	xor	eax, edx
	subl	%esi, %edx		#	sub	ecx, edx
	sbbl	%esi, %eax		#	sbb	eax, edx
	pushl	%ebx			#
	pushl	%ecx			#
	pushl	%edi			#
	pushl	%eax			#	push	eax
	pushl	%edx			#	push	ecx
	calll	___udivmoddi4		#	call	___udivmoddi4
	addl	$20, %esp		#	add	esp, 20
	movl	(%esp), %eax		#	pop	eax
	movl	4(%esp), %edx		#	pop	edx
	xorl	%esi, %eax		#	xor	eax, ebx
	xorl	%esi, %edx		#	xor	edx, ebx
	subl	%esi, %eax		#	sub	eax, ebx
	sbbl	%esi, %edx		#	sbb	edx, ebx
	leal	-12(%ebp), %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]

OUCH: despite the command line option -O3 given, the generated code shows 45 instructions and clobbers the registers EBP, EDI and ESI without necessity; the properly optimised code shows just 32 instructions and avoids 8 superfluous memory accesses.

Now compare the code shown above with the code from the clang_rt.builtins-i386.lib library, which is even worse and shows 51 instructions:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___moddi3
	.type	___moddi3, @function
	.text

___moddi3:
	push	ebp
	mov	ebp, esp
	push	ebx
	push	edi
	push	esi
	and	esp, -8
	sub	esp, 16
	mov	edx, [___security_cookie]
	mov	ecx, [ebp+20]
	mov	esi, [ebp+16]
	mov	eax, [ebp+12]
	mov	edi, esp
	xor	edx, ebp
	mov	ebx, eax
	mov	[esp+8], edx
	mov	edx, ecx
	sar	edx, 31
	add	esi, edx
	adc	ecx, edx
	xor	esi, edx
	sar	ebx, 31
	xor	ecx, edx
	mov	edx, [ebp+8]
	xor	eax, ebx
	xor	edx, ebx
	sub	edx, ebx
	sbb	eax, ebx
	push	edi
	push	ecx
	push	esi
	push	eax
	push	edx
	call	___udivmoddi4
	add	esp, 20
	mov	edi, [esp]
	mov	esi, [esp+4]
	mov	ecx, [esp+8]
	xor	edi, ebx
	xor	esi, ebx
	sub	edi, ebx
	sbb	esi, ebx
	xor	ecx, ebp
	call	@__security_check_cookie@4
	mov	eax, edi
	mov	edx, esi
	lea	esp, [ebp-12]
	pop	esi
	pop	edi
	pop	ebx
	pop	ebp
	ret

	.ident	"clang version 10.0.0 "
	.end

Miserable performance of 128÷128-bit (and 64÷64-bit) integer division

Create the text file case4llvm.c with the following source code, refactored from __udivmodti4(), in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// Effects: if modulus != 0, *modulus = numerator % denominator
// Returns: numerator / denominator

// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide

__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
    union {
        __uint128_t all;
        struct {
            unsigned long long low, high;
        };
    } dividend, divisor, quotient, remainder;
    dividend.all = numerator;
    divisor.all = denominator;
    unsigned carry, shift;
    // special cases, X is unknown, K != 0
    if (dividend.high == 0) {
        if (divisor.high == 0) {
            // 0 X
            // ---
            // 0 X
            if (modulus != 0)
                *modulus = dividend.low % divisor.low;
            return dividend.low / divisor.low;
        }
        // 0 X
        // ---
        // K X
        if (modulus != 0)
            *modulus = dividend.low;
        return 0;
    }
    // dividend.high != 0
    if (divisor.low == 0) {
        if (divisor.high == 0) {
            // K X
            // ---
            // 0 0
            if (modulus != 0)
                *modulus = dividend.high % divisor.low;
            return dividend.high / divisor.low;
        }
        // divisor.high != 0
        if (dividend.low == 0) {
            // K 0
            // ---
            // K 0
            if (modulus != 0) {
                remainder.high = dividend.high % divisor.high;
                remainder.low = 0;
                *modulus = remainder.all;
            }
            return dividend.high / divisor.high;
        }
        // K K
        // ---
        // K 0
        if ((divisor.high & (divisor.high - 1)) == 0) {
            // divisor is a power of 2
            if (modulus != 0) {
                remainder.low = dividend.low;
                remainder.high = dividend.high & (divisor.high - 1);
                *modulus = remainder.all;
            }
            return dividend.high >> __builtin_ctzll(divisor.high);
        }
        // K K
        // ---
        // K 0
        shift = __builtin_clzll(divisor.high)
              - __builtin_clzll(dividend.high);
        // 0 <= shift < 63 or shift large
        if (shift > 62) {
            if (modulus != 0)
                *modulus = dividend.all;
            return 0;
        }
        ++shift;
        // 0 < shift < 64
#ifdef OPTIMIZE
        remainder.all = dividend.all >> shift;
#else
        remainder.high = dividend.high >> shift;
        remainder.low = (dividend.low >> shift)
                      | (dividend.high << (64 - shift));
#endif
        quotient.high = dividend.low << (64 - shift);
        quotient.low = 0;
    } else {
        // divisor.low != 0
        if (divisor.high == 0) {
            // K X
            // ---
            // 0 K
            if ((divisor.low & (divisor.low - 1)) == 0) {
                // divisor is a power of 2
                if (modulus != 0)
                    *modulus = dividend.low & (divisor.low - 1);
                if (divisor.low == 1)
                    return dividend.all;
                shift = __builtin_ctzll(divisor.low);
#ifdef OPTIMIZE
                quotient.all = dividend.all >> shift;
#else
                quotient.high = dividend.high >> shift;
                quotient.low = (dividend.low >> shift)
                             | (dividend.high << (64 - shift));
#endif
                return quotient.all;
            }
            // K X
            // ---
            // 0 K
            shift = __builtin_clzll(divisor.low)
                  - __builtin_clzll(dividend.high)
                  + 65;
            // 1 < shift < 128
#ifdef OPTIMIZE
            remainder.all = dividend.all >> shift;
            quotient.all = dividend.all << (128 - shift);
#else
            if (shift == 64) {
                remainder.high = 0;
                remainder.low = dividend.high;
                quotient.high = dividend.low;
                quotient.low = 0;
            } else if (shift < 64) {
                // 1 < shift < 64
                remainder.high = dividend.high >> shift;
                remainder.low = (dividend.low >> shift)
                              | (dividend.high << (64 - shift));
                quotient.high = dividend.low << (64 - shift);
                quotient.low = 0;
            } else {
                // 64 < shift < 128
                remainder.high = 0;
                remainder.low = dividend.high >> (shift - 64);
                quotient.high = (dividend.low >> (shift - 64))
                              | (dividend.high << (128 - shift));
                quotient.low = dividend.low << (128 - shift);
            }
#endif
        } else {
            // K X
            // ---
            // K K
            shift = __builtin_clzll(divisor.high)
                  - __builtin_clzll(dividend.high);
            // 0 <= shift < 64 or shift large
            if (shift > 63) {
                if (modulus != 0)
                    *modulus = dividend.all;
                return 0;
            }
            ++shift;
            // 0 < shift <= 64
#ifdef OPTIMIZE
            remainder.all = dividend.all >> shift;
            quotient.high = dividend.low << (64 - shift);
#else
            if (shift == 64) {
                remainder.high = 0;
                remainder.low = dividend.high;
                quotient.high = dividend.low;
            } else {
                remainder.high = dividend.high >> shift;
                remainder.low = (dividend.low >> shift)
                              | (dividend.high << (64 - shift));
                quotient.high = dividend.low << (64 - shift);
            }
#endif
            quotient.low = 0;
        }
    }
    // Not a special case
    // q and r are initialized with:
    // remainder.all = dividend.all >> shift;
    // quotient.all = dividend.all << (128 - shift);
    // 0 < shift < 128
    for (carry = 0; shift > 0; --shift) {
        // remainder:quotient = ((remainder:quotient) << 1) | carry
#ifdef OPTIMIZE
        remainder.all = (remainder.all << 1) | (quotient.all >> 127);
        quotient.all = (quotient.all << 1) | carry;
#else
        remainder.high = (remainder.high << 1) | (remainder.low >> 63);
        remainder.low = (remainder.low << 1) | (quotient.high >> 63);
        quotient.high = (quotient.high << 1) | (quotient.low >> 63);
        quotient.low = (quotient.low << 1) | carry;
#endif
#if 0
        carry = 0;
        if (remainder.all < divisor.all)
            continue;

        carry = 1;
        remainder.all -= divisor.all;
#elif 0
        carry = remainder.all >= divisor.all;
        if (carry != 0)
            remainder.all -= divisor.all;
#else
        const __int128_t sign = (__int128_t) (divisor.all - remainder.all - 1) >> 127;
        carry = sign & 1;
        remainder.all -= divisor.all & sign;
#endif
    }
#ifdef OPTIMIZE
    quotient.all = (quotient.all << 1) | carry;
#else
    quotient.high = (quotient.high << 1) | (quotient.low >> 63);
    quotient.low = (quotient.low << 1) | carry;
#endif
    if (modulus != 0)
        *modulus = remainder.all;
    return quotient.all;
}

Note: with the preprocessor macro OPTIMIZE defined, this function gains about 10 % to 25 % performance when compiled with Clang, but looses about the same amount when compiled with GCC!

Create the text file case4gcc.c with the following source code, refactored from libgcc2.c, in the same directory as case4llvm.c:

// Copyleft © 2015-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Copyright © 1989-2024 Free Software Foundation, Inc.

typedef unsigned long long uint64_t;

static inline
uint64_t DIV(uint64_t dividend_hi, uint64_t dividend_lo, uint64_t divisor, uint64_t *remainder) {
    uint64_t quotient;
    __asm__("divq\t%[divisor]"
           :"=a" (quotient),
            "=d" (*remainder)
           :"a" (dividend_lo),
            "d" (dividend_hi),
            [divisor] "rm" (divisor)
           :"cc");
    return quotient;
}

__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
    union {
        __uint128_t all;
        struct {
            uint64_t low, high;
        };
    } dividend, divisor, quotient, product;
    unsigned bn, bm;

    dividend.all = numerator;
    divisor.all = denominator;

    if (divisor.high == 0) {
        if (divisor.low > dividend.high) // 0:q = n:n / 0:d
            quotient.high = 0;
        else                             // q:q = n:n / 0:d
            quotient.high = DIV(0, dividend.high, divisor.low, &dividend.high);

        quotient.low = DIV(dividend.high, dividend.low, divisor.low, &dividend.low);

        // remainder in dividend.low
        if (modulus != 0) {
            dividend.high = 0;
            *modulus = dividend.all;
        }
    } else {
        if (divisor.high > dividend.high) { // 0:0 = n:n / d:d
            quotient.all = 0;

            // remainder in dividend.all
            if (modulus != 0)
                *modulus = dividend.all;
        } else { // 0:q = n:n / d:d

            bm = __builtin_clzll(divisor.high);
            if (bm == 0) {
                // from most significant bit of divisor.high is set
                // and dividend.high >= divisor.high
                // follows most significant bit of dividend.high is set
                // as well as quotient.low is either 0 or 1
                //
                // this special case is necessary, not an optimization!

                // the condition on the next line takes advantage that
                // (due to program flow) dividend.high >= divisor.high
                if (dividend.high > divisor.high || dividend.low >= divisor.low) {
                    dividend.all -= divisor.all;
                    quotient.low = 1;
                } else
                    quotient.low = 0;

                quotient.high = 0;

                if (modulus != 0)
                    *modulus = dividend.all;
            } else { // normalize
                bn = 64 - bm;

                quotient.high = dividend.high >> bn;
#ifdef OPTIMIZE
                dividend.all <<= bm;
                divisor.all <<= bm;
#else
                dividend.high <<= bm;
                dividend.high |= dividend.low >> bn;
                dividend.low <<= bm;

                divisor.high <<= bm;
                divisor.high |= divisor.low >> bn;
                divisor.low <<= bm;
#endif
                quotient.low = DIV(quotient.high, dividend.high, divisor.high, &dividend.high);
                quotient.high = 0;
                product.all = quotient.all * divisor.low;

                if (product.all > dividend.all) {
                    product.all -= divisor.all;
                    quotient.low -= 1;
                }

                // remainder is (dividend - product) >> bm
                if (modulus != 0) {
                    dividend.all -= product.all;
#ifdef OPTIMIZE
                    dividend.all >>= bm;
#else
                    dividend.low >>= bm;
                    dividend.low |= dividend.high << bn;
                    dividend.high >>= bm;
#endif
                    *modulus = dividend.all;
                }
            }
        }
    }

    return quotient.all;
}

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# * The software is provided "as is" without any warranty, neither express
#   nor implied.
# * In no event will the author be held liable for any damage(s) arising
#   from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
#   and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
#   as an agent, (independent) contractor, employee, member or officer of
#   a business, corporation or organization (commercial or non-commercial)
#   does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
#   for a business, for commercial, corporate, governmental, military or
#   organizational purposes of any kind, or in a commercial, corporate,
#   governmental, military or organizational environment of any kind.

# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
#   in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
#   (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
#   left), 8-byte aligned;
# - 128-bit integer arguments are passed as pair of 64-bit integer arguments,
#   low part before/below high part;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
#   RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
#   preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
#   case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
#   before calling other functions (CALL instruction pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.

# NOTE: raises "division exception" when divisor is 0!

.file	"umodti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__umodti3:
	sub	rsp, 24
	mov	r8, rsp		# r8 = oword ptr remainder
	call	__udivmodti4
	pop	rax
	pop	rdx		# rdx:rax = remainder
	pop	rcx
	ret
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__udivti3:
	xor	r8, r8
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
__udivmodti4:
	cmp	rdi, rdx
	mov	rax, rsi
	sbb	rax, rcx
	jb	.trivial	# dividend < divisor?

	mov	r10, rdx
	mov	r11, rcx	# r11:r10 = divisor

	bsr	rcx, rcx	# rcx = index of most significant '1' bit
				#        in high qword of divisor
	jnz	.extended	# high qword of divisor <> 0?

	# divisor < 2**64

	cmp	rsi, rdx
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64: quotient < 2**64
	# perform normal division
.normal:
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = dividend
	div	r10		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	0f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
0:
	mov	rdx, r11	# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64: quotient >= 2**64
	# perform "long" alias "schoolbook" division
.long:
	mov	rdx, r11	# rdx = 0
	mov	rax, rsi	# rdx:rax = high qword of dividend
	div	r10		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	mov	rcx, rax	# rcx = high qword of quotient
	mov	rax, rdi	# rax = low qword of dividend
	div	r10		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	1f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
1:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	test	r8, r8
	jz	2f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0
	ret

	# dividend >= divisor >= 2**64: quotient < 2**64
.extended:
	xor	ecx, 63		# ecx = number of leading '0' bits
				#        in (high qword of) divisor
	jz	.special	# divisor >= 2**127?

	# perform "extended & adjusted" division

	mov	r9, r11		# r9 = high qword of divisor
	shld	r9, r10, cl	# r9 = divisor / 2**(index + 1)
				#    = divisor'
#	shl	r10, cl

	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = dividend

	push	rbx
.ifnotdef JCCLESS
	xor	ebx, ebx	# rbx = high qword of quotient' = 0

	cmp	rdx, r9
	jb	3f		# high qword of dividend < divisor'?

	# high qword of dividend >= divisor':
	# subtract divisor' from high qword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	rdx, r9		# rdx = high qword of dividend - divisor'
				#     = high qword of dividend'
	inc	ebx		# rbx = high qword of quotient' = 1
3:
.elseif 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	sbb	rbx, rbx	# rbx = (high qword of dividend < divisor') ? -1 : 0
	and	rbx, r9		# rbx = (high qword of dividend < divisor') ? divisor' : 0
	add	rdx, rbx	# rdx = high qword of dividend
				#     - (high qword of dividend < divisor') ? 0 : divisor'
				#     = high qword of dividend'
	neg	rbx		# CF = (high qword of dividend < divisor')
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.elseif 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	cmovb	rdx, rsi	#     = high qword of dividend'
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.else # JCCLESS
	xor	ebx, ebx	# rbx = high qword of quotient' = 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	cmovb	rdx, rsi	#     = high qword of dividend'
	sbb	ebx, -1		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.endif # JCCLESS
	div	r9		# rax = dividend' / divisor'
				#     = low qword of quotient',
				# rdx = remainder'
	shld	rbx, rax, cl	# rbx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	rax, cl
	mov	rax, r10	# rax = low qword of divisor
	mov	r9, r11		# r9 = high qword of divisor
	imul	r9, rbx		# r9 = high qword of divisor * quotient"
	mul	rbx		# rdx:rax = low qword of divisor * quotient"
.ifnotdef JCCLESS
	add	rdx, r9		# rdx:rax = divisor * quotient"
	jnc	4f		# divisor * quotient" < 2**64?
.if 0
	sbb	rbx, 0		# rbx = quotient" - 1
.else
	dec	rbx		# rbx = quotient" - 1
.endif
	sub	rax, r10
	sbb	rdx, r11	# rdx:rax = divisor * (quotient" - 1)
4:
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.else # JCCLESS
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend
				#         - low qword of divisor * quotient"
	sub	rsi, r9		# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.endif # JCCLESS
	jnb	5f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
.if 0
	sbb	rbx, 0		# rbx = quotient" - 1
				#     = quotient
.else
	dec	rbx		# rbx = quotient" - 1
				#     = quotient
.endif
	add	rdi, r10
	adc	rsi, r11	# rsi:rdi = remainder" + divisor
				#         = remainder
5:
	test	r8, r8
	jz	6f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = rsi:rdi
6:
	mov	rax, rbx	# rax = (low qword of) quotient
	xor	edx, edx	# rdx:rax = quotient

	pop	rbx
	ret

	# dividend >= divisor >= 2**127:
	# quotient = 1, remainder = dividend - divisor
.special:
	test	r8, r8
	jz	7f		# address of remainder = 0?

	sub	rdi, r10
	sbb	rsi, r11	# rsi:rdi = dividend - divisor
				#         = remainder
	mov	[r8], rdi
	mov	[r8+8], rsi
7:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# rdx:rax = quotient = 1
	ret

.size	__udivmodti4, .-__udivmodti4
.type	__udivmodti4, @function
.global	__udivmodti4
.size	__udivti3, .-__udivti3
.type	__udivti3, @function
.global	__udivti3
.size	__umodti3, .-__umodti3
.type	__umodti3, @function
.global	__umodti3
.end

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# Common "cdecl" calling and naming convention for i386 platform:
# - arguments are pushed on stack in reverse order (from right to left),
#   4-byte aligned;
# - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
#   low part below high part;
# - 64-bit integer result is returned in registers EAX (low part) and
#   EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.

# NOTE: raises "division exception" when divisor is 0!

.file	"udivmoddi4.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+20] = (optional) qword ptr remainder
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__udivmoddi4:
	mov	ecx, [esp+8]	# ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	0f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
0:
	xor	edx, edx	# edx:eax = quotient
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+24]	# ecx = address of remainder
	test	ecx, ecx
	jz	1f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
1:
	pop	edx		# edx:eax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	2f		# address of remainder = 0?

	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	mov	[ecx], eax
	mov	[ecx+4], edx	# [ecx] = remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# edx:eax = quotient = 0
	ret

	# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor >= 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	3f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
3:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+12]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	push	ebx		# [esp] = quotient"
	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	# ebx = high dword of divisor * quotient"
	mov	ecx, [esp+16]	# ecx = high dword of dividend
	sub	ecx, ebx	# ecx = dividend
				#     - low dword of divisor * quotient"
	mov	ebx, [esp+12]	# ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	# ecx:ebx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JCCLESS
	pop	eax		# eax = quotient"
	jnb	4f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	# ecx:ebx = remainder" + divisor
				#         = remainder
	dec	eax		# eax = quotient" - 1
				#     = (low dword of) quotient
4:
.else # JCCLESS
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	# [esp] = quotient" - (remainder" < 0)
				#       = (low dword of) quotient
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	ebx, eax
	adc	ecx, edx	# ecx:ebx = remainder" + divisor
				#         = remainder
	pop	eax		# eax = (low dword of) quotient
.endif # JCCLESS
	mov	edx, [esp+24]	# edx = address of remainder
	test	edx, edx
	jz	5f		# address of remainder = 0?

	mov	[edx], ebx
	mov	[edx+4], ecx	# [edx] = remainder
5:
	xor	edx, edx	# edx:eax = quotient
	pop	ebx
	ret

	# dividend >= divisor >= 2**63:
	# quotient = 1, remainder = dividend - divisor
.special:
	or	ecx, [esp+20]	# ecx = address of remainder
	jz	6f		# address of remainder = 0?
.if 0
	neg	edx
	neg	eax
	sbb	edx, 0		# edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	# edx:eax = dividend - divisor
				#         = remainder
.else
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	# edx:eax = dividend - divisor
				#         = remainder
.endif
	mov	[ecx], eax
	mov	[ecx+4], edx	# [ecx] = remainder
6:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# edx:eax = quotient = 1
	ret

.size	__udivmoddi4, .-__udivmoddi4
.type	__udivmoddi4, @function
.global	__udivmoddi4
.end

Create the text file case4.c with the following content in the same directory as case4gcc.c and case4llvm.c:

// Copyright © 2015-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);

__attribute__ ((noinline))
static
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
    if (remainder != NULL)
        *remainder = divisor;

    return dividend;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr128(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
    //   initialised with 2**128 / golden ratio

    static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
    const  __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
    const  __uint128_t sign = (__int128_t) lfsr >> 127;

    return lfsr = (lfsr << 1) ^ (poly & sign);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = (int64_t) lfsr >> 63;

    return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
    // 32-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xDB710641 (CRC-32 IEEE),
    //   initialised with 2**32 / golden ratio

    static uint32_t lfsr = 0x9E3779B9;
    const  uint32_t sign = (int32_t) lfsr >> 31;

    return lfsr = (lfsr << 1) ^ (0xDB710641 & sign);
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t n;
    __uint128_t dividend, divisor = ~0, remainder;
    volatile __uint128_t quotient;

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr64();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr64();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __unopti4(~0, 3, NULL);
        quotient = __unopti4(~0, 3, &remainder);
    }

    t1 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __udivmodti4(~0, 3, NULL);
        quotient = __udivmodti4(~0, 3, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu micro-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}

// Copyright © 2015-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);

__attribute__ ((noinline))
static
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
    if (remainder != NULL)
        *remainder = divisor;

    return dividend;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr128l(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
    //   initialised with 2**128 / golden ratio

    static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
    const  __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
    const  __uint128_t sign = (__int128_t) lfsr >> 127;

    return lfsr = (lfsr << 1) ^ (poly & sign);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr128r(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA,
    //   initialised with bit-vector of prime numbers: 2**n == prime

    static __uint128_t lfsr = (__uint128_t) 0x800228A202088288 << 64 | 0x28208A20A08A28AC;
    const  __uint128_t poly = (__uint128_t) 0xB64E4D3FA8E7331B << 64 | 0xD871FA30D46D4DBA;
    const  __uint128_t mask = 0 - (lfsr & 1);

    return lfsr = (lfsr >> 1) ^ (poly & mask);
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t m, n;
    __uint128_t dividend, divisor = ~0, remainder;
    volatile __uint128_t quotient;

    for (m = 0u; m < 64u; m += m + 1u)
    {
        t0 = clock();

        for (n = 500000000u; n > 0u; n--)
        {
            dividend = lfsr128l();
            dividend >>= dividend & m;
            quotient = __unopti4(dividend, divisor, NULL);
            divisor = lfsr128r();
            divisor >>= divisor & m;
            quotient = __unopti4(dividend, divisor, &remainder);
        }

        t1 = clock();

        for (n = 500000000u; n > 0u; n--)
        {
            dividend = lfsr128l();
            dividend >>= dividend & m;
            quotient = __udivmodti4(dividend, divisor, NULL);
            divisor = lfsr128r();
            divisor >>= divisor & m;
            quotient = __udivmodti4(dividend, divisor, &remainder);
        }

        t2 = clock();
        tt = t2 - t0;
        t2 -= t1;
        t1 -= t0;
        t0 = t2 - t1;

        printf("\n"
               "__unopti4()       %4lu.%06lu       0\n"
               "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
               "                  %4lu.%06lu nano-seconds\n",
               t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
    }

    t0 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __unopti4(~0, 3, NULL);
        quotient = __unopti4(~0, 3, &remainder);
    }

    t1 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __udivmodti4(~0, 3, NULL);
        quotient = __udivmodti4(~0, 3, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu micro-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}

// Copyright © 2015-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

extern
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);

__attribute__ ((noinline))
static
uint64_t __unopdi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder)
{
    if (remainder != NULL)
        *remainder = divisor;

    return dividend;
}

__attribute__ ((always_inline))
static
uint64_t lfsr64l(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = (int64_t) lfsr >> 63;

    return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}

__attribute__ ((always_inline))
static
uint64_t lfsr64r(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x2B5926535897936B (CRC-64 "Jones"),
    //   initialised with bit-vector of prime numbers: 2**n == prime

    static uint64_t lfsr = 0x28208A20A08A28AC;
    const  uint64_t mask = 0 - (lfsr & 1);

    return lfsr = (lfsr >> 1) ^ (0x2B5926535897936B & mask);
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t m, n;
    uint64_t dividend, divisor = ~0, remainder;
    volatile uint64_t quotient;

    for (m = 0u; m < 32u; m += m + 1u)
    {
        t0 = clock();

        for (n = 500000000u; n > 0u; n--)
        {
            dividend = lfsr64l();
            dividend >>= dividend & m;
            quotient = __unopdi4(dividend, divisor, NULL);
            divisor = lfsr64r();
            divisor >>= divisor & m;
            quotient = __unopdi4(dividend, divisor, &remainder);
        }

        t1 = clock();

        for (n = 500000000u; n > 0u; n--)
        {
            dividend = lfsr64l();
            dividend >>= dividend & m;
            quotient = __udivmoddi4(dividend, divisor, NULL);
            divisor = lfsr64r();
            divisor >>= divisor & m;
            quotient = __udivmoddi4(dividend, divisor, &remainder);
        }

        t2 = clock();
        tt = t2 - t0;
        t2 -= t1;
        t1 -= t0;
        t0 = t2 - t1;

        printf("\n"
               "__unopdi4()       %4lu.%06lu       0\n"
               "__udivmoddi4()    %4lu.%06lu    %4lu.%06lu\n"
               "                  %4lu.%06lu nano-seconds\n",
               t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
    }

    t0 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __unopdi4(~0, 3, NULL);
        quotient = __unopdi4(~0, 3, &remainder);
    }

    t1 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __udivmoddi4(~0, 3, NULL);
        quotient = __udivmoddi4(~0, 3, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopdi4()       %4lu.%06lu       0\n"
           "__udivmoddi4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu micro-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}

Note: modification of the source files case4llvm.c, case4gcc.c and case4.c to demonstrate the equally disastrous (mis)performance of the 64÷64-bit integer division on 32-bit processors is left as an exercise to the reader.

On your favourite machine with an AMD64 processor, where both Clang and the GNU Compiler Collection are installed, run the following command lines to compile, link and run the benchmark program:

lscpu
gcc -mno-sse -O3 case4.c case4gcc.c
echo 'GCC/libgcc'
./a.out
clang -mno-sse -O3 case4.c case4llvm.c
echo 'LLVM/clang/compiler-rt'
./a.out

Note: if you prefer to use the library function __udivmodti4() provided by the respective compiler, run the following command lines instead:

lscpu
gcc -mno-sse -O3 case4.c
echo 'GCC/libgcc'
./a.out
clang -mno-sse -O3 -rtlib=compiler-rt case4.c
echo 'LLVM/clang/compiler-rt'
./a.out

Runtime measurement on AMD^® Ryzen^™ 5 3600

Note: for better readability and to ease their comparison, the numbers are shown in two columns.

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 5 3600 6-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2195.688
CPU max MHz:                     3600,0000
CPU min MHz:                     2200,0000
BogoMIPS:                        7186.29
Virtualization:                  AMD-V
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        3 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-11
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          3.308552       0              3.411228       0
__udivmodti4()       8.207382       4.898830      12.463517       9.052289
                    11.515934 nano-seconds        15.874745 nano-seconds

__unopti4()          3.291094       0              3.384842       0
__udivmodti4()       8.461054       5.169960      13.061772       9.676930
                    11.752148 nano-seconds        16.446614 nano-seconds

__unopti4()          3.290164       0              3.379618       0
__udivmodti4()       8.141390       4.851226      13.752975      10.373357
                    11.431554 nano-seconds        17.132593 nano-seconds

__unopti4()          3.342998       0              3.388291       0
__udivmodti4()       8.466321       5.123323      15.481519      12.093228
                    11.809319 nano-seconds        18.869810 nano-seconds

__unopti4()          3.356793       0              3.422627       0
__udivmodti4()       8.335018       4.978225      18.226895      14.804268
                    11.691811 nano-seconds        21.649522 nano-seconds

__unopti4()          3.333240       0              3.383521       0
__udivmodti4()       8.313948       4.980708      24.646997      21.263476
                    11.647188 nano-seconds        28.030518 nano-seconds

__unopti4()          3.329830       0              3.435418       0
__udivmodti4()       8.615062       5.285232      37.806325      34.370907
                    11.944892 nano-seconds        41.241743 nano-seconds

__unopti4()          1.668976       0              1.695685       0
__udivmodti4()      12.617999      10.949023     167.891651     166.195966
                    14.286975 nano-seconds       169.587336 nano-seconds

__unopti4()          1.760362       0              1.708451       0
__udivmodti4()      18.046460      16.286098     246.065291     244.356840
                    19.806822 nano-seconds       247.773742 nano-seconds

__unopti4()          1.248846       0              1.463868       0
__udivmodti4()       7.155582       5.906736      10.833658       9.369790
                     8.404428 nano-seconds        12.297526 nano-seconds

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 5 3600 6-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2195.688
CPU max MHz:                     3600,0000
CPU min MHz:                     2200,0000
BogoMIPS:                        7186.29
Virtualization:                  AMD-V
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        3 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-11
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          1.668976       0              1.695685       0
__udivmodti4()      12.617999      10.949023     167.891651     166.195966
                    14.286975 nano-seconds       169.587336 nano-seconds

__unopti4()          1.760362       0              1.708451       0
__udivmodti4()      18.046460      16.286098     246.065291     244.356840
                    19.806822 nano-seconds       247.773742 nano-seconds

__unopti4()          1.248846       0              1.463868       0
__udivmodti4()       7.155582       5.906736      10.833658       9.369790
                     8.404428 nano-seconds        12.297526 nano-seconds

OUCH: in the best case, the executable generated by Clang performs the 128÷128-bit integer division 1.6 times slower than the executable generated by GCC, while in the worst case it is but 15 (in words: fifteen) times slower – an utterly disastrous result!

Runtime measurement on AMD^® EPYC^™ 7262

Note: for better readability and to ease their comparison, the numbers are shown in two columns.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7262 8-Core Processor
Stepping:              0
CPU MHz:               3193.726
BogoMIPS:              6387.45
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          4.280000       0              3.980000       0
__udivmodti4()      10.230000       5.950000      15.810000      11.830000
                    14.510000 nano-seconds        19.790000 nano-seconds

__unopti4()          4.290000       0              3.990000       0
__udivmodti4()      10.540000       6.250000      16.700000      12.710000
                    14.830000 nano-seconds        20.690000 nano-seconds

__unopti4()          4.280000       0              3.990000       0
__udivmodti4()      10.090000       5.810000      17.980000      13.990000
                    14.370000 nano-seconds        21.970000 nano-seconds

__unopti4()          4.280000       0              3.990000       0
__udivmodti4()      10.380000       6.100000      20.280000      16.290000
                    14.660000 nano-seconds        24.270000 nano-seconds

__unopti4()          4.280000       0              3.990000       0
__udivmodti4()      10.340000       6.060000      24.560000      20.570000
                    14.620000 nano-seconds        28.550000 nano-seconds

__unopti4()          4.280000       0              3.990000       0
__udivmodti4()      10.300000       6.020000      33.150000      29.160000
                    14.580000 nano-seconds        37.140000 nano-seconds

__unopti4()          4.290000       0              3.990000       0
__udivmodti4()      10.580000       6.290000      51.200000      47.210000
                    14.870000 nano-seconds        55.190000 nano-seconds

__unopti4()          1.920000       0              2.250000       0
__udivmodti4()      15.420000      13.500000     230.230000     227.980000
                    17.340000 nano-seconds       232.480000 nano-seconds

__unopti4()          2.000000       0              2.280000       0
__udivmodti4()      22.120000      20.120000     340.400000     338.120000
                    24.120000 nano-seconds       342.680000 nano-seconds

__unopti4()          1.620000       0              1.780000       0
__udivmodti4()       8.810000       7.190000      13.200000      11.420000
                    10.430000 nano-seconds        14.980000 nano-seconds

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7262 8-Core Processor
Stepping:              0
CPU MHz:               3193.726
BogoMIPS:              6387.45
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          1.920000       0              2.250000       0
__udivmodti4()      15.420000      13.500000     230.230000     227.980000
                    17.340000 nano-seconds       232.480000 nano-seconds

__unopti4()          2.000000       0              2.280000       0
__udivmodti4()      22.120000      20.120000     340.400000     338.120000
                    24.120000 nano-seconds       342.680000 nano-seconds

__unopti4()          1.620000       0              1.780000       0
__udivmodti4()       8.810000       7.190000      13.200000      11.420000
                    10.430000 nano-seconds        14.980000 nano-seconds

OUCH: now the worst case is 17 (in words: seventeen) times slower!

Runtime measurement on Intel^® Core^™ i5-6600

Note: for better readability and to ease their comparison, the numbers are shown in two columns.

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         3311.998
BogoMIPS:                        6623.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        24 MiB
NUMA node0 CPU(s):               0-3
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          5.075482       0              5.077103       0
__udivmodti4()      34.428038      29.352556     183.407210     178.330107
                    39.503520 nano-seconds       188.484313 nano-seconds

__unopti4()          4.652310       0              4.653633       0
__udivmodti4()      36.139848      31.487538     280.501125     275.847492
                    40.792158 nano-seconds       285.154758 nano-seconds

__unopti4()          4.938115       0              4.650043       0
__udivmodti4()       8.971484       4.033369      12.585962       7.935919
                    13.909599 nano-seconds        17.236005 nano-seconds

Oops: in the best case, the executable generated by Clang performs the 128÷128-bit integer division at half the speed of the executable generated by GCC, while in the worst case it is but 9 (in words: nine) times slower – another devastating result!

Careless (or clueless?) wasting of resources on Microsoft Windows

The executable installer LLVM-10.0.0-win64.exe dumps the following duplicate files in the directory C:\Program Files\LLVM\bin\, wasting nearly 500 MB disk space, which is about a third of the disk space occupied by the whole package:

C:\>DIR "C:\Program Files\LLVM\bin" /O:-S
[…]
03/25/2020   0:15 PM        83,258,880 clang-cl.exe
03/25/2020   0:03 PM        83,258,880 clang.exe
03/25/2020   0:15 PM        83,258,880 clang++.exe
03/25/2020   0:15 PM        83,258,880 clang-cpp.exe
[…]
03/25/2020   0:15 PM        57,812,480 lld-link.exe
03/25/2020   0:05 PM        57,812,480 lld.exe
03/25/2020   0:15 PM        57,812,480 ld.lld.exe
03/25/2020   0:15 PM        57,812,480 ld64.lld.exe
03/25/2020   0:15 PM        57,812,480 wasm-ld.exe
[…]
03/25/2020   0:15 PM        18,182,144 llvm-ranlib.exe
03/25/2020   0:15 PM        18,182,144 llvm-lib.exe
03/25/2020   0:00 PM        18,182,144 llvm-ar.exe
[…]
C:\>FC.EXE /B "C:\Program Files\LLVM\bin\clang.exe" "C:\Program Files\LLVM\bin\clang-cl.exe"
Comparing files C:\Program Files\LLVM\bin\clang.exe and C:\Program Files\LLVM\bin\clang-cl.exe
FC: no differences encountered
[…]
C:\>FSUTIL.EXE Hardlink List "C:\Program Files\LLVM\bin\clang.exe"
\Program Files\LLVM\bin\clang.exe
C:\>
[…]

Dito for the executable installer LLVM-10.0.0-win32.exe:

C:\>DIR "C:\Program Files (x86)\LLVM\bin" /O:-S
[…]
03/25/2020  11:21 AM        77,862,912 clang-cpp.exe
03/25/2020  11:21 AM        77,862,912 clang-cl.exe
03/25/2020  11:21 AM        77,862,912 clang++.exe
03/25/2020  11:13 AM        77,862,912 clang.exe
[…]
03/25/2020  11:22 AM        54,811,648 ld.lld.exe
03/25/2020  11:22 AM        54,811,648 ld64.lld.exe
03/25/2020  11:15 AM        54,811,648 lld.exe
03/25/2020  11:22 AM        54,811,648 lld-link.exe
03/25/2020  11:22 AM        54,811,648 wasm-ld.exe
[…]
03/25/2020  11:21 AM        17,346,560 llvm-ranlib.exe
03/25/2020  11:10 AM        17,346,560 llvm-ar.exe
03/25/2020  11:21 AM        17,346,560 llvm-lib.exe
[…]
C:\>

Ever heard of hardlinks?
NTFS supports them since its introduction nearly 30 years ago, and Windows NT provides an API to create them since 24 years.

Which cross-compiler?

The pages Getting Started with the LLVM System and Cross-compilation using Clang state:

It is possible to cross-compile LLVM itself. That is, you can create LLVM executables and libraries to be hosted on a platform different from the platform where they are built (a Canadian Cross build).

Clang/LLVM is natively a cross-compiler, meaning that one set of programs can compile to all targets by setting the -target option.

Contrary to both statements, it’s not even possible to build an executable for 32-bit processors with the 64-bit compiler and linker on Windows, and vice versa, i.e. for another architecture of the same platform!

To falsify LLVM’s statements and prove my claim, create the text file case6.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

void case6(_Complex double a, _Complex double b) {
    volatile _Complex double p = a * b, q = a / b;
}

Compile the source file case6.c with Clang from the 32-bit package, targetting but the AMD64 platform, and display the generated assembly code:

"C:\Program Files (x86)\LLVM\bin\clang.exe" -o- -O3 -S -target amd64-pc-windows case6.c

[…]
case6:					# @case6
[…]
	callq	__muldc3
[…]
	callq	__divdc3
[…]

Note: the functions __divdc3() and __muldc3() for the AMD64 platform are provided in the library clang_rt.builtins-x86_64.lib.

"C:\Program Files (x86)\LLVM\bin\clang.exe" -O3 -target amd64-pc-windows case6.c

llvm-3d1c91.o : error LNK2019: unresolved external symbol __muldc3 referenced in function _main
llvm-3d1c91.o : error LNK2019: unresolved external symbol __divdc3 referenced in function _main
a.exe : fatal error LNK1120: 2 unresolved externals
clang: error: linker command failed with exit code 1120 (use -v to see invocation)

Compile the source file case6.c a second time, now with Clang from the 64-bit package, targetting but the i386 platform, and display the generated assembly code:

"C:\Program Files\LLVM\bin\clang.exe" -o- -O3 -S -target i386-pc-windows case6.c

[…]
_case6:					# @case6
[…]
	calll	___muldc3
[…]
	calll	___divdc3
[…]

Note: the functions __divdc3() and __muldc3() for the i386 platform are provided in the library clang_rt.builtins-i386.lib.

"C:\Program Files\LLVM\bin\clang.exe" -O3 -target i386-pc-windows case6.c

llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___muldc3 referenced in function _main
llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___divdc3 referenced in function _main
a.exe : fatal error LNK1120: 2 unresolved externals
clang: error: linker command failed with exit code 1120 (use -v to see invocation)

Although the compiler installed with both packages is able to produce 32-bit as well as 64-bit objects, and the linker is able to produce either 32-bit executables from 32-bit objects or 64-bit executables from 64-bit objects, 32-bit objects can’t be linked to produce 32-bit executables using the 64-bit package, and vice versa: each package contains only the clang_rt.*.lib for its own processor architecture!

C:\>DIR "C:\Program Files\LLVM\lib\clang\10.0.0\lib\windows\*-i386.lib"
[…]
File Not Found
[…]
C:\>DIR "C:\Program Files (x86)\LLVM\lib\clang\10.0.0\lib\windows\*-x86_64.lib"
[…]
File Not Found
[…]

Note: side-by-side installation of the 32-bit and 64-bit package needs 3 GB disk space, wasting 1 GB for duplicate files – that’s 100 times the disk space of the missing clang_rt.*.lib libraries!

Misleading message box from installer on Microsoft Windows

[Screen shot of misleading installer message box]

Poor souls who want (really: need; see above) to install the 64-bit package after or aside the 32-bit package (or vice versa) on Windows are greeted from the installers with the misleading message box shown on the right: the version attributed as old is but the current version for the other processor architecture!
If they choose to continue without uninstallation, the shortcuts eventually created in the start menu by the old version and its uninstall information are overwritten – a really user-friendly move!

Also note the denglisch kauderwelsch: the title bar and the buttons are localised, but the message text isn’t.

Inconsistent undefined behaviour

Create the text file case8.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __absusi2(int value) {
    if (value < 0)
        value = -value;
    if (value < 0)
        __builtin_trap();
    return value;
}

int __absvsi2(int value) {
    const int sign = value >> 31;
    value += sign;
    value ^= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

int __abswsi2(int value) {
    const int sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

long __absudi2(long value) {
    if (value < 0)
        value = -value;
    if (value < 0)
        __builtin_trap();
    return value;
}

long __absvdi2(long value) {
    const long sign = value >> 63;
    value += sign;
    value ^= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

long __abswdi2(long value) {
    const long sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

__int128_t __absuti2(__int128_t value) {
    if (value < 0)
        value = -value;
    if (value < 0)
        __builtin_trap();
    return value;
}

__int128_t __absvti2(__int128_t value) {
    const __int128_t sign = value >> 127;
    value += sign;
    value ^= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

__int128_t __abswti2(__int128_t value) {
    const __int128_t sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

Compile the source file case8.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case8.c

[…]
__absusi2:				# @__absusi2
# %bb.0:
	movl	%edi, %eax
	negl	%eax
	cmovll	%edi, %eax
	retq
[…]
__absvsi2:				# @__absvsi2
# %bb.0:
	movl	%edi, %eax
	negl	%eax
	cmovll	%edi, %eax
	retq
[…]
__abswsi2:				# @__abswsi2
# %bb.0:
	movl	%edi, %eax
	negl	%eax
	cmovll	%edi, %eax
	retq
[…]
__absudi2:				# @__absudi2
# %bb.0:
	movq	%rdi, %rax
	negq	%rax
	cmovlq	%rdi, %rax
	retq
[…]
__absvdi2:				# @__absvdi2
# %bb.0:
	movq	%rdi, %rax
	negq	%rax
	cmovlq	%rdi, %rax
	retq
[…]
__abswdi2:				# @__abswdi2
# %bb.0:
	movq	%rdi, %rax
	negq	%rax
	cmovlq	%rdi, %rax
	retq
[…]
__absuti2:				# @__absuti2
# %bb.0:
	xorl	%edx, %edx
	movq	%rdi, %rax
	negq	%rax
	sbbq	%rsi, %rdx
	testq	%rsi, %rsi
	cmovnsq	%rdi, %rax
	cmovnsq	%rsi, %rdx
	retq
[…]
__absvti2:				# @__absvti2
# %bb.0:
	movq	%rsi, %rdx
	movq	%rdi, %rax
	movq	%rsi, %rcx
	sarq	$63, %rcx
	addq	%rcx, %rax
	adcq	%rcx, %rdx
	xorq	%rcx, %rdx
	js	.LBB7_1
# %bb.2:
	xorq	%rcx, %rax
	retq
.LBB7_1:
	ud2
[…]
__abswti2:				# @__abswti2
# %bb.0:
	movq	%rsi, %rdx
	movq	%rdi, %rax
	movq	%rsi, %rcx
	sarq	$63, %rcx
	xorq	%rcx, %rdx
	xorq	%rcx, %rax
	subq	%rcx, %rax
	sbbq	%rcx, %rdx
	js	.LBB8_1
# %bb.2:
	retq
.LBB8_1:
	ud2
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: except for the __absvti2() and __abswti2() functions, the optimiser removes the conditional expression detecting integer overflow – or undefined behaviour – i.e. it fails to recognise both alternative implementations of abs() only for 128-bit integers!

Impaired performance of one of the most trivial functions

Create the text file case9.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef __amd64__
__int128_t __absti2(__int128_t value) {
    return value < 0 ? -value : value;
}
#else
long long __absdi2(long long value) {
#ifndef ALTERNATE
    return value < 0 ? -value : value;
#else
    return __builtin_llabs(value);
#endif
}

int __abssi2(int value) {
#ifndef ALTERNATE
    return value < 0 ? -value : value;
#else
    return __builtin_abs(value);
#endif
}
#endif

Compile the source file case9.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case9.c

Note: the left column shows the generated code, while the right column shows shorter, but still not properly optimised code as comment.

[…]
__absti2:				# @__absti2
# %bb.0:
	xorl	%edx, %edx		#	xor	edx, edx
	movq	%rdi, %rax		#	mov	rax, rdi
	negq	%rax			#	neg	rax
	sbbq	%rsi, %rdx		#	sbb	rdx, rsi
	testq	%rsi, %rsi		#
	cmovnsq	%rdi, %rax		#	cmovs	rax, rdi
	cmovnsq	%rsi, %rdx		#	cmovs	rdx, rsi
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: the use of CMOVcc instructions introduces an explicit data dependency without necessity and impairs performance!

Ouch: the TEST instruction is superfluous!

The following code for the __absti2() function avoids the CMOVcc instructions and is 3 bytes shorter:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"absti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text

__absti2:
	mov	rax, rsi	# rax = high qword of argument
	cqo			# rdx = (argument < 0) ? -1 : 0
	mov	rax, rdx	# rdx:rax = (argument < 0) ? -1 : 0
	add	rdi, rdx
	adc	rsi, rdx	# rsi:rdi = (argument < 0) ? argument - 1 : argument
	xor	rax, rdi
	xor	rdx, rsi	# rdx:rax = (argument < 0) ? -argument : argument
				#         = |argument|
	ret

.size	__absti2, .-__absti2
.type	__absti2, @function
.global	__absti2
.end

Compile the source file case9.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case9.c

[…]
__absdi2:				# @__absdi2
# %bb.0:
	movl	8(%esp), %edx
	movl	4(%esp), %eax
	movl	%edx, %ecx
	sarl	$31, %ecx
	addl	%ecx, %eax
	adcl	%ecx, %edx
	xorl	%ecx, %eax
	xorl	%ecx, %edx
	retl
[…]
__abssi2:				# @__abssi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %eax
	negl	%eax
	cmovll	%ecx, %eax
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: the use of a CMOVcc instruction introduces an explicit data dependency without necessity and impairs performance!

The following code for the __absdi2() and __abssi2() functions avoids the CMOVcc instruction and is shorter too:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"absdi2.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text

__absdi2:
	mov	eax, [esp+8]
	mov	ecx, [esp+4]	# eax:ecx = argument
	cdq			# edx = (argument < 0) ? -1 : 0
	add	ecx, edx
	adc	eax, edx	# eax:ecx = (argument < 0) ? argument - 1 : argument
	xor	ecx, edx
	xor	edx, eax	# edx:ecx = (argument < 0) ? -argument : argument
				#         = |argument|
	mov	eax, ecx	# edx:eax = |argument|
	ret

.size	__absdi2, .-__absdi2
.type	__absdi2, @function
.global	__absdi2

__abssi2:
	mov	eax, [esp+4]	# eax = argument
	cdq			# edx = (argument < 0) ? -1 : 0
	add	eax, edx	# eax = (argument < 0) ? argument - 1 : argument
	xor	eax, edx	# eax = (argument < 0) ? -argument : argument
				#     = |argument|
	ret

.size	__abssi2, .-__abssi2
.type	__abssi2, @function
.global	__abssi2
.end

Note: exploration of the code generated with the preprocessor macro ALTERNATE defined is left as an exercise to the reader.

Invalid machine code generated for `bswapsi2()`, `bswapdi2()` and `__builtin_bswap*()`

Create the text file case10.c with the following source code, partially copied from __bswapsi2() and __bswapdi2(), in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

unsigned int __aswapsi2(unsigned int u) {
  return __builtin_bswap32(u);
}

unsigned int __bswapsi2(unsigned int u) {
  return (((u & 0xFF000000U) >> 24)
        | ((u & 0x00FF0000U) >> 8)
        | ((u & 0x0000FF00U) << 8)
        | ((u & 0x000000FFU) << 24));
}

unsigned long long __aswapdi2(unsigned long long u) {
  return __builtin_bswap64(u);
}

unsigned long long __bswapdi2(unsigned long long u) {
  return (((u & 0xFF00000000000000ULL) >> 56)
        | ((u & 0x00FF000000000000ULL) >> 40)
        | ((u & 0x0000FF0000000000ULL) >> 24)
        | ((u & 0x000000FF00000000ULL) >> 8)
        | ((u & 0x00000000FF000000ULL) << 8)
        | ((u & 0x0000000000FF0000ULL) << 24)
        | ((u & 0x000000000000FF00ULL) << 40)
        | ((u & 0x00000000000000FFULL) << 56));
}

Compile the source file case10.c with Clang, engaging its optimiser, targetting the i386 processor, and display the generated assembly code:

clang -m32 -march=i386 -o- -O3 -S -target i386-pc-linux case10.c

Note: the left column shows the generated invalid code, while the right column shows properly optimised valid code as comment.

[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax		#	mov	eax, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	eax, 16
					#	ror	ax, 8
	retl				#	ret
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %eax		#	mov	eax, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	eax, 16
					#	ror	ax, 8
	retl				#	ret
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movl	4(%esp), %edx		#	mov	eax, [esp+8]
	movl	8(%esp), %eax		#	mov	edx, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	dx, 8
					#	ror	eax, 16
					#	ror	edx, 16
					#	ror	ax, 8
	bswapl	%edx			#	ror	dx, 8
	retl				#	ret
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	4(%esp), %edx		#	mov	eax, [esp+8]
	movl	8(%esp), %eax		#	mov	edx, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	dx, 8
					#	ror	eax, 16
					#	ror	edx, 16
					#	ror	ax, 8
	bswapl	%edx			#	ror	dx, 8
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: the BSWAP instruction was introduced with the i486 processor, it is not supported on the i386 processor!

Bad, insufficient or no (peephole) optimisation at all in various library functions

Peek into the code of some functions provided in the clang_rt.builtins-i386.lib and the libclang_rt.builtins-x64_86.a libraries.

`__absvti2()`

The __absvti2() function provided in the libclang_rt.builtins-x86_64.a library exhibits the code shown in the left column below; the right column shows better, but still not properly optimised code, with only 10 instructions in just 27 bytes instead of 18 instructions in 69 bytes:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__absvti2
	.type	__absvti2, @function
	.text

__absvti2:
	mov	rcx, 0x8000000000000000	#
	xor	rcx, rsi		#
	or	rcx, rdi		#
	jz	.overflow		#
	mov	rdx, rsi		#	mov	rdx, rsi
	mov	rax, rdi		#	mov	rax, rdi
	mov	rcx, rsi		#	sar	rsi, 63
	sar	rcx, 63			#	xor	rax, rsi
	xor	rdx, rcx		#	xor	rdx, rsi
	xor	rax, rcx		#	sub	rax, rsi
	sub	rax, rcx		#	sbb	rdx, rsi
	sbb	rdx, rcx		#	jo	.overflow
	ret				#	ret
.overflow:
	push	rax			#	ud2
	lea	rdi, …			#
	lea	rdx, …			#
	mov	esi, 24			#
	call	__abort			#

	.ident	"clang version 10.0.0 "
	.end

The following source coaxes Clang to generate this better code – almost, with but one superfluous MOV instruction:

// Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

__int128_t __absvti2(__int128_t value) {
#ifndef ALTERNATE
    const __int128_t sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
#else
    const __int128_t sign = value >> 127;
    value += sign;
    value ^= sign;
#endif
    if (value < 0)
        __builtin_trap();
    return value;
}

`ashldi3()`, `ashrdi3()` and `__lshrdi3()`

The __ashldi3() function provided in the clang_rt.builtins-i386.lib library exhibits the braindead code shown in the left column below; the right column shows properly optimised code, with only 12 instructions in just 30 bytes instead of 24 instructions in 51 bytes, avoiding a superfluous conditional branch:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___ashldi3
	.type	___ashldi3, @function
	.text

___ashldi3:
	push	esi			#
	mov	ecx, [esp+16]		#	mov	ecx, [esp+12]
	mov	edx, [esp+8]		#	mov	eax, [esp+4]
	test	cl, 32			#	test	cl, 32
	jnz	0f			#	jnz	0f
	mov	esi, [esp+12]		#	mov	edx, [esp+8]
	test	ecx, ecx		#
	jz	2f			#
	mov	eax, edx		#
	shl	eax, cl			#
	shld	esi, edx, cl		#	shld	edx, eax, cl
	xor	ecx, ecx		#	shl	eax, cl
	mov	edx, esi		#
	jmp	1f			#	ret
0:
	shl	edx, cl			#	shl	eax, cl
	xor	eax, eax		#	mov	edx, eax
	xor	ecx, ecx		#	xor	eax, eax
1:
	or	edx, ecx		#
	pop	esi			#
	ret				#
2:
	mov	eax, edx		#
	mov	edx, esi		#
	pop	esi			#
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end

Note: exploration of the equally bad code generated for the __ashrdi3() and __lshrdi3() functions is left as an exercise to the reader.

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	__ashrdi3
	.type	__ashrdi3, @function
	.text

__ashrdi3:
	push	esi			#
	mov	ecx, [esp+16]		#	mov	ecx, [esp+12]
	mov	eax, [esp+12]		#	mov	edx, [esp+8]
	test	cl, 32			#	test	cl, 32
	jnz	0f			#	jnz	0f
	mov	esi, [esp+8]		#	mov	eax, [esp+4]
	test	ecx, ecx		#
	jz	1f			#
	mov	edx, eax		#
	sar	edx, cl			#	shrd	eax, edx, cl
	shrd	esi, eax, cl		#	sar	edx, cl
	jmp	2f			#	ret
0:
	mov	edx, eax		#	mov	eax, edx
	sar	eax, cl			#	sar	eax, cl
	sar	edx, 31			#	sar	edx, 31
	pop	esi			#
	ret				#
1:
	mov	edx, eax		#
2:
	mov	eax, esi		#
	pop	esi			#
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	__lshrdi3
	.type	__lshrdi3, @function
	.text

__lshrdi3:
	push	esi			#
	mov	ecx, [esp+16]		#	mov	ecx, [esp+12]
	mov	esi, [esp+12]		#	mov	edx, [esp+8]
	test	cl, 32			#	test	cl, 32
	jnz	0f			#	jnz	0f
	mov	eax, [esp+8]		#	mov	eax, [esp+4]
	test	ecx, ecx		#
	jz	2f			#
	mov	edx, esi		#
	shr	edx, cl			#
	shrd	eax, esi, cl		#	shrd	eax, edx, cl
	mov	esi, eax		#	shr	edx, cl
	xor	eax, eax		#
	jmp	1f			#	ret
0:
	shr	esi, cl			#	shr	edx, cl
	xor	eax, eax		#	mov	eax, edx
	xor	edx, edx		#	xor	edx, edx
1:
	or	eax, esi		#
	pop	esi			#
	ret				#
2:
	mov	edx, esi		#
	pop	esi			#
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end

`cmp?i2()` and `ucmp?i2()`

In their second halves, the functions __cmpdi2(), and __ucmpdi2(), provided in the clang_rt.builtins-i386.lib library exhibit the insane code shown in the left column below; the right column shows shorter (but still unoptimised) code, with 12 instructions in 35 bytes instead of 15 instructions in 47 bytes for each function:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___cmpdi2
	.global	___ucmpdi2
	.type	___cmpdi2, @function
	.type	___ucmpdi2, @function
	.text

___cmpdi2:
	mov	ecx, [esp+16]		#	mov	ecx, [esp+16]
	xor	eax, eax		#	xor	eax, eax
	cmp	[esp+8], ecx		#	cmp	ecx, [esp+8]
	jl	0f			#	jg	0f
	mov	eax, 2			#	mov	eax, 2
	jg	0f			#	jl	0f
	mov	ecx, [esp+4]		#
	mov	edx, [esp+12]		#	mov	ecx, [esp+12]
	mov	eax, 0			#	xor	eax, eax
	cmp	ecx, edx		#	cmp	ecx, [esp+4]
	jb	0f			#	ja	0f
	cmp	edx, ecx		#
	mov	eax, 1			#
	adc	eax, 0			#	adc	eax, 1
0:					# 0:
	ret				#	ret

___ucmpdi2:
	mov	ecx, [esp+16]		#	mov	ecx, [esp+16]
	xor	eax, eax		#	xor	eax, eax
	cmp	[esp+8], ecx		#	cmp	ecx, [esp+8]
	jb	0f			#	ja	0f
	mov	eax, 2			#	mov	eax, 2
	ja	0f			#	jb	0f
	mov	ecx, [esp+4]		#
	mov	edx, [esp+12]		#	mov	ecx, [esp+12]
	mov	eax, 0			#	xor	eax, eax
	cmp	ecx, edx		#	cmp	ecx, [esp+4]
	jb	0f			#	ja	0f
	cmp	edx, ecx		#
	mov	eax, 1			#
	adc	eax, 0			#	adc	eax, 1
0:					# 0:
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end

Assembly programmers but write straightforward and branch-free code instead, using only 11 instructions in just 31 bytes for the __cmpdi2() function, and only 8 instructions in just 25 bytes for the __ucmpdi2() function:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"cmpdi2.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text

__cmpdi2:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = left argument
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	# edx:eax = left argument - right argument
	setg	ah		# ah = (left argument > right argument) ? 1 : 0
	setl	al		# al = (left argument < right argument) ? 1 : 0
	stc
	sbb	al, ah		# al = (left argument < right argument)
				#    - (left argument > right argument)
				#    - 1
				#    = {0, -1, -2}
	movsx	eax, al
	neg	eax		# eax = (left argument > right argument)
				#     - (left argument < right argument)
				#     + 1
				#     = {0, 1, 2}
	ret

.size	__cmpdi2, .-__cmpdi2
.type	__cmpdi2, @function
.global	__cmpdi2

__ucmpdi2:
	xor	eax, eax	# eax = 0
	mov	ecx, [esp+4]
	mov	edx, [esp+8]	# edx:ecx = left argument
	sub	ecx, [esp+12]
	sbb	edx, [esp+16]	# edx:ecx = left argument - right argument
	seta	al		# eax = (left argument > right argument)
	sbb	eax, -1		# eax = left argument > right argument
				#     - left argument < right argument
				#     + 1
				#     = {0, 1, 2}
	ret

.size	__ucmpdi2, .-__ucmpdi2
.type	__ucmpdi2, @function
.global	__ucmpdi2
.end

Note: exploration of the equally bad code generated for the __cmpti2() and __ucmpti2() functions is left as an exercise to the reader.

`__mulo?i4()`

The __mulosi4() and __mulodi4() functions provided in the clang_rt.builtins-i386.a library exhibit completely insane and horrible code with 51 instructions in 130 bytes and 98 instructions in 266 bytes respectively, while the __mulosi4(), __mulodi4() and __muloti4() functions provided in the clang_rt.builtins-x86_64.a library exhibit completely insane and horrible code with 44 instructions in 131 bytes, 48 instructions in 155 bytes, and 94 instructions in 302 bytes respectively.

Note: completely insane and horrible code not shown here to preserve the mental health of the reader!

Note: Clang generates calls of these monstrosities for the __builtin_*mul*_overflow builtins as well as the -fsanitize=integer, -fsanitize=undefined and -ftrapv command line options.

Create the text file case11.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
#ifndef OPTIMIZE
long long mulodi4(long long multiplicand, long long multiplier, int *overflow) {
    long long product;
    *overflow = __builtin_smulll_overflow(multiplicand, multiplier, &product);
    return product;
}
#else
long long __mulodi4(long long multiplicand, long long multiplier, int *overflow) {
    unsigned long long product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
    *overflow = __builtin_umulll_overflow((multiplicand ^ sign) - sign,
                                          (multiplier ^ tmp) - tmp,
                                          &product);
    sign ^= tmp;
    product = (product ^ sign) - sign;
    *overflow |= (long long) (product ^ sign) < 0;
    return product;
}
#endif
#else // __amd64__
#ifndef OPTIMIZE
__int128_t muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow) {
    __int128_t product;
    *overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    return product;
}
#else
__int128_t __muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow) {
    __uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
    *overflow = __builtin_mul_overflow((multiplicand ^ sign) - sign,
                                       (multiplier ^ tmp) - tmp,
                                       &product);
    sign ^= tmp;
    product = (product ^ sign) - sign;
    *overflow |= (__int128_t) (product ^ sign) < 0;
    return product;
}
#endif
#endif // __amd64__

Compile the source file case11.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case11.c

Note: the left column shows the generated code, while the right column shows shorter, but still not optimised code as comment.

[…]
muloti4:				# @muloti4
# %bb.0:
	pushq	%rbx			#	push	r8
	subq	$16, %rsp		#	xor	eax, eax
	movq	%r8, %rbx		#	push	rax
	movq	$0, 8(%rsp)		#	push	rax
	leaq	8(%rsp), %r8		#	mov	r8, rsp
	callq	__muloti4		#	call	__muloti4
	xorl	%ecx, %ecx		#	pop	r8
	cmpq	$0, 8(%rsp)		#	pop	rcx
	setne	%cl			#	cmp	rcx, r8
	movl	%ecx, (%rbx)		#	setne	cl
	addq	$16, %rsp		#	pop	r8
	popq	%rbx			#	mov	[r8], ecx
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

OOPS: 13 instructions in just 30 bytes instead of 13 instructions in 46 bytes – but not counting the additional 343 instructions in 1109 bytes of the called __muloti4(), __divti3(), __udivti3() and __udivmodti4() functions!

Compile the source file case11.c a second time with Clang, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case11.c

[…]
__muloti4:				# @__muloti4
# %bb.0:
	pushq	%rbp
	pushq	%r15
	pushq	%r14
	pushq	%rbx
	movq	%rcx, %r10
	movq	%rsi, %rax
	movq	%rsi, %r11
	sarq	$63, %r11
	sarq	$63, %rcx
	xorq	%r11, %rax
	xorq	%r11, %rdi
	subq	%r11, %rdi
	sbbq	%r11, %rax
	movq	%rdx, %rsi
	xorq	%rcx, %r10
	xorq	%rcx, %rsi
	subq	%rcx, %rsi
	sbbq	%rcx, %r10
	setne	%dl
	testq	%rax, %rax
	setne	%r14b
	andb	%dl, %r14b
	mulq	%rsi
	movq	%rax, %r9
	seto	%bpl
	movq	%r10, %rax
	mulq	%rdi
	movq	%rax, %r10
	seto	%r15b
	orb	%bpl, %r15b
	orb	%r14b, %r15b
	addq	%r9, %r10
	movq	%rdi, %rax
	mulq	%rsi
	addq	%r10, %rdx
	setb	%bl
	orb	%r15b, %bl
	movzbl	%bl, %esi
	xorq	%r11, %rcx
	xorq	%rcx, %rdx
	xorq	%rcx, %rax
	subq	%rcx, %rax
	sbbq	%rcx, %rdx
	xorq	%rdx, %rcx
	shrq	$63, %rcx
	orl	%esi, %ecx
	movl	%ecx, (%r8)
	popq	%rbx
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: the proper implementation of the __muloti4() function coaxes Clang to generate 52 instructions in 147 bytes instead of 94 instructions in 302 bytes – and without calls of the __divti3() and __udivti3() functions!

Assembly programmers but use only 44 instructions in just 121 bytes, don’t clobber the non-volatile registers RBP, RBX, R14 and R15, and avoid the 8 memory accesses as well as the 9 partial register writes which impair performance:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"muloti4.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
				# r8 = dword ptr flag
__muloti4:
	mov	r10, rdx	# r10 = low qword of multiplier
	mov	r11, rcx	# r11 = high qword of multiplier

	mov	rax, rcx	# rax = high qword of multiplier
	cqo			# rdx = (multiplier < 0) ? -1 : 0
	mov	r9, rdx		# r9 = (multiplier < 0) ? -1 : 0
	xor	r10, rdx
	xor	r11, rdx	# r11:r10 = (multiplier < 0) ? ~multiplier : multiplier
	sub	r10, rdx
	sbb	r11, rdx	# r11:r10 = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	rax, rsi	# rax = high qword of multiplicand
	cqo			# rdx = (multiplicand < 0) ? -1 : 0
	xor	r9, rdx		# r9 = (multiplicand < 0) <> (multiplier < 0) ? -1 : 0
				#    = (product < 0) ? -1 : 0
	xor	rdi, rdx
	xor	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	rdi, rdx
	sbb	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	xor	ecx, ecx
	mov	[r8], ecx	# flag = 0
	cmp	rcx, rsi
	sbb	edx, edx	# edx = (high qword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**64) ? 0 : -1
	cmp	rcx, r11
	sbb	ecx, ecx	# ecx = (high qword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**64) ? 0 : -1
	and	ecx, edx	# ecx = (|multiplicand| < 2**64)
				#     & (|multiplier| < 2**64) ? 0 : -1
				#     = (|product| < 2**128) ? 0 : -1
	mov	rax, rsi
	mul	r10		# rdx:rax = high qword of |multiplicand|
				#         * low qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rsi, rax
	mov	rax, rdi
	mul	r11		# rdx:rax = low qword of |multiplicand|
				#         * high qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	add	rsi, rax	# rsi = high qword of |multiplicand|
				#     * low qword of |multiplier|
				#     + low qword of |multiplicand|
				#     * high qword of |multiplier|
#	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rax, rdi
	mul	r10		# rdx:rax = low qword of |multiplicand|
				#         * low qword of |multiplier|
	add	rdx, rsi	# rdx:rax = |product % 2**128|
				#         = |product| % 2**128
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *
.if 0
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = (product < 0)
				#         ? product % 2**128 - 1 : product % 2**128
	sub	rax, r9
	sbb	rdx, r9		# rdx:rax = product % 2**128

	xor	r9, rdx		# r9 = (product % 2**128 < -2**127)
				#    | (product % 2**128 >= 2**127)
				#    ? negative : positive
	add	r9, r9
.else
	add	rax, r9
	adc	rdx, r9		# rdx:rax = (product < 0)
				#         ? ~product % 2**128 : product % 2**128
	mov	rsi, rdx	# rsi = (product % 2**128 < -2**127)
				#     | (product % 2**128 >= 2**127)
				#     ? negative : positive
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = product % 2**128

	add	rsi, rsi
.endif
	adc	ecx, ecx	# ecx = (-2**127 <= product < 2**127) ? 0 : *
	setnz	[r8]		# flag = (-2**127 <= product < 2**127) ? 0 : 1
	ret

.size	__muloti4, .-__muloti4
.type	__muloti4, @function
.global	__muloti4
.end

Compile the source file case11.c a third time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case11.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
mulodi4:				# @mulodi4
# %bb.0:
	pushl	%ebx			#	jmp	__mulodi4
	pushl	%edi			#
	pushl	%esi			#
	subl	$16, %esp		#
	movl	48(%esp), %esi		#
	movl	32(%esp), %eax		#
	movl	36(%esp), %ecx		#
	movl	40(%esp), %edx		#
	movl	44(%esp), %edi		#
	movl	$0, 12(%esp)		#
	subl	$12, %esp		#
	leal	24(%esp), %ebx		#
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%edx			#
	pushl	%ecx			#
	pushl	%eax			#
	calll	__mulodi4		#
	addl	$32, %esp		#
	xorl	%ecx, %ecx		#
	cmpl	$0, 12(%esp)		#
	setne	%cl			#
	movl	%ecx, (%esi)		#
	addl	$16, %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	retl				#
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: only 1 instruction in just 5 bytes instead of 28 instructions in 73 bytes – not counting the additional 434 instructions in 1087 bytes of the called __mulodi4(), __divdi3(), __udivdi3() and __udivmoddi4() functions!

Compile the source file case11.c a fourth time with Clang, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case11.c

[…]
__mulodi4:				# @__mulodi4
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$8, %esp
	movl	32(%esp), %eax
	movl	40(%esp), %esi
	movl	28(%esp), %ecx
	movl	36(%esp), %edi
	movl	%eax, %ebp
	movl	%esi, %ebx
	sarl	$31, %ebp
	sarl	$31, %ebx
	xorl	%ebp, %ecx
	xorl	%ebp, %eax
	subl	%ebp, %ecx
	sbbl	%ebp, %eax
	xorl	%ebx, %edi
	xorl	%ebx, %esi
	subl	%ebx, %edi
	sbbl	%ebx, %esi
	setne	%dl
	testl	%eax, %eax
	setne	%dh
	andb	%dl, %dh
	movb	%dh, 3(%esp)		# 1-byte Spill
	mull	%edi
	movl	%eax, 4(%esp)		# 4-byte Spill
	movl	%esi, %eax
	seto	2(%esp)			# 1-byte Folded Spill
	mull	%ecx
	movl	%eax, %esi
	seto	%al
	orb	2(%esp), %al		# 1-byte Folded Reload
	addl	4(%esp), %esi		# 4-byte Folded Reload
	movb	%al, 4(%esp)		# 1-byte Spill
	movl	%ecx, %eax
	mull	%edi
	addl	%esi, %edx
	movl	44(%esp), %esi
	setb	%cl
	xorl	%ebp, %ebx
	orb	4(%esp), %cl		# 1-byte Folded Reload
	xorl	%ebx, %eax
	xorl	%ebx, %edx
	orb	3(%esp), %cl		# 1-byte Folded Reload
	subl	%ebx, %eax
	sbbl	%ebx, %edx
	xorl	%edx, %ebx
	shrl	$31, %ebx
	movzbl	%cl, %ecx
	orl	%ecx, %ebx
	movl	%ebx, (%esi)
	addl	$8, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: the proper implementation of the __mulodi4() function coaxes Clang to generate 59 instructions in 146 bytes instead of 98 instructions in 266 bytes – without calls of the __divdi3() and __udivdi3() functions!

Assembly programmers but use only 57 instructions in just 134 bytes, don’t clobber the non-volatile registers EBP, EDI and ESI, and avoid 8 partial register writes which impair performance:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulodi4.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+20] = dword ptr flag
				# [esp+16] = high dword of multiplier
				# [esp+12] = low dword of multiplier
				# [esp+8] = high dword of multiplicand
				# [esp+4] = low dword of multiplicand
__mulodi4:
	push	ebx
	mov	eax, [esp+20]	# eax = high dword of multiplier
	mov	ecx, [esp+16]	# ecx = low dword of multiplier
	cdq			# edx = (multiplier < 0) ? -1 : 0
	mov	ebx, edx	# ebx = (multiplier < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (multiplier < 0) ? ~multiplier : multiplier
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	[esp+16], ecx
	mov	[esp+20], eax	# multiplier = |multiplier|

	mov	eax, [esp+12]	# eax = high dword of multiplicand
	mov	ecx, [esp+8]	# ecx = low dword of multiplicand
	cdq			# edx = (multiplicand < 0) ? -1 : 0
	xor	ebx, edx	# ebx = (multiplier < 0) ^ (multiplicand < 0) ? -1 : 0
				#     = (product < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	mov	[esp+8], ecx
	mov	[esp+12], eax	# multiplicand = |multiplicand|

	push	ebx		# [esp] = (product < 0) ? -1 : 0
	xor	ebx, ebx	# ebx = 0
#	mov	eax, [esp+16]	# eax = high dword of |multiplicand|
	cmp	ebx, eax
	sbb	edx, edx	# edx = (high dword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**32) ? 0 : -1
	mov	ecx, [esp+24]	# ecx = high dword of |multiplier|
	cmp	ebx, ecx
	sbb	ebx, ebx	# ebx = (high dword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplier| < 2**32) ? 0 : -1
	and	ebx, edx	# ebx = (|multiplicand| < 2**32)
				#     & (|multiplier| < 2**32) ? 0 : -1
				#     = (|product| < 2**64) ? 0 : -1

	mov	edx, [esp+20]	# edx = low dword of |multiplier|
	mul	edx		# edx:eax = high dword of |multiplicand|
				#         * low dword of |multiplier|
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *
.if 0
	xchg	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|,
				# eax = high dword of |multiplier|
	mov	edx, [esp+12]	# edx = low dword of |multiplicand|
.else
	mov	edx, ecx	# edx = high dword of |multiplier|
	mov	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|
	mov	eax, [esp+12]	# eax = low dword of |multiplicand|
.endif
	mul	edx		# edx:eax = high dword of |multiplier|
				#         * low dword of |multiplicand|
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	add	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|
				#     + high dword of |multiplier|
				#     * low dword of |multiplicand|
#	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	mov	eax, [esp+12]	# eax = low dword of |multiplicand|
	mov	edx, [esp+20]	# edx = low dword of |multiplier|
	mul	edx		# edx:eax = low dword of |multiplicand|
				#         * low dword of |multiplier|
	add	edx, ecx	# edx:eax = |product % 2**64|
				#         = |product| % 2**64
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	pop	ecx		# ecx = (product < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = product % 2**64

	xor	ecx, edx	# ecx = (multiplicand < 0)
				#     ^ (multiplier < 0)
				#     ^ (product < 0) ? negative : positive
	add	ecx, ecx
	adc	ebx, ebx	# ebx = (-2**63 <= product < 2**63) ? 0 : *
	neg	ebx
	sbb	ebx, ebx	# ebx = (-2**63 <= product < 2**63) ? 0 : -1
	neg	ebx		# ebx = (-2**63 <= product < 2**63) ? 0 : 1
	mov	ecx, [esp+28]	# ecx = address of flag
	mov	[ecx], ebx	# flag = (-2**63 <= product < 2**63) ? 0 : 1
	pop	ebx
	ret

.size	__mulodi4, .-__mulodi4
.type	__mulodi4, @function
.global	__mulodi4
.end

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulodi4.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text

__mulodi4:
	push	ebx
	mov	ebx, [esp+16]	# ebx = low dword of multiplier
	mov	eax, [esp+8]	# eax = low dword of multiplicand
	mul	ebx		# edx:eax = low dword of multiplicand
				#         * low dword of multiplier
	push	eax
	mov	ecx, edx	# ecx = low dword of "inner" product
	mov	eax, [esp+16]	# eax = high dword of multiplicand
	mul	ebx		# edx:eax = high dword of multiplicand
				#         * low dword of multiplier
	xor	ebx, ebx	# ebx = high dword of "inner" product
	add	ecx, eax
	adc	ebx, edx	# ebx:ecx = intermediate "inner" product
	mov	eax, [esp+12]	# eax = low dword of multiplicand
	mul	dword ptr [esp+24]
				# edx:eax = low dword of multiplicand
				#         * high dword of multiplier
	add	ecx, eax
	adc	ebx, edx	# ebx:ecx = final "inner" product
	push	ecx
	sbb	ecx, ecx	# ecx = 0 - carry from "inner" product
	mov	eax, [esp+20]	# eax = high dword of multiplicand
	mul	dword ptr [esp+28]
				# edx:eax = high dword of multiplicand
				#         * high dword of multiplier
	neg	ecx		# ecx = carry from "inner" product
	add	eax, ebx
	adc	edx, ecx	# edx:eax = high dword of multiplicand
				#         * high dword of multiplier
				#         + high dword of "inner" product
				#         + carry from "inner" product
				#         = high qword of (unsigned) product"

	xor	ebx, ebx	# ebx = sign of multiplier = 0
	cmp	ebx, [esp+28]
	jle	0f		# (high dword of) multiplier >= 0?

	not	ebx		# ebx = 0 - sign of multiplier = -1
	sub	eax, [esp+16]
	sbb	edx, [esp+20]	# edx:eax = high qword of product"
				#         - multiplicand
				#         = high qword of product'
0:
	xor	ecx, ecx
	cmp	ecx, [esp+20]
	jle	1f		# (high dword of) multiplicand >= 0?

	not	ebx		# ebx = (0 - sign of multiplier)
				#     ^ (0 - sign of multiplicand)
	sub	eax, [esp+24]
	sbb	edx, [esp+28]	# edx:eax = high qword of product'
				#         - multiplier
				#         = high qword of (signed) product
1:
	xor	eax, ebx
	xor	edx, ebx	# edx:eax = high qword of product
				#         ^ (0 - sign of multiplier)
				#         ^ (0 - sign of multiplicand)
	or	eax, edx	# eax = high qword of product
				#     <> (0 - sign of multiplier)
				#      ^ (0 - sign of multiplicand)
	pop	edx		# edx = high dword of (low qword of) product
	shld	ecx, edx, 1	# ecx = sign of product
	add	ebx, ecx	# ebx = sign of product
				#     <> sign of multiplier
				#      ^ sign of multiplicand
	or	eax, ebx	# eax = (-2**63 <= product < 2**63) ? 0 : *
	setnz	cl		# ecx = (-2**63 <= product < 2**63) ? 0 : 1
	mov	eax, [esp+28]	# eax = address of flag
	mov	[eax], ecx	# flag = (-2**63 <= product < 2**63) ? 0 : 1
	pop	eax		# edx:eax = (low qword of) product
	pop	ebx
	ret

.size	__mulodi4, .-__mulodi4
.type	__mulodi4, @function
.global	__mulodi4
.end

`__parity?i2()`

The __paritysi2() and __paritydi2() functions provided in the clang_rt.builtins-i386.lib library exhibit the insane code shown in the left column below; the right column shows properly optimised code, with 15 instructions in 40 bytes instead of 21 instructions in 57 bytes for both functions together, more than halving the number of instructions executed per function call:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___paritysi2
	.global	___paritydi2
	.type	___paritysi2, @function
	.type	___paritydi2, @function
	.text

___paritysi2:
	mov	eax, [esp+4]		#	mov	eax, [esp+4]
	mov	ecx, eax		#
	shr	ecx, 16			#	shld	ecx, eax, 16
	xor	ecx, eax		#	xor	ecx, eax
	mov	eax, ecx		#
	shr	eax, 8			#
	xor	eax, ecx		#
	mov	ecx, eax		#
	shr	ecx, 4			#
	xor	ecx, eax		#
	mov	eax, 0x6996		#
	and	cl, 15			#	xor	eax, eax
	shr	eax, cl			#	xor	cl, ch
	and	eax, 1			#	setpo	al
	ret				#	ret

___paritydi2:
	mov	eax, [esp+8]		#	mov	eax, [esp+8]
	xor	eax, [esp+4]		#	xor	eax, [esp+4]
	push	eax			#	shld	ecx, eax, 16
	call	___paritysi2		#	xor	ecx, eax
	add	esp, 4			#	xor	eax, eax
					#	xor	cl, ch
					#	setpo	al
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end

Braindead implementation of code generator for `__builtin_*()`

For __builtin_*() functions, Clang emits unoptimised code, which can even be worse than the poor code provided in the clang_rt.builtins-i386.* and clang_rt.builtins-x86_64.* libraries, and fails optimisation!

`__builtin_parity()`

Create the text file case12a.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __paritysi2(unsigned int value) {
    return __builtin_parity(value);
}

int __paritydi2(unsigned long long value) {
    return __builtin_parityll(value);
}

Compile the source file case12a.c with Clang, targetting the AMD64 platform, and display the generated (unoptimised) assembly code:

clang -o- -S -target amd64-pc-linux case12a.c

[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movl	%edi, -4(%rbp)
	movl	-4(%rbp), %eax
	movl	%eax, %ecx
	shrl	%ecx
	andl	$0x55555555, %ecx
	subl	%ecx, %eax
	movl	%eax, %ecx
	andl	$0x33333333, %ecx
	shrl	$2, %eax
	andl	$0x33333333, %eax
	addl	%eax, %ecx
	movl	%ecx, %eax
	shrl	$4, %eax
	addl	%eax, %ecx
	andl	$0xF0F0F0F, %ecx
	imull	$0x1010101, %ecx, %eax
	shrl	$24, %eax
	andl	$1, %eax
	popq	%rbp
	retq
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rax
	movq	%rax, %rcx
	shrq	%rcx
	movabsq	$0x5555555555555555, %rdx
	andq	%rdx, %rcx
	subq	%rcx, %rax
	movabsq	$0x3333333333333333, %rcx
	movq	%rax, %rdx
	andq	%rcx, %rdx
	shrq	$2, %rax
	andq	%rcx, %rax
	addq	%rax, %rdx
	movq	%rdx, %rax
	shrq	$4, %rax
	addq	%rax, %rdx
	movabsq	$0xF0F0F0F0F0F0F0F, %rax
	andq	%rax, %rdx
	movabsq	$0x101010101010101, %rax
	imulq	%rax, %rdx
	shrq	$56, %rdx
	andq	$1, %rdx
	movl	%edx, %eax
	popq	%rbp
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]

Compile the source file case12a.c a second time with Clang, now engaging its optimiser, again targetting the AMD64 platform, and display the generated optimised assembly code:

clang -o- -O3 -S -target amd64-pc-linux case12a.c

[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	movl	%edi, %ecx
	shrl	$16, %ecx
	xorl	%edi, %ecx
	movl	%ecx, %edx
	shrl	$8, %edx
	xorl	%eax, %eax
	xorb	%cl, %dl
	setnp	%al
	retq
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	movq	%rdi, %rax
	shrq	%rax
	movabsq	$0x5555555555555555, %rcx
	andq	%rax, %rcx
	subq	%rcx, %rdi
	movabsq	$0x3333333333333333, %rax
	movq	%rdi, %rcx
	andq	%rax, %rcx
	shrq	$2, %rdi
	andq	%rax, %rdi
	addq	%rcx, %rdi
	movq	%rdi, %rax
	shrq	$4, %rax
	addq	%rdi, %rax
	movabsq	$0x10F0F0F0F0F0F0F, %rcx
	andq	%rax, %rcx
	movabsq	$0x101010101010101, %rax
	imulq	%rcx, %rax
	shrq	$56, %rax
	andl	$1, %eax
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: the optimised code generated for __builtin_parityll() is a perfect and very convincing declaration of bankruptcy!

Compile the source file case12a.c a third time with Clang, again engaging its optimiser, now targetting the i386 platform, and display the generated optimised assembly code:

clang -o- -O3 -S -target i386-pc-linux case12a.c

[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	shrl	$16, %ecx
	xorl	%eax, %ecx
	xorl	%eax, %eax
	xorb	%ch, %cl
	setnp	%al
	retl
[…]
.LCPI1_0:
	.zero	16,85
.LCPI1_1:
	.zero	16,51
.LCPI1_2:
	.zero	16,15
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	movq	4(%esp), %xmm0		# xmm0 = mem[0],zero
	movdqa	%xmm0, %xmm1
	psrlw	$1, %xmm1
	pand	.LCPI1_0, %xmm1
	psubb	%xmm1, %xmm0
	movdqa	.LCPI1_1, %xmm1		# xmm1 = [51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51]
	movdqa	%xmm0, %xmm2
	psrlw	$2, %xmm0
	pand	%xmm1, %xmm2
	pand	%xmm1, %xmm0
	paddb	%xmm2, %xmm0
	movdqa	%xmm0, %xmm1
	psrlw	$4, %xmm1
	paddb	%xmm0, %xmm1
	pxor	%xmm0, %xmm0
	pand	.LCPI1_2, %xmm1
	psadbw	%xmm1, %xmm0
	movd	%xmm0, %eax
	andl	$1, %eax
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: again the optimised code generated for __builtin_parityll() is a perfect and very convincing declaration of bankruptcy!

Compile the source file case12a.c a fourth time with Clang, again targetting the i386 platform, now without SSE support, and display the generated optimised assembly code:

clang -mno-sse -o- -O3 -S -target i386-pc-linux case12a.c

[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	shrl	$16, %ecx
	xorl	%eax, %ecx
	xorl	%eax, %eax
	xorb	%ch, %cl
	setnp	%al
	retl
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	movl	8(%esp), %ecx
	movl	4(%esp), %eax
	movl	%ecx, %edx
	shrl	%edx
	andl	$0x55555555, %edx
	subl	%edx, %ecx
	movl	%ecx, %edx
	shrl	$2, %ecx
	andl	$0x33333333, %edx
	andl	$0x33333333, %ecx
	addl	%edx, %ecx
	movl	%ecx, %edx
	shrl	$4, %edx
	addl	%ecx, %edx
	andl	$0x10F0F0F, %edx
	imull	$0x1010101, %edx, %ecx
	movl	%eax, %edx
	shrl	%edx
	shrl	$24, %ecx
	andl	$0x55555555, %edx
	subl	%edx, %eax
	movl	%eax, %edx
	shrl	$2, %eax
	andl	$0x33333333, %edx
	andl	$0x33333333, %eax
	addl	%edx, %eax
	movl	%eax, %edx
	shrl	$4, %edx
	addl	%eax, %edx
	andl	$0x10F0F0F, %edx
	imull	$0x1010101, %edx, %eax
	shrl	$24, %eax
	addl	%ecx, %eax
	andl	$1, %eax
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: once more the optimised code generated for __builtin_parityll() is a perfect and very convincing declaration of bankruptcy!

`__builtin_rotateleft()` and `__builtin_rotateright()`

Create the text file case12b.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long __rotldi3(long long value, long long count) {
    return __builtin_rotateleft64(value, count);
}

long long __rotrdi3(long long value, long long count) {
    return __builtin_rotateright64(value, count);
}

Compile the source file case12b.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case12b.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movb	28(%esp), %ch		#	mov	ecx, [esp+16]
	movl	20(%esp), %esi		#	mov	edx, [esp+12]
	movl	24(%esp), %edx		#	mov	eax, [esp+8]
	xorl	%ebp, %ebp		#	mov	ebx, edx
	movb	%ch, %cl		#	test	cl, 32
	movl	%edx, %edi		#	cmovnz	ebx, eax
	movl	%esi, %ebx		#	cmovnz	eax, edx
	movl	%esi, %eax		#	mov	edx, ebx
	negb	%cl			#
	shrl	%cl, %edi		#	shld	edx, eax, cl
	shrdl	%cl, %edx, %ebx		#	shld	eax, ebx, cl
	testb	$32, %cl		#
	movb	%ch, %cl		#
	cmovnel	%edi, %ebx		#
	cmovnel	%ebp, %edi		#
	shll	%cl, %eax		#
	shldl	%cl, %esi, %edx		#
	testb	$32, %ch		#
	cmovnel	%eax, %edx		#
	cmovnel	%ebp, %eax		#
	orl	%ebx, %eax		#
	orl	%edi, %edx		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movl	20(%esp), %esi		#	mov	eax, [esp+8]
	movl	24(%esp), %edx		#	mov	edx, [esp+12]
	movb	28(%esp), %cl		#	mov	ecx, [esp+16]
	xorl	%ebp, %ebp		#	mov	ebx, eax
					#	test	cl, 32
	movl	%edx, %edi		#	cmovnz	ebx, edx
	movl	%esi, %ebx		#	cmovnz	edx, eax
	movl	%esi, %eax		#	mov	eax, ebx
	shrl	%cl, %edi		#	shrd	eax, edx, cl
	shrdl	%cl, %edx, %ebx		#	shrd	edx, ebx, cl
	testb	$32, %cl		#
	cmovnel	%edi, %ebx		#
	cmovnel	%ebp, %edi		#
	negb	%cl			#
	shll	%cl, %eax		#
	shldl	%cl, %esi, %edx		#
	testb	$32, %cl		#
	cmovnel	%eax, %edx		#
	cmovnel	%ebp, %eax		#
	orl	%ebx, %eax		#
	orl	%edi, %edx		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: 31 instructions in 67 bytes for the __rotldi3() function, and 29 instructions in 63 bytes for the __rotrdi3() function, instead of only 13 instructions in 35 bytes for each of these functions – the optimised code generated for __builtin_rotate*() is yet another perfect and very convincing declaration of bankruptcy!

`__builtin_mul_overflow()`

Create the text file case12c.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __umulvsi3(unsigned int multiplicand, unsigned int multiplier) {
    unsigned int product;
    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}

unsigned long long __umulvdi3(unsigned long long multiplicand, unsigned long long multiplier) {
    unsigned long long product;
    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}

#ifdef __amd64__
__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    __uint128_t product;
    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}
#endif

Compile the source file case12c.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case12c.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__umulvsi3:				# @__umulvsi3
# %bb.0:
	movl	%edi, %eax
	mull	%esi
	jo	.LBB0_1
# %bb.2:
	retq
.LBB0_1:
	ud2
[…]
__umulvdi3:				# @__umulvdi3
# %bb.0:
	movq	%rdi, %rax
	mulq	%rsi
	jo	.LBB1_1
# %bb.2:
	retq
.LBB1_1:
	ud2
[…]
__umulvti3:				# @__umulvti3
# %bb.0:
	pushq	%rbx			#
	movq	%rdx, %r8		#	mov	r8, rdx
	movq	%rsi, %rax		#	mov	r9, rcx
	testq	%rcx, %rcx		#	neg	rcx
	setne	%dl			#	sbb	ecx, ecx
	testq	%rsi, %rsi		#	xor	eax, eax
	setne	%r9b			#	cmp	rax, rsi
	andb	%dl, %r9b		#	sbb	eax, eax
	mulq	%r8			#	and	ecx, eax
	movq	%rax, %r11		#
	seto	%r10b			#	mov	rax, rdi
	movq	%rcx, %rax		#	mul	r9
	mulq	%rdi			#	adc	ecx, ecx
	movq	%rax, %rsi		#	mov	r9, rax
	seto	%bl			#
	orb	%r10b, %bl		#	mov	rax, rsi
	orb	%r9b, %bl		#	mul	r8
	addq	%r11, %rsi		#	adc	ecx, ecx
	movq	%rdi, %rax		#	add	r9, rax
	mulq	%r8			#
	addq	%rsi, %rdx		#	mov	rax, rdi
	setb	%cl			#	mul	r8
	orb	%bl, %cl		#	add	rdx, r9
	cmpb	$1, %cl			#	adc	ecx, ecx
	je	.LBB2_1			#	jnz	.LBB2_1
# %bb.2:
	popq	%rbx			#
	retq				#	ret
.LBB2_1:
	ud2				#	ud2
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: the __umulvti3() function shows 28 instructions in 77 bytes instead of only 23 instructions in just 58 bytes, clobbers registers RBX and R10 without necessity, and performs 2 superfluous memory accesses plus 9 partial register writes which impair performance!

Compile the source file case12c.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case12c.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__umulvsi3:				# @__umulvsi3
# %bb.0:
	movl	4(%esp), %eax
	mull	8(%esp)
	jo	.LBB0_1
# %bb.2:
	retl
.LBB0_1:
	ud2
[…]
__umulvdi3:				# @__umulvdi3
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movl	32(%esp), %esi		#	xor	ebx, ebx
	movl	24(%esp), %eax		#	mov	ecx, [esp+12]
	movl	20(%esp), %ebp		#	cmp	ebx, ecx
	testl	%esi, %esi		#	sbb	edx, edx
	setne	%dl			#	mov	eax, [esp+20]
	testl	%eax, %eax		#	cmp	ebx, eax
	setne	%cl			#	sbb	ebx, ebx
	andb	%dl, %cl		#	and	ebx, edx
	mull	28(%esp)		#	mul	[esp+8]
	movl	%eax, %edi		#
	movl	%esi, %eax		#	xchg	ecx, eax
	seto	%bl			#	adc	ebx, ebx
	mull	%ebp			#	mul	[esp+16]
	movl	%eax, %esi		#
	movl	%ebp, %eax		#
	seto	%ch			#	adc	ebx, ebx
	mull	28(%esp)		#	add	ecx, eax
	addl	%edi, %esi		#	mov	eax, [esp+8]
	orb	%bl, %ch		#	mul	[esp+16]
	addl	%esi, %edx		#	add	edx, ecx
	setb	%bl			#	adc	ebx, ebx
	orb	%ch, %bl		#
	orb	%cl, %bl		#
	cmpb	$1, %bl			#
	je	.LBB1_1			#	jnz	.LBB1_1
# %bb.2:
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
.LBB1_1:
	ud2				#	ud2
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: the __umulvdi3() function shows 35 instructions in 77 bytes instead of only 23 instructions in just 54 bytes, clobbers registers EBP, EDI and ESI without necessity, and performs 5 excess memory accesses plus 9 partial register writes which impair performance!

`__builtin_add_overflow()` and `__builtin_sub_overflow()`

Create the text file case12d.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Saturating signed addition and subtraction

int clamp_add_32(int augend, int addend) {
#ifdef UNDEFINED_BEHAVIOUR
    int sum, clamp = 2147483647 + (augend < 0);
#else
    int sum, clamp = 2147483647 ^ (augend >> 31);
#endif
    return __builtin_add_overflow(augend, addend, &sum) ? clamp : sum;
}

int clamp_sub_32(int minuend, int subtrahend) {
#ifdef UNDEFINED_BEHAVIOUR
    int difference, clamp = 2147483647 + (minuend < 0);
#else
    int difference, clamp = 2147483647 ^ (minuend >> 31);
#endif
    return __builtin_sub_overflow(minuend, subtrahend, &difference) ? clamp : difference;
}

long long clamp_add_64(long long augend, long long addend) {
#ifdef UNDEFINED_BEHAVIOUR
    long long sum, clamp = 9223372036854775807LL + (augend < 0);
#else
    long long sum, clamp = 9223372036854775807LL ^ (augend >> 63);
#endif
    return __builtin_add_overflow(augend, addend, &sum) ? clamp : sum;
}

long long clamp_sub_64(long long minuend, long long subtrahend) {
#ifdef UNDEFINED_BEHAVIOUR
    long long difference, clamp = 9223372036854775807LL + (minuend < 0);
#else
    long long difference, clamp = 9223372036854775807LL ^ (minuend >> 63);
#endif
    return __builtin_sub_overflow(minuend, subtrahend, &difference) ? clamp : difference;
}

Compile the source file case12d.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -Os -S -target i386-pc-linux case12d.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
clamp_add_32:				# @clamp_add_32
# %bb.0:
	movl	4(%esp), %eax		#	mov	eax, [esp+4]
	movl	%eax, %ecx		#
	sarl	$31, %ecx		#	cdq
	xorl	$2147483647, %ecx	#	xor	edx, 0x7FFFFFFF
	addl	8(%esp), %eax		#	add	eax, [esp+8]
	cmovol	%ecx, %eax		#	cmovo	eax, edx
	retl				#	ret
[…]
clamp_sub_32:				# @clamp_sub_32
# %bb.0:
	movl	4(%esp), %eax		#	mov	eax, [esp+4]
	movl	%eax, %ecx		#
	sarl	$31, %ecx		#	cdq
	xorl	$2147483647, %ecx	#	xor	edx, 0x7FFFFFFF
	subl	8(%esp), %eax		#	sub	eax, [esp+8]
	cmovol	%ecx, %eax		#	cmovo	eax, edx
	retl				#	ret
[…]
clamp_add_64:				# @clamp_add_64
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	pushl	%eax			#
	movl	28(%esp), %esi		#	mov	edx, [esp+12]
	movl	24(%esp), %eax		#	mov	eax, [esp+8]
	movl	36(%esp), %ebp		#
	movl	%esi, %edi		#
	movl	%esi, %edx		#	mov	ecx, edx
	sarl	$31, %edi		#	sar	ecx, 31
	movl	%edi, %ebx		#	mov	ebx, 0x7FFFFFFF
	xorl	$2147483647, %edi	#	xor	ebx, ecx
	addl	32(%esp), %eax		#	add	eax, [esp+16]
	notl	%ebx			#
	adcl	%ebp, %edx		#	adc	edx, [esp+20]
	setns	%cl			#
	testl	%esi, %esi		#
	setns	%ch			#
	cmpb	%cl, %ch		#
	setne	3(%esp)			#
	testl	%ebp, %ebp		#
	setns	%cl			#
	cmpb	%cl, %ch		#
	sete	%cl			#
	testb	%cl, 3(%esp)		#
	cmovnel	%edi, %edx		#	cmovo	edx, ebx
	cmovnel	%edi, %eax		#	cmovo	eax, ecx
	addl	$4, %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
clamp_sub_64:				# @clamp_sub_64
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	pushl	%eax			#
	movl	28(%esp), %esi		#	mov	edx, [esp+12]
	movl	24(%esp), %eax		#	mov	eax, [esp+8]
	movl	36(%esp), %ebp		#
	movl	%esi, %edi		#
	movl	%esi, %edx		#	mov	ecx, edx
	sarl	$31, %edi		#	sar	ecx, 31
	movl	%edi, %ebx		#	mov	ebx, 0x7FFFFFFF
	xorl	$2147483647, %edi	#	xor	ebx, ecx
	subl	32(%esp), %eax		#	sub	eax, [esp+16]
	notl	%ebx			#
	sbbl	%ebp, %edx		#	sbb	eax, [esp+20]
	setns	%cl			#
	testl	%esi, %esi		#
	setns	%ch			#
	cmpb	%cl, %ch		#
	setne	3(%esp)			#
	testl	%ebp, %ebp		#
	setns	%cl			#
	cmpb	%cl, %ch		#
	setne	%cl			#
	testb	%cl, 3(%esp)		#
	cmovnel	%edi, %edx		#	cmovo	edx, ebx
	cmovnel	%ebx, %eax		#	cmovo	eax, ecx
	addl	$4, %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: for each of the functions clamp_add_64 and clamp_sub_64, this optimising compiler generates 34 instructions in 83 bytes, clobbering registers EBP, EDI and ESI without necessity, wasting 10 (in words: ten) superfluous instructions to determine overflow via twiddling with sign bits, thereby performing 4 partial register writes which impair performance, instead of only 13 instructions in just 37 bytes and using the overflow flag – a real shame!

`__builtin_copysign()`

Create the text file case12e.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

double copysign(double destination, double source) {
    return __builtin_copysign(destination, source);
}

Compile the source file case12e.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -Os -S -target i386-pc-linux case12e.c

[…]
.LCPI0_0:
	.quad	-9223372036854775808	# double -0
	.quad	-9223372036854775808	# double -0
.LCPI0_1:
	.quad	9223372036854775807	# double NaN
	.quad	9223372036854775807	# double NaN
	.text
copysign:				# @copysign
# %bb.0:
	subl	$12, %esp
	movsd	24(%esp), %xmm0		# xmm0 = mem[0],zero
	movsd	16(%esp), %xmm1		# xmm1 = mem[0],zero
	andps	.LCPI0_0, %xmm0
	andps	.LCPI0_1, %xmm1
	orps	%xmm0, %xmm1
	movlps	%xmm1, (%esp)
	fldl	(%esp)
	addl	$12, %esp
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: 10 instructions in 43 bytes, plus 32 bytes for 2 constants, whose access yields a page fault in the worst case.

Compile the source file case12e.c a second time with Clang, now without SSE support, and display the generated assembly code:

clang -mno-sse -o- -Os -S -target i386-pc-linux case12e.c

[…]
copysign:				# @copysign
# %bb.0:
	subl	$12, %esp
	fldl	16(%esp)
	fldl	24(%esp)
	fstpl	(%esp)
	fabs
	fld	%st(0)
	fchs
	testb	$-128, 7(%esp)
	fxch	%st(1)
	fcmovne	%st(1), %st
	fstp	%st(1)
	addl	$12, %esp
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: 13 instructions in 33 bytes, wasting 3 instructions by writing the second argument without necessity on the stack.

Proper code uses but only 5 instructions in just 19 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"copysign.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text

copysign:
	mov	eax, [esp+16]
	shld	[esp+8], eax, 1
	ror	[esp+8], 1
	fld	[esp+4]
	ret

.size	copysign, .-copysign
.type	copysign, @function
.global	copysign
.end

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"copysign.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text

copysign:
.ifnotdef OPTIMIZE
	mov	rax, 0x7FFFFFFFFFFFFFFF
	movd	xmm2, rax
	andpd	xmm0, xmm2
	andnpd	xmm2, xmm1
	orpd	xmm0, xmm2
.else
	movd	rcx, xmm0
	movd	rdx, xmm1
	shld	rcx, rdx, 1
	ror	rcx, 1
	movd	xmm0, rcx
.endif
	ret

.size	copysign, .-copysign
.type	copysign, @function
.global	copysign
.end

Note: exploration of the equally bad code generated for the copysignf() and copysignl() functions is left as an exercise to the reader.

Utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication

Create the text file case13.c with the following content in an arbitrary, preferable empty directory:

// Copyright © 2015-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

#ifdef OPTIMIZE
__attribute__ ((noinline))
static
__int128_t __muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow)
{
    __uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);

    *overflow = __builtin_mul_overflow((multiplicand ^ sign) - sign,
                                       (multiplier ^ tmp) - tmp,
                                       &product);
    sign ^= tmp;
    product = (product ^ sign) - sign;
    *overflow |= (__int128_t) (product ^ sign) < 0;

    return product;
}
#endif

__attribute__ ((always_inline))
static
__uint128_t lfsr128(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
    //   initialised with 2**128 / golden ratio

    static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
    const  __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
    const  __uint128_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);

    return lfsr ^= poly & sign;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);

    return lfsr ^= 0xAD93D23594C935A9 & sign;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
    // 32-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xDB710641 (CRC-32 IEEE),
    //   initialised with 2**32 / golden ratio

    static uint32_t lfsr = 0x9E3779B9;
    const  uint32_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);

    return lfsr ^= 0xDB710641 & sign;
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t n;
    __int128_t multiplicand, multiplier = ~0;
    volatile __int128_t product;
    volatile int overflow;

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        product = multiplicand * multiplier;
        multiplier = lfsr64();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr64();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        product = multiplicand * multiplier;
        multiplier = lfsr32();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr32();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr64();
        product = multiplicand * multiplier;
        multiplier = lfsr64();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr64();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr64();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr32();
        product = multiplicand * multiplier;
        multiplier = lfsr32();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr32();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr32();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}

Note: modification of the source file case13.c to demonstrate the equally devastating (mis)performance of the overflow-checking 64×64-bit integer multiplication on 32-bit processors is left as an exercise to the reader.

Run the following command lines to compile, link and run the benchmark program:

lscpu
clang -O3 -rtlib=compiler-rt case13.c
echo 'LLVM/clang/compiler-rt'
./a.out
clang -DOPTIMIZE -O3 case13.c
echo 'LLVM/clang'
./a.out

Runtime measurement on AMD^® EPYC^™ 7262

Note: for better readability and to ease their comparison, the numbers are shown in two columns.

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7262 8-Core Processor
Stepping:              0
CPU MHz:               3193.726
BogoMIPS:              6387.45
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
[…]
                    LLVM/clang/compiler-rt         LLVM/clang

__multi3()         4.190000       0               4.180000       0
__muloti4()      238.320000     234.130000        7.170000       2.990000
                 242.510000 nano-seconds         11.350000 nano-seconds

__multi3()         4.050000       0               4.050000       0
__muloti4()      348.200000     344.150000        7.230000       3.180000
                 352.250000 nano-seconds         11.280000 nano-seconds

__multi3()         3.900000       0               3.880000       0
__muloti4()      232.630000     228.730000        8.420000       4.540000
                 236.530000 nano-seconds         12.300000 nano-seconds

__multi3()         3.850000       0               3.850000       0
__muloti4()        4.240000       0.390000        4.230000       0.380000
                   8.090000 nano-seconds          8.080000 nano-seconds

Note: the overhead for the pseudo-random number generators is negligible here!

Oops: with the properly implemented __muloti4() function, which is called by __builtin_mul_overflow() for signed 128-bit integers, overflow-checking multiplication runs about 2 times slower than unchecked multiplication.

OUCH: with the __muloti4() function provided in the compiler-rt library, overflow-checking multiplication runs but about 2 orders of magnitude slower than unchecked multiplication – an utterly devastating result, proving the miserable implementation of Clang and its runtime library again!

Which optimiser?

Create the text file case14.c with the following content, presented by Matt Godbolt as example (t) in his ACM queue article Optimizations in C++ Compilers – A practical journey, in an arbitrary, preferable empty directory:

bool isWhitespace(char c)
{
    return c == ' '
      || c == '\r'
      || c == '\n'
      || c == '\t';
}

Compile the source file case14.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case14.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
isWhitespace:				# @isWhitespace
# %bb.0:
	cmpb	$32, %dil		#
	ja	.LBB0_2			#
# %bb.1:
	movl	$1, %eax		#
	movzbl	%dil, %ecx		#	mov	ecx, edi
	movabsq	$0x100002400, %rdx	#	mov	rax, 0x100002600
	btq	%rcx, %rdx		#	shr	rax, cl
	jae	.LBB0_2			#
# %bb.3:
	retq				#
.LBB0_2:				#
	xorl	%eax, %eax		#	cmp	cl, 33
	cmpb	$9, %dil		#	sbb	ecx, ecx
					#	neg	ecx
	sete	%al			#	and	eax, ecx
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

OOPS: 12 instructions in 42 bytes, including 2 conditional branches which impair performance, plus register RDX clobbered, instead of only 8 instructions in just 19 bytes, and without (superfluous) conditional branches.

Compile the source file case14.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case14.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
isWhitespace:				# @isWhitespace
# %bb.0:
	pushl	%esi			#
	movb	8(%esp), %cl		#	mov	ecx, [esp+4]
	movl	%ecx, %edx		#
	addb	$-10, %dl		#
	cmpb	$22, %dl		#
	ja	.LBB0_2			#
# %bb.1:
	movzbl	%dl, %edx		#	xor	eax, eax
	movl	$0x400009, %esi		#	cmp	eax, ecx
	movl	$1, %eax		#	adc	eax, 0x2600
	btl	%edx, %esi		#	shr	eax, cl
	jae	.LBB0_2			#
# %bb.3:
	popl	%esi			#
	retl				#
.LBB0_2:
	xorl	%eax, %eax		#	xor	edx, edx
	cmpb	$9, %cl			#	cmp	ecx, 33
	sete	%al			#	adc	edx, edx
	popl	%esi			#	and	eax, edx
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: 18 instructions in 45 bytes, including 2 conditional branches which impair performance, plus register ESI clobbered, instead of only 10 instructions in just 25 bytes, again without (superfluous) conditional branches, and no non-volatile register clobbered.

The optimiser fails – take 2

Create the text file case15.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2018-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long __absvdi2(long long value) {
#ifdef ALTERNATE // undefined behaviour
    value = __builtin_llabs(value);
    if (value < 0)
        __builtin_trap();
    return value;
#else
    long long sign = 0 - (value < 0);
    if (__builtin_saddll_overflow(value, sign, &value))
        __builtin_trap();
    return value ^ sign;
#endif
}

long long __addvdi3(long long augend, long long addend) {
    long long sum;
    if (__builtin_saddll_overflow(augend, addend, &sum))
        __builtin_trap();
    return sum;
}

long long __mulvdi3(long long multiplicand, long long multiplier) {
    long long product;
    if (__builtin_smulll_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}

long long __negvdi2(long long negend) {
    long long negation;
    if (__builtin_ssubll_overflow(0, negend, &negation))
        __builtin_trap();
    return negation;
}

long long __subvdi3(long long minuend, long long subtrahend) {
    long long difference;
    if (__builtin_ssubll_overflow(minuend, subtrahend, &difference))
        __builtin_trap();
    return difference;
}

Compile the source file case15.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case15.c

Note: the left column shows the generated insane code, while the right column shows properly optimised code as comment.

[…]
__absvdi2:				# @__absvdi2
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	pushl	%eax			#
	movl	20(%esp), %esi		#	mov	eax, [esp+8]
	movl	16(%esp), %eax		#	mov	ecx, [esp+4]
	movl	%esi, %ecx		#
	movl	%esi, %edx		#
	sarl	$31, %ecx		#	cdq
	addl	%ecx, %eax		#	add	ecx, edx
	adcl	%ecx, %edx		#	adc	eax, edx
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	3(%esp)			#
	testl	%ecx, %ecx		#
	setns	%bl			#
	cmpb	%bl, %bh		#
	sete	%bl			#
	andb	3(%esp), %bl		#
	cmpb	$1, %bl			#
	je	.LBB0_1			#	into
# %bb.2:
	xorl	%ecx, %eax		#	xor	ecx, edx
	xorl	%ecx, %edx		#	xor	edx, eax
	addl	$4, %esp		#	mov	eax, ecx
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB0_1:
	ud2				#
[…]
__addvdi3:				# @__addvdi3
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	movl	12(%esp), %eax		#	mov	eax, [esp+4]
	movl	16(%esp), %esi		#	mov	edx, [esp+8]
	movl	24(%esp), %ecx		#	add	eax, [esp+12]
	addl	20(%esp), %eax		#	adc	edx, [esp+16]
	movl	%esi, %edx		#
	adcl	%ecx, %edx		#
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	%bl			#
	testl	%ecx, %ecx		#
	setns	%cl			#
	cmpb	%cl, %bh		#
	sete	%cl			#
	andb	%bl, %cl		#
	cmpb	$1, %cl			#
	je	.LBB1_1			#	into
# %bb.2:
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB1_1:
	ud2				#
[…]
__mulvdi3:				# @__mulvdi3
# %bb.0:
	pushl	%edi			#
	pushl	%esi			#
	pushl	%eax			#	push	ebx
	movl	16(%esp), %eax		#	mov	ebx, [esp+16]
	movl	24(%esp), %edx		#	mov	eax, [esp+8]
	movl	20(%esp), %ecx		#	mul	ebx
	movl	28(%esp), %esi		#	push	eax
	movl	$0, (%esp)		#	mov	ecx, edx
	subl	$12, %esp		#	mov	eax, [esp+16]
	leal	12(%esp), %edi		#	mul	ebx
	pushl	%edi			#	xor	ebx, ebx
	pushl	%esi			#	add	ecx, eax
	pushl	%edx			#	adc	ebx, edx
	pushl	%ecx			#	mov	eax, [esp+12]
	pushl	%eax			#	mul	dword ptr [esp+24]
	calll	__mulodi4		#	add	ecx, eax
	addl	$32, %esp		#	adc	ebx, edx
	cmpl	$0, (%esp)		#	push	ecx
					#	sbb	ecx, ecx
					#	mov	eax, [esp+20]
					#	mul	dword ptr [esp+28]
					#	neg	ecx
					#	add	eax, ebx
					#	adc	edx, ecx
					#	xor	ebx, ebx
					#	cmp	ebx, [esp+28]
					#	jle	0f
					#	not	ebx
					#	sub	eax, [esp+16]
					#	sbb	edx, [esp+20]
					# 0:
					#	xor	ecx, ecx
					#	cmp	ecx, [esp+20]
					#	jle	1f
					#	not	ebx
					#	sub	eax, [esp+24]
					#	sbb	edx, [esp+28]
					# 1:
					#	xor	eax, ebx
					#	xor	edx, ebx
					#	or	eax, edx
					#	pop	edx
					#	shld	ecx, edx, 1
					#	add	ebx, ecx
					#	or	eax, ebx
	jne	.LBB2_1			#	jnz	.LBB2_1
					#	pop	eax
# %bb.2:
	addl	$4, %esp		#	pop	ebx
	popl	%esi			#
	popl	%edi			#
	retl				#	ret
.LBB2_1:
	ud2				#	ud2
[…]
__negvdi2:				# @__negvdi2
# %bb.0:
	xorl	%eax, %eax		#	xor	eax, eax
	xorl	%edx, %edx		#	cdq
	movl	8(%esp), %ecx		#
	subl	4(%esp), %eax		#	sub	eax, [esp+4]
	sbbl	%ecx, %edx		#	sbb	edx, [esp+8]
	testl	%edx, %ecx		#
	js	.LBB3_1			#	into
# %bb.2:
	retl				#	ret
.LBB3_1:
	ud2				#
[…]
__subvdi3:				# @__subvdi3
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	movl	12(%esp), %eax		#	mov	eax, [esp+4]
	movl	16(%esp), %esi		#	mov	edx, [esp+8]
	movl	24(%esp), %ecx		#
	subl	20(%esp), %eax		#	sub	eax, [esp+12]
	movl	%esi, %edx		#
	sbbl	%ecx, %edx		#	sbb	edx, [esp+16]
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	%bl			#
	testl	%ecx, %ecx		#
	setns	%cl			#
	cmpb	%cl, %bh		#
	setne	%cl			#
	andb	%bl, %cl		#
	cmpb	$1, %cl			#
	je	.LBB4_1			#	into
# %bb.2:
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB4_1:
	ud2				#
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: 29 instructions in 68 bytes instead of only 10 instructions in just 21 bytes for the __absvdi2() function, and 24 instructions in 47 bytes instead of only 6 instructions in just 18 bytes for each of the __addvdi3() and __subvdi3() functions!

OOPS: 24 instructions in 60 bytes, plus 98 instructions in 266 bytes of the called __mulodi4() function, instead of only 46 instructions in just 113 bytes for the __mulvdi3() function.

Note: exploration of the equally bad code generated for the __absvsi2(), __addvsi3(), __mulvsi3() and __subvsi3() functions as well as the __absvti2(), __addvti3(), __mulvti3() and __subvti3() functions is left as an exercise to the reader.

The optimiser fails – take 3

Create the text file case16.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int reverse(unsigned int value) {
#ifndef ALTERNATE
    // swap all adjacent bits
    value = ((value &  0xAAAAAAAAU) >> 1)
          | ((value & ~0xAAAAAAAAU) << 1);
    // swap adjacent pairs of bits
    value = ((value &  0xCCCCCCCCU) >> 2)
          | ((value & ~0xCCCCCCCCU) << 2);
    // swap adjacent nibbles
    value = ((value &  0xF0F0F0F0U) >> 4)
          | ((value & ~0xF0F0F0F0U) << 4);
    // swap adjacent octetts
    value = ((value &  0xFF00FF00U) >> 8)
          | ((value & ~0xFF00FF00U) << 8);
    // swap high and low part
    value = ((value &  0xFFFF0000U) >> 16)
          | ((value & ~0xFFFF0000U) << 16);
#else
    value = ((value <<  1) &  0xAAAAAAAAU)
          | ((value >>  1) & ~0xAAAAAAAAU);
    value = ((value <<  2) &  0xCCCCCCCCU)
          | ((value >>  2) & ~0xCCCCCCCCU);
    value = ((value <<  4) &  0xF0F0F0F0U)
          | ((value >>  4) & ~0xF0F0F0F0U);
    value = ((value <<  8) &  0xFF00FF00U)
          | ((value >>  8) & ~0xFF00FF00U);
    value = ((value << 16) &  0xFFFF0000U)
          | ((value >> 16) & ~0xFFFF0000U);
#endif
    return value;
}

Compile the source file case16.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case16.c

Note: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment.

[…]
reverse:				# @reverse
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	andl	$0xD5555555, %eax
	shrl	%ecx
	andl	$0x55555555, %ecx
	leal	(%ecx,%eax,2), %eax
	movl	%eax, %ecx
	andl	$0xF3333333, %eax
	shrl	$2, %ecx
	andl	$0x33333333, %ecx
	leal	(%ecx,%eax,4), %eax
	movl	%eax, %ecx
	shll	$4, %eax
	shrl	$4, %ecx
	andl	$0xF0F0F0F0, %eax
	andl	$0x0F0F0F0F, %ecx
	orl	%ecx, %eax
	movl	%eax, %ecx		#	bswap	%eax
	shll	$8, %eax		#
	shrl	$8, %ecx		#
	andl	$0xFF00FF00, %eax	#
	andl	$0x00FF00FF, %ecx	#
	orl	%ecx, %eax		#
	roll	$16, %eax		#
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: the optimiser fails to recognise this common and well-known idiom for endian conversion!

Note: exploration of the code generated with the preprocessor macro ALTERNATE defined is left as an exercise to the reader.

The optimiser fails – take 4

Create the text file case17.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpti2(__uint128_t a, __uint128_t b) {
    if (a < b)
        return 0;
    if (a > b)
        return 2;
    return 1;
}

Compile the source file case17.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case17.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__ucmpti2:				# @__ucmpti2
# %bb.0:
	cmpq	%rdi, %rdx		#	cmp	rdi, rdx
	movq	%rcx, %rax		#	mov	rax, rsi
	sbbq	%rsi, %rax		#	sbb	rax, rcx
	movl	$1, %r8d		#
	adcl	$0, %r8d		#
	xorl	%eax, %eax		#	sbb	eax, eax
	cmpq	%rdx, %rdi		#	cmp	rdx, rdi
	sbbq	%rcx, %rsi		#	sbb	rcx, rsi
	cmovael	%r8d, %eax		#	adc	eax, 1
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: 10 instructions in 32 bytes, including a superfluous conditional move which impairs performance, instead of 8 instructions in 21 bytes, without conditional move.

The optimiser fails – take 5

Create the text file case18.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpti2(__uint128_t a, __uint128_t b) {
    if (a > b)
        return 2;
    if (a == b)
        return 1;
    return 0;
}

Compile the source file case18.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-windows case18.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__ucmpti2:				# @__ucmpti2
# %bb.0:
	movdqa	(%rcx), %xmm0		#	mov	r8, [rcx]
	movq	(%rdx), %r8		#	mov	r9, [rcx+8]
	pcmpeqb	(%rdx), %xmm0		#	mov	rcx, [rdx]
	pmovmskb	%xmm0, %eax	#	mov	rdx, [rdx+8]
	xorl	%r9d, %r9d		#	cmp	r8, rcx
	cmpl	$0xFFFF, %eax		#	mov	rax, r9
	sete	%r9b			#	sbb	rax, rdx
	cmpq	(%rcx), %r8		#	sbb	eax, eax
	movq	8(%rdx), %rax		#	cmp	rcx, r8
	sbbq	8(%rcx), %rax		#	sbb	rdx, r9
	movl	$2, %eax		#	adc	eax, 1
	cmovael %r9d, %eax		#
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: 13 instructions in 48 bytes, including a superfluous conditional move which impairs performance, using the SSE register XMM0 without necessity, performing 6 memory accesses, instead of 12 instructions in 35 bytes, without conditional move, performing 4 memory accesses.

The optimiser fails – take 6

Create the text file case19.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __clzti2(__uint128_t value) {
    if ((value >> 64) != 0)
        return __builtin_clzll(value >> 64);
    if ((value & ~0ULL) != 0)
        return __builtin_clzll(value) + 64;
    return 128;
}

int __ctzti2(__uint128_t value) {
    if ((value & ~0ULL) != 0)
        return __builtin_ctzll(value);
    if ((value >> 64) != 0)
        return __builtin_ctzll(value >> 64) + 64;
    return 128;
}

Compile the source file case19.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case19.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__clzti2:				# @__clzti2
# %bb.0:
	testq	%rsi, %rsi		#	bsr	rax, rsi
	je	.LBB0_2			#	jz	.LBB0_2
# %bb.1:
	bsrq	%rsi, %rax		#
	xorl	$63, %eax		#	xor	eax, 63
	retq				#	ret
.LBB0_2:
	testq	%rdi, %rdi		#	bsr	rax, rdi
	je	.LBB0_3			#	jz	.LBB0_3
# %bb.4:
	bsrq	%rdi, %rax		#
	xorl	$63, %eax		#	xor	eax, 127
	orl	$64, %eax		#
	retq				#	ret
.LBB0_3:
	movl	$128, %eax		#	mov	eax, 128
	retq				#	ret
[…]
__ctzti2:				# @__ctzti2
# %bb.0:
	testq	%rdi, %rdi		#
	je	.LBB1_2			#
# %bb.1:
	bsfq	%rdi, %rax		#	bsf	rax, rdi
	retq				#	jnz	.return
.LBB1_2:
	testq	%rsi, %rsi		#
	je	.LBB1_3			#
# %bb.4:
	bsfq	%rsi, %rax		#	bsf	rax, rsi
					#	jz	.LBB1_3
	orl	$64, %eax		#	or	eax, 64
					# .return:
	retq				#	ret
.LBB1_3:
	movl	$128, %eax		#	mov	eax, 128
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: 13 instructions in 35 bytes instead of 10 instructions in 26 bytes for the __clzti2() function, and 11 instructions in 30 bytes instead of 8 instructions in 22 bytes for the __ctzti2() function.

The optimiser fails – take 7

Create the text file case20.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
long long __mindi3(long long a, long long b) {
    return a < b ? a : b;
}

unsigned long long __umindi3(unsigned long long a, unsigned long long b) {
    return a < b ? a : b;
}
#else
__int128_t __maxti3(__int128_t a, __int128_t b) {
    return a > b ? a : b;
}

__uint128_t __umaxti3(__uint128_t a, __uint128_t b) {
    return a > b ? a : b;
}
#endif

Compile the source file case20.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case20.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__maxti3:				# @__maxti3
# %bb.0:
	movq	%rdx, %rax		#	mov	rax, rdx
	cmpq	%rdx, %rdi		#	cmp	rdx, rdi
	movq	%rsi, %rdx		#	mov	rdx, rcx
	sbbq	%rcx, %rdx		#	sbb	rcx, rsi
	cmovgeq	%rdi, %rax		#	cmovl	rax, rdi
	cmovgeq	%rsi, %rcx		#	cmovl	rdx, rsi
	movq	%rcx, %rdx		#
	retq				#	ret
[…]
__umaxti3:				# @__umaxti3
# %bb.0:
	movq	%rdx, %rax		#	mov	rax, rdx
	cmpq	%rdi, %rdx		#	cmp	rdx, rdi
	movq	%rcx, %rdx		#	mov	rdx, rcx
	sbbq	%rsi, %rdx		#	sbb	rcx, rsi
	cmovbq	%rdi, %rax		#	cmovb	rax, rdi
	cmovbq	%rsi, %rcx		#	cmovb	rdx, rsi
	movq	%rcx, %rdx		#
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: 8 instructions in 24 bytes instead of 7 instructions in 21 bytes for the __maxti3() and __umaxti3() functions.

Note: exploration of the code generated for the __minti3() and __uminti3() functions on the AMD64 platform is left as an exercise to the reader.

Compile the source file case20.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case20.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__mindi3:				# @__mindi3
# %bb.0:
	pushl	%edi			#
	pushl	%esi			#
	movl	24(%esp), %edx		#	mov	edx, [esp+16]
	movl	12(%esp), %ecx		#
	movl	20(%esp), %eax		#	mov	eax, [esp+12]
	movl	16(%esp), %esi		#
	cmpl	%ecx, %eax		#	cmp	eax, [esp+4]
	movl	%edx, %edi		#	mov	ecx, edx
	sbbl	%esi, %edi		#	sbb	ecx, [esp+8]
	cmovgel	%ecx, %eax		#	cmovge	eax, [esp+4]
	cmovgel	%esi, %edx		#	cmovge	edx, [esp+8]
	popl	%esi			#
	popl	%edi			#
	retl				#	ret
[…]
__umindi3:				# @__umindi3
# %bb.0:
	pushl	%edi			#
	pushl	%esi			#
	movl	16(%esp), %esi		#
	movl	20(%esp), %eax		#	mov	eax, [esp+4]
	movl	12(%esp), %ecx		#
	movl	24(%esp), %edx		#	mov	edx, [esp+8]
	cmpl	%eax, %ecx		#	cmp	eax, [esp+12]
	movl	%esi, %edi		#	mov	ecx, edx
	sbbl	%edx, %edi		#	sbb	ecx, [esp+16]
	cmovbl	%ecx, %eax		#	cmovnb	eax, [esp+12]
	cmovbl	%esi, %edx		#	cmovnb	edx, [esp+16]
	popl	%esi			#
	popl	%edi			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: 14 instructions in 33 bytes, clobbering the registers EDI and ESI without necessity, instead of only 8 instructions in 29 bytes for the __mindi3() and __umindi3() functions.

Note: exploration of the equally bad code generated for the __maxdi3() and __umaxdi3() functions on the i386 platform is left as an exercise to the reader.

The optimiser fails – take 8

Create the text file case21.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __rotlsi3(unsigned int value, int count) {
    return (value << (31 & count))
         | (value >> (31 & -count));
}

unsigned int __rotrsi3(unsigned int value, int count) {
    return (value >> (31 & count))
         | (value << (31 & -count));
}

unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}

unsigned long long __rotrdi3(unsigned long long value, int count) {
    return (value >> (63 & count))
         | (value << (63 & -count));
}

#ifdef __amd64__
__uint128_t __rotlti3(__uint128_t value, int count) {
    return (value << (127 & count))
         | (value >> (127 & -count));
}

__uint128_t __rotrti3(__uint128_t value, int count) {
#ifndef OPTIMIZE
    return (value >> (127 & count))
         | (value << (127 & -count));
#else
    __asm__("movq\t%[low], %%rax\n\t"
            "shrdq\t%%cl, %[high], %%rax\n\t"
            "shrdq\t%%cl, %[low], %[high]\n\t"
            "movq\t%[high], %%rdx\n\t"
            "test\t$64, %%cl\n\t"
            "cmovnz\t%%rax, %%rdx\n\t"
            "cmovnz\t%[high], %%rax"
           :"=A" (value)
           :"c" (count),
            [low] "r" ((unsigned long long) (value & ~0ULL)),
            [high] "r" ((unsigned long long) (value >> 64))
           :"cc");
    return value;
#endif
}
#endif // __amd64__

Compile the source file case21.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case21.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	roll	%cl, %eax
	retq
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	rorl	%cl, %eax
	retq
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rolq	%cl, %rax
	retq
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rorq	%cl, %rax
	retq
[…]
__rotlti3:				# @__rotlti3
# %bb.0:
	movl	%edx, %ecx		#	mov	ecx, edx
	movq	%rsi, %rdx		#
	movq	%rdi, %rax		#	mov	rax, rdi
	shldq	%cl, %rdi, %rsi		#	shld	rax, rsi, cl
	shlq	%cl, %rdi		#	shld	rsi, rdi, cl
	xorl	%r8d, %r8d		#	mov	rdx, rsi
	testb	$64, %cl		#	test	cl, 64
	cmovneq	%rdi, %rsi		#	cmovnz	rdx, rax
	cmovneq	%r8, %rdi		#	cmovnz	rax, rsi
	negb	%cl			#
	shrdq	%cl, %rdx, %rax		#
	shrq	%cl, %rdx		#
	testb	$64, %cl		#
	cmovneq	%rdx, %rax		#
	cmovneq	%r8, %rdx		#
	orq	%rdi, %rax		#
	orq	%rsi, %rdx		#
	retq				#	ret
[…]
__rotrti3:				# @__rotrti3
# %bb.0:
	movl	%edx, %ecx		#	mov	ecx, edx
	movq	%rsi, %rdx		#
	movq	%rdi, %rax		#	mov	rax, rdi
	movq	%rdi, %rsi		#
	shrdq	%cl, %rdx, %rsi		#	shrd	rax, rsi, cl
	movq	%rdx, %rdi		#
	shrq	%cl, %rdi		#	shrd	rsi, rdi, cl
	xorl	%r8d, %r8d		#	mov	rdx, rsi
	testb	$64, %cl		#	test	cl, 64
	cmovneq	%rdi, %rsi		#	cmovnz	rdx, rax
	cmovneq	%r8, %rdi		#	cmovnz	rax, rsi
	negb	%cl			#
	shldq	%cl, %rax, %rdx		#
	shlq	%cl, %rax		#
	testb	$64, %cl		#
	cmovneq	%rax, %rdx		#
	cmovneq	%r8, %rax		#
	orq	%rsi, %rax		#
	orq	%rdi, %rdx		#
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: while the optimiser recognises the common and well-known idiom for rotation of 32-bit as well as 64-bit integers and generates proper code for the __rotlsi3(), __rotrsi3(), __rotldi3() and __rotrdi3() functions, it but fails rather bad at 128-bit integers and generates awful code using 18 instructions in 56 bytes and 20 instructions in 62 bytes respectively instead of only 9 instructions in just 28 bytes for each of the __rotlti3() and __rotrti3() functions!

Note: exploration of the equally bad code generated for the __rotldi3() and __rotrdi3() functions on the i386 platform is left as an exercise to the reader.

The optimiser fails – take 9

Create the text file case22.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __rotlsi3(unsigned int value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateleft32(value, count);
#else
    return (value << (31 & count))
         | (value >> (31 & -count));
#endif
}

unsigned int __rotrsi3(unsigned int value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateright32(value, count);
#else
    return (value >> (31 & count))
         | (value << (31 & -count));
#endif
}

unsigned long long __rotldi3(unsigned long long value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateleft64(value, count);
#else
    return (value << (63 & count))
         | (value >> (63 & -count));
#endif
}

unsigned long long __rotrdi3(unsigned long long value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateright64(value, count);
#else
    return (value >> (63 & count))
         | (value << (63 & -count));
#endif
}

unsigned int __aswapsi2(unsigned int value) {
    return (__rotlsi3(value, 8) & 0x00FF00FFU)
#ifdef ALTERNATE
         | (__rotlsi3(value, 24) & 0xFF00FF00U);
#else
         | (__rotrsi3(value, 8) & 0xFF00FF00U);
#endif
}

unsigned int __bswapsi2(unsigned int value) {
    return __rotlsi3(value & 0xFF00FF00U, 8)
#ifdef ALTERNATE
         | __rotlsi3(value & 0x00FF00FFU, 24);
#else
         | __rotrsi3(value & 0x00FF00FFU, 8);
#endif
}

unsigned long long __aswapdi2(unsigned long long value) {
    return (__rotldi3(value, 8)  & 0x000000FF000000FFULL)
         | (__rotldi3(value, 24) & 0x0000FF000000FF00ULL)
#ifdef ALTERNATE
         | (__rotldi3(value, 40) & 0x00FF000000FF0000ULL)
         | (__rotldi3(value, 56) & 0xFF000000FF000000ULL);
#else
         | (__rotrdi3(value, 24) & 0x00FF000000FF0000ULL)
         | (__rotrdi3(value, 8)  & 0xFF000000FF000000ULL);
#endif
}

unsigned long long __bswapdi2(unsigned long long value) {
    return __rotldi3(value & 0xFF000000FF000000ULL, 8)
         | __rotldi3(value & 0x00FF000000FF0000ULL, 24)
#ifdef ALTERNATE
         | __rotldi3(value & 0x0000FF000000FF00ULL, 40)
         | __rotldi3(value & 0x000000FF000000FFULL, 56);
#else
         | __rotrdi3(value & 0x0000FF000000FF00ULL, 24)
         | __rotrdi3(value & 0x000000FF000000FFULL, 8);
#endif
}

Compile the source file case22.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case22.c

[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	roll	%cl, %eax
	retl
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	rorl	%cl, %eax
	retl
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	24(%esp), %esi
	movl	20(%esp), %eax
	movb	28(%esp), %cl
	xorl	%ebp, %ebp
	movl	%eax, %edi
	movl	%esi, %ebx
	movl	%esi, %edx
	shll	%cl, %edi
	shldl	%cl, %eax, %ebx
	testb	$32, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	negb	%cl
	shrl	%cl, %edx
	shrdl	%cl, %esi, %eax
	testb	$32, %cl
	cmovnel	%edx, %eax
	cmovnel	%ebp, %edx
	orl	%edi, %eax
	orl	%ebx, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	20(%esp), %esi
	movl	24(%esp), %edx
	movb	28(%esp), %cl
	xorl	%ebp, %ebp
	movl	%edx, %edi
	movl	%esi, %ebx
	movl	%esi, %eax
	shrl	%cl, %edi
	shrdl	%cl, %edx, %ebx
	testb	$32, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	negb	%cl
	shll	%cl, %eax
	shldl	%cl, %esi, %edx
	testb	$32, %cl
	cmovnel	%eax, %edx
	cmovnel	%ebp, %eax
	orl	%ebx, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	roll	$24, %eax
	roll	$8, %ecx
	andl	$0xFF00FF00, %eax
	andl	$0xFF00FF, %ecx
	orl	%ecx, %eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %edx
	movl	%ecx, %eax
	andl	$0xFF00, %edx
	andl	$0xFF0000, %eax
	shldl	$8, %ecx, %edx
	shrdl	$8, %ecx, %eax
	orl	%edx, %eax
	retl
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: although the code generated for the __rotldi3() and __rotrdi3() functions is horrible, the __aswapdi2() and __bswapdi2() functions are properly optimised.

OUCH: contrary to the __aswapdi2() and __bswapdi2() functions, the __aswapsi2() and __bswapsi2() functions are but not optimised!

Compile the source file case22.c a second time with Clang, again targetting the i386 platform, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case22.c

[…]
[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	roll	%cl, %eax
	retl
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	rorl	%cl, %eax
	retl
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movb	28(%esp), %ch
	movl	20(%esp), %esi
	movl	24(%esp), %edx
	xorl	%ebp, %ebp
	movb	%ch, %cl
	movl	%edx, %edi
	movl	%esi, %ebx
	movl	%esi, %eax
	negb	%cl
	shrl	%cl, %edi
	shrdl	%cl, %edx, %ebx
	testb	$32, %cl
	movb	%ch, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	shll	%cl, %eax
	shldl	%cl, %esi, %edx
	testb	$32, %ch
	cmovnel	%eax, %edx
	cmovnel	%ebp, %eax
	orl	%ebx, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	20(%esp), %esi
	movl	24(%esp), %edx
	movb	28(%esp), %cl
	xorl	%ebp, %ebp
	movl	%edx, %edi
	movl	%esi, %ebx
	movl	%esi, %eax
	shrl	%cl, %edi
	shrdl	%cl, %edx, %ebx
	testb	$32, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	negb	%cl
	shll	%cl, %eax
	shldl	%cl, %esi, %edx
	testb	$32, %cl
	cmovnel	%eax, %edx
	cmovnel	%ebp, %eax
	orl	%ebx, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	roll	$24, %eax
	roll	$8, %ecx
	andl	$0xFF00FF00, %eax
	andl	$0xFF00FF, %ecx
	orl	%ecx, %eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %edx
	movl	%ecx, %eax
	andl	$0xFF00, %edx
	andl	$0xFF0000, %eax
	shldl	$8, %ecx, %edx
	shrdl	$8, %ecx, %eax
	orl	%edx, %eax
	retl
[…]
.LCPI8_0:
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
.LCPI8_1:
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movq	4(%esp), %xmm0		# xmm0 = mem[0],zero
	pshufd	$68, %xmm0, %xmm0	# xmm0 = xmm0[0,1,0,1]
	movdqa	%xmm0, %xmm1
	movdqa	%xmm0, %xmm2
	movdqa	%xmm0, %xmm3
	movdqa	%xmm0, %xmm5
	movdqa	%xmm0, %xmm4
	psrlq	$56, %xmm1
	psrlq	$40, %xmm2
	psllq	$24, %xmm3
	psrlq	$8, %xmm5
	psllq	$40, %xmm4
	movsd	%xmm1, %xmm2		# xmm2 = xmm1[0],xmm2[1]
	movdqa	%xmm0, %xmm1
	psllq	$8, %xmm1
	movsd	%xmm1, %xmm3		# xmm3 = xmm1[0],xmm3[1]
	movdqa	%xmm0, %xmm1
	psllq	$56, %xmm0
	psrlq	$24, %xmm1
	movsd	%xmm4, %xmm0		# xmm0 = xmm4[0],xmm0[1]
	orpd	%xmm2, %xmm3
	movsd	%xmm1, %xmm5		# xmm5 = xmm1[0],xmm5[1]
	andpd	.LCPI8_1, %xmm3
	orpd	%xmm5, %xmm0
	andpd	.LCPI8_0, %xmm0
	orpd	%xmm0, %xmm3
	pshufd	$78, %xmm3, %xmm0	# xmm0 = xmm3[2,3,0,1]
	por	%xmm3, %xmm0
	movd	%xmm0, %eax
	pshufd	$229, %xmm0, %xmm0	# xmm0 = xmm0[1,1,2,3]
	movd	%xmm0, %edx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %edx
	movl	20(%esp), %eax
	movl	%edx, %edi
	movl	%eax, %ebx
	movl	%eax, %ecx
	movl	%edx, %esi
	andl	$0xFF0000, %edi
	andl	$0xFF0000, %ebx
	shrl	$24, %ecx
	shrl	$24, %esi
	shrl	$8, %ebx
	shrl	$8, %edi
	orl	%ecx, %ebx
	movl	%eax, %ecx
	orl	%esi, %edi
	movl	%edx, %esi
	shll	$24, %edx
	shll	$24, %eax
	andl	$0xFF00, %ecx
	andl	$0xFF00, %esi
	shll	$8, %esi
	shll	$8, %ecx
	orl	%edi, %esi
	orl	%ebx, %ecx
	orl	%esi, %edx
	orl	%ecx, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: the braindead implementation of the code generator for __builtin_*() strikes again!

Compile the source file case22.c a third time with Clang, now targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case22.c

[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	roll	%cl, %eax
	retq
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	rorl	%cl, %eax
	retq
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rolq	%cl, %rax
	retq
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rorq	%cl, %rax
	retq
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	%edi, %eax
	roll	$8, %eax
	andl	$0xFF00FF, %eax
	roll	$24, %edi
	andl	$0xFF00FF00, %edi
	addl	%edi, %eax
	retq
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	%edi, %ecx
	andl	$0xFF00, %ecx
	shldl	$8, %edi, %ecx
	movl	%edi, %eax
	andl	$0xFF0000, %eax
	shrdl	$8, %edi, %eax
	orl	%ecx, %eax
	retq
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movq	%rdi, %rax
	bswapq	%rax
	retq
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movq	%rdi, %rax
	bswapq	%rax
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: while the __aswapdi2() and __bswapdi2() functions are optimised, the __aswapsi2() and __bswapsi2() functions are not optimised!

Compile the source file case22.c a fourth time with Clang, again targetting the AMD64 platform, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case22.c

[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	roll	%cl, %eax
	retq
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	rorl	%cl, %eax
	retq
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rolq	%cl, %rax
	retq
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rorq	%cl, %rax
	retq
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	%edi, %eax
	roll	$8, %eax
	andl	$0xFF00FF, %eax
	roll	$24, %edi
	andl	$0xFF00FF00, %edi
	addl	%edi, %eax
	retq
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	%edi, %ecx
	andl	$0xFF00, %ecx
	shldl	$8, %edi, %ecx
	movl	%edi, %eax
	andl	$0xFF0000, %eax
	shrdl	$8, %edi, %eax
	orl	%ecx, %eax
	retq
[…]
.LCPI8_0:
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
.LCPI8_1:
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movq	%rdi, %xmm0
	pshufd	$68, %xmm0, %xmm0	# xmm0 = xmm0[0,1,0,1]
	movdqa	%xmm0, %xmm1
	psrlq	$56, %xmm1
	movdqa	%xmm0, %xmm2
	psrlq	$40, %xmm2
	movsd	%xmm1, %xmm2		# xmm2 = xmm1[0],xmm2[1]
	movdqa	%xmm0, %xmm1
	psllq	$8, %xmm1
	movdqa	%xmm0, %xmm3
	psllq	$24, %xmm3
	movsd	%xmm1, %xmm3		# xmm3 = xmm1[0],xmm3[1]
	orpd	%xmm2, %xmm3
	movdqa	%xmm0, %xmm1
	psrlq	$24, %xmm1
	movdqa	%xmm0, %xmm2
	psrlq	$8, %xmm2
	movsd	%xmm1, %xmm2		# xmm2 = xmm1[0],xmm2[1]
	movdqa	%xmm0, %xmm1
	psllq	$40, %xmm1
	psllq	$56, %xmm0
	movsd	%xmm1, %xmm0		# xmm0 = xmm1[0],xmm0[1]
	orpd	%xmm2, %xmm0
	andpd	.LCPI8_0(%rip), %xmm0
	andpd	.LCPI8_1(%rip), %xmm3
	orpd	%xmm0, %xmm3
	pshufd	$78, %xmm3, %xmm0	# xmm0 = xmm3[2,3,0,1]
	por	%xmm3, %xmm0
	movq	%xmm0, %rax
	retq
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	%edi, %eax
	andl	$0xFF000000, %eax
	shldq	$8, %rdi, %rax
	movabsq	$0xFF000000FF0000, %rcx
	andq	%rdi, %rcx
	rolq	$24, %rcx
	orq	%rax, %rcx
	movabsq	$0xFF000000FF00, %rdx
	andq	%rdi, %rdx
	rolq	$40, %rdx
	movabsq	$0xFF00000000, %rax
	andq	%rdi, %rax
	shrdq	$8, %rdi, %rax
	orq	%rdx, %rax
	orq	%rcx, %rax
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: the braindead implementation of the code generator for __builtin_*() strikes again!

The optimiser fails – take 10

Create the text file case23.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef OPTIMIZE
#define __builtin_rotateleft32(value, count)  (((value) << (31 & (count))) | ((value) >> (31 & -(count))))
#define __builtin_rotateright32(value, count) (((value) >> (31 & (count))) | ((value) << (31 & -(count))))
#define __builtin_rotateleft64(value, count)  (((value) << (63 & (count))) | ((value) >> (63 & -(count))))
#define __builtin_rotateright64(value, count) (((value) >> (63 & (count))) | ((value) << (63 & -(count))))
#endif

unsigned int __aswapsi2(unsigned int value) {
    return (__builtin_rotateleft32(value, 8) & 0x00FF00FFU)
#ifdef ALTERNATE
         | (__builtin_rotateleft32(value, 24) & 0xFF00FF00U);
#else
         | (__builtin_rotateright32(value, 8) & 0xFF00FF00U);
#endif
}

unsigned int __bswapsi2(unsigned int value) {
    return __builtin_rotateleft32(value & 0xFF00FF00U, 8)
#ifdef ALTERNATE
         | __builtin_rotateleft32(value & 0x00FF00FFU, 24);
#else
         | __builtin_rotateright32(value & 0x00FF00FFU, 8);
#endif
}

unsigned int __cswapsi2(unsigned int value) {
    value = ((value << 8) & 0xFF00FF00U)
          | ((value >> 8) & 0x00FF00FFU);
#ifndef ALTERNATE
    value = ((value << 16) & 0xFFFF0000U)
          | ((value >> 16) & 0x0000FFFFU);
#else
    value = (value << 16) | (value >> 16);
#endif
    return value;
}

unsigned int __dswapsi2(unsigned int value) {
    value = ((value & 0x00FF00FFU) << 8)
          | ((value & 0xFF00FF00U) >> 8);
#ifndef ALTERNATE
    value = ((value & 0x0000FFFFU) << 16)
          | ((value & 0xFFFF0000U) >> 16);
#else
    value = (value << 16) | (value >> 16);
#endif
    return value;
}

unsigned long long __aswapdi2(unsigned long long value) {
    return (__builtin_rotateleft64(value, 8)  & 0x000000FF000000FFULL)
         | (__builtin_rotateleft64(value, 24) & 0x0000FF000000FF00ULL)
#ifdef ALTERNATE
         | (__builtin_rotateleft64(value, 40) & 0x00FF000000FF0000ULL)
         | (__builtin_rotateleft64(value, 56) & 0xFF000000FF000000ULL);
#else
         | (__builtin_rotateright64(value, 24) & 0x00FF000000FF0000ULL)
         | (__builtin_rotateright64(value, 8)  & 0xFF000000FF000000ULL);
#endif
}

unsigned long long __bswapdi2(unsigned long long value) {
    return __builtin_rotateleft64(value & 0xFF000000FF000000ULL, 8)
         | __builtin_rotateleft64(value & 0x00FF000000FF0000ULL, 24)
#ifdef ALTERNATE
         | __builtin_rotateleft64(value & 0x0000FF000000FF00ULL, 40)
         | __builtin_rotateleft64(value & 0x000000FF000000FFULL, 56);
#else
         | __builtin_rotateright64(value & 0x0000FF000000FF00ULL, 24)
         | __builtin_rotateright64(value & 0x000000FF000000FFULL, 8);
#endif
}

unsigned long long __cswapdi2(unsigned long long value) {
    value = ((value << 8) & 0xFF00FF00FF00FF00ULL)
          | ((value >> 8) & 0x00FF00FF00FF00FFULL);
    value = ((value << 16) & 0xFFFF0000FFFF0000ULL)
          | ((value >> 16) & 0x0000FFFF0000FFFFULL);
#ifndef ALTERNATE
    value = ((value << 32) & 0xFFFFFFFF00000000ULL)
          | ((value >> 32) & 0x00000000FFFFFFFFULL);
#else
    value = (value << 32) | (value >> 32);
#endif
    return value;
}

unsigned long long __dswapdi2(unsigned long long value) {
    value = ((value & 0x00FF00FF00FF00FFULL) << 8)
          | ((value & 0xFF00FF00FF00FF00ULL) >> 8);
    value = ((value & 0x0000FFFF0000FFFFULL) << 16)
          | ((value & 0xFFFF0000FFFF0000ULL) >> 16);
#ifndef ALTERNATE
    value = ((value & 0x00000000FFFFFFFFULL) << 32)
          | ((value & 0xFFFFFFFF00000000ULL) >> 32);
#else
    value = (value << 32) | (value >> 32);
#endif
    return value;
}

Compile the source file case23.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -mno-sse -o- -O3 -S -target i386-pc-linux case23.c

[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__cswapsi2:				# @__cswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__dswapsi2:				# @__dswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__cswapdi2:				# @__cswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__dswapdi2:				# @__dswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

Note: the generated code is properly optimised!

Compile the source file case23.c a second time with Clang, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case23.c

[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	roll	$24, %eax
	roll	$8, %ecx
	andl	$0xFF00FF00, %eax
	andl	$0xFF00FF, %ecx
	orl	%ecx, %eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %edx
	movl	%ecx, %eax
	andl	$0xFF00, %edx
	andl	$0xFF0000, %eax
	shldl	$8, %ecx, %edx
	shrdl	$8, %ecx, %eax
	orl	%edx, %eax
	retl
[…]
__cswapsi2:				# @__cswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__dswapsi2:				# @__dswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %esi
	movl	20(%esp), %eax
	movl	%esi, %edx
	movl	%esi, %ecx
	shldl	$24, %eax, %edx
	shldl	$8, %eax, %ecx
	movl	%edx, %ebx
	movzbl	%cl, %edi
	andl	$0xFF0000, %ecx
	andl	$0xFF000000, %edx
	andl	$0xFF00, %ebx
	orl	%edi, %ebx
	movl	%esi, %edi
	shrl	$24, %esi
	shrl	$8, %edi
	andl	$0xFF00, %edi
	orl	%edi, %esi
	movl	%eax, %edi
	shll	$24, %eax
	shll	$8, %edi
	orl	%esi, %ecx
	andl	$0xFF0000, %edi
	orl	%ecx, %edx
	orl	%ebx, %edi
	orl	%edi, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %edx
	movl	20(%esp), %eax
	movl	%edx, %edi
	movl	%eax, %ebx
	movl	%eax, %ecx
	movl	%edx, %esi
	andl	$0xFF0000, %edi
	andl	$0xFF0000, %ebx
	shrl	$24, %ecx
	shrl	$24, %esi
	shrl	$8, %ebx
	shrl	$8, %edi
	orl	%ecx, %ebx
	movl	%eax, %ecx
	orl	%esi, %edi
	movl	%edx, %esi
	shll	$24, %edx
	shll	$24, %eax
	andl	$0xFF00, %ecx
	andl	$0xFF00, %esi
	shll	$8, %esi
	shll	$8, %ecx
	orl	%edi, %esi
	orl	%ebx, %ecx
	orl	%esi, %edx
	orl	%ecx, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	retl
[…]
__cswapdi2:				# @__cswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__dswapdi2:				# @__dswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: the braindead implementation of the code generator for __builtin_*() strikes again!

The optimiser fails – take 11

Create the text file case24.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long quotient(long long numerator, long long denominator, long long *modulus) {
    *modulus = numerator % denominator;
    return numerator / denominator;
}

long long modulus(long long numerator, long long denominator, long long *quotient) {
    *quotient = numerator / denominator;
    return numerator % denominator;
}

Compile the source file case24.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case24.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
quotient:				# @quotient
# %bb.0:
	pushl	%ebp			#	jmp	__divmoddi4
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	subl	$12, %esp		#
	movl	44(%esp), %edi		#
	movl	32(%esp), %ebx		#
	movl	36(%esp), %ebp		#
	movl	40(%esp), %esi		#
	pushl	%edi			#
	pushl	%esi			#
	pushl	%ebp			#
	pushl	%ebx			#
	calll	__divdi3		#
	addl	$16, %esp		#
	movl	%eax, %ecx		#
	movl	%edx, 8(%esp)		# 4-byte Spill
	imull	%eax, %edi		#
	mull	%esi			#
	addl	%edi, %edx		#
	movl	8(%esp), %edi		# 4-byte Reload
	imull	%edi, %esi		#
	addl	%edx, %esi		#
	subl	%eax, %ebx		#
	movl	48(%esp), %eax		#
	movl	%edi, %edx		#
	sbbl	%esi, %ebp		#
	movl	%ebx, (%eax)		#
	movl	%ebp, 4(%eax)		#
	movl	%ecx, %eax		#
	addl	$12, %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	popl	%ebp			#
	retl				#
[…]
modulus:				# @modulus
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	subl	$12, %esp		#	sub	esp, 8
	movl	44(%esp), %ebx		#
	movl	32(%esp), %esi		#
	movl	36(%esp), %edi		#
	movl	40(%esp), %ebp		#
					#	push	esp
	pushl	%ebx			#	push	[esp+28]
	pushl	%ebp			#	push	[esp+28]
	pushl	%edi			#	push	[esp+28]
	pushl	%esi			#	push	[esp+28]
	calll	__divdi3		#	call	__divmoddi4
	addl	$16, %esp		#	add	esp, 20
	movl	%edx, %ecx		#
	movl	48(%esp), %edx		#	mov	ecx, [esp+28]
	imull	%eax, %ebx		#
	movl	%ecx, 4(%edx)		#	mov	[ecx], eax
	movl	%eax, (%edx)		#	mov	[ecx+4], edx
	mull	%ebp			#
	imull	%ebp, %ecx		#
	addl	%ebx, %edx		#
	addl	%edx, %ecx		#
	subl	%eax, %esi		#
	sbbl	%ecx, %edi		#
	movl	%esi, %eax		#
	movl	%edi, %edx		#
	addl	$12, %esp		#	pop	edx
					#	pop	eax
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	popl	%ebp			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

OUCH: 35 superfluous instructions in 77 bytes for the function quotient(), wasting about 18 processor clock cycles per function call, instead of only 1 (in words: one) instruction in just 5 bytes – this optimiser is a bad joke!

OOPS: 34 instructions in 74 bytes for the function modulus(), where just 14 instructions in only 40 bytes yield the same result, but without clobbering 4 registers, without 8 superfluous memory accesses, without 3 superfluous multiplications, without 5 superfluous additions or subtractions, and not wasting about 10 processor clock cycles per function call.

The optimiser fails – take 12

Create the text file case25.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned long long multiply64(unsigned int multiplicand, unsigned int multiplier) {
    unsigned long long product;
    __asm__("mull\t%[multiplier]"
           :"=A" (product)
           :"%0" (multiplicand),
            [multiplier] "rm" (multiplier)
           :"cc");
    return product;
}

#ifdef __amd64__
__uint128_t multiply128(unsigned long long multiplicand, unsigned long long multiplier) {
    __uint128_t product;
    __asm__("mulq\t%[multiplier]"
           :"=A" (product)
           :"%0" (multiplicand),
            [multiplier] "rm" (multiplier)
           :"cc");
    return product;
}
#endif

Compile the source file case25.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case25.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
multiply64:				# @multiply64
# %bb.0:
	movl	%edi, %eax		#	mov	eax, edi
	movl	%esi, -4(%rsp)		#
	#APP
	mull	-4(%rsp)		#	mul	esi
	#NO_APP
	retq				#	ret
[…]
multiply128:				# @multiply128
# %bb.0:
	movq	%rdi, %rax		#	mov	rax, rdi
	movq	%rsi, -8(%rsp)		#
	#APP
	mulq	-8(%rsp)		#	mul	rsi
	#NO_APP
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: despite the contraint rm alias register or memory given for the multiplier, it is written into the red zone on the stack and then read from there.

Compile the source file case25.c a second time with Clang, now with the option -mno-red-zone, and display the generated assembly code:

clang -mno-red-zone -o- -O3 -S -target amd64-pc-linux case25.c

Note: the left column shows the generated code, while the right column shows the expected unoptimised code as comment.

[…]
multiply64:				# @multiply64
# %bb.0:
	subq	$4, %rsp		#
	movl	%edi, %eax		#	mov	eax, edi
	movl	%esi, (%rsp)		#	push	rsi
	#APP
	mull	(%rsp)			#	mul	dword ptr [rsp]
	#NO_APP
	addq	$4, %rsp		#	pop	rsi
	retq				#	ret
[…]
multiply128:				# @multiply128
# %bb.0:
	pushq	%rax			#
	movq	%rdi, %rax		#	mov	rax, rdi
	movq	%rsi, (%rsp)		#	push	rsi
	#APP
	mulq	(%rsp)			#	mul	qword ptr [rsp]
	#NO_APP
	popq	%rcx			#	pop	rsi
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: instead to push register ESI which holds the multiplier onto the stack, the code generated for the multiply64() function stores the multiplier with a MOV instruction after decrementing the stack pointer with a superfluous SUB instruction, using 17 instead of 8 bytes.

Ouch: instead to push register RSI which holds the multiplier onto the stack, the code generated for the multiply128() function stores the multiplier with a MOV instruction after decrementing the stack pointer with a superfluous PUSH instruction, using 14 instead of 10 bytes.

The optimiser fails – take 13

Compile the source file case4gcc.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case4gcc.c

Note: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment.

[…]
	bsrq	%r11, %rbx		#	bsrq	%r11, %rcx
	xorq	$63, %rbx		#	xorl	$63, %ecx
	je	.LBB1_11
# %bb.17:
	movq	%rbx, %rcx		#
	negq	%rcx			#
	movq	%rsi, %rdx		#	xorl	%edx, %edx
	shrq	%cl, %rdx		#	shldq	%cl, %rsi, %rdx
	movl	%ebx, %ecx		#
	shldq	%cl, %rdi, %rsi
	shlq	%cl, %rdi
	shldq	%cl, %r9, %r11
	shlq	%cl, %r9
	movq	%r11, -8(%rsp)		#
	movq	%rsi, %rax
	divq	-8(%rsp)		#	divq	%r11
	movq	%rdx, %rsi
	movq	%rax, %r10
	mulq	%r9
	cmpq	%rax, %rdi
	movq	%rsi, %rcx		#	movq	%rsi, %rbx
	sbbq	%rdx, %rcx		#	sbbq	%rdx, %rbx
	jae	.LBB1_19
# %bb.18:
	subq	%r9, %rax
	sbbq	%r11, %rdx
	addq	$-1, %r10
.LBB1_19:
	testq	%r8, %r8
	je	.LBB1_21
# %bb.20:
	movl	$64, %r9d		#
	subq	%rbx, %r9		#
	subq	%rax, %rdi
	sbbq	%rdx, %rsi
	movl	%ebx, %ecx		#
	shrq	%cl, %rdi		#	shrdq	%cl, %rdi, %rsi
	movq	%rsi, %rax		#
	movl	%r9d, %ecx		#
	shlq	%cl, %rax		#
	movl	%ebx, %ecx		#
	shrq	%cl, %rsi		#	shrq	%cl, %rdi
	orq	%rdi, %rax		#
	movq	%rsi, 8(%r8)
	movq	%rax, (%r8)		#	movq	%rdi, (%r8)
	jmp	.LBB1_21
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: while the optimiser (partially) recognises the common and well-known idiom for multi-word left shifts and generates the code following line # %bb.17:, it but fails to recognise the same idiom for the multi-word right shift and generates the code following line # %bb.20:, using 12 (in words: twelve) superfluous instructions in a total of 42 instructions for the 2 hot basic blocks shown!

Compile the source file case4gcc.c a second time with Clang, now with the preprocessor macro OPTIMIZE defined, targetting the AMD64 platform, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case4gcc.c

[…]
.LBB1_7:
	bsrq	%r10, %rbx
	xorq	$63, %rbx
	je	.LBB1_13
# %bb.8:
	movl	%ebx, %ecx
	negb	%cl
	movq	%rsi, %rdx
	shrq	%cl, %rdx
	movl	%ebx, %ecx
	shldq	%cl, %rdi, %rsi
	shlq	%cl, %rdi
	shldq	%cl, %r9, %r10
	shlq	%cl, %r9
	xorl	%r11d, %r11d
	testb	$64, %bl
	cmovneq	%rdi, %rsi
	cmovneq	%r11, %rdi
	cmovneq	%r9, %r10
	cmovneq	%r11, %r9
	movq	%r10, -8(%rsp)
	movq	%rsi, %rax
	divq	-8(%rsp)
	movq	%rax, %r14
	movq	%rdx, %rcx
	movq	%r9, %rax
	andq	$-2, %rax
	mulq	%r14
	andq	$-2, %rdi
	cmpq	%rax, %rdi
	movq	%rcx, %rsi
	sbbq	%rdx, %rsi
	sbbq	$0, %r14
	testq	%r8, %r8
	je	.LBB1_19
# %bb.9:
	xorl	%r11d, %r11d
	subq	%rax, %rdi
	sbbq	%rdx, %rcx
	cmovaeq	%r11, %r10
	cmovaeq	%r11, %r9
	addq	%rdi, %r9
	adcq	%rcx, %r10
	movl	%ebx, %ecx
	shrdq	%cl, %r10, %r9
	shrq	%cl, %r10
	testb	$64, %bl
	cmovneq	%r10, %r9
	cmovneq	%r11, %r10
	movq	%r10, 8(%r8)
	movq	%r9, (%r8)
	jmp	.LBB1_19
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: although the shift count is greater than 0 and can’t exceed 63 (what an optimiser worth its name can prove from the program flow), this optimiser generates 2 superfluous TEST instructions which always set the zero flag ZF, followed by 6 superfluous CMOVNZ instructions, i.e. NOP instructions, resulting in a total of 49 instead of the former 42 instructions.

The optimiser fails – take 14

Create the text file case27.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// signed division overflows (really: raises 'division exception')
//  only for -2**31 / -1 = ±2**31!

int clamp_div_32(int dividend, int divisor)
{
    return (dividend == ~2147483647) & (divisor == -1) ? 2147483647 : dividend / divisor;
}

int clamp_mod_32(int dividend, int divisor)
{
    return (dividend == ~2147483647) & (divisor == -1) ? 0 : dividend % divisor;
}

Compile the source file case27.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case27.c

Note: the left column shows the generated code, while the right column shows properly optimised code as comment.

[…]
clamp_div_32:				# @clamp_div_32
# %bb.0:
	pushl	%esi			#
	movl	8(%esp), %eax		#	mov	eax, [esp+4]
	movl	12(%esp), %esi		#	mov	ecx, [esp+8]
	cmpl	$-2147483648, %eax	#	cmp	eax, -2147483648
	jne	LBB0_2			#	jne	LBB0_2
# %bb.1:
	movl	$2147483647, %ecx
	cmpl	$-1, %esi		#	cmp	ecx, -1
	jne	LBB0_2			#	jne	LBB0_2
# %bb.3:
	movl	%ecx, %eax		#	mov	eax, 2147483647
	popl	%esi			#
	retl				#	ret
LBB0_2:
	cltd				#	cdq
	idivl	%esi			#	idiv	ecx
	movl	%eax, %ecx		#
	movl	%ecx, %eax		#
	popl	%esi			#
	retl				#	ret
[…]
clamp_mod_32:				# @clamp_mod_32
# %bb.0:
	movl	4(%esp), %eax		#	mov	eax, [esp+4]
	movl	8(%esp), %ecx		#	mov	ecx, [esp+8]
	cmpl	$-2147483648, %eax	#	cmp	eax, -2147483648
	jne	LBB1_2			#	jne	LBB1_2
# %bb.1:
	xorl	%edx, %edx		#
	cmpl	$-1, %ecx		#	cmp	ecx, -1
	jne	LBB1_2			#	jne	LBB1_2
# %bb.3:
	movl	%edx, %eax		#	xor	eax, eax
	retl				#	ret
LBB1_2:
	cltd				#	cdq
	idivl	%ecx			#	idiv	ecx
	movl	%edx, %eax		#	mov	eax, edx
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Ouch: the code generated for these almost identical functions shows significant differences (the first function clobbers register ESI without necessity) and a common deficiency (superfluous MOV instructions).

The optimiser fails – take 15

Create the text file case28.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Signed division overflows (really: raises "division exception")
//  only for -2**63 / -1 = ±2**63!

long long clamp_div_64(long long dividend, long long divisor) {
    return (divisor == -1LL) & (dividend == ~9223372036854775807LL) ? 9223372036854775807LL : dividend / divisor;
}

long long clamp_mod_64(long long dividend, long long divisor) {
    return (divisor ^ -1LL) | (dividend ^ ~9223372036854775807LL) ? dividend % divisor : 0LL;
}

Compile the source file case28.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case28.c

Note: the left column shows the horrible generated code, while the right column shows properly optimised code as comment.

[…]
clamp_div_64:				# @clamp_div_64
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	subl	$12, %esp		#
	movl	36(%esp), %esi		#
	movl	$-1, %ebp		#
	movl	32(%esp), %ecx		#
	movl	44(%esp), %ebx		#
	movl	40(%esp), %edi		#
	movl	$2147483647, %edx	#
	movl	$-1, %eax
	addl	$1, %ebp		#
	movl	$2147483647, %ebp	#
	adcl	$0, %ebp		#
	xorl	%esi, %ebp		#
	orl	%ecx, %ebp		#
	jne	.LBB0_2			#
# %bb.1:
	movl	%edi, %ebp		#	mov	eax, [esp+12]
	andl	%ebx, %ebp		#	and	eax, [esp+16]
	cmpl	$-1, %ebp		#	inc	eax
	je	.LBB0_3			#
.LBB0_2:
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	pushl	%ecx			#
	calll	__divdi3		#	jnz	__divdi3
	addl	$16, %esp		#	cdq
.LBB0_3:
	addl	$12, %esp		#	sub	eax, [esp+4]
	popl	%esi			#	sbb	edx, [esp+8]
	popl	%edi			#	jno	.LABEL
	popl	%ebx			#	not	eax
	popl	%ebp			#	not	edx
					# .LABEL:
	retl				#	ret
[…]
clamp_mod_64:				# @clamp_mod_64
# %bb.0:
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	movl	20(%esp), %esi		#
	movl	16(%esp), %ecx		#
	movl	28(%esp), %ebx		#
	movl	24(%esp), %edi		#
	movl	%esi, %eax		#
	xorl	$-2147483648, %eax	#
	orl	%ecx, %eax		#
	jne	.LBB1_2			#
# %bb.1:
	movl	%edi, %edx		#	mov	eax, [esp+12]
	xorl	%eax, %eax		#
	andl	%ebx, %edx		#	and	eax, [esp+16]
	cmpl	$-1, %edx		#	inc	eax
	movl	$0, %edx		#
	je	.LABEL
	jne	.LBB1_2			#
# %bb.3:
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	retl				#
.LBB1_2:
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	pushl	%ecx			#
	calll	__moddi3		#	jnz	__moddi3
	addl	$16, %esp		#	cdq
.LABEL
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]

Oops: basic knowledge of elementary integer arithmetic would help the compiler to optimise the second function even before code generation: the remainder of a division by −1 (or 1) is 0, independent of the dividend!

OUCH: 34 instructions in 84 bytes for the saturating signed division and 31 instructions in 66 bytes for the saturating signed modulus, clobbering registers EBP, EBX, EDI and ESI and copying the arguments without necessity, instead of only 11 instructions in jusr 27 bytes respectively only 6 instructions in just 13 bytes.

OUCH: the 4 highlighted instructions of the first function show completely confused code to load the constant −2147483647 into register EBP; additionally the already loaded constants 2147483647 and −1 are not (re)used!

OOPS: the 2 highlighted instructions of the second function can be replaced by a single INC instruction; the following conditional branch JNE can then be replaced by a JE to the tail of the function, saving the following 4 duplicate instructions; the preceding XOR instruction to clear register EAX is superfluous.

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

The software and the documentation on this site are provided as is without any warranty, neither express nor implied.
In no event will the author be held liable for any damage(s) arising from the use of the software or the documentation.
Permission is granted to use the current version of the software and the current version of the documentation solely for personal private and non-commercial purposes.
An individuals use of the software or the documentation in his or her capacity or function as an agent, (independent) contractor, employee, member or officer of a business, corporation or organisation (commercial or non-commercial) does not qualify as personal private and non-commercial purpose.
Without written approval from the author the software or the documentation must not be used for a business, for commercial, corporate, governmental, military or organisational purposes of any kind, or in a commercial, corporate, governmental, military or organisational environment of any kind.
Redistribution of the software and the documentation is allowed only in unmodified form of its current version and free of charge.

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):

the (pseudonymised) IP address;
the date and time of the request;
the URL of the requested web page or file;
the Referer and User-Agent HTTP headers sent by the web browser;
the result (success or failure) of the request;
the amount of data received and sent.

Me, myself & IT

True Lies – or What LLVM Claims, but Fails to Deliver

Purpose

Reason

Example 1

Example 2

Violation of the Microsoft calling convention

Poor to abysmal performance of 128÷128-bit integer division

Bloated code and poor performance of 64÷64-bit integer division

Counterexample for the 3 preceding bugs

Unoptimised code of 64÷64-bit integer division functions

Miserable performance of 128÷128-bit (and 64÷64-bit) integer division

Runtime measurement on AMD® Ryzen™ 5 3600

Runtime measurement on AMD® EPYC™ 7262

Runtime measurement on Intel® Core™ i5-6600

Careless (or clueless?) wasting of resources on Microsoft Windows

Which cross-compiler?

Misleading message box from installer on Microsoft Windows

Inconsistent undefined behaviour

Impaired performance of one of the most trivial functions

Invalid machine code generated for __bswapsi2(), __bswapdi2() and __builtin_bswap*()

Bad, insufficient or no (peephole) optimisation at all in various library functions

__absvti2()

__ashldi3(), __ashrdi3() and __lshrdi3()

__cmp?i2() and __ucmp?i2()

__mulo?i4()

__parity?i2()

Braindead implementation of code generator for __builtin_*()

__builtin_parity()

__builtin_rotateleft*() and __builtin_rotateright*()

__builtin_mul_overflow()

__builtin_add_overflow() and __builtin_sub_overflow()

__builtin_copysign()

Utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication

Runtime measurement on AMD® EPYC™ 7262

Which optimiser?

The optimiser fails – take 2

The optimiser fails – take 3

The optimiser fails – take 4

The optimiser fails – take 5

The optimiser fails – take 6

The optimiser fails – take 7

The optimiser fails – take 8

The optimiser fails – take 9

The optimiser fails – take 10

The optimiser fails – take 11

The optimiser fails – take 12

The optimiser fails – take 13

The optimiser fails – take 14

The optimiser fails – take 15

Contact

Terms and Conditions

Data Protection Declaration

Runtime measurement on AMD^® Ryzen^™ 5 3600

Runtime measurement on AMD^® EPYC^™ 7262

Runtime measurement on Intel^® Core^™ i5-6600

Invalid machine code generated for `bswapsi2()`, `bswapdi2()` and `__builtin_bswap*()`

`__absvti2()`

`ashldi3()`, `ashrdi3()` and `__lshrdi3()`

`cmp?i2()` and `ucmp?i2()`

`__mulo?i4()`

`__parity?i2()`

Braindead implementation of code generator for `__builtin_*()`

`__builtin_parity()`

`__builtin_rotateleft()` and `__builtin_rotateright()`

`__builtin_mul_overflow()`

`__builtin_add_overflow()` and `__builtin_sub_overflow()`

`__builtin_copysign()`

Runtime measurement on AMD^® EPYC^™ 7262