Valid HTML 4.01 Transitional Valid CSS Valid SVG 1.0

Me, myself & IT

True Lies – or What LLVM Claims, but Fails to Deliver

Purpose
Reason
Example 1
Example 2
Violation of the Microsoft calling convention
Poor to abysmal performance of 128÷128-bit integer division
Bloated code and poor performance of 64÷64-bit integer division
Counterexample for the 3 preceding bugs
Unoptimised code of 64÷64-bit integer division functions
Miserable performance of 128÷128-bit (and 64÷64-bit) integer division
Runtime measurement on AMD Ryzen 5 3600
Runtime measurement on AMD EPYC 7262
Runtime measurement on Intel Core i5-6600
Careless (or clueless?) wasting of resources on Microsoft Windows
Which cross-compiler?
Misleading message box from installer on Microsoft Windows
Inconsistent undefined behaviour
Impaired performance of one of the most trivial functions
Invalid machine code generated for __bswapsi2(), __bswapdi2() and __builtin_bswap*()
Bad, insufficient or no (peephole) optimisation at all in various library functions
__absvti2()
__ashldi3(), __ashrdi3() and __lshrdi3()
__cmp?i2() and __ucmp?i2()
__mulo?i4()
__parity?i2()
Braindead implementation of code generator for __builtin_*()
__builtin_parity()
__builtin_rotateleft*() and __builtin_rotateright*()
__builtin_mul_overflow()
__builtin_copysign()
Utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication
Runtime measurement on AMD EPYC 7262
Which optimiser?
The optimiser fails – take 2
The optimiser fails – take 3
The optimiser fails – take 4
The optimiser fails – take 5
The optimiser fails – take 6
The optimiser fails – take 7
The optimiser fails – take 8
The optimiser fails – take 9
The optimiser fails – take 10
The optimiser fails – take 11
The optimiser fails – take 12
The optimiser fails – take 13

Purpose

Demonstrate multiple bugs, defects, deficiencies and flaws of LLVM’s Clang compiler, from bloated, unoptimised or wrong code generated over poor, miserable, disastrous and abysmal performance of the integer division and the overflow-checking integer multiplication routines of its compiler-rt runtime libraries to careless wasting of resources on Microsoft® Windows®.

Reason

For their compiler-rt runtime libraries, LLVM claims boasts
  1. The compiler-rt project provides highly tuned implementations of the low-level code generator support routines like "__fixunsdfdi" and other calls generated when a target doesn't have a short sequence of native instructions to implement a core IR operation.
and states brags
The builtins library provides optimized implementations of this and other low-level routines, either in target-independent C form, or as a heavily-optimized assembly.
but dares to ship unoptimised and untuned code crap like the library functions __cmpdi2(), __cmpti2(), __ucmpdi2(), __ucmpti2(), __udivmoddi4(), __udivmodti4(), __mulodi4(), __muloti4() or __popcountti2(), which even show pessimised code that impairs their own Clang compiler and its optimiser from generating proper and performant machine code suitable for current super-scalar processors!

Note: for the synopsis of the library’s contents see README.txt.

As proved below, especially in the sections miserable performance of integer division, unoptimised library functions and utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication, the statements cited above are blatant lies!

Note: I recommend to read Randall Hyde’s ACM article, especially the following part:

Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.

Example 1

Literally bad smelling code from the 80’s, almost verbatim stolen taken from GCC, ugly and defeating proper optimisation:
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
[…]
// This file implements __ucmpdi2 for the compiler_rt library.
[…]
// Returns:  if (a <  b) returns 0
//           if (a == b) returns 1
//           if (a >  b) returns 2
[…]
int __ucmpdi2(uint64_t a, uint64_t b) {
    union {
        uint64_t all;
        struct {
            uint32_t low, high;
        } s;
    } x, y;
    x.all = a;
    y.all = b;
    if (x.s.high < y.s.high)
        return 0;
    if (x.s.high > y.s.high)
        return 2;
    if (x.s.low < y.s.low)
        return 0;
    if (x.s.low > y.s.low)
        return 2;
    return 1;
}

#ifdef __ARM_EABI__
// Returns:  if (a <  b) returns -1
//           if (a == b) returns  0
//           if (a >  b) returns  1
int __aeabi_ulcmp(int64_t a, int64_t b) {
    return __ucmpdi2(a, b) - 1;
}
#endif
Note: spotting the most obvious bug is left as an exercise to the reader.

Properly written code which lets the optimiser do its job follows:

// Copyleft © 2004-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

typedef unsigned long long uint64_t;

int __ucmpdi2(uint64_t a, uint64_t b) {
    return (a > b) - (a < b) + 1;
}

#ifdef __ARM_EABI__
int __aeabi_ulcmp(uint64_t a, uint64_t b) {
    return (a > b) - (a < b);
}
#endif

Example 2

Awkward construction of magic 128-bit constants:
//===-- popcountti2.c - Implement __popcountti2
[…]
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
[…]
int __popcountti2(__int128_t a) {
  __uint128_t x3 = (__uint128_t)a;
  x3 = x3 - ((x3 >> 1) &
             (((__uint128_t)0x5555555555555555uLL << 64) | 0x5555555555555555uLL));
  // Every 2 bits holds the sum of every pair of bits (64)
  x3 = ((x3 >> 2) &
        (((__uint128_t)0x3333333333333333uLL << 64) | 0x3333333333333333uLL)) +
       (x3 & (((__uint128_t)0x3333333333333333uLL << 64) | 0x3333333333333333uLL));
  // Every 4 bits holds the sum of every 4-set of bits (3 significant bits) (32)
  x3 = (x3 + (x3 >> 4)) &
       (((__uint128_t)0x0F0F0F0F0F0F0F0FuLL << 64) | 0x0F0F0F0F0F0F0F0FuLL);
[…]
Properly written code looks as follows:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __popcountti2(__uint128_t value) {
    value -= (value >> 1) & (~(__uint128_t) 0 / 3); // 0x55...55
    value = ((value >> 2) & (~(__uint128_t) 0 / 5)) // 0x33...33
          + (value & (~(__uint128_t) 0 / 5));
    value += value >> 4;
    value &= ~(__uint128_t) 0 / 17;                 // 0x0F...0F
    value *= ~(__uint128_t) 0 / 255;                // 0x01...01
    return value >> 120;
}

Violation of the Microsoft calling convention

The MSDN articles Overview of x64 Calling Conventions and Parameter Passing specify:
There's a strict one-to-one correspondence between a function call's arguments and the registers used for those arguments. Any argument that doesn't fit in 8 bytes, or isn't 1, 2, 4, or 8 bytes, must be passed by reference. A single argument is never spread across multiple registers.

[…] 16-byte arguments are passed by reference. […]

To return a user-defined type by value in RAX, it must have a length of 1, 2, 4, 8, 16, 32, or 64 bits. […] Otherwise, the caller must allocate memory for the return value and pass a pointer to it as the first argument. The remaining arguments are then shifted one argument to the right. The same pointer must be returned by the callee in RAX.

The LLVM Project Blog post Clang is now used to build Chrome for Windows states:
Clang is the first-ever open-source C++ compiler that’s ABI-compatible with Microsoft Visual C++ (MSVC) – meaning you can build some parts of your program (for example, system libraries) with the MSVC compiler (“cl.exe”), other parts with Clang, and when linked together (either by MSVC’s linker, “link.exe“, or LLD, the LLVM project’s linker – see below) the parts will form a working program.
To falsify LLVM’s claim, create the text file case0.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __clang__
typedef struct {
    unsigned __int64 low, high;
} __uint128_t;
#else
__attribute__ ((ms_abi))
#endif
__uint128_t __ubugti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder) {
    if (remainder != 0)
#ifndef ALTERNATE
        *remainder = divisor;
    return dividend;
#else
        *remainder = dividend % divisor;
    return dividend / divisor;
#endif
}
Note: the actual definition or name of the 128-bit integer scalar user-defined aggregate data type __uint128_t for the Visual C compiler is irrelevant; for binary compatibility, defined by the respective Application Binary Interface alias ABI, only its matching size and memory layout matters!

Compile the source file case0.c with Clang, engaging its optimiser, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-windows case0.c
[…]
__ubugti4:				# @__ubugti4
# %bb.0:
	movaps	(%rcx), %xmm0
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movaps	(%rdx), %xmm1
	movaps	%xmm1, (%r8)
.LBB0_2:
	retq
[…]
OOPS: the code generated by Clang expects the addresses of its three (128-bit) arguments in the registers RCX, RDX and R8 instead of RDX, R8 and R9, while it places the (128-bit) return value in the SSE register XMM0 instead of the memory pointed to by RCX, without to return this pointer in RAX, thus violating the Microsoft ABI twice!

Compile the source file case0.c a second time with Clang, now without SSE support, and display the generated assembly code:

clang -mno-sse -o- -O3 -S -target amd64-pc-windows case0.c
[…]
__ubugti4:				# @__ubugti4
# %bb.0:
	movq	%rdx, %r9
	movq	(%rcx), %rax
	movq	8(%rcx), %rdx
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movq	(%r9), %r10
	movq	8(%r9), %rcx
	movq	%rcx, 8(%r8)
	movq	%r10, (%r8)
.LBB0_2:
	retq
[…]
OUCH: the code generated by Clang still expects the addresses of its three (128-bit) arguments in the registers RCX, RDX and R8 instead of RDX, R8 and R9, while it places the (128-bit) return value in the register pair EDX:EAX instead of the memory pointed to by RCX, without to return this pointer in RAX, again violating the Microsoft ABI twice!

Now compile the source file case0.c with the reference compiler for the Microsoft ABI, Visual C, and display the generated assembly code:

CL.EXE /c /FaCON: /FoNUL: /nologo /Ox /W4 case0.c
case0.c
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 
[…]
__ubugti4 PROC
; Line 11
	test	r9, r9
	je	SHORT $LN1@ubugti4
; Line 13
	mov	rax, QWORD PTR [r8]
	mov	QWORD PTR [r9], rax
	mov	rax, QWORD PTR [r8+8]
	mov	QWORD PTR [r9+8], rax
$LN1@ubugti4:
; Line 14
	mov	rax, QWORD PTR [rdx]
	mov	QWORD PTR [rcx], rax
	mov	rax, QWORD PTR [rdx+8]
	mov	QWORD PTR [rcx+8], rax
	mov	rax, rcx
; Line 19
	ret	0
__ubugti4 ENDP
[…]
This code complies (of course) with the Microsoft ABI:

Poor to abysmal performance of 128÷128-bit integer division

Compile the source file case0.c a third time with Clang, now with the preprocessor macro ALTERNATE defined, and display the generated assembly code:
clang -DALTERNATE -o- -O3 -S -target amd64-pc-windows case0.c
[…]
__ubugti4:				# @__ubugti4
[…]
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movq	%r8, %r15
[…]
	leaq	80(%rsp), %rcx
	leaq	64(%rsp), %rdx
	callq	__umodti3
	movaps	%xmm0, (%r15)
.LBB0_2:
[…]
	leaq	48(%rsp), %rcx
	leaq	32(%rsp), %rdx
	callq	__udivti3
	nop
[…]
Oops: instead to generate a single call of the library function __udivmodti4() which returns both quotient and remainder, even with the command line option -O3 specified Clang generates separate calls of the library functions __umodti3() and __udivti3() – which is especially funny, since both call the library function __udivmodti4() in turn!

Note: as demonstrated below, this code runs about 3 to 34 (in words: thirty-four) times slower than code generated by GCC!

Additionally count the number of instructions and the code size of the __udivmodti4() function provided in the clang_rt.builtins-x86_64.lib library: it shows 234 instructions in 765 bytes, while properly written (assembly) code shows only 71 instructions in just 225 bytes – and runs up to 17 (in words: seventeen) times faster!

Note: exploration of the equally entertaining call chain for signed 128÷128-bit integer division is left as an exercise to the reader.

Bloated code and poor performance of 64÷64-bit integer division

Create the text file case2.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef ALTERNATE
typedef struct {
   __int64 quotient, remainder;
} lldiv_t;

__attribute__ ((ms_abi))
lldiv_t lldiv(__int64 dividend, __int64 divisor) {
    lldiv_t lldiv = {dividend / divisor, dividend % divisor};
    return lldiv;
}
#else
unsigned long __udivmoddi4(unsigned long dividend, unsigned long divisor, unsigned long *remainder) {
    if (remainder != 0)
        *remainder = dividend % divisor;
    return dividend / divisor;
}
#endif // ALTERNATE
Compile the source file case2.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-windows case2.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
_lldiv:					# @lldiv
# %bb.0:
	pushl	%ebp			#	push	ebp
	pushl	%ebx			#	mov	ebp, [esp+8]
	pushl	%edi			#	push	ebx
	pushl	%esi			#
	movl	36(%esp), %esi		#
	movl	20(%esp), %ebp		#
	movl	24(%esp), %edi		#
	movl	28(%esp), %ebx		#
	pushl	%esi			#	push	[esp+28]
	pushl	36(%esp)		#	push	[esp+28]
	pushl	%ebx			#	push	[esp+28]
	pushl	%edi			#	push	[esp+28]
	calll	__alldiv		#	call	__alldvrm
	movl	%edx, %ecx		#
	movl	%eax, (%ebp)		#	mov	[ebp], eax
	imull	%eax, %esi		#
	mull	32(%esp)		#
	movl	%ecx, 4(%ebp)		#	mov	[ebp+4], edx
	imull	32(%esp), %ecx		#
	addl	%esi, %edx		#
	addl	%edx, %ecx		#
	subl	%eax, %edi		#
	movl	%ebp, %eax		#
	sbbl	%ecx, %ebx		#
	movl	%edi, 8(%ebp)		#	mov	[ebp+8], ecx
	movl	%ebx, 12(%ebp)		#	mov	[ebp+12], ebx
	popl	%esi			#
	popl	%edi			#	pop	ebx
	popl	%ebx			#	mov	eax, ebp
	popl	%ebp			#	pop	ebp
	retl				#	ret
[…]
Ouch: instead to generate a single call of the compiler helper function __alldvrm(), which returns both quotient and remainder in the register pairs EDX:EAX and EBX:ECX, even with the command line option -O3 specified Clang generates code to compute the remainder itself, wasting 15 of the total 31 instructions, wasting 34 of the total 74 bytes, and wasting about 9 CPU clock cycles per function call!

Note: the code of the __divmoddi4() function provided in the clang_rt.builtins-i386.lib library is equally bad:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___divmoddi4
	.type	___divmoddi4, @function
	.text

___divmoddi4:
	push	ebp
	push	ebx
	push	edi
	push	esi
	push	eax
	mov	edi, [esp+36]
	mov	ebx, [esp+24]
	mov	ebp, [esp+28]
	mov	esi, [esp+32]
	push	edi
	push	esi
	push	ebp
	push	ebx
	call	___divdi3
	add	esp, 16
	mov	ecx, eax
	mov	[esp], edx
	imul	edi, eax
	mul	eax, esi
	add	edx, edi
	mov	edi, [esp]
	imul	esi, edi
	add	esi, edx
	sub	ebx, eax
	mov	eax, [esp+40]
	mov	edx, edi
	sbb	ebp, esi
	mov	[eax], ebx
	mov	[eax+4], ebp
	mov	eax, ecx
	add	esp, 4
	pop	esi
	pop	edi
	pop	ebx
	pop	ebp
	ret

	.ident	"clang version 10.0.0 "
	.end
Additionally count the number of instructions and the code size of the __udivmoddi4() function provided in the clang_rt.builtins-i386.lib library: it shows 256 instructions in 759 bytes, while properly written (assembly) code shows only 68 instructions in just 166 bytes – and runs up to 11 (in words: eleven) times faster!

Compile the source file case2.c a second time with Clang, now with the preprocessor macro ALTERNATE defined, targetting the AMD64 platform, and display the generated assembly code:

clang -DALTERNATE -o- -O3 -S -target amd64-pc-linux case2.c
[…]
__udivmoddi4:				# @__udivmoddi4
# %bb.0:
	testq	%rdx, %rdx
	je	.LBB0_2
# %bb.1:
	movq	%rdx, %rcx
	movq	%rdi, %rax
	xorl	%edx, %edx
	divq	%rsi
	movq	%rdx, (%rcx)
.LBB0_2:
	movq	%rdi, %rax
	xorl	%edx, %edx
	divq	%rsi
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: despite the command line option -O3 given, the generated code performs two expensive divisions instead of just one!

Counterexample for the 3 preceding bugs

Compile the source file case2.c a third time with Clang, now targetting the AMD64 platform, and display the generated assembly code:
clang -DALTERNATE -o- -O3 -S -target amd64-pc-windows case2.c
[…]
lldiv:					# @lldiv
# %bb.0:
	movq	%rdx, %rax
	cqto
	idivq	%r8
	movq	%rax, (%rcx)
	movq	%rdx, 8(%rcx)
	movq	%rcx, %rax
	retq
[…]

Unoptimised code of 64÷64-bit integer division functions

Create the text file case3.c with the following source code, refactored from __divdi3() and __moddi3(), in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

unsigned long long __udivmoddi4(unsigned long long dividend,
                                unsigned long long divisor,
                                unsigned long long *remainder);
#ifndef ALTERNATE
long long __divdi3(long long dividend, long long divisor) {
    long long r = divisor >> 63;    // r = divisor < 0 ? -1 : 0
    long long s = dividend >> 63;   // s = dividend < 0 ? -1 : 0
    divisor = (divisor ^ r) - r;    // negate if divisor < 0
    dividend = (dividend ^ s) - s;  // negate if dividend < 0
    s ^= r;                         // sign of quotient
                                    // negate if quotient < 0
    return (__udivmoddi4(dividend, divisor, 0) ^ s) - s;
}
#else
long long __moddi3(long long dividend, long long divisor) {
    long long r = divisor >> 63;    // r = divisor < 0 ? -1 : 0
    long long s = dividend >> 63;   // s = dividend < 0 ? -1 : 0
    divisor = (divisor ^ r) - r;    // negate if divisor < 0
    dividend = (dividend ^ s) - s;  // negate if dividend < 0
    __udivmoddi4(dividend, divisor, (unsigned long long *) &r);
    return (r ^ s) - s;             // negate if dividend < 0
}
#endif // ALTERNATE
Compile the source file case3.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated, not properly optimised assembly code:
clang -o- -O3 -S -target i386-pc-windows case3.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
___divdi3:				# @__divdi3
# %bb.0:
	pushl	%ebp			#
	movl	%esp, %ebp		#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movl	20(%ebp), %ecx		#	mov	eax, [esp+20]
	movl	12(%ebp), %eax		#
	movl	16(%ebp), %edi		#	mov	ebx, [esp+16]
	movl	8(%ebp), %ebx		#
	movl	%ecx, %edx		#
	movl	%eax, %esi		#
	sarl	$31, %edx		#
	sarl	$31, %esi		#	cdq
	xorl	%edx, %edi		#	xor	ebx, edx
	xorl	%edx, %ecx		#	xor	eax, edx
	subl	%edx, %edi		#	sub	ebx, edx
	sbbl	%edx, %ecx		#	sbb	eax, edx
	xorl	%esi, %ebx		#
	xorl	%esi, %eax		#
	subl	%esi, %ebx		#
	sbbl	%esi, %eax		#
	xorl	%edx, %esi		#	mov	ebx, edx
	pushl	$0			#	push	0
	pushl	%ecx			#	push	eax
	pushl	%edi			#	push	ecx
					#	mov	eax, [esp+24]
					#	mov	ecx, [esp+20]
					#	cdq
					#	xor	ecx, edx
					#	xor	eax, edx
					#	sub	ecx, edx
					#	sbb	eax, edx
					#	xor	ebx, edx
	pushl	%eax			#	push	eax
	pushl	%ebx			#	push	ecx
	calll	___udivmoddi4		#	call	___udivmoddi4
	addl	$20, %esp		#	add	esp, 20
	xorl	%esi, %eax		#	xor	eax, ebx
	xorl	%esi, %edx		#	xor	edx, ebx
	subl	%esi, %eax		#	sub	eax, ebx
	sbbl	%esi, %edx		#	sbb	edx, ebx
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
Oops: despite the command line option -O3 given, the generated code shows 38 instructions and clobbers the registers EBP, EDI and ESI without necessity; the properly optimised code shows just 30 instructions and avoids 6 superfluous memory accesses.

Compile the source file case3.c a second time with Clang, now with the preprocessor macro ALTERNATE defined, again targetting the i386 platform, and display the generated assembly code:

clang -DALTERNATE -o- -O3 -S -target i386-pc-windows case3.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
___moddi3:				# @__moddi3
# %bb.0:
	pushl	%ebp			#
	movl	%esp, %ebp		#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	andl	$-8, %esp		#
	subl	$16, %esp		#	sub	esp, 8
	movl	20(%ebp), %ecx		#	mov	eax, [esp+28]
	movl	12(%ebp), %eax		#
	movl	16(%ebp), %edi		#	mov	ecx, [esp+24]
	movl	%esp, %ebx		#	push	esp
	movl	%ecx, %edx		#
	movl	%eax, %esi		#
	sarl	$31, %edx		#	cdq
	sarl	$31, %esi		#
	xorl	%edx, %edi		#	xor	ecx, edx
	xorl	%edx, %ecx		#	xor	eax, edx
	movl	%edx, 4(%esp)		#
	movl	%edx, (%esp)		#
	subl	%edx, %edi		#	sub	ecx, edx
	sbbl	%edx, %ecx		#	sbb	eax, edx
					#	push	eax
					#	push	ecx
					#	mov	eax, [esp+32]
	movl	8(%ebp), %edx		#	mov	ecx, [esp+28]
					#	cdq
					#	mov	ebx, edx
	xorl	%esi, %eax		#	xor	ecx, edx
	xorl	%esi, %edx		#	xor	eax, edx
	subl	%esi, %edx		#	sub	ecx, edx
	sbbl	%esi, %eax		#	sbb	eax, edx
	pushl	%ebx			#
	pushl	%ecx			#
	pushl	%edi			#
	pushl	%eax			#	push	eax
	pushl	%edx			#	push	ecx
	calll	___udivmoddi4		#	call	___udivmoddi4
	addl	$20, %esp		#	add	esp, 20
	movl	(%esp), %eax		#	pop	eax
	movl	4(%esp), %edx		#	pop	edx
	xorl	%esi, %eax		#	xor	eax, ebx
	xorl	%esi, %edx		#	xor	edx, ebx
	subl	%esi, %eax		#	sub	eax, ebx
	sbbl	%esi, %edx		#	sbb	edx, ebx
	leal	-12(%ebp), %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
OUCH: despite the command line option -O3 given, the generated code shows 45 instructions and clobbers the registers EBP, EDI and ESI without necessity; the properly optimised code shows just 32 instructions and avoids 8 superfluous memory accesses.

Now compare the code shown above with the code from the clang_rt.builtins-i386.lib library, which is even worse and shows 51 instructions:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___moddi3
	.type	___moddi3, @function
	.text

___moddi3:
	push	ebp
	mov	ebp, esp
	push	ebx
	push	edi
	push	esi
	and	esp, -8
	sub	esp, 16
	mov	edx, [___security_cookie]
	mov	ecx, [ebp+20]
	mov	esi, [ebp+16]
	mov	eax, [ebp+12]
	mov	edi, esp
	xor	edx, ebp
	mov	ebx, eax
	mov	[esp+8], edx
	mov	edx, ecx
	sar	edx, 31
	add	esi, edx
	adc	ecx, edx
	xor	esi, edx
	sar	ebx, 31
	xor	ecx, edx
	mov	edx, [ebp+8]
	xor	eax, ebx
	xor	edx, ebx
	sub	edx, ebx
	sbb	eax, ebx
	push	edi
	push	ecx
	push	esi
	push	eax
	push	edx
	call	___udivmoddi4
	add	esp, 20
	mov	edi, [esp]
	mov	esi, [esp+4]
	mov	ecx, [esp+8]
	xor	edi, ebx
	xor	esi, ebx
	sub	edi, ebx
	sbb	esi, ebx
	xor	ecx, ebp
	call	@__security_check_cookie@4
	mov	eax, edi
	mov	edx, esi
	lea	esp, [ebp-12]
	pop	esi
	pop	edi
	pop	ebx
	pop	ebp
	ret

	.ident	"clang version 10.0.0 "
	.end

Miserable performance of 128÷128-bit (and 64÷64-bit) integer division

Create the text file case4llvm.c with the following source code, refactored from __udivmodti4(), in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// Effects: if modulus != 0, *modulus = numerator % denominator
// Returns: numerator / denominator

// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide

__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
    union {
        __uint128_t all;
        struct {
            unsigned long long low, high;
        };
    } dividend, divisor, quotient, remainder;
    dividend.all = numerator;
    divisor.all = denominator;
    unsigned carry, shift;
    // special cases, X is unknown, K != 0
    if (dividend.high == 0) {
        if (divisor.high == 0) {
            // 0 X
            // ---
            // 0 X
            if (modulus)
                *modulus = dividend.low % divisor.low;
            return dividend.low / divisor.low;
        }
        // 0 X
        // ---
        // K X
        if (modulus)
            *modulus = dividend.low;
        return 0;
    }
    // dividend.high != 0
    if (divisor.low == 0) {
        if (divisor.high == 0) {
            // K X
            // ---
            // 0 0
            if (modulus)
                *modulus = dividend.high % divisor.low;
            return dividend.high / divisor.low;
        }
        // divisor.high != 0
        if (dividend.low == 0) {
            // K 0
            // ---
            // K 0
            if (modulus) {
                remainder.high = dividend.high % divisor.high;
                remainder.low = 0;
                *modulus = remainder.all;
            }
            return dividend.high / divisor.high;
        }
        // K K
        // ---
        // K 0
        if ((divisor.high & (divisor.high - 1)) == 0) {
            // divisor is a power of 2
            if (modulus) {
                remainder.low = dividend.low;
                remainder.high = dividend.high & (divisor.high - 1);
                *modulus = remainder.all;
            }
            return dividend.high >> __builtin_ctzll(divisor.high);
        }
        // K K
        // ---
        // K 0
        shift = __builtin_clzll(divisor.high)
              - __builtin_clzll(dividend.high);
        // 0 <= shift < 63 or shift large
        if (shift > 62) {
            if (modulus)
                *modulus = dividend.all;
            return 0;
        }
        ++shift;
        // 0 < shift < 64
#ifdef OPTIMIZE
        remainder.all = dividend.all >> shift;
#else
        remainder.high = dividend.high >> shift;
        remainder.low = (dividend.low >> shift)
                      | (dividend.high << (64 - shift));
#endif
        quotient.high = dividend.low << (64 - shift);
        quotient.low = 0;
    } else {
        // divisor.low != 0
        if (divisor.high == 0) {
            // K X
            // ---
            // 0 K
            if ((divisor.low & (divisor.low - 1)) == 0) {
                // divisor is a power of 2
                if (modulus)
                    *modulus = dividend.low & (divisor.low - 1);
                if (divisor.low == 1)
                    return dividend.all;
                shift = __builtin_ctzll(divisor.low);
#ifdef OPTIMIZE
                quotient.all = dividend.all >> shift;
#else
                quotient.high = dividend.high >> shift;
                quotient.low = (dividend.low >> shift)
                             | (dividend.high << (64 - shift));
#endif
                return quotient.all;
            }
            // K X
            // ---
            // 0 K
            shift = __builtin_clzll(divisor.low)
                  - __builtin_clzll(dividend.high)
                  + 65;
            // 1 < shift < 128
#ifdef OPTIMIZE
            remainder.all = dividend.all >> shift;
            quotient.all = dividend.all << (128 - shift);
#else
            if (shift == 64) {
                remainder.high = 0;
                remainder.low = dividend.high;
                quotient.high = dividend.low;
                quotient.low = 0;
            } else if (shift < 64) {
                // 1 < shift < 64
                remainder.high = dividend.high >> shift;
                remainder.low = (dividend.low >> shift)
                              | (dividend.high << (64 - shift));
                quotient.high = dividend.low << (64 - shift);
                quotient.low = 0;
            } else {
                // 64 < shift < 128
                remainder.high = 0;
                remainder.low = dividend.high >> (shift - 64);
                quotient.high = (dividend.low >> (shift - 64))
                              | (dividend.high << (128 - shift));
                quotient.low = dividend.low << (128 - shift);
            }
#endif
        } else {
            // K X
            // ---
            // K K
            shift = __builtin_clzll(divisor.high)
                  - __builtin_clzll(dividend.high);
            // 0 <= shift < 64 or shift large
            if (shift > 63) {
                if (modulus)
                    *modulus = dividend.all;
                return 0;
            }
            ++shift;
            // 0 < shift <= 64
#ifdef OPTIMIZE
            remainder.all = dividend.all >> shift;
            quotient.high = dividend.low << (64 - shift);
#else
            if (shift == 64) {
                remainder.high = 0;
                remainder.low = dividend.high;
                quotient.high = dividend.low;
            } else {
                remainder.high = dividend.high >> shift;
                remainder.low = (dividend.low >> shift)
                              | (dividend.high << (64 - shift));
                quotient.high = dividend.low << (64 - shift);
            }
#endif
            quotient.low = 0;
        }
    }
    // Not a special case
    // q and r are initialized with:
    // remainder.all = dividend.all >> shift;
    // quotient.all = dividend.all << (128 - shift);
    // 0 < shift < 128
    for (carry = 0; shift > 0; --shift) {
        // remainder:quotient = ((remainder:quotient) << 1) | carry
#ifdef OPTIMIZE
        remainder.all = (remainder.all << 1) | (quotient.all >> 127);
        quotient.all = (quotient.all << 1) | carry;
#else
        remainder.high = (remainder.high << 1) | (remainder.low >> 63);
        remainder.low = (remainder.low << 1) | (quotient.high >> 63);
        quotient.high = (quotient.high << 1) | (quotient.low >> 63);
        quotient.low = (quotient.low << 1) | carry;
#endif
#if 0
        carry = 0;
        if (remainder.all < divisor.all)
            continue;

        carry = 1;
        remainder.all -= divisor.all;
#elif 0
        carry = remainder.all >= divisor.all;
        if (carry != 0)
            remainder.all -= divisor.all;
#else
        const __int128_t sign = (__int128_t) (divisor.all - remainder.all - 1) >> 127;
        carry = sign & 1;
        remainder.all -= divisor.all & sign;
#endif
    }
#ifdef OPTIMIZE
    quotient.all = (quotient.all << 1) | carry;
#else
    quotient.high = (quotient.high << 1) | (quotient.low >> 63);
    quotient.low = (quotient.low << 1) | carry;
#endif
    if (modulus)
        *modulus = remainder.all;
    return quotient.all;
}
Note: with the preprocessor macro OPTIMIZE defined, this function gains about 10 % to 25 % performance when compiled with Clang, but looses about the same amount when compiled with GCC!

Create the text file case4gcc.c with the following source code, refactored from libgcc2.c, in the same directory as case4llvm.c:

// Copyleft © 2015-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Copyright © 1989-2021 Free Software Foundation, Inc.

typedef unsigned long long uint64_t;

static inline
uint64_t DIV(uint64_t dividend_hi, uint64_t dividend_lo, uint64_t divisor, uint64_t *remainder) {
    uint64_t quotient;
    __asm__("divq\t%[divisor]"
           :"=a" (quotient),
            "=d" (*remainder)
           :"a" (dividend_lo),
            "d" (dividend_hi),
            [divisor] "rm" (divisor)
           :"cc");
    return quotient;
}

__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
    union {
        __uint128_t all;
        struct {
            uint64_t low, high;
        };
    } dividend, divisor, quotient, product;
    unsigned bn, bm;

    dividend.all = numerator;
    divisor.all = denominator;

    if (divisor.high == 0) {
        if (divisor.low > dividend.high) // 0:q = n:n / 0:d
            quotient.high = 0;
        else                             // q:q = n:n / 0:d
            quotient.high = DIV(0, dividend.high, divisor.low, &dividend.high);

        quotient.low = DIV(dividend.high, dividend.low, divisor.low, &dividend.low);

        // remainder in dividend.low
        if (modulus != 0) {
            dividend.high = 0;
            *modulus = dividend.all;
        }
    } else {
        if (divisor.high > dividend.high) { // 0:0 = n:n / d:d
            quotient.all = 0;

            // remainder in dividend.all
            if (modulus != 0)
                *modulus = dividend.all;
        } else { // 0:q = n:n / d:d

            bm = __builtin_clzll(divisor.high);
            if (bm == 0) {
                // from most significant bit of divisor.high is set
                // and dividend.high >= divisor.high
                // follows most significant bit of dividend.high is set
                // as well as quotient.low is either 0 or 1
                //
                // this special case is necessary, not an optimization!

                // the condition on the next line takes advantage that
                // (due to program flow) dividend.high >= divisor.high
                if (dividend.high > divisor.high || dividend.low >= divisor.low) {
                    dividend.all -= divisor.all;
                    quotient.low = 1;
                } else
                    quotient.low = 0;

                quotient.high = 0;

                if (modulus != 0)
                    *modulus = dividend.all;
            } else { // normalize
                bn = 64 - bm;

                quotient.high = dividend.high >> bn;
#ifdef OPTIMIZE
                dividend.all <<= bm;
                divisor.all <<= bm;
#else
                dividend.high <<= bm;
                dividend.high |= dividend.low >> bn;
                dividend.low <<= bm;

                divisor.high <<= bm;
                divisor.high |= divisor.low >> bn;
                divisor.low <<= bm;
#endif
                quotient.low = DIV(quotient.high, dividend.high, divisor.high, &dividend.high);
                quotient.high = 0;
                product.all = quotient.all * divisor.low;

                if (product.all > dividend.all) {
                    product.all -= divisor.all;
                    quotient.low -= 1;
                }

                // remainder is (dividend - product) >> bm
                if (modulus != 0) {
                    dividend.all -= product.all;
#ifdef OPTIMIZE
                    dividend.all >>= bm;
#else
                    dividend.low >>= bm;
                    dividend.low |= dividend.high << bn;
                    dividend.high >>= bm;
#endif
                    *modulus = dividend.all;
                }
            }
        }
    }

    return quotient.all;
}
Create the text file case4.c with the following content in the same directory as case4gcc.c and case4llvm.c:
// Copyright © 2015-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);

__attribute__ ((noinline))
static
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
    if (remainder != NULL)
        *remainder = divisor;

    return dividend;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr128(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
    //   initialised with 2**128 / golden ratio

    static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
    const  __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
    const  __uint128_t sign = (__int128_t) lfsr >> 127;

    return lfsr = (lfsr << 1) ^ (poly & sign);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = (int64_t) lfsr >> 63;

    return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
    // 32-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xDB710641 (CRC-32 IEEE),
    //   initialised with 2**32 / golden ratio

    static uint32_t lfsr = 0x9E3779B9;
    const  uint32_t sign = (int32_t) lfsr >> 31;

    return lfsr = (lfsr << 1) ^ (0xDB710641 & sign);
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t n;
    __uint128_t dividend, divisor = ~0, remainder;
    volatile __uint128_t quotient;

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr64();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr64();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __unopti4(~0, 3, NULL);
        quotient = __unopti4(~0, 3, &remainder);
    }

    t1 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __udivmodti4(~0, 3, NULL);
        quotient = __udivmodti4(~0, 3, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu micro-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
Note: modification of the source files case4llvm.c, case4gcc.c and case4.c to demonstrate the equally disastrous (mis)performance of the 64÷64-bit integer division on 32-bit processors is left as an exercise to the reader.

On your favourite machine with an AMD64 processor, where both Clang and the GNU Compiler Collection are installed, run the following command lines to compile, link and run the benchmark program:

lscpu
gcc -mno-sse -O3 case4.c case4gcc.c
echo 'GCC/libgcc'
./a.out
clang -mno-sse -O3 case4.c case4llvm.c
echo 'LLVM/clang/compiler-rt'
./a.out
Note: if you prefer to use the library function __udivmodti4() provided by the respective compiler, run the following command lines instead:
lscpu
gcc -mno-sse -O3 case4.c
echo 'GCC/libgcc'
./a.out
clang -mno-sse -O3 -rtlib=compiler-rt case4.c
echo 'LLVM/clang/compiler-rt'
./a.out

Runtime measurement on AMD® Ryzen 5 3600

Note: for better readability and to ease their comparision, the numbers are shown in two columns.
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 5 3600 6-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2195.688
CPU max MHz:                     3600,0000
CPU min MHz:                     2200,0000
BogoMIPS:                        7186.29
Virtualization:                  AMD-V
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        3 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-11
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          1.668976       0              1.695685       0
__udivmodti4()      12.617999      10.949023     167.891651     166.195966
                    14.286975 nano-seconds       169.587336 nano-seconds

__unopti4()          1.760362       0              1.708451       0
__udivmodti4()      18.046460      16.286098     246.065291     244.356840
                    19.806822 nano-seconds       247.773742 nano-seconds

__unopti4()          1.248846       0              1.463868       0
__udivmodti4()       7.155582       5.906736      10.833658       9.369790
                     8.404428 nano-seconds        12.297526 nano-seconds
OUCH: in the best case, the executable generated by Clang performs the 128÷128-bit integer division 1.6 times slower than the executable generated by GCC, while in the worst case it is but 15 (in words: fifteen) times slower – an utterly disastrous result!

Runtime measurement on AMD® EPYC 7262

Note: for better readability and to ease their comparision, the numbers are shown in two columns.
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7262 8-Core Processor
Stepping:              0
CPU MHz:               3193.726
BogoMIPS:              6387.45
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          1.920000       0              2.250000       0
__udivmodti4()      15.420000      13.500000     230.230000     227.980000
                    17.340000 nano-seconds       232.480000 nano-seconds

__unopti4()          2.000000       0              2.280000       0
__udivmodti4()      22.120000      20.120000     340.400000     338.120000
                    24.120000 nano-seconds       342.680000 nano-seconds

__unopti4()          1.620000       0              1.780000       0
__udivmodti4()       8.810000       7.190000      13.200000      11.420000
                    10.430000 nano-seconds        14.980000 nano-seconds
OUCH: now the worst case is 17 (in words: seventeen) times slower!

Runtime measurement on Intel® Core i5-6600

Note: for better readability and to ease their comparision, the numbers are shown in two columns.
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         3311.998
BogoMIPS:                        6623.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        24 MiB
NUMA node0 CPU(s):               0-3
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          5.075482       0              5.077103       0
__udivmodti4()      34.428038      29.352556     183.407210     178.330107
                    39.503520 nano-seconds       188.484313 nano-seconds

__unopti4()          4.652310       0              4.653633       0
__udivmodti4()      36.139848      31.487538     280.501125     275.847492
                    40.792158 nano-seconds       285.154758 nano-seconds

__unopti4()          4.938115       0              4.650043       0
__udivmodti4()       8.971484       4.033369      12.585962       7.935919
                    13.909599 nano-seconds        17.236005 nano-seconds
Oops: in the best case, the executable generated by Clang performs the 128÷128-bit integer division at half the speed of the executable generated by GCC, while in the worst case it is but 9 (in words: nine) times slower – another devastating result!

Careless (or clueless?) wasting of resources on Microsoft Windows

The executable installer LLVM-10.0.0-win64.exe dumps the following duplicate files in the directory C:\Program Files\LLVM\bin\, wasting nearly 500 MB disk space, which is about a third of the disk space occupied by the whole package:
C:\>DIR "C:\Program Files\LLVM\bin" /O:-S
[…]
03/25/2020   0:15 PM        83,258,880 clang-cl.exe
03/25/2020   0:03 PM        83,258,880 clang.exe
03/25/2020   0:15 PM        83,258,880 clang++.exe
03/25/2020   0:15 PM        83,258,880 clang-cpp.exe
[…]
03/25/2020   0:15 PM        57,812,480 lld-link.exe
03/25/2020   0:05 PM        57,812,480 lld.exe
03/25/2020   0:15 PM        57,812,480 ld.lld.exe
03/25/2020   0:15 PM        57,812,480 ld64.lld.exe
03/25/2020   0:15 PM        57,812,480 wasm-ld.exe
[…]
03/25/2020   0:15 PM        18,182,144 llvm-ranlib.exe
03/25/2020   0:15 PM        18,182,144 llvm-lib.exe
03/25/2020   0:00 PM        18,182,144 llvm-ar.exe
[…]
C:\>FC.EXE /B "C:\Program Files\LLVM\bin\clang.exe" "C:\Program Files\LLVM\bin\clang-cl.exe"
Comparing files C:\Program Files\LLVM\bin\clang.exe and C:\Program Files\LLVM\bin\clang-cl.exe
FC: no differences encountered
[…]
C:\>FSUTIL.EXE Hardlink List "C:\Program Files\LLVM\bin\clang.exe"
\Program Files\LLVM\bin\clang.exe
C:\>
[…]
Dito for the executable installer LLVM-10.0.0-win32.exe:
C:\>DIR "C:\Program Files (x86)\LLVM\bin" /O:-S
[…]
03/25/2020  11:21 AM        77,862,912 clang-cpp.exe
03/25/2020  11:21 AM        77,862,912 clang-cl.exe
03/25/2020  11:21 AM        77,862,912 clang++.exe
03/25/2020  11:13 AM        77,862,912 clang.exe
[…]
03/25/2020  11:22 AM        54,811,648 ld.lld.exe
03/25/2020  11:22 AM        54,811,648 ld64.lld.exe
03/25/2020  11:15 AM        54,811,648 lld.exe
03/25/2020  11:22 AM        54,811,648 lld-link.exe
03/25/2020  11:22 AM        54,811,648 wasm-ld.exe
[…]
03/25/2020  11:21 AM        17,346,560 llvm-ranlib.exe
03/25/2020  11:10 AM        17,346,560 llvm-ar.exe
03/25/2020  11:21 AM        17,346,560 llvm-lib.exe
[…]
C:\>
Ever heard of hardlinks?
NTFS supports them since its introduction nearly 30 years ago, and Windows NT provides an API to create them since 24 years.

Which cross-compiler?

The pages Getting Started with the LLVM System and Cross-compilation using Clang state:
It is possible to cross-compile LLVM itself. That is, you can create LLVM executables and libraries to be hosted on a platform different from the platform where they are built (a Canadian Cross build).
Clang/LLVM is natively a cross-compiler, meaning that one set of programs can compile to all targets by setting the -target option.
Contrary to both statements, it’s not even possible to build an executable for 32-bit processors with the 64-bit compiler and linker on Windows, and vice versa, i.e. for another architecture of the same platform!

To falsify LLVM’s statements and prove my claim, create the text file case6.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

void case6(_Complex double a, _Complex double b) {
    volatile _Complex double p = a * b, q = a / b;
}
Compile the source file case6.c with Clang from the 32-bit package, targetting but the AMD64 platform, and display the generated assembly code:
"C:\Program Files (x86)\LLVM\bin\clang.exe" -o- -O3 -S -target amd64-pc-windows case6.c
[…]
case6:					# @case6
[…]
	callq	__muldc3
[…]
	callq	__divdc3
[…]
Note: the functions __divdc3() and __muldc3() for the AMD64 platform are provided in the library clang_rt.builtins-x86_64.lib.
"C:\Program Files (x86)\LLVM\bin\clang.exe" -O3 -target amd64-pc-windows case6.c
llvm-3d1c91.o : error LNK2019: unresolved external symbol __muldc3 referenced in function _main
llvm-3d1c91.o : error LNK2019: unresolved external symbol __divdc3 referenced in function _main
a.exe : fatal error LNK1120: 2 unresolved externals
clang: error: linker command failed with exit code 1120 (use -v to see invocation)
Compile the source file case6.c a second time, now with Clang from the 64-bit package, targetting but the i386 platform, and display the generated assembly code:
"C:\Program Files\LLVM\bin\clang.exe" -o- -O3 -S -target i386-pc-windows case6.c
[…]
_case6:					# @case6
[…]
	calll	___muldc3
[…]
	calll	___divdc3
[…]
Note: the functions __divdc3() and __muldc3() for the i386 platform are provided in the library clang_rt.builtins-i386.lib.
"C:\Program Files\LLVM\bin\clang.exe" -O3 -target i386-pc-windows case6.c
llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___muldc3 referenced in function _main
llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___divdc3 referenced in function _main
a.exe : fatal error LNK1120: 2 unresolved externals
clang: error: linker command failed with exit code 1120 (use -v to see invocation)
Although the compiler installed with both packages is able to produce 32-bit as well as 64-bit objects, and the linker is able to produce either 32-bit executables from 32-bit objects or 64-bit executables from 64-bit objects, 32-bit objects can’t be linked to produce 32-bit executables using the 64-bit package, and vice versa: each package contains only the clang_rt.*.lib for its own processor architecture!
C:\>DIR "C:\Program Files\LLVM\lib\clang\10.0.0\lib\windows\*-i386.lib"
[…]
File Not Found
[…]
C:\>DIR "C:\Program Files (x86)\LLVM\lib\clang\10.0.0\lib\windows\*-x86_64.lib"
[…]
File Not Found
[…]
Note: side-by-side installation of the 32-bit and 64-bit package needs 3 GB disk space, wasting 1 GB for duplicate files – that’s 100 times the disk space of the missing clang_rt.*.lib libraries!

Misleading message box from installer on Microsoft Windows

[Screen shot of misleading installer message box] Poor souls who want (really: need; see above) to install the 64-bit package after or aside the 32-bit package (or vice versa) on Windows are greeted from the installers with the misleading message box shown on the right: the version attributed as old is but the current version for the other processor architecture!
If they choose to continue without uninstallation, the shortcuts eventually created in the start menu by the old version and its uninstall information are overwritten – a really user-friendly move!

Also note the denglisch kauderwelsch: the title bar and the buttons are localised, but the message text isn’t.

Inconsistent undefined behaviour

Create the text file case8.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __absusi2(int value) {
    if (value < 0)
        value = -value;
    if (value < 0)
        __builtin_trap();
    return value;
}

int __absvsi2(int value) {
    const int sign = value >> 31;
    value += sign;
    value ^= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

int __abswsi2(int value) {
    const int sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

long __absudi2(long value) {
    if (value < 0)
        value = -value;
    if (value < 0)
        __builtin_trap();
    return value;
}

long __absvdi2(long value) {
    const long sign = value >> 63;
    value += sign;
    value ^= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

long __abswdi2(long value) {
    const long sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

__int128_t __absuti2(__int128_t value) {
    if (value < 0)
        value = -value;
    if (value < 0)
        __builtin_trap();
    return value;
}

__int128_t __absvti2(__int128_t value) {
    const __int128_t sign = value >> 127;
    value += sign;
    value ^= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}

__int128_t __abswti2(__int128_t value) {
    const __int128_t sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
    if (value < 0)
        __builtin_trap();
    return value;
}
Compile the source file case8.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case8.c
[…]
__absusi2:				# @__absusi2
# %bb.0:
	movl	%edi, %eax
	negl	%eax
	cmovll	%edi, %eax
	retq
[…]
__absvsi2:				# @__absvsi2
# %bb.0:
	movl	%edi, %eax
	negl	%eax
	cmovll	%edi, %eax
	retq
[…]
__abswsi2:				# @__abswsi2
# %bb.0:
	movl	%edi, %eax
	negl	%eax
	cmovll	%edi, %eax
	retq
[…]
__absudi2:				# @__absudi2
# %bb.0:
	movq	%rdi, %rax
	negq	%rax
	cmovlq	%rdi, %rax
	retq
[…]
__absvdi2:				# @__absvdi2
# %bb.0:
	movq	%rdi, %rax
	negq	%rax
	cmovlq	%rdi, %rax
	retq
[…]
__abswdi2:				# @__abswdi2
# %bb.0:
	movq	%rdi, %rax
	negq	%rax
	cmovlq	%rdi, %rax
	retq
[…]
__absuti2:				# @__absuti2
# %bb.0:
	xorl	%edx, %edx
	movq	%rdi, %rax
	negq	%rax
	sbbq	%rsi, %rdx
	testq	%rsi, %rsi
	cmovnsq	%rdi, %rax
	cmovnsq	%rsi, %rdx
	retq
[…]
__absvti2:				# @__absvti2
# %bb.0:
	movq	%rsi, %rdx
	movq	%rdi, %rax
	movq	%rsi, %rcx
	sarq	$63, %rcx
	addq	%rcx, %rax
	adcq	%rcx, %rdx
	xorq	%rcx, %rdx
	js	.LBB7_1
# %bb.2:
	xorq	%rcx, %rax
	retq
.LBB7_1:
	ud2
[…]
__abswti2:				# @__abswti2
# %bb.0:
	movq	%rsi, %rdx
	movq	%rdi, %rax
	movq	%rsi, %rcx
	sarq	$63, %rcx
	xorq	%rcx, %rdx
	xorq	%rcx, %rax
	subq	%rcx, %rax
	sbbq	%rcx, %rdx
	js	.LBB8_1
# %bb.2:
	retq
.LBB8_1:
	ud2
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: except for the __absvti2() and __abswti2() functions, the optimiser removes the conditional expression detecting integer overflow – or undefined behaviour – i.e. it fails to recognise both alternative implementations of abs() only for 128-bit integers!

Impaired performance of one of the most trivial functions

Create the text file case9.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef __amd64__
__int128_t __absti2(__int128_t value) {
    return value < 0 ? -value : value;
}
#else
long long __absdi2(long long value) {
#ifndef ALTERNATE
    return value < 0 ? -value : value;
#else
    return __builtin_llabs(value);
#endif
}

int __abssi2(int value) {
#ifndef ALTERNATE
    return value < 0 ? -value : value;
#else
    return __builtin_abs(value);
#endif
}
#endif
Compile the source file case9.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case9.c
Note: the left column shows the generated code, while the right column shows shorter, but still not properly optimised code as comment.
[…]
__absti2:				# @__absti2
# %bb.0:
	xorl	%edx, %edx		#	xor	edx, edx
	movq	%rdi, %rax		#	mov	rax, rdi
	negq	%rax			#	neg	rax
	sbbq	%rsi, %rdx		#	sbb	rdx, rsi
	testq	%rsi, %rsi		#
	cmovnsq	%rdi, %rax		#	cmovs	rax, rdi
	cmovnsq	%rsi, %rdx		#	cmovs	rdx, rsi
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: the use of CMOVcc instructions introduces an explicit data dependency without necessity and impairs performance!

Ouch: the TEST instruction is superfluous!

The following code for the __absti2() function avoids the CMOVcc instructions and is 3 bytes shorter:

# Copyright © 2004-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.arch	generic64
.code64
.intel_syntax noprefix
.text

__absti2:
	mov	rax, rsi	# rax = high qword of argument
	cqo			# rdx = (argument < 0) ? -1 : 0
	mov	rax, rdx	# rdx:rax = (argument < 0) ? -1 : 0
	add	rdi, rdx
	adc	rsi, rdx	# rsi:rdi = (argument < 0) ? argument - 1 : argument
	xor	rax, rdi
	xor	rdx, rsi	# rdx:rax = (argument < 0) ? -argument : argument
				#         = |argument|
	ret

.size	__absti2, .-__absti2
.type	__absti2, @function
.global	__absti2
.end
Compile the source file case9.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case9.c
[…]
__absdi2:				# @__absdi2
# %bb.0:
	movl	8(%esp), %edx
	movl	4(%esp), %eax
	movl	%edx, %ecx
	sarl	$31, %ecx
	addl	%ecx, %eax
	adcl	%ecx, %edx
	xorl	%ecx, %eax
	xorl	%ecx, %edx
	retl
[…]
__abssi2:				# @__abssi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %eax
	negl	%eax
	cmovll	%ecx, %eax
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: the use of a CMOVcc instruction introduces an explicit data dependency without necessity and impairs performance!

The following code for the __absdi2() and __abssi2() functions avoids the CMOVcc instruction and is shorter too:

# Copyright © 2004-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.arch	generic32
.code32
.intel_syntax noprefix
.text

__absdi2:
	mov	eax, [esp+8]
	mov	ecx, [esp+4]	# eax:ecx = argument
	cdq			# edx = (argument < 0) ? -1 : 0
	add	ecx, edx
	adc	eax, edx	# eax:ecx = (argument < 0) ? argument - 1 : argument
	xor	ecx, edx
	xor	edx, eax	# edx:ecx = (argument < 0) ? -argument : argument
				#         = |argument|
	mov	eax, ecx	# edx:eax = |argument|
	ret

.size	__absdi2, .-__absdi2
.type	__absdi2, @function
.global	__absdi2

__abssi2:
	mov	eax, [esp+4]	# eax = argument
	cdq			# edx = (argument < 0) ? -1 : 0
	add	eax, edx	# eax = (argument < 0) ? argument - 1 : argument
	xor	eax, edx	# eax = (argument < 0) ? -argument : argument
				#     = |argument|
	ret

.size	__abssi2, .-__abssi2
.type	__abssi2, @function
.global	__abssi2
.end
Note: exploration of the code generated with the preprocessor macro ALTERNATE defined is left as an exercise to the reader.

Invalid machine code generated for __bswapsi2(), __bswapdi2() and __builtin_bswap*()

Create the text file case10.c with the following source code, partially copied from __bswapsi2() and __bswapdi2(), in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

unsigned int __aswapsi2(unsigned int u) {
  return __builtin_bswap32(u);
}

unsigned int __bswapsi2(unsigned int u) {
  return (((u & 0xFF000000U) >> 24)
        | ((u & 0x00FF0000U) >> 8)
        | ((u & 0x0000FF00U) << 8)
        | ((u & 0x000000FFU) << 24));
}

unsigned long long __aswapdi2(unsigned long long u) {
  return __builtin_bswap64(u);
}

unsigned long long __bswapdi2(unsigned long long u) {
  return (((u & 0xFF00000000000000ULL) >> 56)
        | ((u & 0x00FF000000000000ULL) >> 40)
        | ((u & 0x0000FF0000000000ULL) >> 24)
        | ((u & 0x000000FF00000000ULL) >> 8)
        | ((u & 0x00000000FF000000ULL) << 8)
        | ((u & 0x0000000000FF0000ULL) << 24)
        | ((u & 0x000000000000FF00ULL) << 40)
        | ((u & 0x00000000000000FFULL) << 56));
}
Compile the source file case10.c with Clang, engaging its optimiser, targetting the i386 processor, and display the generated assembly code:
clang -m32 -march=i386 -o- -O3 -S -target i386-pc-linux case10.c
Note: the left column shows the generated invalid code, while the right column shows properly optimised valid code as comment.
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax		#	mov	eax, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	eax, 16
					#	ror	ax, 8
	retl				#	ret
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %eax		#	mov	eax, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	eax, 16
					#	ror	ax, 8
	retl				#	ret
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movl	4(%esp), %edx		#	mov	eax, [esp+8]
	movl	8(%esp), %eax		#	mov	edx, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	dx, 8
					#	ror	eax, 16
					#	ror	edx, 16
					#	ror	ax, 8
	bswapl	%edx			#	ror	dx, 8
	retl				#	ret
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	4(%esp), %edx		#	mov	eax, [esp+8]
	movl	8(%esp), %eax		#	mov	edx, [esp+4]
	bswapl	%eax			#	ror	ax, 8
					#	ror	dx, 8
					#	ror	eax, 16
					#	ror	edx, 16
					#	ror	ax, 8
	bswapl	%edx			#	ror	dx, 8
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: the BSWAP instruction was introduced with the i486 processor, it is not supported on the i386 processor!

Bad, insufficient or no (peephole) optimisation at all in various library functions

Peek into the code of some functions provided in the clang_rt.builtins-i386.lib and the libclang_rt.builtins-x64_86.a libraries.

__absvti2()

The __absvti2() function provided in the libclang_rt.builtins-x86_64.a library exhibits the code shown in the left column below; the right column shows better, but still not properly optimised code, with only 10 instructions in just 27 bytes instead of 18 instructions in 69 bytes:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__absvti2
	.type	__absvti2, @function
	.text

__absvti2:
	mov	rcx, 8000000000000000h	#
	xor	rcx, rsi		#
	or	rcx, rdi		#
	jz	.overflow		#
	mov	rdx, rsi		#	mov	rdx, rsi
	mov	rax, rdi		#	mov	rax, rdi
	mov	rcx, rsi		#	sar	rsi, 63
	sar	rcx, 63			#	xor	rax, rsi
	xor	rdx, rcx		#	xor	rdx, rsi
	xor	rax, rcx		#	sub	rax, rsi
	sub	rax, rcx		#	sbb	rdx, rsi
	sbb	rdx, rcx		#	jo	.overflow
	ret				#	ret
.overflow:
	push	rax			#	ud2
	lea	rdi, …			#
	lea	rdx, …			#
	mov	esi, 24			#
	call	__abort			#

	.ident	"clang version 10.0.0 "
	.end
The following source coaxes Clang to generate this better code – almost, with but one superfluous MOV instruction:
// Copyright © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

__int128_t __absvti2(__int128_t value) {
#ifndef ALTERNATE
    const __int128_t sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;
#else
    const __int128_t sign = value >> 127;
    value += sign;
    value ^= sign;
#endif
    if (value < 0)
        __builtin_trap();
    return value;
}

__ashldi3(), __ashrdi3() and __lshrdi3()

The __ashldi3() function provided in the clang_rt.builtins-i386.lib library exhibits the braindead code shown in the left column below; the right column shows properly optimised code, with only 12 instructions in just 30 bytes instead of 24 instructions in 51 bytes, avoiding a superfluous conditional branch:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___ashldi3
	.type	___ashldi3, @function
	.text

___ashldi3:
	push	esi			#
	mov	ecx, [esp+16]		#	mov	ecx, [esp+12]
	mov	edx, [esp+8]		#	mov	eax, [esp+4]
	test	cl, 32			#	test	cl, 32
	jnz	0f			#	jnz	0f
	mov	esi, [esp+12]		#	mov	edx, [esp+8]
	test	ecx, ecx		#
	jz	2f			#
	mov	eax, edx		#
	shl	eax, cl			#
	shld	esi, edx, cl		#	shld	edx, eax, cl
	xor	ecx, ecx		#	shl	eax, cl
	mov	edx, esi		#
	jmp	1f			#	ret
0:
	shl	edx, cl			#	shl	eax, cl
	xor	eax, eax		#	mov	edx, eax
	xor	ecx, ecx		#	xor	eax, eax
1:
	or	edx, ecx		#
	pop	esi			#
	ret				#
2:
	mov	eax, edx		#
	mov	edx, esi		#
	pop	esi			#
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end
Note: exploration of the equally bad code generated for the __ashrdi3() and __lshrdi3() functions is left as an exercise to the reader.

__cmp?i2() and __ucmp?i2()

In their second halves, the functions __cmpdi2(), and __ucmpdi2(), provided in the clang_rt.builtins-i386.lib library exhibit the insane code shown in the left column below; the right column shows shorter (but still unoptimised) code, with 12 instructions in 35 bytes instead of 15 instructions in 47 bytes for each function:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___cmpdi2
	.global	___ucmpdi2
	.type	___cmpdi2, @function
	.type	___ucmpdi2, @function
	.text

___cmpdi2:
	mov	ecx, [esp+16]		#	mov	ecx, [esp+16]
	xor	eax, eax		#	xor	eax, eax
	cmp	[esp+8], ecx		#	cmp	ecx, [esp+8]
	jl	0f			#	jg	0f
	mov	eax, 2			#	mov	eax, 2
	jg	0f			#	jl	0f
	mov	ecx, [esp+4]		#
	mov	edx, [esp+12]		#	mov	ecx, [esp+12]
	mov	eax, 0			#	xor	eax, eax
	cmp	ecx, edx		#	cmp	ecx, [esp+4]
	jb	0f			#	ja	0f
	cmp	edx, ecx		#
	mov	eax, 1			#
	adc	eax, 0			#	adc	eax, 1
0:					# 0:
	ret				#	ret

___ucmpdi2:
	mov	ecx, [esp+16]		#	mov	ecx, [esp+16]
	xor	eax, eax		#	xor	eax, eax
	cmp	[esp+8], ecx		#	cmp	ecx, [esp+8]
	jb	0f			#	ja	0f
	mov	eax, 2			#	mov	eax, 2
	ja	0f			#	jb	0f
	mov	ecx, [esp+4]		#
	mov	edx, [esp+12]		#	mov	ecx, [esp+12]
	mov	eax, 0			#	xor	eax, eax
	cmp	ecx, edx		#	cmp	ecx, [esp+4]
	jb	0f			#	ja	0f
	cmp	edx, ecx		#
	mov	eax, 1			#
	adc	eax, 0			#	adc	eax, 1
0:					# 0:
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end
Assembly programmers but write straightforward and branch-free code instead, using 14 instructions in 41 bytes for the __cmpdi2() function, and only 11 instructions in just 34 bytes for the __ucmpdi2() function:
# Copyright © 2004-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.arch	generic32
.code32
.intel_syntax noprefix
.text

__cmpdi2:
	mov	ecx, [esp+4]	# ecx = low dword of left argument
	mov	edx, [esp+12]	# edx = low dword of right argument
	cmp	ecx, edx
	mov	eax, [esp+8]	# eax = high dword of left argument
	sbb	eax, [esp+16]	# eax = high dword of left argument
				#     - high dword of right argument
				#     - borrow
	setl	ah		# ah = left argument < right argument
	cmp	edx, ecx
	mov	edx, [esp+16]	# edx = high dword of right argument
	sbb	edx, [esp+8]	# edx = high dword of right argument
				#     - high dword of left argument
				#     - borrow
	setl	al		# al = left argument > right argument
	sub	al, ah		# al = left argument > right argument
				#    - left argument < right argument
	movsx	eax, al		# eax = left argument > right argument
				#     - left argument < right argument
				#     = {-1, 0, 1}
	inc	eax		# eax = {0, 1, 2}
	ret

.size	__cmpdi2, .-__cmpdi2
.type	__cmpdi2, @function
.global	__cmpdi2

__ucmpdi2:
	mov	ecx, [esp+4]	# ecx = low dword of left argument
	mov	edx, [esp+12]	# edx = low dword of right argument
	cmp	ecx, edx
	mov	eax, [esp+8]	# eax = high dword of left argument
	sbb	eax, [esp+16]	# eax = high dword of left argument
				#     - high dword of right argument
				#     - borrow
	sbb	eax, eax	# eax = 0
				#     - left argument < right argument
	cmp	edx, ecx
	mov	edx, [esp+16]	# edx = high dword of right argument
	sbb	edx, [esp+8]	# edx = high dword of right argument
				#     - high dword of left argument
				#     - borrow
	adc	eax, 1		# eax = left argument > right argument
				#     - left argument < right argument
				#     + 1
				#     = {0, 1, 2}
	ret

.size	__ucmpdi2, .-__ucmpdi2
.type	__ucmpdi2, @function
.global	__ucmpdi2
.end
Note: exploration of the equally bad code generated for the __cmpti2() and __ucmpti2() functions is left as an exercise to the reader.

__mulo?i4()

The __mulosi4() and __mulodi4() functions provided in the clang_rt.builtins-i386.a library exhibit completely insane and horrible code with 51 instructions in 130 bytes and 98 instructions in 266 bytes respectively, while the __mulosi4(), __mulodi4() and __muloti4() functions provided in the clang_rt.builtins-x86_64.a library exhibit completely insane and horrible code with 44 instructions in 131 bytes, 48 instructions in 155 bytes, and 94 instructions in 302 bytes respectively.

Note: completely insane and horrible code not shown here to preserve the mental health of the reader!

Note: Clang generates calls of these monstrosities for the __builtin_*mul*_overflow builtins as well as the -fsanitize=integer, -fsanitize=undefined and -ftrapv command line options.

Create the text file case11.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
#ifndef OPTIMIZE
long long mulodi4(long long multiplicand, long long multiplier, int *overflow) {
    long long product;
    *overflow = __builtin_smulll_overflow(multiplicand, multiplier, &product);
    return product;
}
#else
long long __mulodi4(long long multiplicand, long long multiplier, int *overflow) {
    unsigned long long product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
    *overflow = __builtin_umulll_overflow((multiplicand ^ sign) - sign,
                                          (multiplier ^ tmp) - tmp,
                                          &product);
    sign ^= tmp;
    product = (product ^ sign) - sign;
    *overflow |= (long long) (product ^ sign) < 0;
    return product;
}
#endif
#else
#ifndef OPTIMIZE
__int128_t muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow) {
    __int128_t product;
    *overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    return product;
}
#else
__int128_t __muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow) {
    __uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
    *overflow = __builtin_mul_overflow((multiplicand ^ sign) - sign,
                                       (multiplier ^ tmp) - tmp,
                                       &product);
    sign ^= tmp;
    product = (product ^ sign) - sign;
    *overflow |= (__int128_t) (product ^ sign) < 0;
    return product;
}
#endif
#endif // __amd64 __
Compile the source file case11.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case11.c
Note: the left column shows the generated code, while the right column shows shorter, but still not optimised code as comment.
[…]
muloti4:				# @muloti4
# %bb.0:
	pushq	%rbx			#	push	r8
	subq	$16, %rsp		#	xor	eax, eax
	movq	%r8, %rbx		#	push	rax
	movq	$0, 8(%rsp)		#	push	rax
	leaq	8(%rsp), %r8		#	mov	r8, rsp
	callq	__muloti4		#	call	__muloti4
	xorl	%ecx, %ecx		#	pop	r8
	cmpq	$0, 8(%rsp)		#	pop	rcx
	setne	%cl			#	cmp	rcx, r8
	movl	%ecx, (%rbx)		#	setne	cl
	addq	$16, %rsp		#	pop	r8
	popq	%rbx			#	mov	[r8], ecx
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OOPS: 13 instructions in just 30 bytes instead of 13 instructions in 46 bytes – not counting the additional 343 instructions in 1109 bytes of the called __muloti4(), __divti3(), __udivti3() and __udivmodti4() functions!

Compile the source file case11.c a second time with Clang, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case11.c
[…]
__muloti4:				# @__muloti4
# %bb.0:
	pushq	%rbp
	pushq	%r15
	pushq	%r14
	pushq	%rbx
	movq	%rcx, %r10
	movq	%rsi, %rax
	movq	%rsi, %r11
	sarq	$63, %r11
	sarq	$63, %rcx
	xorq	%r11, %rax
	xorq	%r11, %rdi
	subq	%r11, %rdi
	sbbq	%r11, %rax
	movq	%rdx, %rsi
	xorq	%rcx, %r10
	xorq	%rcx, %rsi
	subq	%rcx, %rsi
	sbbq	%rcx, %r10
	setne	%dl
	testq	%rax, %rax
	setne	%r14b
	andb	%dl, %r14b
	mulq	%rsi
	movq	%rax, %r9
	seto	%bpl
	movq	%r10, %rax
	mulq	%rdi
	movq	%rax, %r10
	seto	%r15b
	orb	%bpl, %r15b
	orb	%r14b, %r15b
	addq	%r9, %r10
	movq	%rdi, %rax
	mulq	%rsi
	addq	%r10, %rdx
	setb	%bl
	orb	%r15b, %bl
	movzbl	%bl, %esi
	xorq	%r11, %rcx
	xorq	%rcx, %rdx
	xorq	%rcx, %rax
	subq	%rcx, %rax
	sbbq	%rcx, %rdx
	xorq	%rdx, %rcx
	shrq	$63, %rcx
	orl	%esi, %ecx
	movl	%ecx, (%r8)
	popq	%rbx
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: the proper implementation of the __muloti4() function coaxes Clang to generate 52 instructions in 147 bytes instead of 94 instructions in 302 bytes – without calls of the __divti3() and __udivti3() functions!

Assembly programmers but use only 44 instructions in just 121 bytes, don’t clobber the non-volatile registers RBP, RBX, R14 and R15, and avoid the 8 memory accesses as well as the 9 partial register writes which impair performance:

# Copyright © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
				# r8 = dword ptr flag
__muloti4:
	mov	r10, rdx	# r10 = low qword of multiplier
	mov	r11, rcx	# r11 = high qword of multiplier

	mov	rax, rcx	# rax = high qword of multiplier
	cqo			# rdx = (multiplier < 0) ? -1 : 0
	mov	r9, rdx		# r9 = (multiplier < 0) ? -1 : 0
	xor	r10, rdx
	xor	r11, rdx	# r11:r10 = (multiplier < 0) ? ~multiplier : multiplier
	sub	r10, rdx
	sbb	r11, rdx	# r11:r10 = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	rax, rsi	# rax = high qword of multiplicand
	cqo			# rdx = (multiplicand < 0) ? -1 : 0
	xor	r9, rdx		# r9 = (multiplicand < 0) <> (multiplier < 0) ? -1 : 0
				#    = (product < 0) ? -1 : 0
	xor	rdi, rdx
	xor	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	rdi, rdx
	sbb	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	xor	ecx, ecx
	mov	[r8], ecx	# flag = 0
	cmp	rcx, rsi
	sbb	edx, edx	# edx = (high qword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**64) ? 0 : -1
	cmp	rcx, r11
	sbb	ecx, ecx	# ecx = (high qword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**64) ? 0 : -1
	and	ecx, edx	# ecx = (|multiplicand| < 2**64)
				#     & (|multiplier| < 2**64) ? 0 : -1
				#     = (|product| < 2**128) ? 0 : -1
	mov	rax, rsi
	mul	r10		# rdx:rax = high qword of |multiplicand|
				#         * low qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rsi, rax
	mov	rax, rdi
	mul	r11		# rdx:rax = low qword of |multiplicand|
				#         * high qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	add	rsi, rax	# rsi = high qword of |multiplicand|
				#     * low qword of |multiplier|
				#     + low qword of |multiplicand|
				#     * high qword of |multiplier|
#	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rax, rdi
	mul	r10		# rdx:rax = low qword of |multiplicand|
				#         * low qword of |multiplier|
	add	rdx, rsi	# rdx:rax = |product % 2**128|
				#         = |product| % 2**128
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *
.if 0
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = (product < 0)
				#         ? product % 2**128 - 1 : product % 2**128
	sub	rax, r9
	sbb	rdx, r9		# rdx:rax = product % 2**128

	xor	r9, rdx		# r9 = (product % 2**128 < -2**127)
				#    | (product % 2**128 >= 2**127)
				#    ? negative : positive
	add	r9, r9
.else
	add	rax, r9
	adc	rdx, r9		# rdx:rax = (product < 0)
				#         ? ~product % 2**128 : product % 2**128
	mov	rsi, rdx	# rsi = (product % 2**128 < -2**127)
				#     | (product % 2**128 >= 2**127)
				#     ? negative : positive
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = product % 2**128

	add	rsi, rsi
.endif
	adc	ecx, ecx	# ecx = (-2**127 <= product < 2**127) ? 0 : *
	setnz	byte ptr [r8]
	ret

.size	__muloti4, .-__muloti4
.type	__muloti4, @function
.global	__muloti4
.end
Compile the source file case11.c a third time with Clang, now targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case11.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
mulodi4:				# @mulodi4
# %bb.0:
	pushl	%ebx			#	jmp	__mulodi4
	pushl	%edi			#
	pushl	%esi			#
	subl	$16, %esp		#
	movl	48(%esp), %esi		#
	movl	32(%esp), %eax		#
	movl	36(%esp), %ecx		#
	movl	40(%esp), %edx		#
	movl	44(%esp), %edi		#
	movl	$0, 12(%esp)		#
	subl	$12, %esp		#
	leal	24(%esp), %ebx		#
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%edx			#
	pushl	%ecx			#
	pushl	%eax			#
	calll	__mulodi4		#
	addl	$32, %esp		#
	xorl	%ecx, %ecx		#
	cmpl	$0, 12(%esp)		#
	setne	%cl			#
	movl	%ecx, (%esi)		#
	addl	$16, %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	retl				#
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: only 1 instruction in just 5 bytes instead of 28 instructions in 73 bytes – not counting the additional 434 instructions in 1087 bytes of the called __mulodi4(), __divdi3(), __udivdi3() and __udivmoddi4() functions!

Compile the source file case11.c a fourth time with Clang, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case11.c
[…]
__mulodi4:				# @__mulodi4
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	subl	$8, %esp
	movl	32(%esp), %eax
	movl	40(%esp), %esi
	movl	28(%esp), %ecx
	movl	36(%esp), %edi
	movl	%eax, %ebp
	movl	%esi, %ebx
	sarl	$31, %ebp
	sarl	$31, %ebx
	xorl	%ebp, %ecx
	xorl	%ebp, %eax
	subl	%ebp, %ecx
	sbbl	%ebp, %eax
	xorl	%ebx, %edi
	xorl	%ebx, %esi
	subl	%ebx, %edi
	sbbl	%ebx, %esi
	setne	%dl
	testl	%eax, %eax
	setne	%dh
	andb	%dl, %dh
	movb	%dh, 3(%esp)		# 1-byte Spill
	mull	%edi
	movl	%eax, 4(%esp)		# 4-byte Spill
	movl	%esi, %eax
	seto	2(%esp)			# 1-byte Folded Spill
	mull	%ecx
	movl	%eax, %esi
	seto	%al
	orb	2(%esp), %al		# 1-byte Folded Reload
	addl	4(%esp), %esi		# 4-byte Folded Reload
	movb	%al, 4(%esp)		# 1-byte Spill
	movl	%ecx, %eax
	mull	%edi
	addl	%esi, %edx
	movl	44(%esp), %esi
	setb	%cl
	xorl	%ebp, %ebx
	orb	4(%esp), %cl		# 1-byte Folded Reload
	xorl	%ebx, %eax
	xorl	%ebx, %edx
	orb	3(%esp), %cl		# 1-byte Folded Reload
	subl	%ebx, %eax
	sbbl	%ebx, %edx
	xorl	%edx, %ebx
	shrl	$31, %ebx
	movzbl	%cl, %ecx
	orl	%ecx, %ebx
	movl	%ebx, (%esi)
	addl	$8, %esp
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: the proper implementation of the __mulodi4() function coaxes Clang to generate 59 instructions in 146 bytes instead of 98 instructions in 266 bytes – without calls of the __divdi3() and __udivdi3() functions!

Assembly programmers but use only 57 instructions in just 134 bytes, don’t clobber the non-volatile registers EBP, EDI and ESI, and avoid 8 partial register writes which impair performance:

# Copyright © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+20] = address of flag
				# [esp+16] = high dword of multiplier
				# [esp+12] = low dword of multiplier
				# [esp+8] = high dword of multiplicand
				# [esp+4] = low dword of multiplicand
__mulodi4:
	push	ebx
	mov	eax, [esp+20]	# eax = high dword of multiplier
	mov	ecx, [esp+16]	# ecx = low dword of multiplier
	cdq			# edx = (multiplier < 0) ? -1 : 0
	mov	ebx, edx	# ebx = (multiplier < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (multiplier < 0) ? ~multiplier : multiplier
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	[esp+16], ecx
	mov	[esp+20], eax	# multiplier = |multiplier|

	mov	eax, [esp+12]	# eax = high dword of multiplicand
	mov	ecx, [esp+8]	# ecx = low dword of multiplicand
	cdq			# edx = (multiplicand < 0) ? -1 : 0
	xor	ebx, edx	# ebx = (multiplier < 0)
				#     ^ (multiplicand < 0) ? -1 : 0
				#     = (product < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	mov	[esp+8], ecx
	mov	[esp+12], eax	# multiplicand = |multiplicand|

	push	ebx		# [esp] = (product < 0) ? -1 : 0
	xor	ebx, ebx	# ebx = 0
#	mov	eax, [esp+16]	# eax = high dword of |multiplicand|
	cmp	ebx, eax
	sbb	edx, edx	# edx = (high dword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**32) ? 0 : -1
	mov	ecx, [esp+24]	# ecx = high dword of |multiplier|
	cmp	ebx, ecx
	sbb	ebx, ebx	# ebx = (high dword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplier| < 2**32) ? 0 : -1
	and	ebx, edx	# ebx = (|multiplicand| < 2**32)
				#     & (|multiplier| < 2**32) ? 0 : -1
				#     = (|product| < 2**64) ? 0 : -1

	mov	edx, [esp+20]	# edx = low dword of |multiplier|
	mul	edx		# edx:eax = high dword of |multiplicand|
				#         * low dword of |multiplier|
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	xchg	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|,
				# eax = high dword of |multiplier|
	mov	edx, [esp+12]	# edx = low dword of |multiplicand|
	mul	edx		# edx:eax = high dword of |multiplier|
				#         * low dword of |multiplicand|
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	add	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|
				#     + high dword of |multiplier|
				#     * low dword of |multiplicand|
#	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	mov	eax, [esp+12]	# eax = low dword of |multiplicand|
	mov	edx, [esp+20]	# edx = low dword of |multiplier|
	mul	edx		# edx:eax = low dword of |multiplicand|
				#         * low dword of |multiplier|
	add	edx, ecx	# edx:eax = |product % 2**64|
				#         = |product| % 2**64
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	pop	ecx		# ecx = (product < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = product % 2**64

	xor	ecx, edx	# ecx = (multiplicand < 0)
				#     ^ (multiplier < 0)
				#     ^ (product < 0) ? negative : positive
	add	ecx, ecx
	mov	ecx, [esp+28]	# ecx = address of flag
	adc	ebx, ebx	# ebx = (-2**63 <= product < 2**63) ? 0 : *
	neg	ebx
	sbb	ebx, ebx	# ebx = (-2**63 <= product < 2**63) ? 0 : -1
	neg	ebx		# ebx = (-2**63 <= product < 2**63) ? 0 : 1
	mov	[ecx], ebx
	pop	ebx
	ret

.size	__mulodi4, .-__mulodi4
.type	__mulodi4, @function
.global	__mulodi4
.end

__parity?i2()

The __paritysi2() and __paritydi2() functions provided in the clang_rt.builtins-i386.lib library exhibit the insane code shown in the left column below; the right column shows properly optimised code, with 15 instructions in 40 bytes instead of 21 instructions in 57 bytes for both functions together, more than halving the number of instructions executed per function call:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic32
	.code32
	.intel_syntax noprefix
	.global	___paritysi2
	.global	___paritydi2
	.type	___paritysi2, @function
	.type	___paritydi2, @function
	.text

___paritysi2:
	mov	eax, [esp+4]		#	mov	eax, [esp+4]
	mov	ecx, eax		#
	shr	ecx, 16			#	shld	ecx, eax, 16
	xor	ecx, eax		#	xor	ecx, eax
	mov	eax, ecx		#
	shr	eax, 8			#
	xor	eax, ecx		#
	mov	ecx, eax		#
	shr	ecx, 4			#
	xor	ecx, eax		#
	mov	eax, 0x6996		#
	and	cl, 15			#	xor	eax, eax
	shr	eax, cl			#	xor	cl, ch
	and	eax, 1			#	setnp	al
	ret				#	ret

___paritydi2:
	mov	eax, [esp+8]		#	mov	eax, [esp+8]
	xor	eax, [esp+4]		#	xor	eax, [esp+4]
	push	eax			#	shld	ecx, eax, 16
	call	___paritysi2		#	xor	ecx, eax
	add	esp, 4			#	xor	eax, eax
					#	xor	cl, ch
					#	setnp	al
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end

Braindead implementation of code generator for __builtin_*()

For __builtin_*() functions, Clang emits unoptimised code, which can even be worse than the poor code provided in the clang_rt.builtins-i386.* and clang_rt.builtins-x86_64.* libraries, and fails optimisation!

__builtin_parity()

Create the text file case12a.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __paritysi2(unsigned int value) {
    return __builtin_parity(value);
}

int __paritydi2(unsigned long long value) {
    return __builtin_parityll(value);
}
Compile the source file case12a.c with Clang, targetting the AMD64 platform, and display the generated (unoptimised) assembly code:
clang -o- -S -target amd64-pc-linux case12a.c
[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movl	%edi, -4(%rbp)
	movl	-4(%rbp), %eax
	movl	%eax, %ecx
	shrl	%ecx
	andl	$0x55555555, %ecx
	subl	%ecx, %eax
	movl	%eax, %ecx
	andl	$0x33333333, %ecx
	shrl	$2, %eax
	andl	$0x33333333, %eax
	addl	%eax, %ecx
	movl	%ecx, %eax
	shrl	$4, %eax
	addl	%eax, %ecx
	andl	$0xF0F0F0F, %ecx
	imull	$0x1010101, %ecx, %eax
	shrl	$24, %eax
	andl	$1, %eax
	popq	%rbp
	retq
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rax
	movq	%rax, %rcx
	shrq	%rcx
	movabsq	$0x5555555555555555, %rdx
	andq	%rdx, %rcx
	subq	%rcx, %rax
	movabsq	$0x3333333333333333, %rcx
	movq	%rax, %rdx
	andq	%rcx, %rdx
	shrq	$2, %rax
	andq	%rcx, %rax
	addq	%rax, %rdx
	movq	%rdx, %rax
	shrq	$4, %rax
	addq	%rax, %rdx
	movabsq	$0xF0F0F0F0F0F0F0F, %rax
	andq	%rax, %rdx
	movabsq	$0x101010101010101, %rax
	imulq	%rax, %rdx
	shrq	$56, %rdx
	andq	$1, %rdx
	movl	%edx, %eax
	popq	%rbp
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
Compile the source file case12a.c a second time with Clang, now engaging its optimiser, again targetting the AMD64 platform, and display the generated optimised assembly code:
clang -o- -O3 -S -target amd64-pc-linux case12a.c
[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	movl	%edi, %ecx
	shrl	$16, %ecx
	xorl	%edi, %ecx
	movl	%ecx, %edx
	shrl	$8, %edx
	xorl	%eax, %eax
	xorb	%cl, %dl
	setnp	%al
	retq
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	movq	%rdi, %rax
	shrq	%rax
	movabsq	$0x5555555555555555, %rcx
	andq	%rax, %rcx
	subq	%rcx, %rdi
	movabsq	$0x3333333333333333, %rax
	movq	%rdi, %rcx
	andq	%rax, %rcx
	shrq	$2, %rdi
	andq	%rax, %rdi
	addq	%rcx, %rdi
	movq	%rdi, %rax
	shrq	$4, %rax
	addq	%rdi, %rax
	movabsq	$0x10F0F0F0F0F0F0F, %rcx
	andq	%rax, %rcx
	movabsq	$0x101010101010101, %rax
	imulq	%rcx, %rax
	shrq	$56, %rax
	andl	$1, %eax
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: the optimised code generated for __builtin_parityll() is a perfect and very convincing declaration of bankruptcy!

Compile the source file case12a.c a third time with Clang, again engaging its optimiser, now targetting the i386 platform, and display the generated optimised assembly code:

clang -o- -O3 -S -target i386-pc-linux case12a.c
[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	shrl	$16, %ecx
	xorl	%eax, %ecx
	xorl	%eax, %eax
	xorb	%ch, %cl
	setnp	%al
	retl
[…]
.LCPI1_0:
	.zero	16,85
.LCPI1_1:
	.zero	16,51
.LCPI1_2:
	.zero	16,15
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	movq	4(%esp), %xmm0		# xmm0 = mem[0],zero
	movdqa	%xmm0, %xmm1
	psrlw	$1, %xmm1
	pand	.LCPI1_0, %xmm1
	psubb	%xmm1, %xmm0
	movdqa	.LCPI1_1, %xmm1		# xmm1 = [51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51]
	movdqa	%xmm0, %xmm2
	psrlw	$2, %xmm0
	pand	%xmm1, %xmm2
	pand	%xmm1, %xmm0
	paddb	%xmm2, %xmm0
	movdqa	%xmm0, %xmm1
	psrlw	$4, %xmm1
	paddb	%xmm0, %xmm1
	pxor	%xmm0, %xmm0
	pand	.LCPI1_2, %xmm1
	psadbw	%xmm1, %xmm0
	movd	%xmm0, %eax
	andl	$1, %eax
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: again the optimised code generated for __builtin_parityll() is a perfect and very convincing declaration of bankruptcy!

Compile the source file case12a.c a fourth time with Clang, again targetting the i386 platform, now without SSE support, and display the generated optimised assembly code:

clang -mno-sse -o- -O3 -S -target i386-pc-linux case12a.c
[…]
__paritysi2:				# @__paritysi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	shrl	$16, %ecx
	xorl	%eax, %ecx
	xorl	%eax, %eax
	xorb	%ch, %cl
	setnp	%al
	retl
[…]
__paritydi2:				# @__paritydi2
# %bb.0:
	movl	8(%esp), %ecx
	movl	4(%esp), %eax
	movl	%ecx, %edx
	shrl	%edx
	andl	$0x55555555, %edx
	subl	%edx, %ecx
	movl	%ecx, %edx
	shrl	$2, %ecx
	andl	$0x33333333, %edx
	andl	$0x33333333, %ecx
	addl	%edx, %ecx
	movl	%ecx, %edx
	shrl	$4, %edx
	addl	%ecx, %edx
	andl	$0x10F0F0F, %edx
	imull	$0x1010101, %edx, %ecx
	movl	%eax, %edx
	shrl	%edx
	shrl	$24, %ecx
	andl	$0x55555555, %edx
	subl	%edx, %eax
	movl	%eax, %edx
	shrl	$2, %eax
	andl	$0x33333333, %edx
	andl	$0x33333333, %eax
	addl	%edx, %eax
	movl	%eax, %edx
	shrl	$4, %edx
	addl	%eax, %edx
	andl	$0x10F0F0F, %edx
	imull	$0x1010101, %edx, %eax
	shrl	$24, %eax
	addl	%ecx, %eax
	andl	$1, %eax
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: once more the optimised code generated for __builtin_parityll() is a perfect and very convincing declaration of bankruptcy!

__builtin_rotateleft*() and __builtin_rotateright*()

Create the text file case12b.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long __rotldi3(long long value, long long count) {
    return __builtin_rotateleft64(value, count);
}

long long __rotrdi3(long long value, long long count) {
    return __builtin_rotateright64(value, count);
}
Compile the source file case12b.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case12b.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movb	28(%esp), %ch		#	mov	ecx, [esp+16]
	movl	20(%esp), %esi		#	mov	edx, [esp+12]
	movl	24(%esp), %edx		#	mov	eax, [esp+8]
	xorl	%ebp, %ebp		#	mov	ebx, edx
	movb	%ch, %cl		#	test	cl, 32
	movl	%edx, %edi		#	cmovnz	edx, eax
	movl	%esi, %ebx		#	cmovnz	eax, ebx
	movl	%esi, %eax		#	cmovnz	ebx, edx
	negb	%cl			#
	shrl	%cl, %edi		#	shld	edx, eax, cl
	shrdl	%cl, %edx, %ebx		#	shld	eax, ebx, cl
	testb	$32, %cl		#
	movb	%ch, %cl		#
	cmovnel	%edi, %ebx		#
	cmovnel	%ebp, %edi		#
	shll	%cl, %eax		#
	shldl	%cl, %esi, %edx		#
	testb	$32, %ch		#
	cmovnel	%eax, %edx		#
	cmovnel	%ebp, %eax		#
	orl	%ebx, %eax		#
	orl	%edi, %edx		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movl	20(%esp), %esi		#	mov	eax, [esp+8]
	movl	24(%esp), %edx		#	mov	edx, [esp+12]
	movb	28(%esp), %cl		#	mov	ecx, [esp+16]
	xorl	%ebp, %ebp		#	mov	ebx, eax
					#	test	cl, 32
	movl	%edx, %edi		#	cmovnz	eax, edx
	movl	%esi, %ebx		#	cmovnz	edx, ebx
	movl	%esi, %eax		#	cmovnz	ebx, eax
	shrl	%cl, %edi		#	shrd	eax, edx, cl
	shrdl	%cl, %edx, %ebx		#	shrd	edx, ebx, cl
	testb	$32, %cl		#
	cmovnel	%edi, %ebx		#
	cmovnel	%ebp, %edi		#
	negb	%cl			#
	shll	%cl, %eax		#
	shldl	%cl, %esi, %edx		#
	testb	$32, %cl		#
	cmovnel	%eax, %edx		#
	cmovnel	%ebp, %eax		#
	orl	%ebx, %eax		#
	orl	%edi, %edx		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: 31 instructions in 67 bytes for the __rotldi3() function, and 29 instructions in 63 bytes for the __rotrdi3() function, instead of only 13 instructions in 35 bytes for each of these functions – the optimised code generated for __builtin_rotate*() is yet another perfect and very convincing declaration of bankruptcy!

__builtin_mul_overflow()

Create the text file case12c.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __umulvsi3(unsigned int multiplicand, unsigned int multiplier) {
    unsigned int product;
    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}

unsigned long long __umulvdi3(unsigned long long multiplicand, unsigned long long multiplier) {
    unsigned long long product;
    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}

#ifdef __amd64__
__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    __uint128_t product;
    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}
#endif
Compile the source file case12c.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case12c.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__umulvsi3:				# @__umulvsi3
# %bb.0:
	movl	%edi, %eax
	mull	%esi
	jo	.LBB0_1
# %bb.2:
	retq
.LBB0_1:
	ud2
[…]
__umulvdi3:				# @__umulvdi3
# %bb.0:
	movq	%rdi, %rax
	mulq	%rsi
	jo	.LBB1_1
# %bb.2:
	retq
.LBB1_1:
	ud2
[…]
__umulvti3:				# @__umulvti3
# %bb.0:
	pushq	%rbx			#
	movq	%rdx, %r8		#	mov	r8, rdx
	movq	%rsi, %rax		#	mov	r9, rcx
	testq	%rcx, %rcx		#	neg	rcx
	setne	%dl			#	sbb	ecx, ecx
	testq	%rsi, %rsi		#	xor	eax, eax
	setne	%r9b			#	cmp	rax, rsi
	andb	%dl, %r9b		#	sbb	eax, eax
	mulq	%r8			#	and	ecx, eax
	movq	%rax, %r11		#
	seto	%r10b			#	mov	rax, rdi
	movq	%rcx, %rax		#	mul	r9
	mulq	%rdi			#	adc	ecx, ecx
	movq	%rax, %rsi		#	mov	r9, rax
	seto	%bl			#
	orb	%r10b, %bl		#	mov	rax, rsi
	orb	%r9b, %bl		#	mul	r8
	addq	%r11, %rsi		#	adc	ecx, ecx
	movq	%rdi, %rax		#	add	r9, rax
	mulq	%r8			#
	addq	%rsi, %rdx		#	mov	rax, rdi
	setb	%cl			#	mul	r8
	orb	%bl, %cl		#	add	rdx, r9
	cmpb	$1, %cl			#	adc	ecx, ecx
	je	.LBB2_1			#	jnz	.LBB2_1
# %bb.2:
	popq	%rbx			#
	retq				#	ret
.LBB2_1:
	ud2				#	ud2
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: the __umulvti3() function shows 28 instructions in 77 bytes instead of only 23 instructions in just 58 bytes, clobbers registers RBX and R10 without necessity, and performs 2 superfluous memory accesses plus 9 partial register writes which impair performance!

Compile the source file case12c.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case12c.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__umulvsi3:				# @__umulvsi3
# %bb.0:
	movl	4(%esp), %eax
	mull	8(%esp)
	jo	.LBB0_1
# %bb.2:
	retl
.LBB0_1:
	ud2
[…]
__umulvdi3:				# @__umulvdi3
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movl	32(%esp), %esi		#	xor	ebx, ebx
	movl	24(%esp), %eax		#	mov	ecx, [esp+12]
	movl	20(%esp), %ebp		#	cmp	ebx, ecx
	testl	%esi, %esi		#	sbb	edx, edx
	setne	%dl			#	mov	eax, [esp+20]
	testl	%eax, %eax		#	cmp	ebx, eax
	setne	%cl			#	sbb	ebx, ebx
	andb	%dl, %cl		#	and	ebx, edx
	mull	28(%esp)		#	mul	[esp+8]
	movl	%eax, %edi		#
	movl	%esi, %eax		#	xchg	ecx, eax
	seto	%bl			#	adc	ebx, ebx
	mull	%ebp			#	mul	[esp+16]
	movl	%eax, %esi		#
	movl	%ebp, %eax		#
	seto	%ch			#	adc	ebx, ebx
	mull	28(%esp)		#	add	ecx, eax
	addl	%edi, %esi		#	mov	eax, [esp+8]
	orb	%bl, %ch		#	mul	[esp+16]
	addl	%esi, %edx		#	add	edx, ecx
	setb	%bl			#	adc	ebx, ebx
	orb	%ch, %bl		#
	orb	%cl, %bl		#
	cmpb	$1, %bl			#
	je	.LBB1_1			#	jnz	.LBB1_1
# %bb.2:
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
.LBB1_1:
	ud2				#	ud2
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: the __umulvdi3() function shows 35 instructions in 77 bytes instead of only 23 instructions in just 54 bytes, clobbers registers EBP, EDI and ESI without necessity, and performs 5 excess memory accesses plus 9 partial register writes which impair performance!

__builtin_copysign()

Create the text file case12d.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

double copysign64(double destination, double source) {
    return __builtin_copysign(destination, source);
}
Compile the source file case12d.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -Os -S -target i386-pc-linux case12d.c
[…]
.LCPI0_0:
	.quad	-9223372036854775808	# double -0
	.quad	-9223372036854775808	# double -0
.LCPI0_1:
	.quad	9223372036854775807	# double NaN
	.quad	9223372036854775807	# double NaN
	.text
copysign64:				# @copysign64
# %bb.0:
	subl	$12, %esp
	movsd	24(%esp), %xmm0		# xmm0 = mem[0],zero
	movsd	16(%esp), %xmm1		# xmm1 = mem[0],zero
	andps	.LCPI0_0, %xmm0
	andps	.LCPI0_1, %xmm1
	orps	%xmm0, %xmm1
	movlps	%xmm1, (%esp)
	fldl	(%esp)
	addl	$12, %esp
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: 10 instructions in 43 bytes, plus 32 bytes for 2 constants, whose access yields a page fault in the worst case.

Compile the source file case12d.c a second time with Clang, now without SSE support, and display the generated assembly code:

clang -mno-sse -o- -Os -S -target i386-pc-linux case12d.c
[…]
copysign:				# @copysign
# %bb.0:
	subl	$12, %esp
	fldl	16(%esp)
	fldl	24(%esp)
	fstpl	(%esp)
	fabs
	fld	%st(0)
	fchs
	testb	$-128, 7(%esp)
	fxch	%st(1)
	fcmovne	%st(1), %st
	fstp	%st(1)
	addl	$12, %esp
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 13 instructions in 33 bytes, wasting 3 instructions by writing the second argument without necessity on the stack.

Proper code uses but only 5 instructions in just 19 bytes:

# Copyright © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.arch	generic32
.code32
.intel_syntax noprefix
.text

copysign64:
	mov	eax, [esp+16]
	shld	[esp+8], eax, 1
	ror	[esp+8], 1
	fld	[esp+4]
	ret

.size	copysign64, .-copysign64
.type	copysign64, @function
.global	copysign64
.end

Utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication

Create the text file case13.c with the following content in an arbitrary, preferable empty directory:
// Copyright © 2015-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

#ifdef OPTIMIZE
__attribute__ ((noinline))
static
__int128_t __muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow)
{
    __uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);

    *overflow = __builtin_mul_overflow((multiplicand ^ sign) - sign,
                                       (multiplier ^ tmp) - tmp,
                                       &product);
    sign ^= tmp;
    product = (product ^ sign) - sign;
    *overflow |= (__int128_t) (product ^ sign) < 0;

    return product;
}
#endif

__attribute__ ((always_inline))
static
__uint128_t lfsr128(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
    //   initialised with 2**128 / golden ratio

    static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
    const  __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
    const  __uint128_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);

    return lfsr ^= poly & sign;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);

    return lfsr ^= 0xAD93D23594C935A9 & sign;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
    // 32-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xDB710641 (CRC-32 IEEE),
    //   initialised with 2**32 / golden ratio

    static uint32_t lfsr = 0x9E3779B9;
    const  uint32_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);

    return lfsr ^= 0xDB710641 & sign;
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t n;
    __int128_t multiplicand, multiplier = ~0;
    volatile __int128_t product;
    volatile int overflow;

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        product = multiplicand * multiplier;
        multiplier = lfsr64();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr64();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        product = multiplicand * multiplier;
        multiplier = lfsr32();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr128();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr32();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr64();
        product = multiplicand * multiplier;
        multiplier = lfsr64();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr64();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr64();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr32();
        product = multiplicand * multiplier;
        multiplier = lfsr32();
        product = multiplicand * multiplier;
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        multiplicand = lfsr32();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
        multiplier = lfsr32();
        overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__multi3()      %4lu.%06lu       0\n"
           "__muloti4()     %4lu.%06lu    %4lu.%06lu\n"
           "                %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
Note: modification of the source file case13.c to demonstrate the equally devastating (mis)performance of the overflow-checking 64×64-bit integer multiplication on 32-bit processors is left as an exercise to the reader.

Run the following command lines to compile, link and run the benchmark program:

lscpu
clang -O3 -rtlib=compiler-rt case13.c
echo 'LLVM/clang/compiler-rt'
./a.out
clang -DOPTIMIZE -O3 case13.c
echo 'LLVM/clang'
./a.out

Runtime measurement on AMD® EPYC 7262

Note: for better readability and to ease their comparision, the numbers are shown in two columns.
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7262 8-Core Processor
Stepping:              0
CPU MHz:               3193.726
BogoMIPS:              6387.45
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
[…]
                    LLVM/clang/compiler-rt         LLVM/clang

__multi3()         4.190000       0               4.180000       0
__muloti4()      238.320000     234.130000        7.170000       2.990000
                 242.510000 nano-seconds         11.350000 nano-seconds

__multi3()         4.050000       0               4.050000       0
__muloti4()      348.200000     344.150000        7.230000       3.180000
                 352.250000 nano-seconds         11.280000 nano-seconds

__multi3()         3.900000       0               3.880000       0
__muloti4()      232.630000     228.730000        8.420000       4.540000
                 236.530000 nano-seconds         12.300000 nano-seconds

__multi3()         3.850000       0               3.850000       0
__muloti4()        4.240000       0.390000        4.230000       0.380000
                   8.090000 nano-seconds          8.080000 nano-seconds
Note: the overhead for the pseudo-random number generators is negligible here!

Oops: with the properly implemented __muloti4() function, which is called by __builtin_mul_overflow() for signed 128-bit integers, overflow-checking multiplication runs about 2 times slower than unchecked multiplication.

OUCH: with the __muloti4() function provided in the compiler-rt library, overflow-checking multiplication runs but about 2 orders of magnitude slower than unchecked multiplication – an utterly devastating result, proving the miserable implementation of Clang and its runtime library again!

Which optimiser?

Create the text file case14.c with the following content, presented by Matt Godbolt as example (t) in his ACM queue article Optimizations in C++ Compilers – A practical journey, in an arbitrary, preferable empty directory:
bool isWhitespace(char c)
{
    return c == ' '
      || c == '\r'
      || c == '\n'
      || c == '\t';
}
Compile the source file case14.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case14.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
isWhitespace:				# @isWhitespace
# %bb.0:
	cmpb	$32, %dil		#
	ja	.LBB0_2			#
# %bb.1:
	movl	$1, %eax		#
	movzbl	%dil, %ecx		#	mov	ecx, edi
	movabsq	$0x100002400, %rdx	#	mov	rax, 100002600h
	btq	%rcx, %rdx		#	shr	rax, cl
	jae	.LBB0_2			#
# %bb.3:
	retq				#
.LBB0_2:				#
	xorl	%eax, %eax		#	cmp	cl, 33
	cmpb	$9, %dil		#	sbb	ecx, ecx
					#	neg	ecx
	sete	%al			#	and	eax, ecx
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OOPS: 12 instructions in 42 bytes, including 2 conditional branches which impair performance, plus register RDX clobbered, instead of only 8 instructions in just 19 bytes, and without (superfluous) conditional branches.

Compile the source file case14.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case14.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
isWhitespace:				# @isWhitespace
# %bb.0:
	pushl	%esi			#
	movb	8(%esp), %cl		#	mov	ecx, [esp+4]
	movl	%ecx, %edx		#
	addb	$-10, %dl		#
	cmpb	$22, %dl		#
	ja	.LBB0_2			#
# %bb.1:
	movzbl	%dl, %edx		#	xor	eax, eax
	movl	$0x400009, %esi		#	cmp	eax, ecx
	movl	$1, %eax		#	adc	eax, 2600h
	btl	%edx, %esi		#	shr	eax, cl
	jae	.LBB0_2			#
# %bb.3:
	popl	%esi			#
	retl				#
.LBB0_2:
	xorl	%eax, %eax		#	xor	edx, edx
	cmpb	$9, %cl			#	cmp	ecx, 33
	sete	%al			#	adc	edx, edx
	popl	%esi			#	and	eax, edx
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: 18 instructions in 45 bytes, including 2 conditional branches which impair performance, plus register ESI clobbered, instead of only 10 instructions in just 25 bytes, again without (superfluous) conditional branches, and no non-volatile register clobbered.

The optimiser fails – take 2

Create the text file case15.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2018-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long __absvdi2(long long value) {
#ifdef ALTERNATE // undefined behaviour
    value = __builtin_llabs(value);
    if (value < 0)
        __builtin_trap();
    return value;
#else
    long long sign = 0 - (value < 0);
    if (__builtin_saddll_overflow(value, sign, &value))
        __builtin_trap();
    return value ^ sign;
#endif
}

long long __addvdi3(long long augend, long long addend) {
    long long sum;
    if (__builtin_saddll_overflow(augend, addend, &sum))
        __builtin_trap();
    return sum;
}

long long __mulvdi3(long long multiplicand, long long multiplier) {
    long long product;
    if (__builtin_smulll_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}

long long __negvdi2(long long negend) {
    long long negation;
    if (__builtin_ssubll_overflow(0, negend, &negation))
        __builtin_trap();
    return negation;
}

long long __subvdi3(long long minuend, long long subtrahend) {
    long long difference;
    if (__builtin_ssubll_overflow(minuend, subtrahend, &difference))
        __builtin_trap();
    return difference;
}
Compile the source file case15.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case15.c
Note: the left column shows the generated insane code, while the right column shows properly optimised code as comment.
[…]
__absvdi2:				# @__absvdi2
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	pushl	%eax			#
	movl	20(%esp), %esi		#	mov	eax, [esp+8]
	movl	16(%esp), %eax		#	mov	ecx, [esp+4]
	movl	%esi, %ecx		#
	movl	%esi, %edx		#
	sarl	$31, %ecx		#	cdq
	addl	%ecx, %eax		#	add	ecx, edx
	adcl	%ecx, %edx		#	adc	eax, edx
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	3(%esp)			#
	testl	%ecx, %ecx		#
	setns	%bl			#
	cmpb	%bl, %bh		#
	sete	%bl			#
	andb	3(%esp), %bl		#
	cmpb	$1, %bl			#
	je	.LBB0_1			#	into
# %bb.2:
	xorl	%ecx, %eax		#	xor	ecx, edx
	xorl	%ecx, %edx		#	xor	edx, eax
	addl	$4, %esp		#	mov	eax, ecx
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB0_1:
	ud2				#
[…]
__addvdi3:				# @__addvdi3
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	movl	12(%esp), %eax		#	mov	eax, [esp+4]
	movl	16(%esp), %esi		#	mov	edx, [esp+8]
	movl	24(%esp), %ecx		#	add	eax, [esp+12]
	addl	20(%esp), %eax		#	adc	edx, [esp+16]
	movl	%esi, %edx		#
	adcl	%ecx, %edx		#
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	%bl			#
	testl	%ecx, %ecx		#
	setns	%cl			#
	cmpb	%cl, %bh		#
	sete	%cl			#
	andb	%bl, %cl		#
	cmpb	$1, %cl			#
	je	.LBB1_1			#	into
# %bb.2:
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB1_1:
	ud2				#
[…]
__mulvdi3:				# @__mulvdi3
# %bb.0:
	pushl	%edi			#
	pushl	%esi			#
	pushl	%eax			#	push	ebx
	movl	16(%esp), %eax		#	mov	ebx, [esp+16]
	movl	24(%esp), %edx		#	mov	eax, [esp+8]
	movl	20(%esp), %ecx		#	mul	ebx
	movl	28(%esp), %esi		#	push	eax
	movl	$0, (%esp)		#	mov	ecx, edx
	subl	$12, %esp		#	mov	eax, [esp+16]
	leal	12(%esp), %edi		#	mul	ebx
	pushl	%edi			#	xor	ebx, ebx
	pushl	%esi			#	add	ecx, eax
	pushl	%edx			#	adc	ebx, edx
	pushl	%ecx			#	mov	eax, [esp+12]
	pushl	%eax			#	mul	dword ptr [esp+24]
	calll	__mulodi4		#	add	ecx, eax
	addl	$32, %esp		#	adc	ebx, edx
	cmpl	$0, (%esp)		#	push	ecx
					#	sbb	ecx, ecx
					#	mov	eax, [esp+20]
					#	mul	dword ptr [esp+28]
					#	neg	ecx
					#	add	eax, ebx
					#	adc	edx, ecx
					#	xor	ebx, ebx
					#	cmp	ebx, [esp+28]
					#	jle	0f
					#	not	ebx
					#	sub	eax, [esp+16]
					#	sbb	edx, [esp+20]
					# 0:
					#	xor	ecx, ecx
					#	cmp	ecx, [esp+20]
					#	jle	1f
					#	not	ebx
					#	sub	eax, [esp+24]
					#	sbb	edx, [esp+28]
					# 1:
					#	xor	eax, ebx
					#	xor	edx, ebx
					#	or	eax, edx
					#	pop	edx
					#	shld	ecx, edx, 1
					#	add	ebx, ecx
					#	or	eax, ebx
	jne	.LBB2_1			#	jnz	.LBB2_1
					#	pop	eax
# %bb.2:
	addl	$4, %esp		#	pop	ebx
	popl	%esi			#
	popl	%edi			#
	retl				#	ret
.LBB2_1:
	ud2				#	ud2
[…]
__negvdi2:				# @__negvdi2
# %bb.0:
	xorl	%eax, %eax		#	mov	eax, [esp+4]
	xorl	%edx, %edx		#	xor	edx, edx
	movl	8(%esp), %ecx		#
	subl	4(%esp), %eax		#	neg	eax
	sbbl	%ecx, %edx		#	sbb	edx, [esp+8]
	testl	%edx, %ecx		#
	js	.LBB3_1			#	into
# %bb.2:
	retl				#	ret
.LBB3_1:
	ud2				#
[…]
__subvdi3:				# @__subvdi3
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	movl	12(%esp), %eax		#	mov	eax, [esp+4]
	movl	16(%esp), %esi		#	mov	edx, [esp+8]
	movl	24(%esp), %ecx		#	sub	eax, [esp+12]
	subl	20(%esp), %eax		#	sbb	edx, [esp+16]
	movl	%esi, %edx		#
	sbbl	%ecx, %edx		#
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	%bl			#
	testl	%ecx, %ecx		#
	setns	%cl			#
	cmpb	%cl, %bh		#
	setne	%cl			#
	andb	%bl, %cl		#
	cmpb	$1, %cl			#
	je	.LBB4_1			#	into
# %bb.2:
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB4_1:
	ud2				#
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: 29 instructions in 68 bytes instead of only 10 instructions in just 21 bytes for the __absvdi2() function, and 24 instructions in 47 bytes instead of only 6 instructions in just 18 bytes for each of the __addvdi3() and __subvdi3() functions!

OOPS: 24 instructions in 60 bytes, plus 98 instructions in 266 bytes of the called __mulodi4() function, instead of only 46 instructions in just 113 bytes for the __mulvdi3() function.

Note: exploration of the equally bad code generated for the __absvsi2(), __addvsi3(), __mulvsi3() and __subvsi3() functions as well as the __absvti2(), __addvti3(), __mulvti3() and __subvti3() functions is left as an exercise to the reader.

The optimiser fails – take 3

Create the text file case16.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int reverse(unsigned int value) {
#ifndef ALTERNATE
    // swap all adjacent bits
    value = ((value &  0xAAAAAAAAU) >> 1)
          | ((value & ~0xAAAAAAAAU) << 1);
    // swap adjacent pairs of bits
    value = ((value &  0xCCCCCCCCU) >> 2)
          | ((value & ~0xCCCCCCCCU) << 2);
    // swap adjacent nibbles
    value = ((value &  0xF0F0F0F0U) >> 4)
          | ((value & ~0xF0F0F0F0U) << 4);
    // swap adjacent octetts
    value = ((value &  0xFF00FF00U) >> 8)
          | ((value & ~0xFF00FF00U) << 8);
    // swap high and low part
    value = ((value &  0xFFFF0000U) >> 16)
          | ((value & ~0xFFFF0000U) << 16);
#else
    value = ((value <<  1) &  0xAAAAAAAAU)
          | ((value >>  1) & ~0xAAAAAAAAU);
    value = ((value <<  2) &  0xCCCCCCCCU)
          | ((value >>  2) & ~0xCCCCCCCCU);
    value = ((value <<  4) &  0xF0F0F0F0U)
          | ((value >>  4) & ~0xF0F0F0F0U);
    value = ((value <<  8) &  0xFF00FF00U)
          | ((value >>  8) & ~0xFF00FF00U);
    value = ((value << 16) &  0xFFFF0000U)
          | ((value >> 16) & ~0xFFFF0000U);
#endif
    return value;
}
Compile the source file case16.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case16.c
Note: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment.
[…]
reverse:				# @reverse
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	andl	$0xD5555555, %eax
	shrl	%ecx
	andl	$0x55555555, %ecx
	leal	(%ecx,%eax,2), %eax
	movl	%eax, %ecx
	andl	$0xF3333333, %eax
	shrl	$2, %ecx
	andl	$0x33333333, %ecx
	leal	(%ecx,%eax,4), %eax
	movl	%eax, %ecx
	shll	$4, %eax
	shrl	$4, %ecx
	andl	$0xF0F0F0F0, %eax
	andl	$0x0F0F0F0F, %ecx
	orl	%ecx, %eax
	movl	%eax, %ecx		#	bswap	%eax
	shll	$8, %eax		#
	shrl	$8, %ecx		#
	andl	$0xFF00FF00, %eax	#
	andl	$0x00FF00FF, %ecx	#
	orl	%ecx, %eax		#
	roll	$16, %eax		#
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: the optimiser fails to recognise this common and well-known idiom for endian conversion!

Note: exploration of the code generated with the preprocessor macro ALTERNATE defined is left as an exercise to the reader.

The optimiser fails – take 4

Create the text file case17.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2004-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpti2(__uint128_t a, __uint128_t b) {
    if (a < b)
        return 0;
    if (a > b)
        return 2;
    return 1;
}
Compile the source file case17.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case17.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__ucmpti2:				# @__ucmpti2
# %bb.0:
	cmpq	%rdi, %rdx		#	cmp	rdi, rdx
	movq	%rcx, %rax		#	mov	rax, rsi
	sbbq	%rsi, %rax		#	sbb	rax, rcx
	movl	$1, %r8d		#
	adcl	$0, %r8d		#
	xorl	%eax, %eax		#	sbb	eax, eax
	cmpq	%rdx, %rdi		#	cmp	rdx, rdi
	sbbq	%rcx, %rsi		#	sbb	rcx, rsi
	cmovael	%r8d, %eax		#	adc	eax, 1
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 10 instructions in 32 bytes, including a superfluous conditional move which impairs performance, instead of 8 instructions in 21 bytes, without conditional move.

The optimiser fails – take 5

Create the text file case18.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2004-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpti2(__uint128_t a, __uint128_t b) {
    if (a > b)
        return 2;
    if (a == b)
        return 1;
    return 0;
}
Compile the source file case18.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-windows case18.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__ucmpti2:				# @__ucmpti2
# %bb.0:
	movdqa	(%rcx), %xmm0		#	mov	r8, [rcx]
	movq	(%rdx), %r8		#	mov	r9, [rcx+8]
	pcmpeqb	(%rdx), %xmm0		#	mov	rcx, [rdx]
	pmovmskb	%xmm0, %eax	#	mov	rdx, [rdx+8]
	xorl	%r9d, %r9d		#	cmp	r8, rcx
	cmpl	$0xFFFF, %eax		#	mov	rax, r9
	sete	%r9b			#	sbb	rax, rdx
	cmpq	(%rcx), %r8		#	sbb	eax, eax
	movq	8(%rdx), %rax		#	cmp	rcx, r8
	sbbq	8(%rcx), %rax		#	sbb	rdx, r9
	movl	$2, %eax		#	adc	eax, 1
	cmovael %r9d, %eax		#
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 13 instructions in 48 bytes, including a superfluous conditional move which impairs performance, using the SSE register XMM0 without necessity, performing 6 memory accesses, instead of 12 instructions in 35 bytes, without conditional move, performing 4 memory accesses.

The optimiser fails – take 6

Create the text file case19.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __clzti2(__uint128_t value) {
    if ((value >> 64) != 0)
        return __builtin_clzll(value >> 64);
    if ((value & ~0ULL) != 0)
        return __builtin_clzll(value) + 64;
    return 128;
}

int __ctzti2(__uint128_t value) {
    if ((value & ~0ULL) != 0)
        return __builtin_ctzll(value);
    if ((value >> 64) != 0)
        return __builtin_ctzll(value >> 64) + 64;
    return 128;
}
Compile the source file case19.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case19.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__clzti2:				# @__clzti2
# %bb.0:
	testq	%rsi, %rsi		#	bsr	rax, rsi
	je	.LBB0_2			#	jz	.LBB0_2
# %bb.1:
	bsrq	%rsi, %rax		#
	xorl	$63, %eax		#	xor	eax, 63
	retq				#	ret
.LBB0_2:
	testq	%rdi, %rdi		#	bsr	rax, rdi
	je	.LBB0_3			#	jz	.LBB0_3
# %bb.4:
	bsrq	%rdi, %rax		#
	xorl	$63, %eax		#	xor	eax, 127
	orl	$64, %eax		#
	retq				#	ret
.LBB0_3:
	movl	$128, %eax		#	mov	eax, 128
	retq				#	ret
[…]
__ctzti2:				# @__ctzti2
# %bb.0:
	testq	%rdi, %rdi		#
	je	.LBB1_2			#
# %bb.1:
	bsfq	%rdi, %rax		#	bsf	rax, rdi
	retq				#	jnz	.return
.LBB1_2:
	testq	%rsi, %rsi		#
	je	.LBB1_3			#
# %bb.4:
	bsfq	%rsi, %rax		#	bsf	rax, rsi
					#	jz	.LBB1_3
	orl	$64, %eax		#	or	eax, 64
					# .return:
	retq				#	ret
.LBB1_3:
	movl	$128, %eax		#	mov	eax, 128
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 13 instructions in 35 bytes instead of 10 instructions in 26 bytes for the __clzti2() function, and 11 instructions in 30 bytes instead of 8 instructions in 22 bytes for the __ctzti2() function.

The optimiser fails – take 7

Create the text file case20.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
long long __mindi3(long long a, long long b) {
    return a < b ? a : b;
}

unsigned long long __umindi3(unsigned long long a, unsigned long long b) {
    return a < b ? a : b;
}
#else
__int128_t __maxti3(__int128_t a, __int128_t b) {
    return a > b ? a : b;
}

__uint128_t __umaxti3(__uint128_t a, __uint128_t b) {
    return a > b ? a : b;
}
#endif
Compile the source file case20.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case20.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__maxti3:				# @__maxti3
# %bb.0:
	movq	%rdx, %rax		#	mov	rax, rdx
	cmpq	%rdx, %rdi		#	cmp	rdx, rdi
	movq	%rsi, %rdx		#	mov	rdx, rcx
	sbbq	%rcx, %rdx		#	sbb	rcx, rsi
	cmovgeq	%rdi, %rax		#	cmovl	rax, rdi
	cmovgeq	%rsi, %rcx		#	cmovl	rdx, rsi
	movq	%rcx, %rdx		#
	retq				#	ret
[…]
__umaxti3:				# @__umaxti3
# %bb.0:
	movq	%rdx, %rax		#	mov	rax, rdx
	cmpq	%rdi, %rdx		#	cmp	rdx, rdi
	movq	%rcx, %rdx		#	mov	rdx, rcx
	sbbq	%rsi, %rdx		#	sbb	rcx, rsi
	cmovbq	%rdi, %rax		#	cmovb	rax, rdi
	cmovbq	%rsi, %rcx		#	cmovb	rdx, rsi
	movq	%rcx, %rdx		#
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 8 instructions in 24 bytes instead of 7 instructions in 21 bytes for the __maxti3() and __umaxti3() functions.

Note: exploration of the code generated for the __minti3() and __uminti3() functions on the AMD64 platform is left as an exercise to the reader.

Compile the source file case20.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case20.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__mindi3:				# @__mindi3
# %bb.0:
	pushl	%edi			#
	pushl	%esi			#
	movl	24(%esp), %edx		#	mov	edx, [esp+16]
	movl	12(%esp), %ecx		#
	movl	20(%esp), %eax		#	mov	eax, [esp+12]
	movl	16(%esp), %esi		#
	cmpl	%ecx, %eax		#	cmp	eax, [esp+4]
	movl	%edx, %edi		#	mov	ecx, edx
	sbbl	%esi, %edi		#	sbb	ecx, [esp+8]
	cmovgel	%ecx, %eax		#	cmovge	eax, [esp+4]
	cmovgel	%esi, %edx		#	cmovge	edx, [esp+8]
	popl	%esi			#
	popl	%edi			#
	retl				#	ret
[…]
__umindi3:				# @__umindi3
# %bb.0:
	pushl	%edi			#
	pushl	%esi			#
	movl	16(%esp), %esi		#
	movl	20(%esp), %eax		#	mov	eax, [esp+4]
	movl	12(%esp), %ecx		#
	movl	24(%esp), %edx		#	mov	edx, [esp+8]
	cmpl	%eax, %ecx		#	cmp	eax, [esp+12]
	movl	%esi, %edi		#	mov	ecx, edx
	sbbl	%edx, %edi		#	sbb	ecx, [esp+16]
	cmovbl	%ecx, %eax		#	cmovnb	eax, [esp+12]
	cmovbl	%esi, %edx		#	cmovnb	edx, [esp+16]
	popl	%esi			#
	popl	%edi			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: 14 instructions in 33 bytes, clobbering the registers EDI and ESI without necessity, instead of only 8 instructions in 29 bytes for the __mindi3() and __umindi3() functions.

Note: exploration of the equally bad code generated for the __maxdi3() and __umaxdi3() functions on the i386 platform is left as an exercise to the reader.

The optimiser fails – take 8

Create the text file case21.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __rotlsi3(unsigned int value, int count) {
    return (value << (31 & count))
         | (value >> (31 & -count));
}

unsigned int __rotrsi3(unsigned int value, int count) {
    return (value >> (31 & count))
         | (value << (31 & -count));
}

unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}

unsigned long long __rotrdi3(unsigned long long value, int count) {
    return (value >> (63 & count))
         | (value << (63 & -count));
}

#ifdef __amd64__
__uint128_t __rotlti3(__uint128_t value, int count) {
    return (value << (127 & count))
         | (value >> (127 & -count));
}

__uint128_t __rotrti3(__uint128_t value, int count) {
#ifdef OPTIMIZE
    __asm__("movq\t%[low], %%rax\n\t"
            "shrdq\t%%cl, %[high], %%rax\n\t"
            "shrdq\t%%cl, %[low], %[high]\n\t"
            "movq\t%[high], %%rdx\n\t"
            "test\t$64, %%cl\n\t"
            "cmovnz\t%%rax, %%rdx\n\t"
            "cmovnz\t%[high], %%rax"
           :"=A" (value)
           :"c" (count),
            [low] "r" ((unsigned long long) (value & ~0ULL)),
            [high] "r" ((unsigned long long) (value >> 64))
           :"cc");
    return value;
#else
    return (value >> (127 & count))
         | (value << (127 & -count));
#endif
}
#endif
Compile the source file case21.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case21.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	roll	%cl, %eax
	retq
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	rorl	%cl, %eax
	retq
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rolq	%cl, %rax
	retq
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rorq	%cl, %rax
	retq
[…]
__rotlti3:				# @__rotlti3
# %bb.0:
	movl	%edx, %ecx		#	mov	ecx, edx
	movq	%rsi, %rdx		#
	movq	%rdi, %rax		#	mov	rax, rdi
	shldq	%cl, %rdi, %rsi		#	shld	rax, rsi, cl
	shlq	%cl, %rdi		#	shld	rsi, rdi, cl
	xorl	%r8d, %r8d		#	mov	rdx, rsi
	testb	$64, %cl		#	test	cl, 64
	cmovneq	%rdi, %rsi		#	cmovnz	rdx, rax
	cmovneq	%r8, %rdi		#	cmovnz	rax, rsi
	negb	%cl			#
	shrdq	%cl, %rdx, %rax		#
	shrq	%cl, %rdx		#
	testb	$64, %cl		#
	cmovneq	%rdx, %rax		#
	cmovneq	%r8, %rdx		#
	orq	%rdi, %rax		#
	orq	%rsi, %rdx		#
	retq				#	ret
[…]
__rotrti3:				# @__rotrti3
# %bb.0:
	movl	%edx, %ecx		#	mov	ecx, edx
	movq	%rsi, %rdx		#
	movq	%rdi, %rax		#	mov	rax, rdi
	movq	%rdi, %rsi		#
	shrdq	%cl, %rdx, %rsi		#	shrd	rax, rsi, cl
	movq	%rdx, %rdi		#
	shrq	%cl, %rdi		#	shrd	rsi, rdi, cl
	xorl	%r8d, %r8d		#	mov	rdx, rsi
	testb	$64, %cl		#	test	cl, 64
	cmovneq	%rdi, %rsi		#	cmovnz	rdx, rax
	cmovneq	%r8, %rdi		#	cmovnz	rax, rsi
	negb	%cl			#
	shldq	%cl, %rax, %rdx		#
	shlq	%cl, %rax		#
	testb	$64, %cl		#
	cmovneq	%rax, %rdx		#
	cmovneq	%r8, %rax		#
	orq	%rsi, %rax		#
	orq	%rdi, %rdx		#
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: while the optimiser recognises the common and well-known idiom for rotation of 32-bit as well as 64-bit integers and generates proper code for the __rotlsi3(), __rotrsi3(), __rotldi3() and __rotrdi3() functions, it but fails rather bad at 128-bit integers and generates awful code using 18 instructions in 56 bytes and 20 instructions in 62 bytes respectively instead of only 9 instructions in just 28 bytes for each of the __rotlti3() and __rotrti3() functions!

Note: exploration of the equally bad code generated for the __rotldi3() and __rotrdi3() functions on the i386 platform is left as an exercise to the reader.

The optimiser fails – take 9

Create the text file case22.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __rotlsi3(unsigned int value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateleft32(value, count);
#else
    return (value << (31 & count))
         | (value >> (31 & -count));
#endif
}

unsigned int __rotrsi3(unsigned int value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateright32(value, count);
#else
    return (value >> (31 & count))
         | (value << (31 & -count));
#endif
}

unsigned long long __rotldi3(unsigned long long value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateleft64(value, count);
#else
    return (value << (63 & count))
         | (value >> (63 & -count));
#endif
}

unsigned long long __rotrdi3(unsigned long long value, int count) {
#ifdef OPTIMIZE
    return __builtin_rotateright64(value, count);
#else
    return (value >> (63 & count))
         | (value << (63 & -count));
#endif
}

unsigned int __aswapsi2(unsigned int value) {
    return (__rotlsi3(value, 8) & 0x00FF00FFU)
#ifdef ALTERNATE
         | (__rotlsi3(value, 24) & 0xFF00FF00U);
#else
         | (__rotrsi3(value, 8) & 0xFF00FF00U);
#endif
}

unsigned int __bswapsi2(unsigned int value) {
    return __rotlsi3(value & 0xFF00FF00U, 8)
#ifdef ALTERNATE
         | __rotlsi3(value & 0x00FF00FFU, 24);
#else
         | __rotrsi3(value & 0x00FF00FFU, 8);
#endif
}

unsigned long long __aswapdi2(unsigned long long value) {
    return (__rotldi3(value, 8)  & 0x000000FF000000FFULL)
         | (__rotldi3(value, 24) & 0x0000FF000000FF00ULL)
#ifdef ALTERNATE
         | (__rotldi3(value, 40) & 0x00FF000000FF0000ULL)
         | (__rotldi3(value, 56) & 0xFF000000FF000000ULL);
#else
         | (__rotrdi3(value, 24) & 0x00FF000000FF0000ULL)
         | (__rotrdi3(value, 8)  & 0xFF000000FF000000ULL);
#endif
}

unsigned long long __bswapdi2(unsigned long long value) {
    return __rotldi3(value & 0xFF000000FF000000ULL, 8)
         | __rotldi3(value & 0x00FF000000FF0000ULL, 24)
#ifdef ALTERNATE
         | __rotldi3(value & 0x0000FF000000FF00ULL, 40)
         | __rotldi3(value & 0x000000FF000000FFULL, 56);
#else
         | __rotrdi3(value & 0x0000FF000000FF00ULL, 24)
         | __rotrdi3(value & 0x000000FF000000FFULL, 8);
#endif
}
Compile the source file case22.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case22.c
[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	roll	%cl, %eax
	retl
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	rorl	%cl, %eax
	retl
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	24(%esp), %esi
	movl	20(%esp), %eax
	movb	28(%esp), %cl
	xorl	%ebp, %ebp
	movl	%eax, %edi
	movl	%esi, %ebx
	movl	%esi, %edx
	shll	%cl, %edi
	shldl	%cl, %eax, %ebx
	testb	$32, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	negb	%cl
	shrl	%cl, %edx
	shrdl	%cl, %esi, %eax
	testb	$32, %cl
	cmovnel	%edx, %eax
	cmovnel	%ebp, %edx
	orl	%edi, %eax
	orl	%ebx, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	20(%esp), %esi
	movl	24(%esp), %edx
	movb	28(%esp), %cl
	xorl	%ebp, %ebp
	movl	%edx, %edi
	movl	%esi, %ebx
	movl	%esi, %eax
	shrl	%cl, %edi
	shrdl	%cl, %edx, %ebx
	testb	$32, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	negb	%cl
	shll	%cl, %eax
	shldl	%cl, %esi, %edx
	testb	$32, %cl
	cmovnel	%eax, %edx
	cmovnel	%ebp, %eax
	orl	%ebx, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	roll	$24, %eax
	roll	$8, %ecx
	andl	$0xFF00FF00, %eax
	andl	$0xFF00FF, %ecx
	orl	%ecx, %eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %edx
	movl	%ecx, %eax
	andl	$0xFF00, %edx
	andl	$0xFF0000, %eax
	shldl	$8, %ecx, %edx
	shrdl	$8, %ecx, %eax
	orl	%edx, %eax
	retl
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: although the code generated for the __rotldi3() and __rotrdi3() functions is horrible, the __aswapdi2() and __bswapdi2() functions are properly optimised.

OUCH: contrary to the __aswapdi2() and __bswapdi2() functions, the __aswapsi2() and __bswapsi2() functions are but not optimised!

Compile the source file case22.c a second time with Clang, again targetting the i386 platform, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case22.c
[…]
[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	roll	%cl, %eax
	retl
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movb	8(%esp), %cl
	movl	4(%esp), %eax
	rorl	%cl, %eax
	retl
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movb	28(%esp), %ch
	movl	20(%esp), %esi
	movl	24(%esp), %edx
	xorl	%ebp, %ebp
	movb	%ch, %cl
	movl	%edx, %edi
	movl	%esi, %ebx
	movl	%esi, %eax
	negb	%cl
	shrl	%cl, %edi
	shrdl	%cl, %edx, %ebx
	testb	$32, %cl
	movb	%ch, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	shll	%cl, %eax
	shldl	%cl, %esi, %edx
	testb	$32, %ch
	cmovnel	%eax, %edx
	cmovnel	%ebp, %eax
	orl	%ebx, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	pushl	%ebp
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	20(%esp), %esi
	movl	24(%esp), %edx
	movb	28(%esp), %cl
	xorl	%ebp, %ebp
	movl	%edx, %edi
	movl	%esi, %ebx
	movl	%esi, %eax
	shrl	%cl, %edi
	shrdl	%cl, %edx, %ebx
	testb	$32, %cl
	cmovnel	%edi, %ebx
	cmovnel	%ebp, %edi
	negb	%cl
	shll	%cl, %eax
	shldl	%cl, %esi, %edx
	testb	$32, %cl
	cmovnel	%eax, %edx
	cmovnel	%ebp, %eax
	orl	%ebx, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	popl	%ebx
	popl	%ebp
	retl
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	roll	$24, %eax
	roll	$8, %ecx
	andl	$0xFF00FF00, %eax
	andl	$0xFF00FF, %ecx
	orl	%ecx, %eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %edx
	movl	%ecx, %eax
	andl	$0xFF00, %edx
	andl	$0xFF0000, %eax
	shldl	$8, %ecx, %edx
	shrdl	$8, %ecx, %eax
	orl	%edx, %eax
	retl
[…]
.LCPI8_0:
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
.LCPI8_1:
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movq	4(%esp), %xmm0		# xmm0 = mem[0],zero
	pshufd	$68, %xmm0, %xmm0	# xmm0 = xmm0[0,1,0,1]
	movdqa	%xmm0, %xmm1
	movdqa	%xmm0, %xmm2
	movdqa	%xmm0, %xmm3
	movdqa	%xmm0, %xmm5
	movdqa	%xmm0, %xmm4
	psrlq	$56, %xmm1
	psrlq	$40, %xmm2
	psllq	$24, %xmm3
	psrlq	$8, %xmm5
	psllq	$40, %xmm4
	movsd	%xmm1, %xmm2		# xmm2 = xmm1[0],xmm2[1]
	movdqa	%xmm0, %xmm1
	psllq	$8, %xmm1
	movsd	%xmm1, %xmm3		# xmm3 = xmm1[0],xmm3[1]
	movdqa	%xmm0, %xmm1
	psllq	$56, %xmm0
	psrlq	$24, %xmm1
	movsd	%xmm4, %xmm0		# xmm0 = xmm4[0],xmm0[1]
	orpd	%xmm2, %xmm3
	movsd	%xmm1, %xmm5		# xmm5 = xmm1[0],xmm5[1]
	andpd	.LCPI8_1, %xmm3
	orpd	%xmm5, %xmm0
	andpd	.LCPI8_0, %xmm0
	orpd	%xmm0, %xmm3
	pshufd	$78, %xmm3, %xmm0	# xmm0 = xmm3[2,3,0,1]
	por	%xmm3, %xmm0
	movd	%xmm0, %eax
	pshufd	$229, %xmm0, %xmm0	# xmm0 = xmm0[1,1,2,3]
	movd	%xmm0, %edx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %edx
	movl	20(%esp), %eax
	movl	%edx, %edi
	movl	%eax, %ebx
	movl	%eax, %ecx
	movl	%edx, %esi
	andl	$0xFF0000, %edi
	andl	$0xFF0000, %ebx
	shrl	$24, %ecx
	shrl	$24, %esi
	shrl	$8, %ebx
	shrl	$8, %edi
	orl	%ecx, %ebx
	movl	%eax, %ecx
	orl	%esi, %edi
	movl	%edx, %esi
	shll	$24, %edx
	shll	$24, %eax
	andl	$0xFF00, %ecx
	andl	$0xFF00, %esi
	shll	$8, %esi
	shll	$8, %ecx
	orl	%edi, %esi
	orl	%ebx, %ecx
	orl	%esi, %edx
	orl	%ecx, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: the braindead implementation of the code generator for __builtin_*() strikes again!

Compile the source file case22.c a third time with Clang, now targetting the AMD64 platform, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-linux case22.c
[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	roll	%cl, %eax
	retq
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	rorl	%cl, %eax
	retq
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rolq	%cl, %rax
	retq
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rorq	%cl, %rax
	retq
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	%edi, %eax
	roll	$8, %eax
	andl	$0xFF00FF, %eax
	roll	$24, %edi
	andl	$0xFF00FF00, %edi
	addl	%edi, %eax
	retq
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	%edi, %ecx
	andl	$0xFF00, %ecx
	shldl	$8, %edi, %ecx
	movl	%edi, %eax
	andl	$0xFF0000, %eax
	shrdl	$8, %edi, %eax
	orl	%ecx, %eax
	retq
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movq	%rdi, %rax
	bswapq	%rax
	retq
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movq	%rdi, %rax
	bswapq	%rax
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: while the __aswapdi2() and __bswapdi2() functions are optimised, the __aswapsi2() and __bswapsi2() functions are not optimised!

Compile the source file case22.c a fourth time with Clang, again targetting the AMD64 platform, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case22.c
[…]
__rotlsi3:				# @__rotlsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	roll	%cl, %eax
	retq
[…]
__rotrsi3:				# @__rotrsi3
# %bb.0:
	movl	%esi, %ecx
	movl	%edi, %eax
	rorl	%cl, %eax
	retq
[…]
__rotldi3:				# @__rotldi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rolq	%cl, %rax
	retq
[…]
__rotrdi3:				# @__rotrdi3
# %bb.0:
	movl	%esi, %ecx
	movq	%rdi, %rax
	rorq	%cl, %rax
	retq
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	%edi, %eax
	roll	$8, %eax
	andl	$0xFF00FF, %eax
	roll	$24, %edi
	andl	$0xFF00FF00, %edi
	addl	%edi, %eax
	retq
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	%edi, %ecx
	andl	$0xFF00, %ecx
	shldl	$8, %edi, %ecx
	movl	%edi, %eax
	andl	$0xFF0000, %eax
	shrdl	$8, %edi, %eax
	orl	%ecx, %eax
	retq
[…]
.LCPI8_0:
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
.LCPI8_1:
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	0			# 0x0
	.byte	255			# 0xff
	.byte	0			# 0x0
	.byte	0			# 0x0
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movq	%rdi, %xmm0
	pshufd	$68, %xmm0, %xmm0	# xmm0 = xmm0[0,1,0,1]
	movdqa	%xmm0, %xmm1
	psrlq	$56, %xmm1
	movdqa	%xmm0, %xmm2
	psrlq	$40, %xmm2
	movsd	%xmm1, %xmm2		# xmm2 = xmm1[0],xmm2[1]
	movdqa	%xmm0, %xmm1
	psllq	$8, %xmm1
	movdqa	%xmm0, %xmm3
	psllq	$24, %xmm3
	movsd	%xmm1, %xmm3		# xmm3 = xmm1[0],xmm3[1]
	orpd	%xmm2, %xmm3
	movdqa	%xmm0, %xmm1
	psrlq	$24, %xmm1
	movdqa	%xmm0, %xmm2
	psrlq	$8, %xmm2
	movsd	%xmm1, %xmm2		# xmm2 = xmm1[0],xmm2[1]
	movdqa	%xmm0, %xmm1
	psllq	$40, %xmm1
	psllq	$56, %xmm0
	movsd	%xmm1, %xmm0		# xmm0 = xmm1[0],xmm0[1]
	orpd	%xmm2, %xmm0
	andpd	.LCPI8_0(%rip), %xmm0
	andpd	.LCPI8_1(%rip), %xmm3
	orpd	%xmm0, %xmm3
	pshufd	$78, %xmm3, %xmm0	# xmm0 = xmm3[2,3,0,1]
	por	%xmm3, %xmm0
	movq	%xmm0, %rax
	retq
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	%edi, %eax
	andl	$0xFF000000, %eax
	shldq	$8, %rdi, %rax
	movabsq	$0xFF000000FF0000, %rcx
	andq	%rdi, %rcx
	rolq	$24, %rcx
	orq	%rax, %rcx
	movabsq	$0xFF000000FF00, %rdx
	andq	%rdi, %rdx
	rolq	$40, %rdx
	movabsq	$0xFF00000000, %rax
	andq	%rdi, %rax
	shrdq	$8, %rdi, %rax
	orq	%rdx, %rax
	orq	%rcx, %rax
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: the braindead implementation of the code generator for __builtin_*() strikes again!

The optimiser fails – take 10

Create the text file case23.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef OPTIMIZE
#define __builtin_rotateleft32(value, count)  (((value) << (31 & (count))) | ((value) >> (31 & -(count))))
#define __builtin_rotateright32(value, count) (((value) >> (31 & (count))) | ((value) << (31 & -(count))))
#define __builtin_rotateleft64(value, count)  (((value) << (63 & (count))) | ((value) >> (63 & -(count))))
#define __builtin_rotateright64(value, count) (((value) >> (63 & (count))) | ((value) << (63 & -(count))))
#endif

unsigned int __aswapsi2(unsigned int value) {
    return (__builtin_rotateleft32(value, 8) & 0x00FF00FFU)
#ifdef ALTERNATE
         | (__builtin_rotateleft32(value, 24) & 0xFF00FF00U);
#else
         | (__builtin_rotateright32(value, 8) & 0xFF00FF00U);
#endif
}

unsigned int __bswapsi2(unsigned int value) {
    return __builtin_rotateleft32(value & 0xFF00FF00U, 8)
#ifdef ALTERNATE
         | __builtin_rotateleft32(value & 0x00FF00FFU, 24);
#else
         | __builtin_rotateright32(value & 0x00FF00FFU, 8);
#endif
}

unsigned int __cswapsi2(unsigned int value) {
    value = ((value << 8) & 0xFF00FF00U)
          | ((value >> 8) & 0x00FF00FFU);
#ifndef ALTERNATE
    value = ((value << 16) & 0xFFFF0000U)
          | ((value >> 16) & 0x0000FFFFU);
#else
    value = (value << 16) | (value >> 16);
#endif
    return value;
}

unsigned int __dswapsi2(unsigned int value) {
    value = ((value & 0x00FF00FFU) << 8)
          | ((value & 0xFF00FF00U) >> 8);
#ifndef ALTERNATE
    value = ((value & 0x0000FFFFU) << 16)
          | ((value & 0xFFFF0000U) >> 16);
#else
    value = (value << 16) | (value >> 16);
#endif
    return value;
}

unsigned long long __aswapdi2(unsigned long long value) {
    return (__builtin_rotateleft64(value, 8)  & 0x000000FF000000FFULL)
         | (__builtin_rotateleft64(value, 24) & 0x0000FF000000FF00ULL)
#ifdef ALTERNATE
         | (__builtin_rotateleft64(value, 40) & 0x00FF000000FF0000ULL)
         | (__builtin_rotateleft64(value, 56) & 0xFF000000FF000000ULL);
#else
         | (__builtin_rotateright64(value, 24) & 0x00FF000000FF0000ULL)
         | (__builtin_rotateright64(value, 8)  & 0xFF000000FF000000ULL);
#endif
}

unsigned long long __bswapdi2(unsigned long long value) {
    return __builtin_rotateleft64(value & 0xFF000000FF000000ULL, 8)
         | __builtin_rotateleft64(value & 0x00FF000000FF0000ULL, 24)
#ifdef ALTERNATE
         | __builtin_rotateleft64(value & 0x0000FF000000FF00ULL, 40)
         | __builtin_rotateleft64(value & 0x000000FF000000FFULL, 56);
#else
         | __builtin_rotateright64(value & 0x0000FF000000FF00ULL, 24)
         | __builtin_rotateright64(value & 0x000000FF000000FFULL, 8);
#endif
}

unsigned long long __cswapdi2(unsigned long long value) {
    value = ((value << 8) & 0xFF00FF00FF00FF00ULL)
          | ((value >> 8) & 0x00FF00FF00FF00FFULL);
    value = ((value << 16) & 0xFFFF0000FFFF0000ULL)
          | ((value >> 16) & 0x0000FFFF0000FFFFULL);
#ifndef ALTERNATE
    value = ((value << 32) & 0xFFFFFFFF00000000ULL)
          | ((value >> 32) & 0x00000000FFFFFFFFULL);
#else
    value = (value << 32) | (value >> 32);
#endif
    return value;
}

unsigned long long __dswapdi2(unsigned long long value) {
    value = ((value & 0x00FF00FF00FF00FFULL) << 8)
          | ((value & 0xFF00FF00FF00FF00ULL) >> 8);
    value = ((value & 0x0000FFFF0000FFFFULL) << 16)
          | ((value & 0xFFFF0000FFFF0000ULL) >> 16);
#ifndef ALTERNATE
    value = ((value & 0x00000000FFFFFFFFULL) << 32)
          | ((value & 0xFFFFFFFF00000000ULL) >> 32);
#else
    value = (value << 32) | (value >> 32);
#endif
    return value;
}
Compile the source file case23.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -mno-sse -o- -O3 -S -target i386-pc-linux case23.c
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__cswapsi2:				# @__cswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__dswapsi2:				# @__dswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__cswapdi2:				# @__cswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__dswapdi2:				# @__dswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Note: the generated code is properly optimised!

Compile the source file case23.c a second time with Clang, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case23.c
[…]
__aswapsi2:				# @__aswapsi2
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	roll	$24, %eax
	roll	$8, %ecx
	andl	$0xFF00FF00, %eax
	andl	$0xFF00FF, %ecx
	orl	%ecx, %eax
	retl
[…]
__bswapsi2:				# @__bswapsi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %edx
	movl	%ecx, %eax
	andl	$0xFF00, %edx
	andl	$0xFF0000, %eax
	shldl	$8, %ecx, %edx
	shrdl	$8, %ecx, %eax
	orl	%edx, %eax
	retl
[…]
__cswapsi2:				# @__cswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__dswapsi2:				# @__dswapsi2
# %bb.0:
	movl	4(%esp), %eax
	bswapl	%eax
	retl
[…]
__aswapdi2:				# @__aswapdi2
# %bb.0:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %esi
	movl	20(%esp), %eax
	movl	%esi, %edx
	movl	%esi, %ecx
	shldl	$24, %eax, %edx
	shldl	$8, %eax, %ecx
	movl	%edx, %ebx
	movzbl	%cl, %edi
	andl	$0xFF0000, %ecx
	andl	$0xFF000000, %edx
	andl	$0xFF00, %ebx
	orl	%edi, %ebx
	movl	%esi, %edi
	shrl	$24, %esi
	shrl	$8, %edi
	andl	$0xFF00, %edi
	orl	%edi, %esi
	movl	%eax, %edi
	shll	$24, %eax
	shll	$8, %edi
	orl	%esi, %ecx
	andl	$0xFF0000, %edi
	orl	%ecx, %edx
	orl	%ebx, %edi
	orl	%edi, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	retl
[…]
__bswapdi2:				# @__bswapdi2
# %bb.0:
	pushl	%ebx
	pushl	%edi
	pushl	%esi
	movl	16(%esp), %edx
	movl	20(%esp), %eax
	movl	%edx, %edi
	movl	%eax, %ebx
	movl	%eax, %ecx
	movl	%edx, %esi
	andl	$0xFF0000, %edi
	andl	$0xFF0000, %ebx
	shrl	$24, %ecx
	shrl	$24, %esi
	shrl	$8, %ebx
	shrl	$8, %edi
	orl	%ecx, %ebx
	movl	%eax, %ecx
	orl	%esi, %edi
	movl	%edx, %esi
	shll	$24, %edx
	shll	$24, %eax
	andl	$0xFF00, %ecx
	andl	$0xFF00, %esi
	shll	$8, %esi
	shll	$8, %ecx
	orl	%edi, %esi
	orl	%ebx, %ecx
	orl	%esi, %edx
	orl	%ecx, %eax
	popl	%esi
	popl	%edi
	popl	%ebx
	retl
[…]
__cswapdi2:				# @__cswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
__dswapdi2:				# @__dswapdi2
# %bb.0:
	movl	4(%esp), %edx
	movl	8(%esp), %eax
	bswapl	%eax
	bswapl	%edx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: the braindead implementation of the code generator for __builtin_*() strikes again!

The optimiser fails – take 11

Create the text file case24.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long quotient(long long numerator, long long denominator, long long *modulus) {
    *modulus = numerator % denominator;
    return numerator / denominator;
}

long long modulus(long long numerator, long long denominator, long long *quotient) {
    *quotient = numerator / denominator;
    return numerator % denominator;
}
Compile the source file case24.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case24.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
quotient:				# @quotient
# %bb.0:
	pushl	%ebp			#	jmp	__divmoddi4
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	subl	$12, %esp		#
	movl	44(%esp), %edi		#
	movl	32(%esp), %ebx		#
	movl	36(%esp), %ebp		#
	movl	40(%esp), %esi		#
	pushl	%edi			#
	pushl	%esi			#
	pushl	%ebp			#
	pushl	%ebx			#
	calll	__divdi3		#
	addl	$16, %esp		#
	movl	%eax, %ecx		#
	movl	%edx, 8(%esp)		# 4-byte Spill
	imull	%eax, %edi		#
	mull	%esi			#
	addl	%edi, %edx		#
	movl	8(%esp), %edi		# 4-byte Reload
	imull	%edi, %esi		#
	addl	%edx, %esi		#
	subl	%eax, %ebx		#
	movl	48(%esp), %eax		#
	movl	%edi, %edx		#
	sbbl	%esi, %ebp		#
	movl	%ebx, (%eax)		#
	movl	%ebp, 4(%eax)		#
	movl	%ecx, %eax		#
	addl	$12, %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	popl	%ebp			#
	retl				#
[…]
modulus:				# @modulus
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	subl	$12, %esp		#	sub	esp, 8
	movl	44(%esp), %ebx		#
	movl	32(%esp), %esi		#
	movl	36(%esp), %edi		#
	movl	40(%esp), %ebp		#
					#	push	esp
	pushl	%ebx			#	push	[esp+28]
	pushl	%ebp			#	push	[esp+28]
	pushl	%edi			#	push	[esp+28]
	pushl	%esi			#	push	[esp+28]
	calll	__divdi3		#	call	__divmoddi4
	addl	$16, %esp		#	add	esp, 20
	movl	%edx, %ecx		#
	movl	48(%esp), %edx		#	mov	ecx, [esp+28]
	imull	%eax, %ebx		#
	movl	%ecx, 4(%edx)		#	mov	[ecx], eax
	movl	%eax, (%edx)		#	mov	[ecx+4], edx
	mull	%ebp			#
	imull	%ebp, %ecx		#
	addl	%ebx, %edx		#
	addl	%edx, %ecx		#
	subl	%eax, %esi		#
	sbbl	%ecx, %edi		#
	movl	%esi, %eax		#
	movl	%edi, %edx		#
	addl	$12, %esp		#	pop	edx
					#	pop	eax
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	popl	%ebp			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: 35 superfluous instructions in 77 bytes for the function quotient(), wasting about 18 CPU clock cycles per function call, instead of only 1 (in words: one) instruction in just 5 bytes – this optimiser is a bad joke!

OOPS: 34 instructions in 74 bytes for the function modulus(), where just 14 instructions in only 40 bytes yield the same result, but without clobbering 4 registers, without 8 superfluous memory accesses, without 3 superfluous multiplications, without 5 superfluous additions or subtractions, and not wasting about 10 CPU clock cycles per function call.

The optimiser fails – take 12

Create the text file case25.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2021, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned long long multiply64(unsigned int multiplicand, unsigned int multiplier) {
    unsigned long long product;
    __asm__("mull\t%[multiplier]"
           :"=A" (product)
           :"%0" (multiplicand),
            [multiplier] "rm" (multiplier)
           :"cc");
    return product;
}

#ifdef __amd64__
__uint128_t multiply128(unsigned long long multiplicand, unsigned long long multiplier) {
    __uint128_t product;
    __asm__("mulq\t%[multiplier]"
           :"=A" (product)
           :"%0" (multiplicand),
            [multiplier] "rm" (multiplier)
           :"cc");
    return product;
}
#endif
Compile the source file case25.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case25.c
Note: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…]
multiply64:				# @multiply64
# %bb.0:
	movl	%edi, %eax		#	mov	eax, edi
	movl	%esi, -4(%rsp)		#
	#APP
	mull	-4(%rsp)		#	mul	esi
	#NO_APP
	retq				#	ret
[…]
multiply128:				# @multiply128
# %bb.0:
	movq	%rdi, %rax		#	mov	rax, rdi
	movq	%rsi, -8(%rsp)		#
	#APP
	mulq	-8(%rsp)		#	mul	rsi
	#NO_APP
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: despite the contraint rm alias register or memory given for the multiplier, it is written into the red zone on the stack and then read from there.

Compile the source file case25.c a second time with Clang, now with the option -mno-red-zone, and display the generated assembly code:

clang -mno-red-zone -o- -O3 -S -target amd64-pc-linux case25.c
Note: the left column shows the generated code, while the right column shows the expected unoptimised code as comment.
[…]
multiply64:				# @multiply64
# %bb.0:
	subq	$4, %rsp		#
	movl	%edi, %eax		#	mov	eax, edi
	movl	%esi, (%rsp)		#	push	rsi
	#APP
	mull	(%rsp)			#	mul	dword ptr [rsp]
	#NO_APP
	addq	$4, %rsp		#	pop	rsi
	retq				#	ret
[…]
multiply128:				# @multiply128
# %bb.0:
	pushq	%rax			#
	movq	%rdi, %rax		#	mov	rax, rdi
	movq	%rsi, (%rsp)		#	push	rsi
	#APP
	mulq	(%rsp)			#	mul	qword ptr [rsp]
	#NO_APP
	popq	%rcx			#	pop	rsi
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: instead to push register ESI which holds the multiplier onto the stack, the code generated for the multiply64() function stores the multiplier with a MOV instruction after decrementing the stack pointer with a superfluous SUB instruction, using 17 instead of 8 bytes.

Ouch: instead to push register RSI which holds the multiplier onto the stack, the code generated for the multiply128() function stores the multiplier with a MOV instruction after decrementing the stack pointer with a superfluous PUSH instruction, using 14 instead of 10 bytes.

The optimiser fails – take 13

Compile the source file case4gcc.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case4gcc.c
Note: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment.
[…]
	bsrq	%r11, %rbx		#	bsrq	%r11, %rcx
	xorq	$63, %rbx		#	xorl	$63, %ecx
	je	.LBB1_11
# %bb.17:
	movq	%rbx, %rcx		#
	negq	%rcx			#
	movq	%rsi, %rdx		#	xorl	%edx, %edx
	shrq	%cl, %rdx		#	shldq	%cl, %rsi, %rdx
	movl	%ebx, %ecx		#
	shldq	%cl, %rdi, %rsi
	shlq	%cl, %rdi
	shldq	%cl, %r9, %r11
	shlq	%cl, %r9
	movq	%r11, -8(%rsp)		#
	movq	%rsi, %rax
	divq	-8(%rsp)		#	divq	%r11
	movq	%rdx, %rsi
	movq	%rax, %r10
	mulq	%r9
	cmpq	%rax, %rdi
	movq	%rsi, %rcx		#	movq	%rsi, %rbx
	sbbq	%rdx, %rcx		#	sbbq	%rdx, %rbx
	jae	.LBB1_19
# %bb.18:
	subq	%r9, %rax
	sbbq	%r11, %rdx
	addq	$-1, %r10
.LBB1_19:
	testq	%r8, %r8
	je	.LBB1_21
# %bb.20:
	movl	$64, %r9d		#
	subq	%rbx, %r9		#
	subq	%rax, %rdi
	sbbq	%rdx, %rsi
	movl	%ebx, %ecx		#
	shrq	%cl, %rdi		#	shrdq	%cl, %rdi, %rsi
	movq	%rsi, %rax		#
	movl	%r9d, %ecx		#
	shlq	%cl, %rax		#
	movl	%ebx, %ecx		#
	shrq	%cl, %rsi		#	shrq	%cl, %rdi
	orq	%rdi, %rax		#
	movq	%rsi, 8(%r8)
	movq	%rax, (%r8)		#	movq	%rdi, (%r8)
	jmp	.LBB1_21
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: while the optimiser (partially) recognises the common and well-known idiom for multi-word left shifts and generates the code following line # %bb.17:, it but fails to recognise the same idiom for the multi-word right shift and generates the code following line # %bb.20:, using 12 (in words: twelve) superfluous instructions in a total of 42 instructions for the 2 hot basic blocks shown!

Compile the source file case4gcc.c a second time with Clang, now with the preprocessor macro OPTIMIZE defined, targetting the AMD64 platform, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case4gcc.c
[…]
.LBB1_7:
	bsrq	%r10, %rbx
	xorq	$63, %rbx
	je	.LBB1_13
# %bb.8:
	movl	%ebx, %ecx
	negb	%cl
	movq	%rsi, %rdx
	shrq	%cl, %rdx
	movl	%ebx, %ecx
	shldq	%cl, %rdi, %rsi
	shlq	%cl, %rdi
	shldq	%cl, %r9, %r10
	shlq	%cl, %r9
	xorl	%r11d, %r11d
	testb	$64, %bl
	cmovneq	%rdi, %rsi
	cmovneq	%r11, %rdi
	cmovneq	%r9, %r10
	cmovneq	%r11, %r9
	movq	%r10, -8(%rsp)
	movq	%rsi, %rax
	divq	-8(%rsp)
	movq	%rax, %r14
	movq	%rdx, %rcx
	movq	%r9, %rax
	andq	$-2, %rax
	mulq	%r14
	andq	$-2, %rdi
	cmpq	%rax, %rdi
	movq	%rcx, %rsi
	sbbq	%rdx, %rsi
	sbbq	$0, %r14
	testq	%r8, %r8
	je	.LBB1_19
# %bb.9:
	xorl	%r11d, %r11d
	subq	%rax, %rdi
	sbbq	%rdx, %rcx
	cmovaeq	%r11, %r10
	cmovaeq	%r11, %r9
	addq	%rdi, %r9
	adcq	%rcx, %r10
	movl	%ebx, %ecx
	shrdq	%cl, %r10, %r9
	shrq	%cl, %r10
	testb	$64, %bl
	cmovneq	%r10, %r9
	cmovneq	%r11, %r10
	movq	%r10, 8(%r8)
	movq	%r9, (%r8)
	jmp	.LBB1_19
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: although the shift count is greater than 0 and can’t exceed 63 (what an optimiser worth its name can prove from the program flow), this optimiser generates 2 superfluous TEST instructions which always set the zero flag ZF, followed by 6 superfluous CMOVNZ instructions, i.e. NOPs, resulting in a total of 49 instead of the former 42 instructions.

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):


Copyright © 1995–2021 • Stefan Kanthak • <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>