Valid HTML 4.01 Transitional Valid CSS Valid SVG 1.0

Me, myself & IT

True lies – or what LLVM claims, but fails to deliver

Purpose
Reason
Example
Violation of the Microsoft calling convention
Poor to abysmal performance of 128÷128-bit integer division returning both quotient and remainder
Bloated code and poor performance of 64÷64-bit integer division returning both quotient and remainder
Counterexample for the 3 preceding bugs
Unoptimised code of 64÷64-bit integer division functions
Disastrous performance of 128÷128-bit (and 64÷64-bit) integer division
Runtime measurement on AMD Ryzen 5 3600
Runtime measurement on AMD EPYC 7262
Runtime measurement on Intel Core i5-6600
Careless (or clueless?) wasting of resources on Microsoft Windows
Which cross-compiler?
Misleading message box from installer on Microsoft Windows
Impaired performance of one of the most trivial functions
Bad, insufficient or no (peephole) optimisation at all in various library functions
__absvti2()
__ashldi3(), __ashrdi3() and __lshrdi3()
__cmp?i2() and __ucmp?i2()
__mulo?i4()
__parity?i2()
Braindead implementation of code generation for __builtin_*()
__builtin_parity*()
Which optimiser?
The optimiser fails, quite bad
The optimiser fails, third time
The optimiser fails, fourth time
The optimiser fails, fifth time
The optimiser fails, sixth time
The optimiser fails, seventh time
The optimiser fails, eighth time

Purpose

Demonstrate multiple bugs, defects, deficiencies and flaws of LLVM’s Clang compiler, from bloated, unoptimised or wrong code generated over poor, abysmal and disastrous performance of the integer division routines of its compiler-rt runtime libraries to careless wasting of resources on Microsoft® Windows®.

Reason

For their compiler-rt runtime libraries, LLVM claims boasts
  1. The compiler-rt project provides highly tuned implementations of the low-level code generator support routines like "__fixunsdfdi" and other calls generated when a target doesn't have a short sequence of native instructions to implement a core IR operation.
and states brags
The builtins library provides optimized implementations of this and other low-level routines, either in target-independent C form, or as a heavily-optimized assembly.
but dares to write (and even ship) unoptimised and untuned code crap like the library functions __cmpdi2(), __cmpti2(), __ucmpdi2(), __ucmpti2(), __udivmoddi4() or __udivmodti4(), which even show pessimised code that impairs their own Clang compiler and its optimiser from generating proper and performant machine code suitable for current super-scalar processors!

As proved below by the desastrous performance of integer division as well as the unoptimised library functions, the statements cited above are blatant lies!

Note: I recommend to read Randall Hyde’s ACM article, especially the following part:

Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.

Example

Literally bad smelling code from the 80’s, ugly and defeating proper optimisation:
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
[…]
// This file implements __ucmpdi2 for the compiler_rt library.
[…]
// Returns:  if (a <  b) returns 0
//           if (a == b) returns 1
//           if (a >  b) returns 2
[…]
int __ucmpdi2(uint64_t a, uint64_t b) {
    union {
        uint64_t all;
        struct {
            uint32_t low, high;
        } s;
    } x, y;
    x.all = a;
    y.all = b;
    if (x.s.high < y.s.high)
        return 0;
    if (x.s.high > y.s.high)
        return 2;
    if (x.s.low < y.s.low)
        return 0;
    if (x.s.low > y.s.low)
        return 2;
    return 1;
}

#ifdef __ARM_EABI__
// Returns:  if (a <  b) returns -1
//           if (a == b) returns  0
//           if (a >  b) returns  1
int __aeabi_ulcmp(int64_t a, int64_t b) {
    return __ucmpdi2(a, b) - 1;
}
#endif
Note: spotting the most obvious bug is left as an exercise to the reader.

Properly written code which lets the optimiser do its job follows:

// Copyleft © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpdi2(uint64_t a, uint64_t b) {
    return (a > b) - (a < b) + 1;
}

#ifdef __ARM_EABI__
int __aeabi_ulcmp(uint64_t a, uint64_t b) {
    return (a > b) - (a < b);
}
#endif

Violation of the Microsoft calling convention

The MSDN articles Overview of x64 Calling Conventions and Parameter Passing specify:
There's a strict one-to-one correspondence between a function call's arguments and the registers used for those arguments. Any argument that doesn't fit in 8 bytes, or isn't 1, 2, 4, or 8 bytes, must be passed by reference. A single argument is never spread across multiple registers.

[…] 16-byte arguments are passed by reference. […]

To return a user-defined type by value in RAX, it must have a length of 1, 2, 4, 8, 16, 32, or 64 bits. […] Otherwise, the caller must allocate memory for the return value and pass a pointer to it as the first argument. The remaining arguments are then shifted one argument to the right. The same pointer must be returned by the callee in RAX.

The LLVM Project Blog post Clang is now used to build Chrome for Windows states:
Clang is the first-ever open-source C++ compiler that’s ABI-compatible with Microsoft Visual C++ (MSVC) – meaning you can build some parts of your program (for example, system libraries) with the MSVC compiler (“cl.exe”), other parts with Clang, and when linked together (either by MSVC’s linker, “link.exe“, or LLD, the LLVM project’s linker – see below) the parts will form a working program.
To falsify LLVM’s claim, create the text file case0.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __clang__
typedef struct {
    unsigned __int64 low, high;
} __uint128_t;
#else
__attribute__((ms_abi))
#endif // __clang__
__uint128_t __ubugti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder) {
    if (remainder != 0)
#ifndef BUNNY
        *remainder = divisor;
    return dividend;
#else
        *remainder = dividend % divisor;
    return dividend / divisor;
#endif // BUNNY
}
Note: the actual definition or name of the 128-bit integer scalar user-defined aggregate data type __uint128_t for the Visual C compiler is irrelevant; for binary compatibility, defined by the respective Application Binary Interface alias ABI, only its matching size and memory layout matters!

Compile the source file case0.c with Clang, engaging its optimiser, and display the generated assembly code:

clang -o- -O3 -S -target amd64-pc-windows case0.c
	.text
[…]
__ubugti4:				# @__ubugti4
# %bb.0:
	movaps	(%rcx), %xmm0
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movaps	(%rdx), %xmm1
	movaps	%xmm1, (%r8)
.LBB0_2:
	retq
[…]
OOPS: the code generated by Clang expects the addresses of its three (128-bit) arguments in the registers RCX, RDX and R8 instead of RDX, R8 and R9, while it places the (128-bit) return value in the SSE register XMM0 instead of the memory pointed to by RCX, without to return this pointer in RAX, thus violating the Microsoft ABI twice!

Compile the source file case0.c with Clang again, now without SSE support, and display the generated assembly code:

clang -mno-sse -o- -O3 -S -target amd64-pc-windows case0.c
	.text
[…]
__ubugti4:				# @__ubugti4
# %bb.0:
	movq	%rdx, %r9
	movq	(%rcx), %rax
	movq	8(%rcx), %rdx
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movq	(%r9), %r10
	movq	8(%r9), %rcx
	movq	%rcx, 8(%r8)
	movq	%r10, (%r8)
.LBB0_2:
	retq
[…]
OUCH: the code generated by Clang still expects the addresses of its three (128-bit) arguments in the registers RCX, RDX and R8 instead of RDX, R8 and R9, while it places the (128-bit) return value in the register pair EDX:EAX instead of the memory pointed to by RCX, without to return this pointer in RAX, again violating the Microsoft ABI twice!

Now compile the source file case0.c with the reference compiler for the Microsoft ABI, Visual C, and display the generated assembly code:

CL.EXE /c /FaCON: /FoNUL: /nologo /Ox /W4 case0.c
case0.c
; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 
[…]
__ubugti4 PROC
; Line 11
	test	r9, r9
	je	SHORT $LN1@ubugti4
; Line 13
	mov	rax, QWORD PTR [r8]
	mov	QWORD PTR [r9], rax
	mov	rax, QWORD PTR [r8+8]
	mov	QWORD PTR [r9+8], rax
$LN1@ubugti4:
; Line 14
	mov	rax, QWORD PTR [rdx]
	mov	QWORD PTR [rcx], rax
	mov	rax, QWORD PTR [rdx+8]
	mov	QWORD PTR [rcx+8], rax
	mov	rax, rcx
; Line 19
	ret	0
__ubugti4 ENDP
[…]
This code complies (of course) with the Microsoft ABI:

Poor to abysmal performance of 128÷128-bit integer division returning both quotient and remainder

Compile the source file case0.c a third time with Clang, now with the preprocessor macro BUNNY defined, and display the generated assembly code:
clang -DBUNNY -o- -O3 -S -target amd64-pc-windows case0.c
	.text
[…]
__ubugti4:				# @__ubugti4
[…]
	testq	%r8, %r8
	je	.LBB0_2
# %bb.1:
	movq	%r8, %r15
[…]
	leaq	80(%rsp), %rcx
	leaq	64(%rsp), %rdx
	callq	__umodti3
	movaps	%xmm0, (%r15)
.LBB0_2:
[…]
	leaq	48(%rsp), %rcx
	leaq	32(%rsp), %rdx
	callq	__udivti3
	nop
[…]
Oops: instead to generate a single call of the library function __udivmodti4() which returns both quotient and remainder, even with the command line option -O3 specified Clang generates separate calls of the library functions __umodti3() and __udivti3() – which is especially funny, since both call the library function __udivmodti4() in turn!

Note: as demonstrated below, this code runs about 3 to 30 (in words: thirty) times slower than code generated by GCC!

Additionally count the number of instructions and the code size of the __udivmodti4() function provided in the clang_rt.builtins-x86_64.lib library: it shows 234 instructions in 765 bytes, while properly written (assembly) code shows only 71 instructions in just 225 bytes – and runs up to 15 (in words: fifteen) times faster!

Note: exploration of the equally entertaining call chain for signed 128÷128-bit integer division is left as an exercise to the reader.

Bloated code and poor performance of 64÷64-bit integer division returning both quotient and remainder

Create the text file case2.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef BUNNY
typedef struct {
   __int64 quotient, remainder;
} lldiv_t;

__attribute__((ms_abi))
lldiv_t lldiv(__int64 dividend, __int64 divisor) {
    lldiv_t lldiv = {dividend / divisor, dividend % divisor};
    return lldiv;
}
#else
unsigned long __udivmoddi4(unsigned long dividend, unsigned long divisor, unsigned long *remainder) {
    if (remainder != 0)
        *remainder = dividend % divisor;
    return dividend / divisor;
}
#endif // BUNNY
Compile the source file case2.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-windows case2.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
_lldiv:					# @lldiv
# %bb.0:
	pushl	%ebp			#	push	ebp
	pushl	%ebx			#	mov	ebp, [esp+8]
	pushl	%edi			#	push	ebx
	pushl	%esi			#
	movl	36(%esp), %esi		#
	movl	20(%esp), %ebp		#
	movl	24(%esp), %edi		#
	movl	28(%esp), %ebx		#
	pushl	%esi			#	push	[esp+28]
	pushl	36(%esp)		#	push	[esp+28]
	pushl	%ebx			#	push	[esp+28]
	pushl	%edi			#	push	[esp+28]
	calll	__alldiv		#	call	__alldvrm
	movl	%edx, %ecx		#
	movl	%eax, (%ebp)		#	mov	[ebp], eax
	imull	%eax, %esi		#
	mull	32(%esp)		#
	movl	%ecx, 4(%ebp)		#	mov	[ebp+4], edx
	imull	32(%esp), %ecx		#
	addl	%esi, %edx		#
	addl	%edx, %ecx		#
	subl	%eax, %edi		#
	movl	%ebp, %eax		#
	sbbl	%ecx, %ebx		#
	movl	%edi, 8(%ebp)		#	mov	[ebp+8], ecx
	movl	%ebx, 12(%ebp)		#	mov	[ebp+12], ebx
	popl	%esi			#
	popl	%edi			#	pop	ebx
	popl	%ebx			#	mov	eax, ebp
	popl	%ebp			#	pop	ebp
	retl				#	ret
[…]
Ouch: instead to generate a single call of the compiler helper function __alldvrm(), which returns both quotient and remainder in the register pairs EDX:EAX and EBX:ECX, even with the command line option -O3 specified Clang generates code to compute the remainder itself, wasting 15 of the total 31 instructions, wasting 34 of the total 74 bytes, and wasting about 9 CPU clock cycles per function call!

Note: the code of the __divmoddi4() function provided in the clang_rt.builtins-i386.lib library is equally bad:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___divmoddi4
	.type	___divmoddi4, @function
	.text

___divmoddi4:
	push	ebp
	push	ebx
	push	edi
	push	esi
	push	eax
	mov	edi, [esp+36]
	mov	ebx, [esp+24]
	mov	ebp, [esp+28]
	mov	esi, [esp+32]
	push	edi
	push	esi
	push	ebp
	push	ebx
	call	___divdi3
	add	esp, 16
	mov	ecx, eax
	mov	[esp], edx
	imul	edi, eax
	mul	eax, esi
	add	edx, edi
	mov	edi, [esp]
	imul	esi, edi
	add	esi, edx
	sub	ebx, eax
	mov	eax, [esp+40]
	mov	edx, edi
	sbb	ebp, esi
	mov	[eax], ebx
	mov	[eax+4], ebp
	mov	eax, ecx
	add	esp, 4
	pop	esi
	pop	edi
	pop	ebx
	pop	ebp
	ret

	.ident	"clang version 10.0.0 "
	.end
Additionally count the number of instructions and the code size of the __udivmoddi4() function provided in the clang_rt.builtins-i386.lib library: it shows 256 instructions in 759 bytes, while properly written (assembly) code shows only 68 instructions in just 166 bytes – and runs up to 11 (in words: eleven) times faster!

Compile the source file case2.c with Clang again, now with the preprocessor macro BUNNY defined, targetting the AMD64 platform, and display the generated assembly code:

clang -DBUNNY -o- -O3 -S -target amd64-pc-linux case2.c
	.text
[…]
__udivmoddi4:				# @__udivmoddi4
# %bb.0:
	testq	%rdx, %rdx
	je	.LBB0_2
# %bb.1:
	movq	%rdx, %rcx
	movq	%rdi, %rax
	xorl	%edx, %edx
	divq	%rsi
	movq	%rdx, (%rcx)
.LBB0_2:
	movq	%rdi, %rax
	xorl	%edx, %edx
	divq	%rsi
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: despite the command line option -O3 given, the generated code performs two expensive divisions instead of just one!

Counterexample for the 3 preceding bugs

Compile the source file case2.c a third time with Clang, now targetting the AMD64 platform, and display the generated assembly code:
clang -DBUNNY -o- -O3 -S -target amd64-pc-windows case2.c
	.text
[…]
lldiv:					# @lldiv
# %bb.0:
	movq	%rdx, %rax
	cqto
	idivq	%r8
	movq	%rax, (%rcx)
	movq	%rdx, 8(%rcx)
	movq	%rcx, %rax
	retq
[…]

Unoptimised code of 64÷64-bit integer division functions

Create the text file case3.c with the following source code, refactored from __divdi3() and __moddi3(), in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

unsigned long long __udivmoddi4(unsigned long long dividend,
                                unsigned long long divisor,
                                unsigned long long *remainder);
#ifndef BUNNY
long long __divdi3(long long dividend, long long divisor) {
    long long r = divisor >> 63;    // r = divisor < 0 ? -1 : 0
    long long s = dividend >> 63;   // s = dividend < 0 ? -1 : 0
    divisor = (divisor ^ r) - r;    // negate if divisor < 0
    dividend = (dividend ^ s) - s;  // negate if dividend < 0
    s ^= r;                         // sign of quotient
                                    // negate if quotient < 0
    return (__udivmoddi4(dividend, divisor, 0) ^ s) - s;
}
#else
long long __moddi3(long long dividend, long long divisor) {
    long long r = divisor >> 63;    // r = divisor < 0 ? -1 : 0
    long long s = dividend >> 63;   // s = dividend < 0 ? -1 : 0
    divisor = (divisor ^ r) - r;    // negate if divisor < 0
    dividend = (dividend ^ s) - s;  // negate if dividend < 0
    __udivmoddi4(dividend, divisor, (unsigned long long *) &r);
    return (r ^ s) - s;             // negate if dividend < 0
}
#endif // BUNNY
Compile the source file case3.c with Clang, engaging its optimiser, targetting the i386 platform, and display the generated, not properly optimised assembly code:
clang -o- -O3 -S -target i386-pc-windows case3.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
___divdi3:				# @__divdi3
# %bb.0:
	pushl	%ebp			#
	movl	%esp, %ebp		#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	movl	20(%ebp), %ecx		#	mov	eax, [esp+20]
	movl	12(%ebp), %eax		#
	movl	16(%ebp), %edi		#	mov	ebx, [esp+16]
	movl	8(%ebp), %ebx		#
	movl	%ecx, %edx		#
	movl	%eax, %esi		#
	sarl	$31, %edx		#
	sarl	$31, %esi		#	cdq
	xorl	%edx, %edi		#	xor	ebx, edx
	xorl	%edx, %ecx		#	xor	eax, edx
	subl	%edx, %edi		#	sub	ebx, edx
	sbbl	%edx, %ecx		#	sbb	eax, edx
	xorl	%esi, %ebx		#
	xorl	%esi, %eax		#
	subl	%esi, %ebx		#
	sbbl	%esi, %eax		#
	xorl	%edx, %esi		#	mov	ebx, edx
	pushl	$0			#	push	0
	pushl	%ecx			#	push	eax
	pushl	%edi			#	push	ecx
					#	mov	eax, [esp+24]
					#	mov	ecx, [esp+20]
					#	cdq
					#	xor	ecx, edx
					#	xor	eax, edx
					#	sub	ecx, edx
					#	sbb	eax, edx
					#	xor	ebx, edx
	pushl	%eax			#	push	eax
	pushl	%ebx			#	push	ecx
	calll	___udivmoddi4		#	call	___udivmoddi4
	addl	$20, %esp		#	add	esp, 20
	xorl	%esi, %eax		#	xor	eax, ebx
	xorl	%esi, %edx		#	xor	edx, ebx
	subl	%esi, %eax		#	sub	eax, ebx
	sbbl	%esi, %edx		#	sbb	edx, ebx
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
Oops: despite the command line option -O3 given, the generated code shows 38 instructions and clobbers the registers EBP, EDI and ESI without necessity; the properly optimised code shows just 30 instructions and avoids 6 superfluous memory accesses.

Compile the source file case3.c a second time with Clang, now with the preprocessor macro BUNNY defined, again targetting the i386 platform, and display the generated assembly code:

clang -DBUNNY -o- -O3 -S -target i386-pc-windows case3.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
___moddi3:				# @__moddi3
# %bb.0:
	pushl	%ebp			#
	movl	%esp, %ebp		#
	pushl	%ebx			#	push	ebx
	pushl	%edi			#
	pushl	%esi			#
	andl	$-8, %esp		#
	subl	$16, %esp		#	sub	esp, 8
	movl	20(%ebp), %ecx		#	mov	eax, [esp+28]
	movl	12(%ebp), %eax		#
	movl	16(%ebp), %edi		#	mov	ecx, [esp+24]
	movl	%esp, %ebx		#	push	esp
	movl	%ecx, %edx		#
	movl	%eax, %esi		#
	sarl	$31, %edx		#	cdq
	sarl	$31, %esi		#
	xorl	%edx, %edi		#	xor	ecx, edx
	xorl	%edx, %ecx		#	xor	eax, edx
	movl	%edx, 4(%esp)		#
	movl	%edx, (%esp)		#
	subl	%edx, %edi		#	sub	ecx, edx
	sbbl	%edx, %ecx		#	sbb	eax, edx
					#	push	eax
					#	push	ecx
					#	mov	eax, [esp+32]
	movl	8(%ebp), %edx		#	mov	ecx, [esp+28]
					#	cdq
					#	mov	ebx, edx
	xorl	%esi, %eax		#	xor	ecx, edx
	xorl	%esi, %edx		#	xor	eax, edx
	subl	%esi, %edx		#	sub	ecx, edx
	sbbl	%esi, %eax		#	sbb	eax, edx
	pushl	%ebx			#
	pushl	%ecx			#
	pushl	%edi			#
	pushl	%eax			#	push	eax
	pushl	%edx			#	push	ecx
	calll	___udivmoddi4		#	call	___udivmoddi4
	addl	$20, %esp		#	add	esp, 20
	movl	(%esp), %eax		#	pop	eax
	movl	4(%esp), %edx		#	pop	edx
	xorl	%esi, %eax		#	xor	eax, ebx
	xorl	%esi, %edx		#	xor	edx, ebx
	subl	%esi, %eax		#	sub	eax, ebx
	sbbl	%esi, %edx		#	sbb	edx, ebx
	leal	-12(%ebp), %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#	pop	ebx
	popl	%ebp			#
	retl				#	ret
[…]
OUCH: despite the command line option -O3 given, the generated code shows 45 instructions and clobbers the registers EBP, EDI and ESI without necessity; the properly optimised code shows just 32 instructions and avoids 8 superfluous memory accesses.

Now compare the code shown above with the code from the clang_rt.builtins-i386.lib library, which is even worse and shows 51 instructions:

# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___moddi3
	.type	___moddi3, @function
	.text

___moddi3:
	push	ebp
	mov	ebp, esp
	push	ebx
	push	edi
	push	esi
	and	esp, -8
	sub	esp, 16
	mov	edx, [___security_cookie]
	mov	ecx, [ebp+20]
	mov	esi, [ebp+16]
	mov	eax, [ebp+12]
	mov	edi, esp
	xor	edx, ebp
	mov	ebx, eax
	mov	[esp+8], edx
	mov	edx, ecx
	sar	edx, 31
	add	esi, edx
	adc	ecx, edx
	xor	esi, edx
	sar	ebx, 31
	xor	ecx, edx
	mov	edx, [ebp+8]
	xor	eax, ebx
	xor	edx, ebx
	sub	edx, ebx
	sbb	eax, ebx
	push	edi
	push	ecx
	push	esi
	push	eax
	push	edx
	call	___udivmoddi4
	add	esp, 20
	mov	edi, [esp]
	mov	esi, [esp+4]
	mov	ecx, [esp+8]
	xor	edi, ebx
	xor	esi, ebx
	sub	edi, ebx
	sbb	esi, ebx
	xor	ecx, ebp
	call	@__security_check_cookie@4
	mov	eax, edi
	mov	edx, esi
	lea	esp, [ebp-12]
	pop	esi
	pop	edi
	pop	ebx
	pop	ebp
	ret

	.ident	"clang version 10.0.0 "
	.end

Disastrous performance of 128÷128-bit (and 64÷64-bit) integer division

Create the text file case4llvm.c with the following source code, refactored from __udivmodti4(), in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// Effects: if modulus != 0, *modulus = numerator % denominator
// Returns: numerator / denominator

// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide

__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
    union {
        __uint128_t all;
        struct {
            unsigned long long low, high;
        };
    } dividend, divisor, quotient, remainder;
    dividend.all = numerator;
    divisor.all = denominator;
    unsigned carry, shift;
    // special cases, X is unknown, K != 0
    if (dividend.high == 0) {
        if (divisor.high == 0) {
            // 0 X
            // ---
            // 0 X
            if (modulus)
                *modulus = dividend.low % divisor.low;
            return dividend.low / divisor.low;
        }
        // 0 X
        // ---
        // K X
        if (modulus)
            *modulus = dividend.low;
        return 0;
    }
    // dividend.high != 0
    if (divisor.low == 0) {
        if (divisor.high == 0) {
            // K X
            // ---
            // 0 0
            if (modulus)
                *modulus = dividend.high % divisor.low;
            return dividend.high / divisor.low;
        }
        // divisor.high != 0
        if (dividend.low == 0) {
            // K 0
            // ---
            // K 0
            if (modulus) {
                remainder.high = dividend.high % divisor.high;
                remainder.low = 0;
                *modulus = remainder.all;
            }
            return dividend.high / divisor.high;
        }
        // K K
        // ---
        // K 0
        if ((divisor.high & (divisor.high - 1)) == 0) {
            // divisor is a power of 2
            if (modulus) {
                remainder.low = dividend.low;
                remainder.high = dividend.high & (divisor.high - 1);
                *modulus = remainder.all;
            }
            return dividend.high >> __builtin_ctzll(divisor.high);
        }
        // K K
        // ---
        // K 0
        shift = __builtin_clzll(divisor.high)
              - __builtin_clzll(dividend.high);
        // 0 <= shift < 63 or shift large
        if (shift > 62) {
            if (modulus)
                *modulus = dividend.all;
            return 0;
        }
        ++shift;
        // 0 < shift < 64
#ifdef OPTIMIZE
        remainder.all = dividend.all >> shift;
#else
        remainder.high = dividend.high >> shift;
        remainder.low = (dividend.low >> shift)
                      | (dividend.high << (64 - shift));
#endif
        quotient.high = dividend.low << (64 - shift);
        quotient.low = 0;
    } else {
        // divisor.low != 0
        if (divisor.high == 0) {
            // K X
            // ---
            // 0 K
            if ((divisor.low & (divisor.low - 1)) == 0) {
                // divisor is a power of 2
                if (modulus)
                    *modulus = dividend.low & (divisor.low - 1);
                if (divisor.low == 1)
                    return dividend.all;
                shift = __builtin_ctzll(divisor.low);
#ifdef OPTIMIZE
                quotient.all = dividend.all >> shift;
#else
                quotient.high = dividend.high >> shift;
                quotient.low = (dividend.low >> shift)
                             | (dividend.high << (64 - shift));
#endif
                return quotient.all;
            }
            // K X
            // ---
            // 0 K
            shift = __builtin_clzll(divisor.low)
                  - __builtin_clzll(dividend.high)
                  + 65;
            // 1 < shift < 128
#ifdef OPTIMIZE
            remainder.all = dividend.all >> shift;
            quotient.all = dividend.all << (128 - shift);
#else
            if (shift == 64) {
                remainder.high = 0;
                remainder.low = dividend.high;
                quotient.high = dividend.low;
                quotient.low = 0;
            } else if (shift < 64) {
                // 1 < shift < 64
                remainder.high = dividend.high >> shift;
                remainder.low = (dividend.low >> shift)
                              | (dividend.high << (64 - shift));
                quotient.high = dividend.low << (64 - shift);
                quotient.low = 0;
            } else {
                // 64 < shift < 128
                remainder.high = 0;
                remainder.low = dividend.high >> (shift - 64);
                quotient.high = (dividend.low >> (shift - 64))
                              | (dividend.high << (128 - shift));
                quotient.low = dividend.low << (128 - shift);
            }
#endif
        } else {
            // K X
            // ---
            // K K
            shift = __builtin_clzll(divisor.high)
                  - __builtin_clzll(dividend.high);
            // 0 <= shift < 64 or shift large
            if (shift > 63) {
                if (modulus)
                    *modulus = dividend.all;
                return 0;
            }
            ++shift;
            // 0 < shift <= 64
#ifdef OPTIMIZE
            remainder.all = dividend.all >> shift;
            quotient.high = dividend.low << (64 - shift);
#else
            if (shift == 64) {
                remainder.high = 0;
                remainder.low = dividend.high;
                quotient.high = dividend.low;
            } else {
                remainder.high = dividend.high >> shift;
                remainder.low = (dividend.low >> shift)
                              | (dividend.high << (64 - shift));
                quotient.high = dividend.low << (64 - shift);
            }
#endif
            quotient.low = 0;
        }
    }
    // Not a special case
    // q and r are initialized with:
    // remainder.all = dividend.all >> shift;
    // quotient.all = dividend.all << (128 - shift);
    // 0 < shift < 128
    for (carry = 0; shift > 0; --shift) {
        // remainder:quotient = ((remainder:quotient) << 1) | carry
#ifdef OPTIMIZE
        remainder.all = (remainder.all << 1) | (quotient.all >> 127);
        quotient.all = (quotient.all << 1) | carry;
#else
        remainder.high = (remainder.high << 1) | (remainder.low >> 63);
        remainder.low = (remainder.low << 1) | (quotient.high >> 63);
        quotient.high = (quotient.high << 1) | (quotient.low >> 63);
        quotient.low = (quotient.low << 1) | carry;
#endif
#if 0
        carry = 0;
        if (remainder.all < divisor.all)
            continue;

        carry = 1;
        remainder.all -= divisor.all;
#elif 0
        carry = remainder.all >= divisor.all;
        if (carry != 0)
            remainder.all -= divisor.all;
#else
        const __int128_t sign = (__int128_t) (divisor.all - remainder.all - 1) >> 127;
        carry = sign & 1;
        remainder.all -= divisor.all & sign;
#endif
    }
#ifdef OPTIMIZE
    quotient.all = (quotient.all << 1) | carry;
#else
    quotient.high = (quotient.high << 1) | (quotient.low >> 63);
    quotient.low = (quotient.low << 1) | carry;
#endif
    if (modulus)
        *modulus = remainder.all;
    return quotient.all;
}
Note: with the preprocessor macro OPTIMIZE defined, this function gains about 10 % to 25 % performance when compiled with Clang, but looses about the same amount when compiled with GCC!

Create the text file case4gcc.c with the following source code, refactored from libgcc2.c, in the same directory as case4llvm.c:

// Copyleft © 2015-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>
// Copyright © 1989-2020 Free Software Foundation, Inc.

typedef unsigned long long uint64_t;

static inline
uint64_t DIV(uint64_t dividend_hi, uint64_t dividend_lo, uint64_t divisor, uint64_t *remainder) {
    uint64_t quotient;
    __asm__("divq %[divisor]"
           : "=a" (quotient), "=d" (*remainder)
           : [divisor] "rm" (divisor), "a" (dividend_lo), "d" (dividend_hi));
    return quotient;
}

__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
    union {
        __uint128_t all;
        struct {
            uint64_t low, high;
        };
    } dividend, divisor, quotient, product;
    unsigned bn, bm;

    dividend.all = numerator;
    divisor.all = denominator;

    if (divisor.high == 0) {
        if (divisor.low > dividend.high) // 0:q = n:n / 0:d
            quotient.high = 0;
        else                             // q:q = n:n / 0:d
            quotient.high = DIV(0, dividend.high, divisor.low, &dividend.high);

        quotient.low = DIV(dividend.high, dividend.low, divisor.low, &dividend.low);

        // remainder in dividend.low
        if (modulus != 0) {
            dividend.high = 0;
            *modulus = dividend.all;
        }
    } else {
        if (divisor.high > dividend.high) { // 0:0 = n:n / d:d
            quotient.all = 0;

            // remainder in dividend.all
            if (modulus != 0)
                *modulus = dividend.all;
        } else { // 0:q = n:n / d:d

            bm = __builtin_clzll(divisor.high);
            if (bm == 0) {
                // from most significant bit of divisor.high is set
                // and dividend.high >= divisor.high
                // follows most significant bit of dividend.high is set
                // as well as quotient.low is either 0 or 1
                //
                // this special case is necessary, not an optimization!

                // the condition on the next line takes advantage that
                // (due to program flow) dividend.high >= divisor.high
                if (dividend.high > divisor.high || dividend.low >= divisor.low) {
                    dividend.all -= divisor.all;
                    quotient.low = 1;
                } else
                    quotient.low = 0;

                quotient.high = 0;

                if (modulus != 0)
                    *modulus = dividend.all;
            } else { // normalize
                bn = 64 - bm;

                quotient.high = dividend.high >> bn;
#ifdef OPTIMIZE
                dividend.all <<= bm;
                divisor.all <<= bm;
#else
                dividend.high <<= bm;
                dividend.high |= dividend.low >> bn;
                dividend.low <<= bm;

                divisor.high <<= bm;
                divisor.high |= divisor.low >> bn;
                divisor.low <<= bm;
#endif
                quotient.low = DIV(quotient.high, dividend.high, divisor.high, &dividend.high);
                quotient.high = 0;
                product.all = quotient.all * divisor.low;

                if (product.all > dividend.all) {
                    product.all -= divisor.all;
                    quotient.low -= 1;
                }

                // remainder is (dividend - product) >> bm
                if (modulus != 0) {
                    dividend.all -= product.all;
#ifdef OPTIMIZE
                    dividend.all >>= bm;
#else
                    dividend.low >>= bm;
                    dividend.low |= dividend.high << bn;
                    dividend.high >>= bm;
#endif
                    *modulus = dividend.all;
                }
            }
        }
    }

    return quotient.all;
}
Create the text file case4.c with the following content in the same directory as case4gcc.c and case4llvm.c:
// Copyright © 2015-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);

__attribute__((noinline))
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
    if (remainder != NULL)
        *remainder = divisor;

    return dividend;
}

__attribute__((always_inline))
__uint128_t lfsr128(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
    //   initialised with 2**128 / golden ratio

    const  __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
    static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
    const  __uint128_t sign = (__int128_t) lfsr >> 127;

    return lfsr = (lfsr << 1) ^ (poly & sign);
}

__attribute__((always_inline))
__uint128_t lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = (int64_t) lfsr >> 63;

    return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}

__attribute__((always_inline))
__uint128_t lfsr32(void)
{
    // 32-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xDB710641 (CRC-32 IEEE),
    //   initialised with 2**32 / golden ratio

    static uint32_t lfsr = 0x9E3779B9;
    const  uint32_t sign = (int32_t) lfsr >> 31;

    return lfsr = (lfsr << 1) ^ (0xDB710641 & sign);
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t n;
    __uint128_t dividend, divisor = ~0, remainder;
    volatile __uint128_t quotient;

    t0 = clock();

    for (n = 500000000; n > 0; n--)
    {
        dividend = lfsr128();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000; n > 0; n--)
    {
        dividend = lfsr128();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000; n > 0; n--)
    {
        dividend = lfsr128();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000; n > 0; n--)
    {
        dividend = lfsr128();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000; n > 0; n--)
    {
        dividend = lfsr64();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000; n > 0; n--)
    {
        dividend = lfsr64();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000; n > 0; n--)
    {
        quotient = __unopti4(~0, 3, NULL);
        quotient = __unopti4(~0, 3, &remainder);
    }

    t1 = clock();

    for (n = 500000; n > 0; n--)
    {
        quotient = __udivmodti4(~0, 3, NULL);
        quotient = __udivmodti4(~0, 3, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu micro-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
Note: modification of the source files case4llvm.c, case4gcc.c and case4.c to demonstrate the equally disastrous (mis)performance of the 64÷64-bit integer division on 32-bit processors is left as an exercise to the reader.

On your favourite system with an AMD64 processor, where both Clang and the GNU Compiler Collection are installed, run the following commands:

lscpu
gcc -mno-sse -O3 case4.c case4gcc.c
echo 'GCC/libgcc'
./a.out
clang -mno-sse -O3 case4.c case4llvm.c
echo 'LLVM/clang/compiler-rt'
./a.out
Note: if you prefer to use the library function __udivmodti4() provided by the respective compiler, run the following commands instead:
lscpu
gcc -mno-sse -O3 case4.c
echo 'GCC/libgcc'
./a.out
clang -mno-sse -O3 -rtlib=compiler-rt case4.c
echo 'LLVM/clang/compiler-rt'
./a.out

Runtime measurement on AMD® Ryzen 5 3600

Note: for better readability and to ease their comparision, the numbers are shown in two columns.
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          12
On-line CPU(s) list:             0-11
Thread(s) per core:              2
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           113
Model name:                      AMD Ryzen 5 3600 6-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2195.688
CPU max MHz:                     3600,0000
CPU min MHz:                     2200,0000
BogoMIPS:                        7186.29
Virtualization:                  AMD-V
L1d cache:                       192 KiB
L1i cache:                       192 KiB
L2 cache:                        3 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-11
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          1.668976       0              1.695685       0
__udivmodti4()      12.617999      10.949023     167.891651     166.195966
                    14.286975 nano-seconds       169.587336 nano-seconds

__unopti4()          1.760362       0              1.708451       0
__udivmodti4()      18.046460      16.286098     246.065291     244.356840
                    19.806822 nano-seconds       247.773742 nano-seconds

__unopti4()          1.248846       0              1.463868       0
__udivmodti4()       7.155582       5.906736      10.833658       9.369790
                     8.404428 nano-seconds        12.297526 nano-seconds
OUCH: in the best case, the executable generated by Clang performs the 128÷128-bit integer division 1.6 times slower than the executable generated by GCC, while in the worst case it is but 15 (in words: fifteen) times slower – an utterly devastating result!

Runtime measurement on AMD® EPYC 7262

Note: for better readability and to ease their comparision, the numbers are shown in two columns.
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            23
Model:                 49
Model name:            AMD EPYC 7262 8-Core Processor
Stepping:              0
CPU MHz:               3193.726
BogoMIPS:              6387.45
Virtualization:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              16384K
NUMA node0 CPU(s):     0-15
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          1.920000       0              2.250000       0
__udivmodti4()      15.420000      13.500000     230.230000     227.980000
                    17.340000 nano-seconds       232.480000 nano-seconds

__unopti4()          2.000000       0              2.280000       0
__udivmodti4()      22.120000      20.120000     340.400000     338.120000
                    24.120000 nano-seconds       342.680000 nano-seconds

__unopti4()          1.620000       0              1.780000       0
__udivmodti4()       8.810000       7.190000      13.200000      11.420000
                    10.430000 nano-seconds        14.980000 nano-seconds
OUCH: now the worst case is 17 (in words: seventeen) times slower!

Runtime measurement on Intel® Core i5-6600

Note: for better readability and to ease their comparision, the numbers are shown in two columns.
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           94
Model name:                      Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz
Stepping:                        3
CPU MHz:                         3311.998
BogoMIPS:                        6623.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        24 MiB
NUMA node0 CPU(s):               0-3
[…]
                      GCC/libgcc                    LLVM/clang/compiler-rt

__unopti4()          5.075482       0              5.077103       0
__udivmodti4()      34.428038      29.352556     183.407210     178.330107
                    39.503520 nano-seconds       188.484313 nano-seconds

__unopti4()          4.652310       0              4.653633       0
__udivmodti4()      36.139848      31.487538     280.501125     275.847492
                    40.792158 nano-seconds       285.154758 nano-seconds

__unopti4()          4.938115       0              4.650043       0
__udivmodti4()       8.971484       4.033369      12.585962       7.935919
                    13.909599 nano-seconds        17.236005 nano-seconds
Oops: in the best case, the executable generated by Clang performs the 128÷128-bit integer division at half the speed of the executable generated by GCC, while in the worst case it is but 9 (in words: nine) times slower – another devastating result!

Careless (or clueless?) wasting of resources on Microsoft Windows

The executable installer LLVM-10.0.0-win64.exe dumps the following duplicate files in the directory C:\Program Files\LLVM\bin\, wasting nearly 500 MB disk space, which is about a third of the disk space occupied by the whole package:
C:\>DIR "C:\Program Files\LLVM\bin" /O:-S
[…]
03/25/2020   0:15 PM        83,258,880 clang-cl.exe
03/25/2020   0:03 PM        83,258,880 clang.exe
03/25/2020   0:15 PM        83,258,880 clang++.exe
03/25/2020   0:15 PM        83,258,880 clang-cpp.exe
[…]
03/25/2020   0:15 PM        57,812,480 lld-link.exe
03/25/2020   0:05 PM        57,812,480 lld.exe
03/25/2020   0:15 PM        57,812,480 ld.lld.exe
03/25/2020   0:15 PM        57,812,480 ld64.lld.exe
03/25/2020   0:15 PM        57,812,480 wasm-ld.exe
[…]
03/25/2020   0:15 PM        18,182,144 llvm-ranlib.exe
03/25/2020   0:15 PM        18,182,144 llvm-lib.exe
03/25/2020   0:00 PM        18,182,144 llvm-ar.exe
[…]
C:\>FC.EXE /B "C:\Program Files\LLVM\bin\clang.exe" "C:\Program Files\LLVM\bin\clang-cl.exe"
Comparing files C:\Program Files\LLVM\bin\clang.exe and C:\Program Files\LLVM\bin\clang-cl.exe
FC: no differences encountered
[…]
C:\>FSUTIL.EXE Hardlink List "C:\Program Files\LLVM\bin\clang.exe"
\Program Files\LLVM\bin\clang.exe
C:\>
[…]
Dito for the executable installer LLVM-10.0.0-win32.exe:
C:\>DIR "C:\Program Files (x86)\LLVM\bin" /O:-S
[…]
03/25/2020  11:21 AM        77,862,912 clang-cpp.exe
03/25/2020  11:21 AM        77,862,912 clang-cl.exe
03/25/2020  11:21 AM        77,862,912 clang++.exe
03/25/2020  11:13 AM        77,862,912 clang.exe
[…]
03/25/2020  11:22 AM        54,811,648 ld.lld.exe
03/25/2020  11:22 AM        54,811,648 ld64.lld.exe
03/25/2020  11:15 AM        54,811,648 lld.exe
03/25/2020  11:22 AM        54,811,648 lld-link.exe
03/25/2020  11:22 AM        54,811,648 wasm-ld.exe
[…]
03/25/2020  11:21 AM        17,346,560 llvm-ranlib.exe
03/25/2020  11:10 AM        17,346,560 llvm-ar.exe
03/25/2020  11:21 AM        17,346,560 llvm-lib.exe
[…]
C:\>
Ever heard of hardlinks?
NTFS supports them since its introduction nearly 30 years ago, and Windows NT provides an API to create them since 24 years.

Which cross-compiler?

The pages Getting Started with the LLVM System and Cross-compilation using Clang state:
It is possible to cross-compile LLVM itself. That is, you can create LLVM executables and libraries to be hosted on a platform different from the platform where they are built (a Canadian Cross build).
Clang/LLVM is natively a cross-compiler, meaning that one set of programs can compile to all targets by setting the -target option.
Contrary to both statements, it’s not even possible to build an executable for 32-bit processors with the 64-bit compiler and linker on Windows, and vice versa!

To falsify LLVM’s statements and prove my claim, create the text file case6.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int main() {
    volatile _Complex double a, b;
    volatile _Complex double m = a * b;
    volatile _Complex double n = a / b;
}
Compile the source file case6.c with Clang from the 32-bit package, targetting but the AMD64 platform, and display the generated assembly code:
"C:\Program Files (x86)\LLVM\bin\clang.exe" -o- -O3 -S -target amd64-pc-windows case6.c
	.text
[…]
_main:					# @main
[…]
	callq	__muldc3
[…]
	callq	__divdc3
[…]
Note: the functions __divdc3() and __muldc3() for the AMD64 platform are provided in the library clang_rt.builtins-x86_64.lib.
"C:\Program Files (x86)\LLVM\bin\clang.exe" -O3 -target amd64-pc-windows case6.c
llvm-3d1c91.o : error LNK2019: unresolved external symbol __muldc3 referenced in function _main
llvm-3d1c91.o : error LNK2019: unresolved external symbol __divdc3 referenced in function _main
a.exe : fatal error LNK1120: 2 unresolved externals
clang: error: linker command failed with exit code 1120 (use -v to see invocation)
Compile the source file case6.c a second time, now with Clang from the 64-bit package, targetting but the i386 platform, and display the generated assembly code:
"C:\Program Files\LLVM\bin\clang.exe" -o- -O3 -S -target i386-pc-windows case6.c
	.text
[…]
_main:					# @main
[…]
	calll	___muldc3
[…]
	calll	___divdc3
[…]
Note: the functions __divdc3() and __muldc3() for the i386 platform are provided in the library clang_rt.builtins-i386.lib.
"C:\Program Files\LLVM\bin\clang.exe" -O3 -target i386-pc-windows case6.c
llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___muldc3 referenced in function _main
llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___divdc3 referenced in function _main
a.exe : fatal error LNK1120: 2 unresolved externals
clang: error: linker command failed with exit code 1120 (use -v to see invocation)
Although the compiler installed with both packages is able to produce 32-bit as well as 64-bit objects, and the linker is able to produce either 32-bit executables from 32-bit objects or 64-bit executables from 64-bit objects, 32-bit objects can’t be linked to produce 32-bit executables using the 64-bit package, and vice versa: each package contains only the clang_rt.*.lib for its own processor architecture!
C:\>DIR "C:\Program Files\LLVM\lib\clang\10.0.0\lib\windows\*-i386.lib"
[…]
File Not Found
[…]
C:\>DIR "C:\Program Files (x86)\LLVM\lib\clang\10.0.0\lib\windows\*-x86_64.lib"
[…]
File Not Found
[…]
Note: side-by-side installation of the 32-bit and 64-bit package needs 3 GB disk space, wasting 1 GB for duplicate files – that’s 100 times the disk space of the missing clang_rt.*.lib libraries!

Misleading message box from installer on Microsoft Windows

[Screenshot of misleading installer message box] Poor souls who want (really: need; see above) to install the 64-bit package after or aside the 32-bit package (or vice versa) on Windows are greeted from the installers with the misleading message box shown on the right: the version attributed as old is but the current version for the other processor architecture!
If they choose to continue without uninstallation, the shortcuts eventually created in the start menu by the old version and its uninstall information are overwritten – a really user-friendly move!

Also note the denglisch kauderwelsch: the title bar and the buttons are localised, but the message text isn’t.

Impaired performance of one of the most trivial functions

Create the text file case8.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef __amd64__
__int128_t __absti2(__int128_t argument) {
    return argument < 0 ? -argument : argument;
}
#else
long long __absdi2(long long argument) {
#ifdef BUNNY
    return __builtin_llabs(argument);
#else
    return argument < 0 ? -argument : argument;
#endif // BUNNY
}

long __abssi2(long argument) {
#ifdef BUNNY
    return __builtin_labs(argument);
#else
    return argument < 0 ? -argument : argument;
#endif // BUNNY
}
#endif // __amd64__
Compile the source file case8.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case8.c
Note: the left column shows the generated code, while the right column shows shorter, but still not properly optimised code as comment!
	.text
[…]
__absti2:				# @__absti2
# %bb.0:
	xorl	%edx, %edx		#	xor	edx, edx
	movq	%rdi, %rax		#	mov	rax, rdi
	negq	%rax			#	neg	rax
	sbbq	%rsi, %rdx		#	sbb	rdx, rsi
	testq	%rsi, %rsi		#
	cmovnsq	%rdi, %rax		#	cmovs	rax, rdi
	cmovnsq	%rsi, %rdx		#	cmovs	rdx, rsi
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: the use of CMOVcc instructions introduces an explicit data dependency without necessity and impairs performance!

The following code avoids the CMOVcc instructions and is 3 bytes shorter:

# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__absti2
	.type	__absti2, @function
	.text

__absti2:
	mov	rax, rsi		# rax = high qword of argument
	cqo				# rdx = (argument < 0) ? -1 : 0
	mov	rax, rdx		# rdx:rax = (argument < 0) ? -1 : 0
	add	rdi, rdx
	adc	rsi, rdx		# rsi:rdi = (argument < 0) ? argument - 1 : argument
	xor	rax, rdi
	xor	rdx, rsi		# rdx:rax = (argument < 0) ? -argument : argument
					#         = |argument|
	ret

	.end
Compile the source file case8.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case8.c
	.text
[…]
___absdi2:				# @__absdi2
# %bb.0:
	movl	8(%esp), %edx
	movl	4(%esp), %eax
	movl	%edx, %ecx
	sarl	$31, %ecx
	addl	%ecx, %eax
	adcl	%ecx, %edx
	xorl	%ecx, %eax
	xorl	%ecx, %edx
	retl
[…]
___abssi2:				# @__abssi2
# %bb.0:
	movl	4(%esp), %ecx
	movl	%ecx, %eax
	negl	%eax
	cmovll	%ecx, %eax
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: the use of a CMOVcc instruction introduces an explicit data dependency without necessity and impairs performance!

The following code for the __absdi2() and __abssi2() function avoids the CMOVcc instruction and is shorter too:

# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___absdi2
	.global	___abssi2
	.type	___absdi2, @function
	.type	___abssi2, @function
	.text

___absdi2:
	mov	eax, [esp+8]
	mov	ecx, [esp+4]		# eax:ecx = argument
	cdq				# edx = (argument < 0) ? -1 : 0
	add	ecx, edx
	adc	eax, edx		# eax:ecx = (argument < 0) ? argument - 1 : argument
	xor	ecx, edx
	xor	edx, eax		# edx:ecx = (argument < 0) ? -argument : argument
					#         = |argument|
	mov	eax, ecx		# edx:eax = |argument|
	ret

___abssi2:
	mov	eax, [esp+4]		# eax = argument
	cdq				# edx = (argument < 0) ? -1 : 0
	add	eax, edx		# eax = (argument < 0) ? argument - 1 : argument
	xor	eax, edx		# eax = (argument < 0) ? -argument : argument
					#     = |argument|
	ret

	.end
Note: exploration of the code generated with the preprocessor macro BUNNY defined is left as an exercise to the reader.

Bad, insufficient or no (peephole) optimisation at all in various library functions

Peek into the code of some functions provided in the clang_rt.builtins-i386.lib and the libclang_rt.builtins-x64_86.a libraries.

__absvti2()

The __absvti2() function provided in the libclang_rt.builtins-x86_64.a library exhibits the code shown in the left column below; the right column shows properly optimised code, with only 10 instructions in just 27 bytes instead of 18 instructions in 69 bytes:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__absvti2
	.type	__absvti2, @function
	.text

__absvti2:
	mov	rcx, 8000000000000000h	#
	xor	rcx, rsi		#
	or	rcx, rdi		#
	jz	.overflow		#
	mov	rdx, rsi		#	mov	rdx, rsi
	mov	rax, rdi		#	mov	rax, rdi
	mov	rcx, rsi		#	sar	rsi, 63
	sar	rcx, 63			#	xor	rax, rsi
	xor	rdx, rcx		#	xor	rdx, rsi
	xor	rax, rcx		#	sub	rax, rsi
	sub	rax, rcx		#	sbb	rdx, rsi
	sbb	rdx, rcx		#	jo	.overflow
	ret				#	ret
.overflow:
	push	rax			#	ud2
	lea	rdi, …			#
	lea	rdx, …			#
	mov	esi, 24			#
	call	__abort			#

	.ident	"clang version 10.0.0 "
	.end
The following source coerces Clang to generate this optimised code – almost, with but one superfluous instruction:
// Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

__int128_t __absvti2(__int128_t argument) {
    const __int128_t sign = 0 - (argument < 0);
    argument ^= sign;
    argument -= sign;
    if (argument < 0)
        __builtin_trap();
    return argument;
}

__ashldi3(), __ashrdi3() and __lshrdi3()

The __ashldi3() function provided in the clang_rt.builtins-i386.lib library exhibits the rather bad code shown in the left column below; the right column shows the proper code, with only 12 instructions in just 30 bytes instead of 24 instructions in 51 bytes, and without a superfluous conditional branch:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___ashldi3
	.type	___ashldi3, @function
	.text

___ashldi3:
	push	esi			#
	mov	ecx, [esp+16]		#	mov	ecx, [esp+16]
	mov	edx, [esp+8]		#	mov	eax, [esp+8]
	test	cl, 32			#	test	cl, 32
	jnz	.0f			#	jnz	.0f
	mov	esi, [esp+12]		#	mov	edx, [esp+12]
	test	ecx, ecx		#
	jz	.2f			#
	mov	eax, edx		#
	shl	eax, cl			#
	shld	esi, edx, cl		#	shld	edx, eax, cl
	xor	ecx, ecx		#	shl	eax, cl
	mov	edx, esi		#
	jmp	.1f			#	ret
.0:
	shl	edx, cl			#	shl	eax, cl
	xor	eax, eax		#	mov	edx, eax
	xor	ecx, ecx		#	xor	eax, eax
.1:
	or	edx, ecx		#
	pop	esi			#
	ret				#
.2:
	mov	eax, edx		#
	mov	edx, esi		#
	pop	esi			#
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end
Note: exploration of the equally bad code generated for the __ashrdi3() and __lshrdi3() functions is left as an exercise to the reader.

__cmp?i2() and __ucmp?i2()

In their second halves, the functions __cmpdi2(), and __ucmpdi2(), provided in the clang_rt.builtins-i386.lib library exhibit the insane code shown in the left column below; the right column shows shorter (but still unoptimised) code, with 12 instructions in 35 bytes instead of 15 instructions in 47 bytes for each function:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___cmpdi2
	.global	___ucmpdi2
	.type	___cmpdi2, @function
	.type	___ucmpdi2, @function
	.text

___cmpdi2:
	mov	ecx, [esp+16]		#	mov	ecx, [esp+16]
	xor	eax, eax		#	xor	eax, eax
	cmp	[esp+8], ecx		#	cmp	ecx, [esp+8]
	jl	@f			#	jg	@f
	mov	eax, 2			#	mov	eax, 2
	jg	@f			#	jl	@f
	mov	ecx, [esp+4]		#
	mov	edx, [esp+12]		#	mov	ecx, [esp+12]
	mov	eax, 0			#	xor	eax, eax
	cmp	ecx, edx		#	cmp	ecx, [esp+4]
	jb	@f			#	ja	@f
	cmp	edx, ecx		#
	mov	eax, 1			#
	adc	eax, 0			#	adc	eax, 1
@@:					# @@:
	ret				#	ret

___ucmpdi2:
	mov	ecx, [esp+16]		#	mov	ecx, [esp+16]
	xor	eax, eax		#	xor	eax, eax
	cmp	[esp+8], ecx		#	cmp	ecx, [esp+8]
	jb	@f			#	ja	@f
	mov	eax, 2			#	mov	eax, 2
	ja	@f			#	jb	@f
	mov	ecx, [esp+4]		#
	mov	edx, [esp+12]		#	mov	ecx, [esp+12]
	mov	eax, 0			#	xor	eax, eax
	cmp	ecx, edx		#	cmp	ecx, [esp+4]
	jb	@f			#	ja	@f
	cmp	edx, ecx		#
	mov	eax, 1			#
	adc	eax, 0			#	adc	eax, 1
@@:					# @@:
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end
Assembly programmers but write straightforward and branch-free code instead, using 14 instructions in 41 bytes for the __cmpdi2() function, and only 11 instructions in just 34 bytes for the __ucmpdi2() function:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___cmpdi2
	.global	___ucmpdi2
	.type	___cmpdi2, @function
	.type	___ucmpdi2, @function
	.text

___cmpdi2:
	mov	ecx, [esp+4]		# ecx = low dword of left argument
	mov	edx, [esp+12]		# edx = low dword of right argument
	cmp	ecx, edx
	mov	eax, [esp+8]		# eax = high dword of left argument
	sbb	eax, [esp+16]		# eax = high dword of left argument
					#     - high dword of right argument
					#     - borrow
	setl	ah			# ah = left argument < right argument
	cmp	edx, ecx
	mov	edx, [esp+16]		# edx = high dword of right argument
	sbb	edx, [esp+8]		# edx = high dword of right argument
					#     - high dword of left argument
					#     - borrow
	setl	al			# al = left argument > right argument
	sub	al, ah			# al = left argument > right argument
					#    - left argument < right argument
	movsx	eax, al			# eax = left argument > right argument
					#     - left argument < right argument
					#     = {-1, 0, 1}
	inc	eax			# eax = {0, 1, 2}
	ret

___ucmpdi2:
	mov	ecx, [esp+4]		# ecx = low dword of left argument
	mov	edx, [esp+12]		# edx = low dword of right argument
	cmp	ecx, edx
	mov	eax, [esp+8]		# eax = high dword of left argument
	sbb	eax, [esp+16]		# eax = high dword of left argument
					#     - high dword of right argument
					#     - borrow
	sbb	eax, eax		# eax = 0
					#     - left argument < right argument
	cmp	edx, ecx
	mov	edx, [esp+16]		# edx = high dword of right argument
	sbb	edx, [esp+8]		# edx = high dword of right argument
					#     - high dword of left argument
					#     - borrow
	adc	eax, 1			# eax = left argument > right argument
					#     - left argument < right argument
					#     + 1
					#     = {0, 1, 2}
	ret

	.end
Note: exploration of the equally bad code generated for the __cmpti2() and __ucmpti2() functions is left as an exercise to the reader.

__mulo?i4()

The __mulosi4() and __mulodi4() functions provided in the clang_rt.builtins-i386.a library exhibit complelety insane and horrible code with 51 instructions in 130 bytes and 98 instructions in 266 bytes respectively, while the __mulosi4(), __mulodi4() and __muloti4() functions provided in the clang_rt.builtins-x86_64.a library exhibit complelety insane and horrible code with 44 instructions in 131 bytes, 48 instructions in 155 bytes, and 94 instructions in 302 bytes respectively.

Note: Clang generates calls of these monstrosities when the -ftrapv command line option is specified, and for the __builtin_*mul*_overflow builtins.

Create the text file case9.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef __amd64__
__int128_t __muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow) {
    union {
        __int128_t all;
        struct {
            unsigned long long low;
            signed long long high;
        };
    } product;
    long long tmp;
    product.all = (multiplicand & ~0ULL) * (multiplier & ~0ULL);
    *overflow = __builtin_smulll_overflow(multiplicand & ~0ULL, multiplier >> 64, &tmp);
    *overflow |= __builtin_saddll_overflow(tmp, product.high, &product.high);
    *overflow |= __builtin_smulll_overflow(multiplicand >> 64, multiplier & ~0ULL, &tmp);
    *overflow |= __builtin_saddll_overflow(tmp, product.high, &product.high);
    return product.all;
}
#else
long long __mulodi4(long long multiplicand, long long multiplier, int *overflow) {
    union {
        long long all;
        struct {
            unsigned long low;
            signed long high;
        };
    } product;
    long tmp;
    product.all = (multiplicand & ~0UL) * (multiplier & ~0UL);
    *overflow = __builtin_smull_overflow(multiplicand & ~0UL, multiplier >> 32, &tmp);
    *overflow |= __builtin_saddl_overflow(tmp, product.high, &product.high);
    *overflow |= __builtin_smull_overflow(multiplicand >> 32, multiplier & ~0UL, &tmp);
    *overflow |= __builtin_saddl_overflow(tmp, product.high, &product.high);
    return product.all;
}
#endif // __amd64 __
Compile the source file case9.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case9.c
	.text
[…]
__muloti4:				# @__muloti4
# %bb.0:
	movq	%rdx, %r9
	movq	%rdx, %rax
	mulq	%rdi
	imulq	%rdi, %rcx
	seto	%dil
	addq	%rdx, %rcx
	seto	%r10b
	imulq	%rsi, %r9
	seto	%dl
	orb	%dil, %dl
	orb	%r10b, %dl
	addq	%rcx, %r9
	seto	%cl
	orb	%dl, %cl
	movzbl  %cl, %ecx
	movl	%ecx, (%r8)
	movq	%r9, %rdx
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
Note: this proper implementation of the __muloti4() function shows only 18 instructions in just 55 bytes instead of 94 instructions in 302 bytes!

Compile the source file case9.c a second time with Clang, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case9.c
	.text
[…]
__mulodi4:				# @__mulodi4
# %bb.0:
	pushl	%ebx
	pushl	%esi
	movl	20(%esp), %ecx
	movl	12(%esp), %esi
	movl	%ecx, %eax
	mull	%esi
	imull	24(%esp), %esi
	seto	%bl
	addl	%edx, %esi
	seto	%dl
	imull	16(%esp), %ecx
	seto	%dh
	orb	%dl, %dh
	addl	%esi, %ecx
	movl	28(%esp), %esi
	seto	%dl
	orb	%dh, %dl
	orb	%bl, %dl
	movzbl	%dl, %edx
	movl	%edx, (%esi)
	movl	%ecx, %edx
	popl	%esi
	popl	%ebx
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
Note: this proper implementation of the __mulodi4() function shows only 24 instructions in just 60 bytes instead of 98 instructions in 266 bytes!

__parity?i2()

The __paritysi2() and __paritydi2() functions provided in the clang_rt.builtins-i386.lib library exhibit the insane code shown in the left column below; the right column shows the proper code, with 15 instructions in 40 bytes instead of 21 instructions in 57 bytes for both functions together, more than halving the number of instructions executed per function call:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___paritysi2
	.global	___paritydi2
	.type	___paritysi2, @function
	.type	___paritydi2, @function
	.text

___paritysi2:
	mov	eax, [esp+4]		#	mov	eax, [esp+4]
	mov	ecx, eax		#
	shr	ecx, 16			#	shld	ecx, eax, 16
	xor	ecx, eax		#	xor	ecx, eax
	mov	eax, ecx		#
	shr	eax, 8			#
	xor	eax, ecx		#
	mov	ecx, eax		#
	shr	ecx, 4			#
	xor	ecx, eax		#
	mov	eax, 0x6996		#
	and	cl, 15			#	xor	eax, eax
	shr	eax, cl			#	xor	cl, ch
	and	eax, 1			#	setnp	al
	ret				#	ret

___paritydi2:
	mov	eax, [esp+8]		#	mov	eax, [esp+8]
	xor	eax, [esp+4]		#	xor	eax, [esp+4]
	push	eax			#	shld	ecx, eax, 16
	call	___paritysi2		#	xor	ecx, eax
	add	esp, 4			#	xor	eax, eax
					#	xor	cl, ch
					#	setnp	al
	ret				#	ret

	.ident	"clang version 10.0.0 "
	.end

Braindead implementation of code generation for __builtin_*()

For __builtin_*() functions, Clang emits unoptimised code, which can even be worse than the poor code provided in the clang_rt.builtins-i386.* and clang_rt.builtins-x86_64.* libraries, and fails optimisation!

__builtin_parity*()

Create the text file case10.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int parity32(unsigned x) {
    return __builtin_parity(x);
}

int parity64(unsigned long long x) {
    return __builtin_parityll(x);
}
Compile the source file case10.c with Clang, targetting the AMD64 platform, and display the generated (not optimised) assembly code:
clang -o- -S -target amd64-pc-linux case10.c
	.text
[…]
parity32:				# @parity32
# %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movl	%edi, -4(%rbp)
	movl	-4(%rbp), %eax
	movl	%eax, %ecx
	shrl	%ecx
	andl	$1431655765, %ecx	# imm = 0x55555555
	subl	%ecx, %eax
	movl	%eax, %ecx
	andl	$858993459, %ecx	# imm = 0x33333333
	shrl	$2, %eax
	andl	$858993459, %eax	# imm = 0x33333333
	addl	%eax, %ecx
	movl	%ecx, %eax
	shrl	$4, %eax
	addl	%eax, %ecx
	andl	$252645135, %ecx	# imm = 0xF0F0F0F
	imull	$16843009, %ecx, %eax	# imm = 0x1010101
	shrl	$24, %eax
	andl	$1, %eax
	popq	%rbp
	retq
[…]
parity64:				# @parity64
# %bb.0:
	pushq	%rbp
	movq	%rsp, %rbp
	movq	%rdi, -8(%rbp)
	movq	-8(%rbp), %rax
	movq	%rax, %rcx
	shrq	%rcx
	movabsq	$6148914691236517205, %rdx # imm = 0x5555555555555555
	andq	%rdx, %rcx
	subq	%rcx, %rax
	movabsq	$3689348814741910323, %rcx # imm = 0x3333333333333333
	movq	%rax, %rdx
	andq	%rcx, %rdx
	shrq	$2, %rax
	andq	%rcx, %rax
	addq	%rax, %rdx
	movq	%rdx, %rax
	shrq	$4, %rax
	addq	%rax, %rdx
	movabsq	$1085102592571150095, %rax # imm = 0xF0F0F0F0F0F0F0F
	andq	%rax, %rdx
	movabsq	$72340172838076673, %rax # imm = 0x101010101010101
	imulq	%rax, %rdx
	shrq	$56, %rdx
	andq	$1, %rdx
	movl	%edx, %eax
	popq	%rbp
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
Compile the source file case10.c a second time with Clang, now engaging its optimiser, again targetting the AMD64 platform, and display the generated optimised assembly code:
clang -o- -O3 -S -target amd64-pc-linux case10.c
	.text
[…]
parity32:				# @parity32
# %bb.0:
	movl	%edi, %ecx
	shrl	$16, %ecx
	xorl	%edi, %ecx
	movl	%ecx, %edx
	shrl	$8, %edx
	xorl	%eax, %eax
	xorb	%cl, %dl
	setnp	%al
	retq
[…]
parity64:				# @parity64
# %bb.0:
	movq	%rdi, %rax
	shrq	%rax
	movabsq	$6148914691236517205, %rcx # imm = 0x5555555555555555
	andq	%rax, %rcx
	subq	%rcx, %rdi
	movabsq	$3689348814741910323, %rax # imm = 0x3333333333333333
	movq	%rdi, %rcx
	andq	%rax, %rcx
	shrq	$2, %rdi
	andq	%rax, %rdi
	addq	%rcx, %rdi
	movq	%rdi, %rax
	shrq	$4, %rax
	addq	%rdi, %rax
	movabsq	$76296276040158991, %rcx # imm = 0x10F0F0F0F0F0F0F
	andq	%rax, %rcx
	movabsq	$72340172838076673, %rax # imm = 0x101010101010101
	imulq	%rcx, %rax
	shrq	$56, %rax
	andl	$1, %eax
	retq
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: the optimised code for __builtin_parityll() is a perfect declaration of bankruptcy!

Which optimiser?

Create the text file case11.c with the following content, presented by Matt Godbolt as example (t) in his ACM queue article Optimizations in C++ Compilers – A practical journey, in an arbitrary, preferable empty directory:
bool isWhitespace(char c)
{
    return c == ' '
      || c == '\r'
      || c == '\n'
      || c == '\t';
}
Compile the source file case11.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case11.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
isWhitespace:				# @isWhitespace
# %bb.0:
	cmpb	$32, %dil		#
	ja	.LBB0_2			#
# %bb.1:
	movl	$1, %eax		#
	movzbl	%dil, %ecx		#	mov	ecx, edi
	movabsq	$0x100002400, %rdx	#	mov	rax, 100002600h
	btq	%rcx, %rdx		#	shr	rax, cl
	jae	.LBB0_2			#
# %bb.3:
	retq				#
.LBB0_2:				#
	xorl	%eax, %eax		#	cmp	cl, 33
	cmpb	$9, %dil		#	sbb	ecx, ecx
					#	neg	ecx
	sete	%al			#	and	eax, ecx
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OOPS: 12 instructions in 42 bytes, including 2 conditional branches which impair performance, plus register RDX clobbered, instead of only 8 instructions in just 19 bytes, and without (superfluous) conditional branches.

Compile the source file case11.c with Clang again, now targetting the i386 platform, and display the generated assembly code:

clang -o- -O3 -S -target i386-pc-linux case11.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
isWhitespace:			# @isWhitespace
# %bb.0:
	pushl	%esi		#
	movb	8(%esp), %cl	#	mov	ecx, [esp+4]
	movl	%ecx, %edx	#
	addb	$-10, %dl	#
	cmpb	$22, %dl	#
	ja	.LBB0_2		#
# %bb.1:
	movzbl	%dl, %edx	#	xor	eax, eax
	movl	$0x400009, %esi	#	cmp	eax, ecx
	movl	$1, %eax	#	adc	eax, 2600h
	btl	%edx, %esi	#	shr	eax, cl
	jae	.LBB0_2		#
# %bb.3:
	popl	%esi		#
	retl			#
.LBB0_2:
	xorl	%eax, %eax	#	xor	edx, edx
	cmpb	$9, %cl		#	cmp	ecx, 33
	sete	%al		#	adc	edx, edx
	popl	%esi		#	and	eax, edx
	retl			#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: 18 instructions in 45 bytes, including 2 conditional branches which impair performance, plus register ESI clobbered, instead of only 10 instructions in just 25 bytes, again without (superfluous) conditional branches, and no non-volatile register clobbered.

The optimiser fails, quite bad

Create the text file case12.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2018-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long __absvdi2(long long argument) {
#ifdef BUNNY // undefined behaviour
    long long magnitude = __builtin_llabs(argument);
    if (magnitude < 0)
        __builtin_trap();
    return magnitude;
#else
    long long sign = 0 - (argument < 0);
    if (__builtin_saddll_overflow(argument, sign, &argument))
        __builtin_trap();
    return argument ^ sign;
#endif
}

long long __addvdi3(long long augend, long long addend) {
    long long sum;
    if (__builtin_saddll_overflow(augend, addend, &sum))
        __builtin_trap();
    return sum;
}

long long __mulvdi3(long long multiplicand, long long multiplier) {
    long long product;
    if (__builtin_smulll_overflow(multiplicand, multiplier, &product))
        __builtin_trap();
    return product;
}

long long __negvdi2(long long negend) {
    long long negation;
    if (__builtin_ssubll_overflow(0, negend, &negation))
        __builtin_trap();
    return negation;
}

long long __subvdi3(long long minuend, long long subtrahend) {
    long long difference;
    if (__builtin_ssubll_overflow(minuend, subtrahend, &difference))
        __builtin_trap();
    return difference;
}
Compile the source file case12.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case12.c
Note: the left column shows the generated insane code, while the right column shows the properly optimised code as comment!
	.text
[…]
__absvdi2:				# @__absvdi2
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	pushl	%eax			#
	movl	20(%esp), %esi		#	mov	eax, [esp+8]
	movl	16(%esp), %eax		#	mov	ecx, [esp+4]
	movl	%esi, %ecx		#
	movl	%esi, %edx		#
	sarl	$31, %ecx		#	cdq
	addl	%ecx, %eax		#	add	ecx, edx
	adcl	%ecx, %edx		#	adc	eax, edx
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	3(%esp)			#
	testl	%ecx, %ecx		#
	setns	%bl			#
	cmpb	%bl, %bh		#
	sete	%bl			#
	andb	3(%esp), %bl		#
	cmpb	$1, %bl			#
	je	.LBB0_1			#	into
# %bb.2:
	xorl	%ecx, %eax		#	xor	ecx, edx
	xorl	%ecx, %edx		#	xor	edx, eax
	addl	$4, %esp		#	mov	eax, ecx
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB0_1:
	ud2				#
[…]
__addvdi3:				# @__addvdi3
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	movl	12(%esp), %eax		#	mov	eax, [esp+4]
	movl	16(%esp), %esi		#	mov	edx, [esp+8]
	movl	24(%esp), %ecx		#	add	eax, [esp+12]
	addl	20(%esp), %eax		#	adc	edx, [esp+16]
	movl	%esi, %edx		#
	adcl	%ecx, %edx		#
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	%bl			#
	testl	%ecx, %ecx		#
	setns	%cl			#
	cmpb	%cl, %bh		#
	sete	%cl			#
	andb	%bl, %cl		#
	cmpb	$1, %cl			#
	je	.LBB1_1			#	into
# %bb.2:
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB1_1:
	ud2				#
[…]
__mulvdi3:				# @__mulvdi3
# %bb.0:
	pushl	%edi			#
	pushl	%esi			#
	pushl	%eax			#	push	ebx
	movl	16(%esp), %eax		#	mov	eax, [esp+16]
	movl	24(%esp), %edx		#	mov	ecx, [esp+12]
	movl	20(%esp), %ecx		#	imul	ecx, eax
	movl	28(%esp), %esi		#	into
	movl	$0, (%esp)		#	mov	edx, [esp+8]
	subl	$12, %esp		#	mov	ebx, [esp+20]
	leal	12(%esp), %edi		#	imul	ebx, edx
	pushl	%edi			#	into
	pushl	%esi			#	mul	edx
	pushl	%edx			#	add	ecx, ebx
	pushl	%ecx			#	into
	pushl	%eax			#	add	edx, ecx
	calll	__mulodi4		#	into
	addl	$32, %esp		#
	cmpl	$0, (%esp)		#
	jne	.LBB2_1			#
# %bb.2:
	addl	$4, %esp		#	pop	ebx
	popl	%esi			#
	popl	%edi			#
	retl				#	ret
.LBB2_1:
	ud2				#
[…]
__negvdi2:				# @__negvdi2
# %bb.0:
	xorl	%eax, %eax		#	mov	eax, [esp+4]
	xorl	%edx, %edx		#	xor	edx, edx
	movl	8(%esp), %ecx		#
	subl	4(%esp), %eax		#	neg	eax
	sbbl	%ecx, %edx		#	sbb	edx, [esp+8]
	testl	%edx, %ecx		#
	js	.LBB3_1			#	into
# %bb.2:
	retl				#	ret
.LBB3_1:
	ud2				#
[…]
__subvdi3:				# @__subvdi3
# %bb.0:
	pushl	%ebx			#
	pushl	%esi			#
	movl	12(%esp), %eax		#	mov	eax, [esp+4]
	movl	16(%esp), %esi		#	mov	edx, [esp+8]
	movl	24(%esp), %ecx		#	sub	eax, [esp+12]
	subl	20(%esp), %eax		#	sbb	edx, [esp+16]
	movl	%esi, %edx		#
	sbbl	%ecx, %edx		#
	setns	%bl			#
	testl	%esi, %esi		#
	setns	%bh			#
	cmpb	%bl, %bh		#
	setne	%bl			#
	testl	%ecx, %ecx		#
	setns	%cl			#
	cmpb	%cl, %bh		#
	setne	%cl			#
	andb	%bl, %cl		#
	cmpb	$1, %cl			#
	je	.LBB4_1			#	into
# %bb.2:
	popl	%esi			#
	popl	%ebx			#
	retl				#	ret
.LBB4_1:
	ud2				#
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: 29 instructions in 68 bytes instead of only 10 instructions in just 21 bytes for the __absvdi2() function, and 24 instructions in 47 bytes instead of only 6 instructions in just 18 bytes for each of the __addvdi3() and __subvdi3() functions!

OOPS: 24 instructions in 60 bytes instead of only 16 instructions in just 35 bytes for the __mulvdi3() function – not counting the 98 instructions in 266 bytes of the called __mulodi4() function.

Note: exploration of the equally bad code generated for the __absvsi2(), __addvsi3(), __mulvsi3() and __subvsi3() as well as the __absvti2(), __addvti3(), __mulvti3() and __subvti3() functions is left as an exercise to the reader.

The optimiser fails, third time

Create the text file case13.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __clzti2(__uint128_t argument) {
    if ((argument >> 64) != 0)
        return __builtin_clzll(argument >> 64);

    if ((argument & ~0ULL) != 0)
        return __builtin_clzll(argument) + 64;

    return 128;
}

int __ctzti2(__uint128_t argument) {
    if ((argument & ~0ULL) != 0)
        return __builtin_ctzll(argument);

    if ((argument >> 64) != 0)
        return __builtin_ctzll(argument >> 64) + 64;

    return 128;
}
Compile the source file case13.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case13.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
__clzti2:				# @__clzti2
# %bb.0:
	testq	%rsi, %rsi		#	bsr	rax, rsi
	je	.LBB0_2			#	jz	.LBB0_2
# %bb.1:
	bsrq	%rsi, %rax		#
	xorl	$63, %eax		#	xor	eax, 63
	retq				#	ret
.LBB0_2:
	testq	%rdi, %rdi		#	bsr	rax, rdi
	je	.LBB0_3			#	jz	.LBB0_3
# %bb.4:
	bsrq	%rdi, %rax		#
	xorl	$63, %eax		#	xor	eax, 127
	orl	$64, %eax		#
	retq				#	ret
.LBB0_3:
	movl	$128, %eax		#	mov	eax, 128
	retq				#	ret
[…]
__ctzti2:				# @__ctzti2
# %bb.0:
	testq	%rdi, %rdi		#
	je	.LBB1_2			#
# %bb.1:
	bsfq	%rdi, %rax		#	bsf	rax, rdi
	retq				#	jnz	.return
.LBB1_2:
	testq	%rsi, %rsi		#
	je	.LBB1_3			#
# %bb.4:
	bsfq	%rsi, %rax		#	bsf	rax, rsi
					#	jz	.LBB1_3
	orl	$64, %eax		#	or	eax, 64
					#.return:
	retq				#	ret
.LBB1_3:
	movl	$128, %eax		#	mov	eax, 128
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 13 instructions in 35 bytes instead of 10 instructions in 26 bytes for the __clzti2() function, and 11 instructions in 30 bytes instead of 8 instructions in 22 bytes for the __ctzti2() function.

The optimiser fails, fourth time

Create the text file case14.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned reverse(unsigned x) { // reverse all bits
    x = ((x &  0xAAAAAAAAU) >> 1)
      | ((x & ~0xAAAAAAAAU) << 1);
    x = ((x &  0xCCCCCCCCU) >> 2)
      | ((x & ~0xCCCCCCCCU) << 2);
    x = ((x &  0xF0F0F0F0U) >> 4)
      | ((x & ~0xF0F0F0F0U) << 4);
    x = ((x &  0xFF00FF00U) >> 8)
      | ((x & ~0xFF00FF00U) << 8);
    x = ((x &  0xFFFF0000U) >> 16)
      | ((x & ~0xFFFF0000U) << 16);

    return x;
}
Compile the source file case14.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case14.c
Note: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment!
	.text
[…]
reverse:				# @reverse
# %bb.0:
	movl	4(%esp), %eax
	movl	%eax, %ecx
	andl	$0xD5555555, %eax
	shrl	%ecx
	andl	$0x55555555, %ecx
	leal	(%ecx,%eax,2), %eax
	movl	%eax, %ecx
	andl	$0xF3333333, %eax
	shrl	$2, %ecx
	andl	$0x33333333, %ecx
	leal	(%ecx,%eax,4), %eax
	movl	%eax, %ecx
	shll	$4, %eax
	shrl	$4, %ecx
	andl	$0xF0F0F0F0, %eax
	andl	$0x0F0F0F0F, %ecx
	orl	%ecx, %eax
	movl	%eax, %ecx		#	bswap	%eax
	shll	$8, %eax		#
	shrl	$8, %ecx		#
	andl	$0xFF00FF00, %eax	#
	andl	$0x00FF00FF, %ecx	#
	orl	%ecx, %eax		#
	roll	$16, %eax		#
	retl
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: the optimiser fails to recognise the common and well-known pattern for endian conversion!

The optimiser fails, fifth time

Create the text file case15.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpti2(__uint128_t a, __uint128_t b) {
    if (a < b)
        return 0;

    if (a > b)
        return 2;

    return 1;
}
Compile the source file case15.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case15.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
__ucmpti2:				# @__ucmpti2
# %bb.0:
	cmpq	%rdi, %rdx		#	cmp	rdi, rdx
	movq	%rcx, %rax		#	mov	rax, rsi
	sbbq	%rsi, %rax		#	sbb	rax, rcx
	movl	$1, %r8d		#
	adcl	$0, %r8d		#
	xorl	%eax, %eax		#	sbb	eax, eax
	cmpq	%rdx, %rdi		#	cmp	rdx, rdi
	sbbq	%rcx, %rsi		#	sbb	rcx, rsi
	cmovael	%r8d, %eax		#	adc	eax, 1
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 10 instructions in 32 bytes, including a superfluous conditional move which impairs performance, instead of 8 instructions in 21 bytes, without conditional move.

The optimiser fails, sixth time

Create the text file case16.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpti2(__uint128_t a, __uint128_t b) {
    if (a > b)
        return 2;

    if (a == b)
        return 1;

    return 0;
}
Compile the source file case16.c with Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-windows case16.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
__ucmpti2:				# @__ucmpti2
# %bb.0:
	movdqa	(%rcx), %xmm0		#	mov	r8, [rcx]
	movq	(%rdx), %r8		#	mov	r9, [rcx+8]
	pcmpeqb	(%rdx), %xmm0		#	mov	rcx, [rdx]
	pmovmskb	%xmm0, %eax	#	mov	rdx, [rdx+8]
	xorl	%r9d, %r9d		#	cmp	r8, rcx
	cmpl	$0xFFFF, %eax		#	mov	rax, r9
	sete	%r9b			#	sbb	rax, rdx
	cmpq	(%rcx), %r8		#	sbb	eax, eax
	movq	8(%rdx), %rax		#	cmp	rcx, r8
	sbbq	8(%rcx), %rax		#	sbb	rdx, r9
	movl	$2, %eax		#	adc	eax, 1
	cmovael %r9d, %eax		#
	retq				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: 13 instructions in 48 bytes, including a superfluous conditional move which impairs performance, using the SSE register XMM0 without necessity, performing 6 memory accesses, instead of 12 instructions in 35 bytes, without conditional move, performing 4 memory accesses.

The optimiser fails, seventh time

Create the text file case17.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long quotient(long long numerator, long long denominator, long long *modulus) {
    *modulus = numerator % denominator;
    return numerator / denominator;
}

long long modulus(long long numerator, long long denominator, long long *quotient) {
    *quotient = numerator / denominator;
    return numerator % denominator;
}
Compile the source file case17.c with Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case17.c
Note: the left column shows the generated code, while the right column shows the properly optimised code as comment!
	.text
[…]
quotient:				# @quotient
# %bb.0:
	pushl	%ebp			#	jmp	__divmoddi4
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	subl	$12, %esp		#
	movl	44(%esp), %edi		#
	movl	32(%esp), %ebx		#
	movl	36(%esp), %ebp		#
	movl	40(%esp), %esi		#
	pushl	%edi			#
	pushl	%esi			#
	pushl	%ebp			#
	pushl	%ebx			#
	calll	__divdi3		#
	addl	$16, %esp		#
	movl	%eax, %ecx		#
	movl	%edx, 8(%esp)		# 4-byte Spill
	imull	%eax, %edi		#
	mull	%esi			#
	addl	%edi, %edx		#
	movl	8(%esp), %edi		# 4-byte Reload
	imull	%edi, %esi		#
	addl	%edx, %esi		#
	subl	%eax, %ebx		#
	movl	48(%esp), %eax		#
	movl	%edi, %edx		#
	sbbl	%esi, %ebp		#
	movl	%ebx, (%eax)		#
	movl	%ebp, 4(%eax)		#
	movl	%ecx, %eax		#
	addl	$12, %esp		#
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	popl	%ebp			#
	retl				#
[…]
modulus:				# @modulus
# %bb.0:
	pushl	%ebp			#
	pushl	%ebx			#
	pushl	%edi			#
	pushl	%esi			#
	subl	$12, %esp		#	sub	esp, 8
	movl	44(%esp), %ebx		#
	movl	32(%esp), %esi		#
	movl	36(%esp), %edi		#
	movl	40(%esp), %ebp		#
					#	push	esp
	pushl	%ebx			#	push	[esp+28]
	pushl	%ebp			#	push	[esp+28]
	pushl	%edi			#	push	[esp+28]
	pushl	%esi			#	push	[esp+28]
	calll	__divdi3		#	call	__divmoddi4
	addl	$16, %esp		#	add	esp, 20
	movl	%edx, %ecx		#
	movl	48(%esp), %edx		#	mov	ecx, [esp+28]
	imull	%eax, %ebx		#
	movl	%ecx, 4(%edx)		#	mov	[ecx], eax
	movl	%eax, (%edx)		#	mov	[ecx+4], edx
	mull	%ebp			#
	imull	%ebp, %ecx		#
	addl	%ebx, %edx		#
	addl	%edx, %ecx		#
	subl	%eax, %esi		#
	sbbl	%ecx, %edi		#
	movl	%esi, %eax		#
	movl	%edi, %edx		#
	addl	$12, %esp		#	pop	edx
					#	pop	eax
	popl	%esi			#
	popl	%edi			#
	popl	%ebx			#
	popl	%ebp			#
	retl				#	ret
[…]
	.ident	"clang version 10.0.0 "
[…]
OUCH: 35 superfluous instructions in 77 bytes for the function quotient(), wasting about 18 CPU clock cycles per function call, instead of only 1 (in words: one) instruction in just 5 bytes – this optimiser is a bad joke!

OOPS: 34 instructions in 74 bytes for the function modulus(), where just 14 instructions in only 40 bytes yield the same result, but without clobbering 4 registers, without 8 superfluous memory accesses, without 3 superfluous multiplications, without 5 superfluous additions or subtractions, and not wasting about 10 CPU clock cycles per function call.

The optimiser fails, eighth time

Compile the source file case4gcc.c with Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case4gcc.c
Note: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment!
	.text
[…]
	bsrq	%r11, %rbx	#	bsrq	%r11, %rcx
	xorq	$63, %rbx	#	xorl	$63, %ecx
	je	.LBB1_11
# %bb.17:
	movq	%rbx, %rcx	#
	negq	%rcx		#
	movq	%rsi, %rdx	#	xorl	%edx, %edx
	shrq	%cl, %rdx	#	shldq	%cl, %rsi, %rdx
	movl	%ebx, %ecx	#
	shldq	%cl, %rdi, %rsi
	shlq	%cl, %rdi
	shldq	%cl, %r9, %r11
	shlq	%cl, %r9
	movq	%r11, -8(%rsp)	#
	movq	%rsi, %rax
	divq	-8(%rsp)	#	divq	%r11
	movq	%rdx, %rsi
	movq	%rax, %r10
	mulq	%r9
	cmpq	%rax, %rdi
	movq	%rsi, %rcx	#	movq	%rsi, %rbx
	sbbq	%rdx, %rcx	#	sbbq	%rdx, %rbx
	jae	.LBB1_19
# %bb.18:
	subq	%r9, %rax
	sbbq	%r11, %rdx
	addq	$-1, %r10
.LBB1_19:
	testq	%r8, %r8
	je	.LBB1_21
# %bb.20:
	movl	$64, %r9d	#
	subq	%rbx, %r9	#
	subq	%rax, %rdi
	sbbq	%rdx, %rsi
	movl	%ebx, %ecx	#
	shrq	%cl, %rdi	#	shrdq	%cl, %rdi, %rsi
	movq	%rsi, %rax	#
	movl	%r9d, %ecx	#
	shlq	%cl, %rax	#
	movl	%ebx, %ecx	#
	shrq	%cl, %rsi	#	shrq	%cl, %rdi
	orq 	%rdi, %rax	#
	movq	%rsi, 8(%r8)
	movq	%rax, (%r8)	#	movq	%rdi, (%r8)
	jmp	.LBB1_21
[…]
	.ident	"clang version 10.0.0 "
[…]
Oops: while the optimiser (partially) recognises the common and well-known pattern for multi-word left shifts and generates the code following line # %bb.17:, it but fails to recognise the same pattern for the multi-word right shift and generates the code following line # %bb.20:, using 12 (in words: twelve) superfluous instructions in a total of 42 instructions for the 2 hot basic blocks shown!

Compile the source file case4gcc.c with Clang again, now with the preprocessor macro OPTIMIZE defined, targetting the AMD64 platform, and display the generated assembly code:

clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case4gcc.c
	.text
[…]
.LBB1_7:
	bsrq	%r10, %rbx
	xorq	$63, %rbx
	je	.LBB1_13
# %bb.8:
	movl	%ebx, %ecx
	negb	%cl
	movq	%rsi, %rdx
	shrq	%cl, %rdx
	movl	%ebx, %ecx
	shldq	%cl, %rdi, %rsi
	shlq	%cl, %rdi
	shldq	%cl, %r9, %r10
	shlq	%cl, %r9
	xorl	%r11d, %r11d
	testb	$64, %bl
	cmovneq	%rdi, %rsi
	cmovneq	%r11, %rdi
	cmovneq	%r9, %r10
	cmovneq	%r11, %r9
	movq	%r10, -8(%rsp)
	movq	%rsi, %rax
	divq	-8(%rsp)
	movq	%rax, %r14
	movq	%rdx, %rcx
	movq	%r9, %rax
	andq	$-2, %rax
	mulq	%r14
	andq	$-2, %rdi
	cmpq	%rax, %rdi
	movq	%rcx, %rsi
	sbbq	%rdx, %rsi
	sbbq	$0, %r14
	testq	%r8, %r8
	je	.LBB1_19
# %bb.9:
	xorl	%r11d, %r11d
	subq	%rax, %rdi
	sbbq	%rdx, %rcx
	cmovaeq	%r11, %r10
	cmovaeq	%r11, %r9
	addq	%rdi, %r9
	adcq	%rcx, %r10
	movl	%ebx, %ecx
	shrdq	%cl, %r10, %r9
	shrq	%cl, %r10
	testb	$64, %bl
	cmovneq	%r10, %r9
	cmovneq	%r11, %r10
	movq	%r10, 8(%r8)
	movq	%r9, (%r8)
	jmp	.LBB1_19
[…]
	.ident	"clang version 10.0.0 "
[…]
Ouch: although the shift count is greater than 0 and can’t exceed 63 (what an optimiser worth its name can prove from the program flow), this optimiser generates 2 superfluous TEST instructions which always set the zero flag ZF, followed by 6 superfluous CMOVNZ instructions, i.e. NOPs, resulting in a total of 49 (instead of the former 42) instructions.

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):


Copyright © 1995–2020 • Stefan Kanthak • <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>