Valid HTML 4.01 Transitional Valid CSS Valid SVG 1.0

Me, myself & IT

Deficiencies in GCC’s code generator and optimiser

Purpose
Reason
Case 0
Case 1: __udivmod?i4() Functions
Case 2: Missing __rotl?i3() and __rotr?i3() Functions
Case 3: __ashl?i3() Functions
Case 4: __ashr?i3() Functions
Case 5: __lshr?i3() Functions
Case 6: __cmp?i2() Functions
Case 7: __ucmp?i2() Functions
Case 8: __mul?i3() Functions
Case 9: __neg?i2() Functions
Case 10: __absv?i2() Functions
Case 11: __addv?i3() Functions
Case 12: __subv?i3() Functions
Case 13: __mulv?i3() Functions
Case 14: __negv?i2() Functions
Case 15: __parity?i2() Functions
Case 16: __builtin_parity() Builtin
Case 17: __builtin_mul_overflow() Builtin
Case 18: __builtin_sub_overflow() Builtin
Case 19: __builtin_copysign() Builtin
Case 20: -ftrapv Command Line Option
Case 21: Shell Game, or CrazyBraindead Register Allocator
Case 22: Undefined Behaviour or Optimiser Failure?
Case 23: Optimiser Failures

Purpose

Demonstrate multiple bugs, defects, deficiencies and flaws of GCC and its libgcc library, especially the rather poor code generator.

Reason

The code generated by GCC (not only for its libgcc library) is rather poor, up to an order of magnitude worse than proper (optimised) code.

Note: I recommend to read Randall Hyde’s ACM article, especially the following part:

Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.

Case 0

In his ACM queue article Optimizations in C++ Compilers – A practical journey, Matt Godbolt presents the following function as example (t):
bool isWhitespace(char c)
{
    return c == ' '
      || c == '\r'
      || c == '\n'
      || c == '\t';
}
He then shows the following code generated by GCC 9.1 for the AMD64 platform, which uses 7 instructions in 27 bytes:
isWhitespace(char):
  xor    eax, eax           ; result = false
  cmp    dil, 32            ; is c > 32
  ja     .L4                ; if so, exit with false
  movabs rax, 4294977024    ; rax = 0x100002600
  shrx   rax, rax, rdi      ; rax >>= c
  and    eax, 1             ; result = rax & 1
.L4:
  ret
This code is but not optimal – the conditional branch impairs performance, it should and can be avoided, as demonstrated by the following equivalent code, which uses 7 instructions in 30 bytes without the preprocessor macro OPTIMIZE defined, else just 26 bytes:
; Copyleft © 2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	movabs	rax, 100002600h
	shrx	rax, rax, rdi
	cmp	dil, 33
#ifndef OPTIMIZE
	setb	dil
	movzx	edi, dil
#else
	sbb	edi, edi
	neg	edi
#endif
	and	eax, edi
	ret
Since the SHRx instruction is only available on newer CPUs, also branch-free code that runs on all AMD64 processors, with 8 instructions in just 25 bytes:
; Copyleft © 2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	mov	ecx, edi
	movabs	rax, 100002600h
	shr	rax, cl
	cmp	cl, 33
	sbb	ecx, ecx
	neg	ecx
	and	eax, ecx
	ret
Finally equivalent and branch-free code that works on 35 year old i386 processors, needing neither an AMD64 processor nor the SHRX instruction, with but 10 instructions in 26 bytes when the preprocessor macro OPTIMIZE is not defined, else just 25 bytes:
; Copyleft © 2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	mov	ecx, [esp+4]
#ifndef OPTIMIZE
	mov	eax, 2600h	; eax = (1 << '\r') | (1 << '\n') | (1 << '\t')
	cmp	cl, al
	seta	al		; eax |= (c > '\0')
#else
	xor	eax, eax
	cmp	eax, ecx	; CF = (c > '\0')
	adc	eax, 2600h
#endif
	shr	eax, cl		; eax >>= (c % ' ')
	xor	edx, edx
	cmp	ecx, 33		; CF = (c <= ' ')
	adc	edx, edx	; edx = (c <= ' ')
	and	eax, edx
	ret

Case 1: __udivmod?i4() Functions

libgcc provides the __udivmoddi4() and __udivmodti4() functions for unsigned integer division:
Runtime Function: unsigned long __udivmoddi4 (unsigned long a, unsigned long b, unsigned long *c)
Runtime Function: unsigned long long __udivmodti4 (unsigned long long a, unsigned long long b, unsigned long long *c)

These functions calculate both the quotient and remainder of the unsigned division of a and b. The return value is the quotient, and the remainder is placed in variable pointed to by c.

The last 5 statements of the following part of their source perform two multi-word left shifts on the variable pair d1, d0 holding the divisor and the variable triple n2, n1, n0 holding the dividend:
/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2020 Free Software Foundation, Inc.
[…]
UDWtype
__udivmoddi4 (UDWtype n, UDWtype d, UDWtype *rp)
{
[…]
  UWtype d0, d1, n0, n1, n2;
  UWtype b, bm;
[…]
	  count_leading_zeros (bm, d1);
	  if (bm == 0)
[…]
	  else
	    {
	      UWtype m1, m0;
	      /* Normalize.  */

	      b = W_TYPE_SIZE - bm;

	      d1 = (d1 << bm) | (d0 >> b);
	      d0 = d0 << bm;
	      n2 = n1 >> b;
	      n1 = (n1 << bm) | (n0 >> b);
	      n0 = n0 << bm;
[…]
For the AMD64 target platform, GCC compiles this sequence of shifts to the following unoptimised and rather awful code:
[…]
0000000000000000 <__udivmodti4>:
[…]
  b0: 4c 0f bd da           bsr    %rdx,%r11
  b4: 49 83 f3 3f           xor    $0x3f,%r11
  b8: 45 85 db              test   %r11d,%r11d
  bb: 75 33                 jne    f0 <__udivmodti4+0xf0>
[…]
  f0: 49 63 c3              movslq %r11d,%rax
  f3: bd 40 00 00 00        mov    $0x40,%ebp
  f8: 44 89 d9              mov    %r11d,%ecx
  fb: 4d 89 ec              mov    %r13,%r12
  fe: 48 29 c5              sub    %rax,%rbp
 101: 48 d3 e2              shl    %cl,%rdx
 104: 49 89 f2              mov    %rsi,%r10
 107: 48 89 f8              mov    %rdi,%rax
 10a: 89 e9                 mov    %ebp,%ecx
 10c: 44 89 db              mov    %r11d,%ebx
 10f: 49 d3 ec              shr    %cl,%r12
 112: 44 89 d9              mov    %r11d,%ecx
 115: 49 d3 e5              shl    %cl,%r13
 118: 89 e9                 mov    %ebp,%ecx
 11a: 49 09 d4              or     %rdx,%r12
 11d: 49 d3 ea              shr    %cl,%r10
 120: 44 89 d9              mov    %r11d,%ecx
 123: 48 d3 e6              shl    %cl,%rsi
 126: 89 e9                 mov    %ebp,%ecx
 128: 4c 89 d2              mov    %r10,%rdx
 12b: 48 d3 e8              shr    %cl,%rax
 12e: 44 89 d9              mov    %r11d,%ecx
 131: 48 09 c6              or     %rax,%rsi
 134: 48 d3 e7              shl    %cl,%rdi
 137: 48 89 f0              mov    %rsi,%rax
 13a: 49 89 f9              mov    %rdi,%r9
[…]
Oops: the TEST instruction at address b8 is superfluous.

Ouch: the optimiser fails to recognise the common and well-known idiom of complementary shifts, for which it should generate the following properly optimised code:

[…]
                            mov    %r11d, %ecx
                            xor    n2, n2
                            shld   %cl, n1, n2
                            shld   %cl, n0, n1
                            shl    %cl, n0
                            shld   %cl, d0, d1
                            shl    %cl, d0
[…]
Note: the optimiser but recognises the following similar, also common and well-known idiom used for rotation:
unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}
Note: exploration of the same unoptimised and awful code in the __divmodti4(), __divti3(), __modti3(), __udivti3() and __umodti3() as well as the __divmoddi4(), __divdi3(), __moddi3(), __udivmoddi4(), __udivdi3() and __umoddi3() functions is left as an exercise to the reader.

Case 2: the missing __rotl?i3() and __rotr?i3() Functions

libgcc fails to provide the __rotlsi3(), __rotldi3() and __rotlti3() as well as the __rotrsi3(), __rotrdi3() and __rotrti3() functions for (unsigned) integer rotation alias circular shift.

Create the text file case2.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __rotlsi3(unsigned int value, int count) {
    return (value << (31 & count))
         | (value >> (31 & -count));
}

unsigned int __rotrsi3(unsigned int value, int count) {
    return (value >> (31 & count))
         | (value << (31 & -count));
}

unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}

unsigned long long __rotrdi3(unsigned long long value, int count) {
    return (value >> (63 & count))
         | (value << (63 & -count));
}

#ifdef __amd64__
__uint128_t __rotlti3(__uint128_t value, int count) {
    return (value << (127 & count))
         | (value >> (127 & -count));
}

__uint128_t __rotrti3(__uint128_t value, int count) {
#ifdef OPTIMIZE
    __asm__("movq\t%[low], %%rax\n\t"
            "shrdq\t%%cl, %[high], %%rax\n\t"
            "shrdq\t%%cl, %[low], %[high]\n\t"
            "movq\t%[high], %%rdx\n\t"
            "test\t$64, %%cl\n\t"
            "cmovnz\t%%rax, %%rdx\n\t"
            "cmovnz\t%[high], %%rax"
           :"=A" (value)
           :"c" (count),
            [low] "r" ((unsigned long long) (value & ~0ULL)),
            [high] "r" ((unsigned long long) (value >> 64))
           :"cc");
    return value;
#else
    return (value >> (127 & count))
         | (value << (127 & -count));
#endif
}
#endif // __amd64__
Compile the source file case2.c with GCC, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -mabi=sysv -o- -Os -S case2.c
[…]
__rotlsi3:
	movl	%edi, %eax
	movl	%esi, %ecx
	roll	%cl, %eax
	ret
[…]
__rotrsi3:
	movl	%edi, %eax
	movl	%esi, %ecx
	rorl	%cl, %eax
	ret
[…]
__rotldi3:
	movq	%rdi, %rax
	movl	%esi, %ecx
	rolq	%cl, %rax
	ret
[…]
__rotrdi3:
	movq	%rdi, %rax
	movl	%esi, %ecx
	rorq	%cl, %rax
	ret
[…]
__rotlti3:
	movq	%rdi, %r9
	movq	%rsi, %r8
	andl	$127, %edx
	movq	%rdi, %rsi
	movl	%edx, %ecx
	movq	%r9, %rax
	movq	%r8, %rdx
	movq	%r8, %rdi
	shldq	%cl, %r9, %rdx
	salq	%cl, %rax
	xorl	%r8d, %r8d
	testb	$64, %cl
	cmovne	%rax, %rdx
	cmovne	%r8, %rax
	negl	%ecx
	xorl	%r9d, %r9d
	andl	$127, %ecx
	shrdq	%cl, %rdi, %rsi
	shrq	%cl, %rdi
	testb	$64, %cl
	cmovne	%rdi, %rsi
	cmovne	%r9, %rdi
	orq	%rsi, %rax
	orq	%rdi, %rdx
	ret
[…]
__rotrti3:
	movq	%rsi, %r8
	movq	%rdi, %r9
	andl	$127, %edx
	movq	%rdi, %rsi
	movl	%edx, %ecx
	movq	%r9, %rax
	movq	%r8, %rdx
	movq	%r8, %rdi
	shrdq	%cl, %r8, %rax
	shrq	%cl, %rdx
	xorl	%r8d, %r8d
	testb	$64, %cl
	cmovne	%rdx, %rax
	cmovne	%r8, %rdx
	negl	%ecx
	andl	$127, %ecx
	shldq	%cl, %r9, %rdi
	salq	%cl, %rsi
	xorl	%r9d, %r9d
	testb	$64, %cl
	cmovne	%rsi, %rdi
	cmovne	%r9, %rsi
	orq	%rdi, %rdx
	orq	%rsi, %rax
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Ouch: while the optimiser recognises the common and well-known idiom for rotation of 32-bit and 64-bit integers, it but fails rather bad at 128-bit integers and generates awful code!

Proper code uses only 9 instructions in just 31 bytes instead of 25 instructions in 77 bytes for each of the __rotlti3() and __rotrti3() functions:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__rotlti3
	.global	__rotrti3
	.type	__rotlti3, @function
	.type	__rotrti3, @function
	.text
				# rsi:rdi = value
				# rdx = count
__rotlti3:
	mov	ecx, edx
	mov	rax, rdi
	shld	rax, rsi, cl
	shld	rsi, rdi, cl
	mov	rdx, rsi
	test	ecx, 64
	cmovnz	rdx, rax
	cmovnz	rax, rsi
	ret

__rotrti3:
	mov	ecx, edx
	mov	rax, rdi
	shrd	rax, rsi, cl
	shrd	rsi, rdi, cl
	mov	rdx, rsi
	test	ecx, 64
	cmovnz	rdx, rax
	cmovnz	rax, rsi
	ret

	.end

Note: exploration of the equally bad code generated for rotation of 64-bit integers on the i386 platform is left as an exercise to the reader.

Case 3: __ashl?i3() Functions

libgcc provides the __ashlsi3(), __ashlti3() and __ashlti3() functions for integer shift operations:
Runtime Function: int __ashlsi3 (int a, int b)
Runtime Function: long __ashldi3 (long a, int b)
Runtime Function: long long __ashlti3 (long long a, int b)

These functions return the result of shifting a left by b bits.

For the AMD64 target platform, GCC compiles __ashlti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__ashlti3>:
   0:	48 85 d2             	test   %rdx,%rdx
   3:	74 3b                	je     40 <__ashlti3+0x40>
   5:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   b:	49 29 d0             	sub    %rdx,%r8
   e:	4d 85 c0             	test   %r8,%r8
  11:	7e 1d                	jle    30 <__ashlti3+0x30>
  13:	89 d1                	mov    %edx,%ecx
  15:	48 89 f8             	mov    %rdi,%rax
  18:	48 d3 e6             	shl    %cl,%rsi
  1b:	48 d3 e0             	shl    %cl,%rax
  1e:	44 89 c1             	mov    %r8d,%ecx
  21:	48 89 f2             	mov    %rsi,%rdx
  24:	48 d3 ef             	shr    %cl,%rdi
  27:	48 09 fa             	or     %rdi,%rdx
  2a:	c3                   	retq   
  2b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	31 c0                	xor    %eax,%eax
  35:	f7 d9                	neg    %ecx
  37:	48 d3 e7             	shl    %cl,%rdi
  3a:	48 89 fa             	mov    %rdi,%rdx
  3d:	c3                   	retq   
  3e:	66 90                	xchg   %ax,%ax
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]
Oops: the TEST instruction at address e is superfluous.

Proper, also branch-free code but uses only 11 instructions in just 32 bytes instead of 26 instructions in 71 bytes and handles shift counts greater 127:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__ashlti3
	.type	__ashlti3, @function
	.text
				# rsi:rdi = value
				# rdx = count
__ashlti3:
	mov	ecx, edx
	xor	edx, edx
	xor	eax, eax
	shld	rsi, rdi, cl
	shl	rdi, cl
	cmp	ecx, 127
	cmovbe	rdx, rdi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi
	ret

	.end
Note: exploration of the equally bad code generated for the __ashldi3() function on the i386 platform is left as an exercise to the reader.

Case 4: __ashr?i3() Functions

libgcc provides the __ashrsi3(), __ashrdi3() and __ashrti3() functions for integer shift operations:
Runtime Function: int __ashrsi3 (int a, int b)
Runtime Function: long __ashrdi3 (long a, int b)
Runtime Function: long long __ashrti3 (long long a, int b)

These functions return the result of arithmetically shifting a right by b bits.

For the AMD64 target platform, GCC compiles __ashrti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__ashrti3>:
   0:	48 89 d1             	mov    %rdx,%rcx
   3:	48 85 d2             	test   %rdx,%rdx
   6:	74 38                	je     40 <__ashrti3+0x40>
   8:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   e:	49 29 d0             	sub    %rdx,%r8
  11:	48 89 f2             	mov    %rsi,%rdx
  14:	4d 85 c0             	test   %r8,%r8
  17:	7e 17                	jle    30 <__ashrti3+0x30>
  19:	48 d3 ef             	shr    %cl,%rdi
  1c:	48 d3 fa             	sar    %cl,%rdx
  1f:	48 89 f0             	mov    %rsi,%rax
  22:	44 89 c1             	mov    %r8d,%ecx
  25:	48 d3 e0             	shl    %cl,%rax
  28:	48 09 f8             	or     %rdi,%rax
  2b:	c3                   	retq   
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	48 89 f0             	mov    %rsi,%rax
  36:	48 c1 fa 3f          	sar    $0x3f,%rdx
  3a:	f7 d9                	neg    %ecx
  3c:	48 d3 f8             	sar    %cl,%rax
  3f:	c3                   	retq   
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]
Oops: the TEST instruction at address 14 is superfluous.

Proper, also branch-free code but uses only 12 instructions in just 36 bytes instead of 25 instructions in 71 bytes and handles shift counts greater 127:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__ashrti3
	.type	__ashrti3, @function
	.text
				# rsi:rdi = value
				# rdx = count
__ashrti3:
	mov	ecx, edx
	mov	rax, rsi
	cqo
	mov	rax, rdx
	shrd	rdi, rsi, cl
	sar	rsi, cl
	cmp	ecx, 127
	cmovbe	rax, rsi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi
	ret

	.end
Note: exploration of the equally bad code generated for the __ashrdi3() function on the i386 platform is left as an exercise to the reader.

Case 5: __lshr?i3() Functions

libgcc provides the __lshrsi3(), __lshrdi3() and __lshrti3() functions for integer shift operations:
Runtime Function: int __lshrsi3 (int a, int b)
Runtime Function: long __lshrdi3 (long a, int b)
Runtime Function: long long __lshrti3 (long long a, int b)

These functions return the result of logically shifting a right by b bits.

For the AMD64 target platform, GCC compiles __ashlti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__lshrti3>:
   0:	48 89 d1             	mov    %rdx,%rcx
   3:	48 85 d2             	test   %rdx,%rdx
   6:	74 38                	je     40 <__lshrti3+0x40>
   8:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   e:	49 29 d0             	sub    %rdx,%r8
  11:	4d 85 c0             	test   %r8,%r8
  14:	7e 1a                	jle    30 <__lshrti3+0x30>
  16:	48 89 f2             	mov    %rsi,%rdx
  19:	48 d3 ef             	shr    %cl,%rdi
  1c:	48 89 f0             	mov    %rsi,%rax
  1f:	48 d3 ea             	shr    %cl,%rdx
  22:	44 89 c1             	mov    %r8d,%ecx
  25:	48 d3 e0             	shl    %cl,%rax
  28:	48 09 f8             	or     %rdi,%rax
  2b:	c3                   	retq   
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	48 89 f0             	mov    %rsi,%rax
  36:	31 d2                	xor    %edx,%edx
  38:	f7 d9                	neg    %ecx
  3a:	48 d3 e8             	shr    %cl,%rax
  3d:	c3                   	retq   
  3e:	66 90                	xchg   %ax,%ax
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]
Oops: the TEST instruction at address 11 is superfluous.

Proper, also branch-free code but uses only 11 instructions in just 32 bytes instead of 25 instructions in 71 bytes and handles shift counts greater 127:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__lshrti3
	.type	__lshrti3, @function
	.text
				# rsi:rdi = value
				# rdx = count
__lshrti3:
	mov	ecx, edx
	xor	edx, edx
	xor	eax, eax
	shrd	rdi, rsi, cl
	shr	rsi, cl
	cmp	ecx, 127
	cmovbe	rax, rsi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi
	ret

	.end
Note: exploration of the equally bad code generated for the __lshrdi3() function on the i386 platform is left as an exercise to the reader.

Case 6: __cmp?i2() Functions

libgcc provides the __cmpsi2(), __cmpdi2() and __cmpti2() functions for signed integer comparision:
Runtime Function: int __cmpsi2 (int a, int b)
Runtime Function: int __cmpdi2 (long a, long b)
Runtime Function: int __cmpti2 (long long a, long long b)

These functions perform a signed comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.

For the AMD64 target platform, GCC compiles __cmpti2() to the following unoptimised and awkward code using (superfluous) conditional branches which impair performance:
[…]
0000000000000000 <__cmpti2>:
   0:	31 c0                	xor    %eax,%eax
   2:	48 39 ce             	cmp    %rcx,%rsi
   5:	7c 1c                	jl     23 <__cmpti2+0x23>
   7:	b8 02 00 00 00       	mov    $0x2,%eax
   c:	7f 15                	jg     23 <__cmpti2+0x23>
   e:	31 c0                	xor    %eax,%eax
  10:	48 39 d7             	cmp    %rdx,%rdi
  13:	72 0e                	jb     23 <__cmpti2+0x23>
  15:	b8 01 00 00 00       	mov    $0x1,%eax
  1a:	ba 02 00 00 00       	mov    $0x2,%edx
  1f:	48 0f 47 c2          	cmova  %rdx,%rax
  23:	c3                   	retq   
[…]
Proper, also branch-free code uses 11 instructions in 29 bytes instead of 12 instructions in 36 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__cmpti2
	.type	__cmpti2, @function
	.text
				# rsi:rdi = left argument
				# rcx:rdx = right argument
__cmpti2:
	cmp	rdi, rdx
	mov	rax, rsi	# rax = high qword of left argument
	sbb	rax, rcx	# rax = high qword of left argument
				#     - high qword of right argument
				#     - borrow
	setl	ah		# ah = left argument < right argument
	cmp	rdx, rdi
	sbb	rcx, rsi	# rcx = high qword of right argument
				#     - high qword of left argument
				#     - borrow
	setl	al		# al = right argument < left argument
	sub	al, ah		# al = left argument > right argument
				#    - left argument < right argument
	movsx	eax, al		# eax = left argument > right argument
				#     - left argument < right argument
				#     = {-1, 0, 1}
	inc	eax		# eax = {0, 1, 2}
	ret

	.end
Note: exploration of the equally bad code generated for the __cmpdi2() and __cmpsi2() functions is left as an exercise to the reader.

Case 7: __ucmp?i2() Functions

libgcc provides the __ucmpsi2(), __ucmpdi2() and __ucmpti2() functions for unsigned integer comparision:
Runtime Function: int __ucmpsi2 (unsigned int a, unsigned int b)
Runtime Function: int __ucmpdi2 (unsigned long a, unsigned long b)
Runtime Function: int __ucmpti2 (unsigned long long a, unsigned long long b)

These functions perform an unsigned comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.

For the AMD64 target platform, GCC compiles __ucmpti2() to the following unoptimised and awkward code using (superfluous) conditional branches which impair performance:
[…]
0000000000000000 <__ucmpti2>:
   0:	31 c0                	xor    %eax,%eax
   2:	48 39 ce             	cmp    %rcx,%rsi
   5:	72 1c                	jb     23 <__ucmpti2+0x23>
   7:	b8 02 00 00 00       	mov    $0x2,%eax
   c:	77 15                	ja     23 <__ucmpti2+0x23>
   e:	31 c0                	xor    %eax,%eax
  10:	48 39 d7             	cmp    %rdx,%rdi
  13:	72 0e                	jb     23 <__ucmpti2+0x23>
  15:	b8 01 00 00 00       	mov    $0x1,%eax
  1a:	ba 02 00 00 00       	mov    $0x2,%edx
  1f:	48 0f 47 c2          	cmova  %rdx,%rax
  23:	c3                   	retq   
[…]
Proper, also branch-free code but uses only 8 instructions in just 21 bytes instead of 12 instructions in 36 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__ucmpti2
	.type	__ucmpti2, @function
	.text
				# rsi:rdi = left argument
				# rcx:rdx = right argument
__ucmpti2:
	cmp	rdi, rdx
	mov	rax, rsi	# rax = high qword of left argument
	sbb	rax, rcx	# rax = high qword of left argument
				#     - high qword of right argument
				#     - borrow
	sbb	eax, eax	# eax = 0
				#     - left argument < right argument
	cmp	rdx, rdi
	sbb	rcx, rsi	# rcx = high qword of right argument
				#     - high qword of left argument
				#     - borrow
	adc	eax, 1		# eax = left argument > right argument
				#     - left argument < right argument
				#     + 1
				#     = {0, 1, 2}
	ret

	.end
Note: exploration of the equally bad code generated for the __ucmpdi2() and __ucmpsi2() functions is left as an exercise to the reader.

Case 8: __mul?i3() Functions

libgcc provides the __mulsi3(), __muldi3() and __multi3() functions for integer multiplication:
Runtime Function: int __mulsi3 (int a, int b)
Runtime Function: long __muldi3 (long a, long b)
Runtime Function: long long __multi3 (long long a, long long b)

These functions return the product of a and b.

For the AMD64 target platform, GCC compiles __multi3() to the following code, which uses a LEA instruction instead of the appropriate (shorter) ADD instruction to compute the high part of the product, and two superfluous MOV instructions:
[…]
0000000000000000 <__multi3>:
   0:	49 89 f0             	mov    %rsi,%r8
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 89 d6             	mov    %rdx,%rsi
   9:	48 0f af f9          	imul   %rcx,%rdi
   d:	49 0f af f0          	imul   %r8,%rsi
  11:	48 f7 e2             	mul    %rdx
  14:	48 01 d7             	add    %rdx,%rdi
  17:	48 8d 14 37          	lea    (%rdi,%rsi,1),%rdx
  1b:	c3                   	retq   
[…]
Proper code but uses 7 instructions in 21 bytes instead of 9 instructions in 28 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__multi3
	.type	__multi3, @function
	.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__multi3:
	mov	rax, rdi	# rax = low qword of multiplicand
	imul	rsi, rdx	# rsi = high qword of multiplicand
				#     * low qword of multiplier
	imul	rdi, rcx	# rdi = low qword of multiplicand
				#     * high qword of multiplier
	mul	rdx		# rdx:rax = low qword of multiplicand
				#         * low qword of multiplier
	add	rdx, rsi
	add	rdx, rdi	# rdx:rax = product % 2**128
	ret

	.end
Note: exploration of the equally bad code generated for the __muldi3() function on the i386 platform is left as an exercise to the reader.

Case 9: __neg?i2() Functions

libgcc provides the __negsi2(), __negdi2() and __negti2() functions for signed integer negation:
Runtime Function: int __negsi2 (int a)
Runtime Function: long __negdi2 (long a)
Runtime Function: long long __negti2 (long long a)

These functions return the negation of a.

For the AMD64 target platform, GCC compiles __negti2() to the following unoptimised and awkward code:
[…]
0000000000000000 <__negti2>:
   0:	48 f7 de             	neg    %rsi
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 f7 d8             	neg    %rax
   9:	48 89 f2             	mov    %rsi,%rdx
   c:	48 83 ff 01          	cmp    $0x1,%rdi
  10:	48 83 d2 ff          	adc    $0xffffffffffffffff,%rdx
  14:	c3                   	retq   
[…]
Proper code but uses 5 instructions in just 12 bytes instead of 7 instructions in 21 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__negti2
	.type	__negti2, @function
	.text
				# rsi:rdi = argument
__negti2:
	mov	rax, rdi
	xor	edx, edx
	neg	rax
	sbb	rdx, rsi
	ret

	.end
Note: exploration of the equally bad code generated for the __negdi2() function on the i386 platform is left as an exercise to the reader.

Case 10: __absv?i2() Functions

libgcc provides the __absvsi2(), __absvdi2() and __absvti2() functions:
Runtime Function: int __absvsi2 (int a)
Runtime Function: long __absvdi2 (long a)
Runtime Function: long long __absvti2 (long long a)

These functions return the absolute value of a.

For the AMD64 target platform, GCC compiles __absvti2() to the following unoptimised and awkward code:
[…]
0000000000000000 <__absvti2>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 89 f2             	mov    %rsi,%rdx
   6:	48 85 f6             	test   %rsi,%rsi
   9:	78 05                	js     10 <__absvti2+0x10>
   b:	c3                   	retq   
   c:	0f 1f 40 00          	nopl   0x0(%rax)
  10:	48 83 ec 08          	sub    $0x8,%rsp
  14:	48 f7 d8             	neg    %rax
  17:	48 83 d2 00          	adc    $0x0,%rdx
  1b:	48 f7 da             	neg    %rdx
  1e:	48 85 d2             	test   %rdx,%rdx
  21:	0f 88 00 00 00 00    	js     …
  27:	48 83 c4 08          	add    $0x8,%rsp
  2b:	c3                   	retq   
[…]
Oops: the TEST instruction at address 1e is superfluous.

Proper code but uses only 10 instructions in just 25 bytes instead of 14 instructions in 44 bytes:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__absvti2
	.type	__absvti2, @function
	.text
				# rsi:rdi = argument
__absvti2:
	mov	rax, rsi
	cqo			# rdx = (argument < 0) ? -1 : 0
	add	rdi, rdx
	adc	rsi, rdx	# rsi:rsi = (argument < 0) ? argument - 1 : argument
	jo	.overflow
	mov	rax, rdx
	xor	rax, rdi
	xor	rdx, rsi	# rdx:rax = (argument < 0) ? -argument : argument
	ret
.overflow:
	ud2

	.end
Note: exploration of the equally bad code generated for the __absvdi2() and __absvsi2() functions is left as an exercise to the reader.

Case 11: __addv?i3() Functions

libgcc provides the __addvsi3(), __addvdi3() and __addvti3() functions for overflow-trapping signed integer addition:
Runtime Function: int __addvsi3 (int a, int b)
Runtime Function: long __addvdi3 (long a, long b)
Runtime Function: long long __addvti3 (long long a, long long b)

These functions return the sum of a and b; that is a + b.

For the AMD64 target platform, GCC compiles __addvti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__addvti3>:
   0:	49 89 f9             	mov    %rdi,%r9
   3:	49 89 d0             	mov    %rdx,%r8
   6:	48 83 ec 08          	sub    $0x8,%rsp
   a:	48 89 f2             	mov    %rsi,%rdx
   d:	4c 89 c8             	mov    %r9,%rax
  10:	48 89 f7             	mov    %rsi,%rdi
  13:	4c 01 c0             	add    %r8,%rax
  16:	48 11 ca             	adc    %rcx,%rdx
  19:	48 85 c9             	test   %rcx,%rcx
  1c:	78 12                	js     30 <__addvti3+0x30>
  1e:	4c 39 c8             	cmp    %r9,%rax
  21:	48 89 d1             	mov    %rdx,%rcx
  24:	48 19 f1             	sbb    %rsi,%rcx
  27:	7c 18                	jl     41 <__addvti3+0x41>
  29:	48 83 c4 08          	add    $0x8,%rsp
  2d:	c3                   	retq   
  2e:	66 90                	xchg   %ax,%ax
  30:	49 39 c1             	cmp    %rax,%r9
  33:	48 19 d7             	sbb    %rdx,%rdi
  36:	0f 8c 00 00 00 00    	jl     …
  3c:	48 83 c4 08          	add    $0x8,%rsp
  40:	c3                   	retq   
  41:	e9 00 00 00 00       	jmpq   …
[…]
Proper code but uses only 7 instructions in just 17 bytes instead of 23 instructions in 70 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__addvti3
	.type	__addvti3, @function
	.text
				# rsi:rdi = augend
				# rcx:rdx = addend
__addvti3:
	add	rdi, rdx
	adc	rsi, rcx
	jo	.overflow
	mov	rax, rdi
	mov	rdx, rsi
	ret
.overflow:
	ud2

	.end
Note: exploration of the equally bad code generated for the __addvdi3() and __addvsi3() functions is left as an exercise to the reader.

Case 12: __subv?i3() Functions

libgcc provides the __subvsi3(), __subvdi3() and __subvti3() functions for overflow-trapping signed integer subtraction:
Runtime Function: int __subvsi3 (int a, int b)
Runtime Function: long __subvdi3 (long a, long b)
Runtime Function: long long __subvti3 (long long a, long long b)

These functions return the difference between a and b; that is a - b.

For the AMD64 target platform, GCC compiles __subvti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__subvti3>:
   0:	49 89 f9             	mov    %rdi,%r9
   3:	49 89 d0             	mov    %rdx,%r8
   6:	48 83 ec 08          	sub    $0x8,%rsp
   a:	48 89 f2             	mov    %rsi,%rdx
   d:	4c 89 c8             	mov    %r9,%rax
  10:	4c 29 c0             	sub    %r8,%rax
  13:	48 19 ca             	sbb    %rcx,%rdx
  16:	48 85 c9             	test   %rcx,%rcx
  19:	78 15                	js     30 <__subvti3+0x30>
  1b:	49 39 c1             	cmp    %rax,%r9
  1e:	48 89 f1             	mov    %rsi,%rcx
  21:	48 19 d1             	sbb    %rdx,%rcx
  24:	7c 1e                	jl     44 <__subvti3+0x44>
  26:	48 83 c4 08          	add    $0x8,%rsp
  2a:	c3                   	retq   
  2b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  30:	4c 39 c8             	cmp    %r9,%rax
  33:	48 89 d1             	mov    %rdx,%rcx
  36:	48 19 f1             	sbb    %rsi,%rcx
  39:	0f 8c 00 00 00 00    	jl     …
  3f:	48 83 c4 08          	add    $0x8,%rsp
  43:	c3                   	retq   
  44:	e9 00 00 00 00       	jmpq   …
[…]
Proper code but uses only 7 instructions in just 17 bytes instead of 23 instructions in 73 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__subvti3
	.type	__subvti3, @function
	.text
				# rsi:rdi = minuend
				# rcx:rdx = subtrahend
__subvti3:
	sub	rdi, rdx
	sbb	rsi, rcx
	jo	.overflow
	mov	rax, rdi
	mov	rdx, rsi
	ret
.overflow:
	ud2

	.end
Note: exploration of the equally bad code generated for the __subvdi3() and __subvsi3() functions is left as an exercise to the reader.

Case 13: __mulv?i3() Functions

libgcc provides the __mulvsi3(), __mulvdi3() and __mulvti3() functions for overflow-trapping signed integer multiplication:
Runtime Function: int __mulvsi3 (int a, int b)
Runtime Function: long __mulvdi3 (long a, long b)
Runtime Function: long long __mulvti3 (long long a, long long b)

These functions return the product of a and b; that is a * b.

For the AMD64 target platform, GCC compiles __mulvti3() to the following horrible code:
[…]
0000000000000000 <__mulvti3>:
   0:	41 55                	push   %r13
   2:	49 89 cb             	mov    %rcx,%r11
   5:	48 89 d0             	mov    %rdx,%rax
   8:	49 89 d2             	mov    %rdx,%r10
   b:	41 54                	push   %r12
   d:	49 89 fc             	mov    %rdi,%r12
  10:	48 89 d1             	mov    %rdx,%rcx
  13:	49 89 f0             	mov    %rsi,%r8
  16:	4c 89 e2             	mov    %r12,%rdx
  19:	49 89 f5             	mov    %rsi,%r13
  1c:	53                   	push   %rbx
  1d:	48 89 fe             	mov    %rdi,%rsi
  20:	48 c1 fa 3f          	sar    $0x3f,%rdx
  24:	48 c1 f8 3f          	sar    $0x3f,%rax
  28:	4c 89 df             	mov    %r11,%rdi
  2b:	4c 39 c2             	cmp    %r8,%rdx
  2e:	75 18                	jne    48 <__mulvti3+0x48>
  30:	4c 39 d8             	cmp    %r11,%rax
  33:	75 6b                	jne    a0 <__mulvti3+0xa0>
  35:	4c 89 e0             	mov    %r12,%rax
  38:	49 f7 ea             	imul   %r10
  3b:	5b                   	pop    %rbx
  3c:	41 5c                	pop    %r12
  3e:	41 5d                	pop    %r13
  40:	c3                   	retq   
  41:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  48:	4c 39 d8             	cmp    %r11,%rax
  4b:	0f 85 a7 00 00 00    	jne    f8 <__mulvti3+0xf8>
  51:	4c 89 e0             	mov    %r12,%rax
  54:	49 f7 e2             	mul    %r10
  57:	49 89 c1             	mov    %rax,%r9
  5a:	4c 89 c0             	mov    %r8,%rax
  5d:	48 89 d7             	mov    %rdx,%rdi
  60:	49 f7 e2             	mul    %r10
  63:	4d 85 c0             	test   %r8,%r8
  66:	79 09                	jns    71 <__mulvti3+0x71>
  68:	49 89 d0             	mov    %rdx,%r8
  6b:	49 29 c8             	sub    %rcx,%r8
  6e:	4c 89 c2             	mov    %r8,%rdx
  71:	48 85 c9             	test   %rcx,%rcx
  74:	79 06                	jns    7c <__mulvti3+0x7c>
  76:	4c 29 e0             	sub    %r12,%rax
  79:	4c 19 ea             	sbb    %r13,%rdx
  7c:	49 89 fa             	mov    %rdi,%r10
  7f:	45 31 db             	xor    %r11d,%r11d
  82:	49 01 c2             	add    %rax,%r10
  85:	4c 89 d0             	mov    %r10,%rax
  88:	49 11 d3             	adc    %rdx,%r11
  8b:	48 c1 f8 3f          	sar    $0x3f,%rax
  8f:	4c 39 d8             	cmp    %r11,%rax
  92:	0f 85 00 00 00 00    	jne    …
  98:	4c 89 c8             	mov    %r9,%rax
  9b:	4c 89 d2             	mov    %r10,%rdx
  9e:	eb 9b                	jmp    3b <__mulvti3+0x3b>
  a0:	4c 89 d0             	mov    %r10,%rax
  a3:	49 f7 e4             	mul    %r12
  a6:	49 89 c0             	mov    %rax,%r8
  a9:	4c 89 d8             	mov    %r11,%rax
  ac:	48 89 d3             	mov    %rdx,%rbx
  af:	49 f7 e4             	mul    %r12
  b2:	4d 85 db             	test   %r11,%r11
  b5:	79 09                	jns    c0 <__mulvti3+0xc0>
  b7:	48 89 d7             	mov    %rdx,%rdi
  ba:	4c 29 e7             	sub    %r12,%rdi
  bd:	48 89 fa             	mov    %rdi,%rdx
  c0:	48 85 f6             	test   %rsi,%rsi
  c3:	79 06                	jns    cb <__mulvti3+0xcb>
  c5:	4c 29 d0             	sub    %r10,%rax
  c8:	4c 19 da             	sbb    %r11,%rdx
  cb:	48 89 de             	mov    %rbx,%rsi
  ce:	31 ff                	xor    %edi,%edi
  d0:	48 01 c6             	add    %rax,%rsi
  d3:	48 89 f0             	mov    %rsi,%rax
  d6:	48 11 d7             	adc    %rdx,%rdi
  d9:	48 c1 f8 3f          	sar    $0x3f,%rax
  dd:	48 39 f8             	cmp    %rdi,%rax
  e0:	0f 85 00 00 00 00    	jne    …
  e6:	4c 89 c0             	mov    %r8,%rax
  e9:	48 89 f2             	mov    %rsi,%rdx
  ec:	e9 4a ff ff ff       	jmpq   3b <__mulvti3+0x3b>
  f1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  f8:	4d 85 c0             	test   %r8,%r8
  fb:	78 23                	js     120 <__mulvti3+0x120>
  fd:	4d 85 db             	test   %r11,%r11
 100:	78 4d                	js     14f <__mulvti3+0x14f>
 102:	4c 09 c7             	or     %r8,%rdi
 105:	0f 85 00 00 00 00    	jne    …
 10b:	4c 89 e0             	mov    %r12,%rax
 10e:	49 f7 e2             	mul    %r10
 111:	48 85 d2             	test   %rdx,%rdx
 114:	0f 89 21 ff ff ff    	jns    3b <__mulvti3+0x3b>
 11a:	e9 00 00 00 00       	jmpq   …
 11f:	90                   	nop
 120:	4d 85 db             	test   %r11,%r11
 123:	78 57                	js     17c <__mulvti3+0x17c>
 125:	0f 85 00 00 00 00    	jne    …
 12b:	49 83 f8 ff          	cmp    $0xffffffffffffffff,%r8
 12f:	0f 85 00 00 00 00    	jne    …
 135:	4c 89 e0             	mov    %r12,%rax
 138:	49 f7 e2             	mul    %r10
 13b:	49 89 d0             	mov    %rdx,%r8
 13e:	4d 29 d0             	sub    %r10,%r8
 141:	0f 89 00 00 00 00    	jns    …
 147:	4c 89 c2             	mov    %r8,%rdx
 14a:	e9 ec fe ff ff       	jmpq   3b <__mulvti3+0x3b>
 14f:	49 83 fb ff          	cmp    $0xffffffffffffffff,%r11
 153:	0f 85 00 00 00 00    	jne    …
 159:	4d 85 c0             	test   %r8,%r8
 15c:	0f 85 00 00 00 00    	jne    …
 162:	4c 89 d0             	mov    %r10,%rax
 165:	49 f7 e4             	mul    %r12
 168:	48 89 d7             	mov    %rdx,%rdi
 16b:	4c 29 e7             	sub    %r12,%rdi
 16e:	0f 89 00 00 00 00    	jns    …
 174:	48 89 fa             	mov    %rdi,%rdx
 177:	e9 bf fe ff ff       	jmpq   3b <__mulvti3+0x3b>
 17c:	4c 21 c7             	and    %r8,%rdi
 17f:	48 83 ff ff          	cmp    $0xffffffffffffffff,%rdi
 183:	0f 85 00 00 00 00    	jne    …
 189:	4c 89 d0             	mov    %r10,%rax
 18c:	4c 09 e0             	or     %r12,%rax
 18f:	0f 84 00 00 00 00    	je     …
 195:	4c 89 e0             	mov    %r12,%rax
 198:	49 f7 e2             	mul    %r10
 19b:	4c 29 e2             	sub    %r12,%rdx
 19e:	49 89 c0             	mov    %rax,%r8
 1a1:	48 89 d6             	mov    %rdx,%rsi
 1a4:	4c 29 d6             	sub    %r10,%rsi
 1a7:	0f 88 00 00 00 00    	js     …
 1ad:	e9 34 ff ff ff       	jmpq   e6 <__mulvti3+0xe6>
[…]
Proper, almost branch-free code but uses only 44 instructions in just 118 bytes instead of 130 instructions in 434 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__mulvti3
	.type	__mulvti3, @function
	.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__mulvti3:
	mov	r10, rdx	# r10 = low qword of multiplier
	mov	r11, rcx	# r11 = high qword of multiplier

	mov	rax, rcx	# rax = high qword of multiplier
	cqo			# rdx = (multiplier < 0) ? -1 : 0
	mov	r9, rdx		# r9 = (multiplier < 0) ? -1 : 0
	xor	r10, rdx
	xor	r11, rdx	# r11:r10 = (multiplier < 0) ? ~multiplier : multiplier
	sub	r10, rdx
	sbb	r11, rdx	# r11:r10 = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	rax, rsi	# rax = high qword of multiplicand
	cqo			# rdx = (multiplicand < 0) ? -1 : 0
	xor	r9, rdx		# r9 = (multiplicand < 0) <> (multiplier < 0) ? -1 : 0
				#    = (product < 0) ? -1 : 0
	xor	rdi, rdx
	xor	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	rdi, rdx
	sbb	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	xor	ecx, ecx
	cmp	rcx, rsi
	sbb	edx, edx	# edx = (high qword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**64) ? 0 : -1
	cmp	rcx, r11
	sbb	ecx, ecx	# ecx = (high qword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplier| < 2**64) ? 0 : -1
	and	ecx, edx	# ecx = (|multiplicand| < 2**64)
				#     & (|multiplier| < 2**64) ? 0 : -1
				#     = (|product| < 2**128) ? 0 : -1
	mov	rax, rsi
	mul	r10		# rdx:rax = high qword of |multiplicand|
				#         * low qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rsi, rax
	mov	rax, rdi
	mul	r11		# rdx:rax = low qword of |multiplicand|
				#         * high qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	add	rsi, rax	# rsi = high qword of |multiplicand|
				#     * low qword of |multiplier|
				#     + low qword of |multiplicand|
				#     * high qword of |multiplier|
#	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rax, rdi
	mul	r10		# rdx:rax = low qword of |multiplicand|
				#         * low qword of |multiplier|
	add	rdx, rsi	# rdx:rax = |product % 2**128|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *
.if 0
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = (product < 0)
				#         ? product % 2**128 - 1 : product % 2**128
	sub	rax, r9
	sbb	rdx, r9		# rdx:rax = product % 2**128

	xor	r9, rdx		# r9 = (product % 2**128 < -2**127)
				#    | (product % 2**128 >= 2**127)
				#    ? negative : positive
	add	r9, r9
.else
	add	rax, r9
	adc	rdx, r9		# rdx:rax = (product < 0)
				#         ? ~product % 2**128 : product % 2**128
	mov	rsi, rdx	# rsi = (product % 2**128 < -2**127)
				#     | (product % 2**128 >= 2**127)
				#     ? negative : positive
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = product % 2**128

	add	rsi, rsi
.endif
	adc	ecx, ecx	# ecx = (-2**127 <= product < 2**127) ? 0 : *
	jnz	.overflow	# product < -2**127?
				# product >= 2**127?
	ret

.overflow:
	ud2

	.end
Note: exploration of the equally horrible code generated for the __mulvdi3() and __mulvsi3() functions is left as an exercise to the reader.

Case 14: __negv?i2() Functions

libgcc provides the __negvsi2(), __negvdi2() and __negvti2() functions for overflow-trapping signed integer negation:
Runtime Function: int __negvsi2 (int a)
Runtime Function: long __negvdi2 (long a)
Runtime Function: long long __negvti2 (long long a)

These functions return the negation of a; that is -a.

For the AMD64 target platform, GCC compiles __negvti2() to the following unoptimised and awful code:
[…]
0000000000000000 <__negvti2>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 83 ec 08          	sub    $0x8,%rsp
   7:	48 89 f2             	mov    %rsi,%rdx
   a:	48 f7 d8             	neg    %rax
   d:	48 83 d2 00          	adc    $0x0,%rdx
  11:	48 f7 da             	neg    %rdx
  14:	48 85 f6             	test   %rsi,%rsi
  17:	78 17                	js     30 <__negvti2+0x30>
  19:	31 c9                	xor    %ecx,%ecx
  1b:	48 39 c1             	cmp    %rax,%rcx
  1e:	48 19 d1             	sbb    %rdx,%rcx
  21:	7c 1b                	jl     3e <__negvti2+0x3e>
  23:	48 83 c4 08          	add    $0x8,%rsp
  27:	c3                   	retq   
  28:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  2f:	00 
  30:	48 85 d2             	test   %rdx,%rdx
  33:	0f 88 00 00 00 00    	js     …
  39:	48 83 c4 08          	add    $0x8,%rsp
  3d:	c3                   	retq   
  3e:	e9 00 00 00 00       	jmpq   …
[…]
Proper code but uses only 7 instructions in just 16 bytes instead of 20 instructions in 67 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__negvti2
	.type	__negvti2, @function
	.text
				# rsi:rdi = argument
__negvti2:
	mov	rax, rdi
	xor	edx, edx
	neg	rax
	sbb	rdx, rsi
	jo	.overflow
	ret
.overflow:
	ud2

	.end
Note: exploration of the equally bad code generated for the __negvdi2() and __negvsi2() functions is left as an exercise to the reader.

Case 15: __parity?i2() Functions

libgcc provides the __paritysi2(), __paritydi2() and __parityti2() functions:
Runtime Function: int __paritysi2 (unsigned int a)
Runtime Function: int __paritydi2 (unsigned long a)
Runtime Function: int __parityti2 (unsigned long long a)

These functions return the value zero if the number of bits set in a is even, and the value one otherwise.

For the AMD64 target platform, GCC compiles __parityti2() to the following unoptimised and awful code:
[…]
0000000000000000 <__parityti2>:
   0:	48 31 f7             	xor    %rsi,%rdi
   3:	b8 96 69 00 00       	mov    $0x6996,%eax
   8:	48 89 fe             	mov    %rdi,%rsi
   b:	48 89 f9             	mov    %rdi,%rcx
   e:	48 c1 ee 20          	shr    $0x20,%rsi
  12:	48 31 f1             	xor    %rsi,%rcx
  15:	48 89 ce             	mov    %rcx,%rsi
  18:	48 c1 ee 10          	shr    $0x10,%rsi
  1c:	48 31 ce             	xor    %rcx,%rsi
  1f:	48 89 f1             	mov    %rsi,%rcx
  22:	48 c1 e9 08          	shr    $0x8,%rcx
  26:	48 31 ce             	xor    %rcx,%rsi
  29:	48 89 f1             	mov    %rsi,%rcx
  2c:	48 c1 e9 04          	shr    $0x4,%rcx
  30:	48 31 f1             	xor    %rsi,%rcx
  33:	83 e1 0f             	and    $0xf,%ecx
  36:	d3 f8                	sar    %cl,%eax
  38:	83 e0 01             	and    $0x1,%eax
  3b:	c3                   	retq   
[…]
Proper code but uses only 9 instructions in just 24 bytes instead of 19 instructions in 60 bytes:
# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__parityti2
	.type	__parityti2, @function
	.text
				# rsi:rdi = argument
__parityti2:
	xor	rdi, rsi
	shld	rdx, rdi, 32
	xor	edx, edi
	shld	edx, edi, 16
	xor	edx, edi
	xor	eax, eax
	xor	dl, dh
	setnp	al
	ret

	.end
Note: exploration of the equally bad code generated for the __paritydi2() and __paritysi2() functions is left as an exercise to the reader.

Case 16: __builtin_parity() Builtin

Create the text file case16.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __paritysi2(unsigned int value) {
    return __builtin_parity(value);
}

int __paritydi2(unsigned long long value) {
    return __builtin_parityll(value);
}
Compile the source file case16.c with GCC, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -mabi=sysv -o- -O3 -S case16.c
[…]
__paritysi2:
	movl	4(%esp), %eax
	movl	%eax, %edx
	shrl	$16, %eax
	xorl	%eax, %edx
	xorl	%eax, %eax
	xorb	%dh, %dl
	setnp	%al
	ret
[…]
__paritydi2:
	pushl	%ebx
	movl	8(%esp), %eax
	movl	12(%esp), %edx
	movl	%edx, %ebx
	xorl	%eax, %ebx
	movl	%ebx, %edx
	shrl	$16, %ebx
	xorl	%ebx, %edx
	xorl	%eax, %eax
	xorb	%dh, %dl
	setnp	%al
	popl	%ebx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Oops: the code for the __paritydi2() function clobbers register EBX instead of the volatile register ECX, thus wasting a PUSH and a POP instruction which perform 2 superfluous memory accesses.

Case 17: __builtin_mul_overflow() Builtin

Create the text file case17.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef OPTIMIZE
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
    __int128_t product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}

__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    __uint128_t product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}
#else
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
    __uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);

    if (__builtin_mul_overflow((multiplicand ^ sign) - sign,
                               (multiplier ^ tmp) - tmp,
                               &product))
        __builtin_trap();

    sign ^= tmp;
    product = (product ^ sign) - sign;

    if ((__int128_t) (product ^ sign) < 0)
        __builtin_trap();

    return product;
}

__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    union {
        __uint128_t all;
        struct {
            unsigned long long low, high;
        };
    } product;
    unsigned long long tmp;

    product.all = (multiplicand & ~0ULL) * (multiplier & ~0ULL);

    if ((((multiplicand >> 64) != 0) && ((multiplier >> 64) != 0))
     || __builtin_umulll_overflow(multiplicand & ~0ULL, multiplier >> 64, &tmp)
     || __builtin_uaddll_overflow(tmp, product.high, &product.high)
     || __builtin_umulll_overflow(multiplicand >> 64, multiplier & ~0ULL, &tmp)
     || __builtin_uaddll_overflow(tmp, product.high, &product.high))
        __builtin_trap();

    return product.all;
}
#endif // OPTIMIZE
Compile the source file case17.c with GCC, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -mabi=sysv -o- -Os -S case17.c
[…]
__mulvti3:
	pushq	%rbp
	movq	%rsi, %r10
	movq	%rdx, %rsi
	movq	%rdi, %rdx
	movq	%rsi, %rax
	sarq	$63, %rdx
	movq	%rcx, %r11
	sarq	$63, %rax
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	xorl	%r14d, %r14d
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	cmpq	%r10, %rdx
	jne	.L4
	cmpq	%rcx, %rax
	jne	.L5
	movq	%rdi, %rax
	imulq	%rsi
	movq	%rax, %rbx
	movq	%rdx, %r8
	jmp	.L2
.L5:
	movq	%rsi, %rcx
	movq	%r11, %rbx
	movq	%r11, %r8
	movq	%rdi, %r15
	jmp	.L6
.L4:
	cmpq	%rcx, %rax
	jne	.L7
	movq	%rdi, %rcx
	movq	%r10, %rbx
	movq	%r10, %r8
	movq	%rsi, %r15
.L6:
	movq	%rdi, %rax
	mulq	%rsi
	movq	%rax, %r12
	movq	%r15, %rax
	movq	%rdx, %r13
	mulq	%r8
	testq	%r8, %r8
	jns	.L8
	xorl	%r8d, %r8d
	subq	%r8, %rax
	sbbq	%r15, %rdx
.L8:
	testq	%r15, %r15
	jns	.L9
	subq	%rcx, %rax
	sbbq	%rbx, %rdx
.L9:
	movq	%r13, %r8
	xorl	%r9d, %r9d
	movq	%r12, %rbx
	addq	%rax, %r8
	adcq	%rdx, %r9
	movq	%r8, %rdx
	sarq	$63, %rdx
	cmpq	%r9, %rdx
	je	.L2
	imulq	%rsi, %r10
	movq	%rdi, %rax
	imulq	%rdi, %r11
	mulq	%rsi
	leaq	(%r10,%r11), %r8
	addq	%rdx, %r8
	movq	%rax, %rbx
	jmp	.L3
.L7:
	movq	%r10, %r8
	movq	%rcx, %rdx
	movq	%rdi, %rax
	imulq	%rdi, %rdx
	imulq	%rsi, %r8
	addq	%rdx, %r8
	mulq	%rsi
	addq	%rdx, %r8
	leaq	1(%r10), %rdx
	movq	%rax, %rbx
	cmpq	$1, %rdx
	ja	.L3
	leaq	1(%rcx), %rdx
	cmpq	$1, %rdx
	ja	.L3
	cmpq	%rcx, %r10
	jne	.L11
	cmpq	%rbx, %r14
	movq	%r14, %rax
	sbbq	%r8, %rax
	jl	.L2
	jmp	.L3
.L11:
	testq	%r8, %r8
	js	.L2
.L3:
	movl	$1, %r14d
.L2:
	testq	%r14, %r14
	je	.L1
	ud2
.L1:
	movq	%rbx, %rax
	movq	%r8, %rdx
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	ret
[…]
__umulvti3:
	pushq	%rbp
	movq	%rsi, %r10
	movq	%rcx, %r11
	movq	%rdx, %rsi
	movq	%rsp, %rbp
	pushq	%r12
	xorl	%r12d, %r12d
	pushq	%rbx
	testq	%r10, %r10
	jne	.L18
	testq	%rcx, %rcx
	jne	.L19
	movq	%rdi, %rax
	mulq	%rdx
	movq	%rax, %rbx
	movq	%rdx, %r8
	jmp	.L16
.L19:
	movq	%rcx, %r8
	movq	%rdi, %r9
	jmp	.L20
.L18:
	testq	%rcx, %rcx
	jne	.L22
	movq	%r10, %r8
	movq	%rdx, %r9
.L20:
	movq	%rdi, %rax
	mulq	%rsi
	movq	%rax, %rcx
	movq	%r9, %rax
	movq	%rdx, %rbx
	xorl	%r9d, %r9d
	mulq	%r8
	movq	%rbx, %r8
	movq	%rcx, %rbx
	addq	%rax, %r8
	adcq	%rdx, %r9
	testq	%r9, %r9
	je	.L16
.L22:
	imulq	%rsi, %r10
	movq	%rdi, %rax
	movl	$1, %r12d
	imulq	%rdi, %r11
	mulq	%rsi
	leaq	(%r10,%r11), %r8
	addq	%rdx, %r8
	movq	%rax, %rbx
.L16:
	testq	%r12, %r12
	je	.L15
	ud2
.L15:
	movq	%rbx, %rax
	movq	%r8, %rdx
	popq	%rbx
	popq	%r12
	popq	%rbp
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 104 instructions in 295 bytes for the __mulvti3() function – still horrible, but not as bad as the __mulvti3() function provided in libgcc any more!

Ouch: 54 instructions in 148 bytes for the __umulvti3() function – also rather terrible!

Compile the source file case17.c a second time with GCC, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

gcc -DOPTIMIZE -m64 -mabi=sysv -o- -Os -S case17.c
[…]
__mulvti3:
	pushq	%rbp
	movq	%rcx, %rax
	movq	%rdi, %r10
	movq	%rsi, %r11
	movq	%rax, %rdi
	sarq	$63, %rdi
	movq	%rsp, %rbp
	pushq	%r14
	movq	%rdi, %r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	movq	%rsi, %rbx
	movslq	%edi, %rsi
	sarq	$63, %rbx
	movslq	%ebx, %rcx
	movq	%rbx, %r12
	movq	%rcx, %r8
	sarq	$63, %rcx
	xorq	%r8, %r10
	movq	%rcx, %r9
	movq	%rsi, %rcx
	sarq	$63, %rsi
	movq	%rsi, %rbx
	xorq	%r9, %r11
	movq	%r10, %rsi
	subq	%r8, %rsi
	movq	%r11, %rdi
	sbbq	%r9, %rdi
	xorq	%rcx, %rdx
	xorq	%rbx, %rax
	movq	%rdx, %r10
	movq	%rax, %r11
	subq	%rcx, %r10
	sbbq	%rbx, %r11
	xorl	%r13d, %r13d
	movq	%r11, %r8
	testq	%rdi, %rdi
	jne	.L4
	testq	%r11, %r11
	jne	.L5
	movq	%rsi, %rax
	mulq	%r10
	movq	%rax, %rcx
	movq	%rdx, %r8
	jmp	.L2
.L5:
	movq	%rsi, %r9
	jmp	.L6
.L4:
	testq	%r11, %r11
	jne	.L8
	movq	%rdi, %r8
	movq	%r10, %r9
.L6:
	movq	%rsi, %rax
	mulq	%r10
	movq	%rax, %rcx
	movq	%r9, %rax
	movq	%rdx, %rbx
	xorl	%r9d, %r9d
	mulq	%r8
	movq	%rbx, %r8
	addq	%rax, %r8
	adcq	%rdx, %r9
	testq	%r9, %r9
	je	.L2
.L8:
	movq	%rdi, %r8
	movq	%r11, %rax
	movl	$1, %r13d
	imulq	%rsi, %rax
	imulq	%r10, %r8
	addq	%rax, %r8
	movq	%rsi, %rax
	mulq	%r10
	addq	%rdx, %r8
	movq	%rax, %rcx
.L2:
	testq	%r13, %r13
	je	.L9
.L10:
	ud2
.L9:
	xorl	%r14d, %r12d
	movslq	%r12d, %r12
	movq	%r12, %rsi
	sarq	$63, %r12
	xorq	%rsi, %rcx
	xorq	%r12, %r8
	movq	%r12, %rbx
	movq	%rcx, %rax
	movq	%r8, %rdx
	subq	%rsi, %rax
	sbbq	%r12, %rdx
	xorq	%rdx, %rbx
	js	.L10
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%rbp
	ret
[…]
__umulvti3:
	movq	%rdi, %rax
	movq	%rdx, %r8
	mulq	%rdx
	movq	%rdx, %r11
	movq	%rax, %r9
	testq	%rsi, %rsi
	je	.L14
	testq	%rcx, %rcx
	je	.L14
.L17:
	ud2
.L14:
	movq	%rdi, %rax
	mulq	%rcx
	movq	%rax, %rdi
	jo	.L17
	addq	%r11, %rdi
	jc	.L17
	movq	%rsi, %rax
	mulq	%r8
	movq	%rax, %rsi
	jo	.L17
	addq	%rdi, %rsi
	movq	%rsi, %rdx
	jc	.L17
	movq	%r9, %rax
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Oops: the optimised implementation coaxes GCC to generate 96 instructions in 273 bytes for the __mulvti3() function – a little shorter and faster than the code generated for __builtin_mul_overflow(), but more than twice the size of the proper assembly code shown in case 13 above, and less than half as fast!

OOPS: 25 instructions in 66 bytes for the __umulvti3() function – also not optimal, but much better than the rather terrible code generated for __builtin_mul_overflow().

Note: exploration of the equally bad code generated for (signed and unsigned) 64-bit integers on the i386 platform is left as an exercise to the reader.

Case 18: __builtin_sub_overflow() Builtin

Create the text file case18.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

__int128_t __absvti2(__int128_t value) {
    __int128_t sign = 0 - (value < 0);

    if (__builtin_sub_overflow(value ^ sign, sign, &value))
        __builtin_trap();

    return value;
}

__int128_t __negvti2(__int128_t negend) {
    __int128_t negation;

    if (__builtin_sub_overflow(0, negend, &negation))
        __builtin_trap();

    return negation;
}

__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
    __int128_t difference;

    if (__builtin_sub_overflow(minuend, subtrahend, &difference))
        __builtin_trap();

    return difference;
}
Compile the source file case18.c with GCC, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -mabi=sysv -o- -Os -S case18.c
[…]
__absvti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	cltq
	movq	%rax, %rsi
	sarq	$63, %rax
	movq	%rax, %rdi
	xorq	%rsi, %r8
	xorq	%rdi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rdi, %rdx
	jno	.L1
	ud2
.L1:
	ret
[…]
__negvti2:
	movq	%rdi, %r9
	movq	%rsi, %r8
	movq	%rdi, %rsi
	xorl	%edx, %edx
	movq	%r8, %rdi
	movq	%r9, %r8
	movl	$1, %eax
	negq	%r8
	movq	%rdi, %r9
	adcq	$0, %r9
	salq	$63, %rax
	negq	%r9
	cmpq	%rax, %rdi
	jne	.L7
	xorl	%edx, %edx
	testq	%rsi, %rsi
	sete	%dl
.L7:
	testq	%rdx, %rdx
	je	.L6
	ud2
.L6:
	movq	%r8, %rax
	movq	%r9, %rdx
	ret
[…]
__subvti3:
	movq	%rsi, %r8
	movq	%rdi, %rsi
	movq	%rsi, %rax
	movq	%r8, %rdi
	movq	%rdx, %r8
	subq	%r8, %rax
	movq	%rdi, %rdx
	sbbq	%rcx, %rdx
	jno	.L11
	ud2
.L11:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: terrible code for the __absvti2() and __negvti2() functions!

OOPS: awful code with 4 superfluous MOV instructions in a total of 11 instructions for the __subvti3() function.

Case 19: __builtin_copysign() Builtin

Create the text file case19.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

double copysign64(double destination, double source) {
    return __builtin_copysign(destination, source);
}
Compile the source file case19.c with GCC, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -mabi=sysv -o- -O3 -S case19.c
[…]
copysign64:
	fldl	12(%esp)
	fxam
	fnstsw	%ax
	fstp	%st(0)
	fldl	4(%esp)
	fabs
	testb	$2, %ah
	je	L1
	fchs
L1:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Ouch: 10 instructions in 24 bytes, including a conditional branch which impairs performance!

Proper code uses only but 5 instructions in just 19 bytes, without conditional branch:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	copysign64
	.type	copysign64, @function
	.text

copysign64:
	mov	eax, [esp+16]
	shld	[esp+8], eax, 1
	ror	[esp+8], 1
	fld	[esp+4]
	ret

	.end
Compile the source file case19.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -mabi=sysv -o- -O3 -S case19.c
[…]
copysign64:
	pushq	%rbp
	andpd	.LC1(%rip), %xmm0
	movapd	%xmm1, %xmm2
	andpd	.LC0(%rip), %xmm2
	movq	%rsp, %rbp
	orpd	%xmm2, %xmm0
	popq	%rbp
	ret
	.section .rdata,"dr"
	.align 16
.LC0:
	.long	0
	.long	-2147483648
	.long	0
	.long	0
	.align 16
.LC1:
	.long	-1
	.long	2147483647
	.long	0
	.long	0
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 8 instructions in 30 bytes, plus 2 constants in 32 bytes, of which either 4 instructions or (at least) one of the 2 constants are superfluous!

Proper code uses only 6 instructions in either 28 or just 24 bytes, and performs no memory access:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	copysign64
	.type	copysign64, @function
	.text
				# xmm0 = destination
				# xmm1 = source
copysign64:
.ifndef OPTIMIZE
	mov	rax, 7FFFFFFFFFFFFFFFh
	movd	xmm2, rax
	andpd	xmm0, xmm2
	andnpd	xmm2, xmm1
	orpd	xmm0, xmm2
.else
	movd	rcx, xmm0
	movd	rdx, xmm1
	shld	rcx, rdx, 1
	ror	rcx, 1
	movd	xmm0, rcx
.endif
	ret

	.end

Case 20: -ftrapv Command Line Option

Create the text file case20.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int si3(int v, int w) {
    volatile int p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}

long long di3(long long v, long long w) {
    volatile long long p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}

#ifdef __amd64__
__int128_t ti3(__int128_t v, __int128_t w) {
    volatile __int128_t p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}
#endif
Compile the source file case20.c with GCC, enabling overflow-trapping, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -ftrapv -m64 -mabi=sysv -o- -Os -S case20.c
	.file	"case20.c"
	.text
	.globl	__mulvsi3
	.globl	__addvsi3
	.globl	__subvsi3
	.globl	__negvsi2
	.globl	si3
	.type	si3, @function
si3:
.LFB0:
	pushq	%rbp
	movl	%esi, %ebp
	pushq	%rbx
	movl	%edi, %ebx
	subq	$40, %rsp
	call	__mulvsi3
	movl	%ebp, %esi
	movl	%ebx, %edi
	movl	%eax, 8(%rsp)
	movl	%ebx, %eax
	cltd
	idivl	%ebp
	movl	%eax, 12(%rsp)
	movl	%edx, 16(%rsp)
	call	__addvsi3
	movl	%ebp, %esi
	movl	%ebx, %edi
	movl	%eax, 20(%rsp)
	call	__subvsi3
	movl	%ebx, %edi
	movl	%eax, 24(%rsp)
	call	__negvsi2
	movl	%ebx, %esi
	movl	%ebx, %edi
	sarl	$31, %esi
	movl	%eax, 28(%rsp)
	xorl	%esi, %edi
	call	__subvsi3
	addq	$40, %rsp
	popq	%rbx
	popq	%rbp
	ret
[…]
	.globl	__mulvdi3
	.globl	__addvdi3
	.globl	__subvdi3
	.globl	__negvdi2
	.globl	di3
	.type	di3, @function
di3:
.LFB1:
	pushq	%rbp
	movq	%rsi, %rbp
	pushq	%rbx
	movq	%rdi, %rbx
	subq	$56, %rsp
	call	__mulvdi3
	movq	%rbp, %rsi
	movq	%rbx, %rdi
	movq	%rax, (%rsp)
	movq	%rbx, %rax
	cqto
	idivq	%rbp
	movq	%rax, 8(%rsp)
	movq	%rdx, 16(%rsp)
	call	__addvdi3
	movq	%rbp, %rsi
	movq	%rbx, %rdi
	movq	%rax, 24(%rsp)
	call	__subvdi3
	movq	%rbx, %rdi
	movq	%rax, 32(%rsp)
	call	__negvdi2
	movq	%rbx, %rsi
	movq	%rbx, %rdi
	sarq	$63, %rsi
	movq	%rax, 40(%rsp)
	xorq	%rsi, %rdi
	call	__subvdi3
	addq	$56, %rsp
	popq	%rbx
	popq	%rbp
	ret
[…]
	.globl	__mulvti3
	.globl	__divti3
	.globl	__modti3
	.globl	__addvti3
	.globl	__subvti3
	.globl	__negvti2
	.globl	ti3
	.type	ti3, @function
ti3:
.LFB2:
	pushq	%r14
	movq	%rcx, %r14
	pushq	%r13
	movq	%rdx, %r13
	pushq	%rbp
	movq	%rdi, %rbp
	pushq	%rbx
	movq	%rsi, %rbx
	subq	$104, %rsp
	call	__mulvti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, (%rsp)
	movq	%rdx, 8(%rsp)
	movq	%r13, %rdx
	call	__divti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 16(%rsp)
	movq	%rdx, 24(%rsp)
	movq	%r13, %rdx
	call	__modti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 32(%rsp)
	movq	%rdx, 40(%rsp)
	movq	%r13, %rdx
	call	__addvti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 48(%rsp)
	movq	%rdx, 56(%rsp)
	movq	%r13, %rdx
	call	__subvti3
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 64(%rsp)
	movq	%rdx, 72(%rsp)
	call	__negvti2
	movq	%rbx, %rcx
	movq	%rbp, %rdi
	sarq	$63, %rcx
	movq	%rax, 80(%rsp)
	xorq	%rcx, %rbx
	movq	%rdx, 88(%rsp)
	xorq	%rcx, %rdi
	movq	%rcx, %rdx
	movq	%rbx, %rsi
	call	__subvti3
	addq	$104, %rsp
	popq	%rbx
	popq	%rbp
	popq	%r13
	popq	%r14
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Oops: instead to generate a single call of the library function __divmodti4() which returns both quotient and remainder, GCC generates separate calls of the library functions __divti3() and __modti3(), doubling the execution time.

Ouch: instead to generate the trivial code for overflow-trapping addition, negation and subtraction inline, GCC generates calls of the library functions __addv?i3(), __mulv?i3(), __negv?i2() and __subv?i3(), increasing the execution time about an order of magnitude!

Note: exploration of the equally bad code generated for the i386 platform is left as an exercise to the reader.

Case 21: Shell Game, or CrazyBraindead Register Allocator

Take a second look at (the beginning of) the __mulvti3() function and the code generated for it:
/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2020 Free Software Foundation, Inc.
[…]
typedef union {
  DWtype ll;
  struct {
    Wtype low, high;
  } s;
} DWunion;
[…]
DWtype
__mulvDI3 (DWtype u, DWtype v)
{
  /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
     but the checked multiplication needs only two.  */
  const DWunion uu = {.ll = u};
  const DWunion vv = {.ll = v};

  if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
    {
      /* u fits in a single Wtype.  */
      if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
	{
	  /* v fits in a single Wtype as well.  */
	  /* A single multiplication.  No overflow risk.  */
	  return (DWtype) uu.s.low * (DWtype) vv.s.low;
	}
[…]
[…]
0000000000000000 <__mulvti3>:
   0:	41 55                	push   %r13
   2:	49 89 cb             	mov    %rcx,%r11
   5:	48 89 d0             	mov    %rdx,%rax
   8:	49 89 d2             	mov    %rdx,%r10
   b:	41 54                	push   %r12
   d:	49 89 fc             	mov    %rdi,%r12
  10:	48 89 d1             	mov    %rdx,%rcx
  13:	49 89 f0             	mov    %rsi,%r8
  16:	4c 89 e2             	mov    %r12,%rdx
  19:	49 89 f5             	mov    %rsi,%r13
  1c:	53                   	push   %rbx
  1d:	48 89 fe             	mov    %rdi,%rsi
  20:	48 c1 fa 3f          	sar    $0x3f,%rdx
  24:	48 c1 f8 3f          	sar    $0x3f,%rax
  28:	4c 89 df             	mov    %r11,%rdi
  2b:	4c 39 c2             	cmp    %r8,%rdx
  2e:	75 18                	jne    48 <__mulvti3+0x48>
  30:	4c 39 d8             	cmp    %r11,%rax
  33:	75 6b                	jne    a0 <__mulvti3+0xa0>
  35:	4c 89 e0             	mov    %r12,%rax
  38:	49 f7 ea             	imul   %r10
  3b:	5b                   	pop    %rbx
  3c:	41 5c                	pop    %r12
  3e:	41 5d                	pop    %r13
  40:	c3                   	retq   
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 25 instructions in 65 bytes, including 8 superfluous MOV instructions shuffling registers, clobbering the non-volatile registers RBX, R12 and R13 without necessity, and performing 6 superfluous memory accesses!

Proper and straightforward code uses only 11 instructions in just 31 bytes, without clobbering non-volatile registers, and without memory accesses at all:

# Copyright © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	__mulvti3
	.type	__mulvti3, @function
	.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__mulvti3:
	mov	r8, rdi
	mov	r9, rdx
	sra	r8, 63
	sra	r9, 63
	cmp	r8, rsi
	jne	__mulvti3+0x48
	cmp	r9, rcx
	jne	__mulvti3+0xa0
	mov	rax, rdi
	imul	rdx
	ret
[…]

Case 22: Undefined Behaviour or Optimiser Failure?

Create the text file case22.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __absusi2(int value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

int __absvsi2(int value) {
    const int sign = value >> 31;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

int __abswsi2(int value) {
    const int sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __absudi2(long long value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __absvdi2(long long value) {
    const long long sign = value >> 63;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __abswdi2(long long value) {
    const long long sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

#ifdef __amd64__
__int128_t __absuti2(__int128_t value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

__int128_t __absvti2(__int128_t value) {
    const __int128_t sign = value >> 127;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

__int128_t __abswti2(__int128_t value) {
    const __int128_t sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}
#endif // __amd64__
Compile the source file case22.c with GCC, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -mabi=sysv -o- -Os -S -Wall -Wextra case22.c
[…]
__absusi2:
	movl	%edi, %eax
	cltd
	xorl	%edx, %eax
	subl	%edx, %eax
	ret
[…]
__absvsi2:
	movl	%edi, %edx
	sarl	$31, %edx
	leal	(%rdi,%rdx), %eax
	xorl	%edx, %eax
	jns	.L2
	ud2
.L2:
	ret
[…]
__abswsi2:
	movl	%edi, %eax
	movl	%edi, %edx
	shrl	$31, %eax
	movl	%eax, %edi
	negl	%edi
	xorl	%edx, %edi
	addl	%edi, %eax
	jns	.L5
	ud2
.L5:
	ret
[…]
__absudi2:
	movq	%rdi, %rax
	cqto
	xorq	%rdx, %rax
	subq	%rdx, %rax
	ret
[…]
__absvdi2:
	movq	%rdi, %rdx
	sarq	$63, %rdx
	leaq	(%rdi,%rdx), %rax
	xorq	%rdx, %rax
	jns	.L8
	ud2
.L8:
	ret
[…]
__abswdi2:
	movq	%rdi, %rax
	cqto
	movslq	%edx, %rdx
	xorq	%rdx, %rax
	subq	%rdx, %rax
	jns	.L10
	ud2
.L10:
	ret
[…]
__absuti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	movq	%rax, %rsi
	xorq	%rax, %r8
	xorq	%rsi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rsi, %rdx
	ret
[…]
__absvti2:
	movq	%rdi, %rax
	movq	%rsi, %rdi
	movq	%rsi, %rdx
	sarq	$63, %rdi
	addq	%rdi, %rax
	adcq	%rdi, %rdx
	xorq	%rdi, %rax
	xorq	%rdi, %rdx
	jns	.L13
	ud2
.L13:
	ret
[…]
__abswti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	cltq
	movq	%rax, %rsi
	sarq	$63, %rax
	movq	%rax, %rdi
	xorq	%rsi, %r8
	xorq	%rdi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rdi, %rdx
	testq	%rdx, %rdx
	jns	.L15
	ud2
.L15:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Ouch: the optimiserpessimiser removes the conditional expression detecting possible integer overflow of the negation, i.e. undefined behaviour, for the __absusi2(), __absudi2() and __absuti2() functions – without warning, despite the command line options -Wall -Wextra given.

Oops: the optimiser fails to recognise the common and well-known alternative implementations used for the __absvsi2(), __absvdi2() and __absvti2() plus the __abswsi2(), __abswdi2() and __abswti2() functions, and generates unoptimised and awful code especially for the latter.

Note: exploration of the equally bad code generated for the i386 platform is left as an exercise to the reader.

Case 23: Optimiser Failures

Create the text file case23.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int sign32(int value) {
    return 0 - (value < 0);
}

long long sign64(long long value) {
#ifdef ALTERNATE
    return 0LL - (value < 0);
#else
    return 0 - (value < 0);
#endif
}

#ifdef __amd64__
__int128_t sign128(__int128_t value) {
#ifdef ALTERNATE
    return (__int128_t) 0 - (value < 0);
#else
    return 0 - (value < 0);
#endif
}
#endif // __amd64__
Compile the source file case23.c with GCC, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -mabi=sysv -o- -O3 -S case23.c
Note: the left column shows the generated code, while the right column shows properly optimised shorter code as comment.
[…]
_sign32:
	movl	4(%esp), %eax
	sarl	$31, %eax
	ret
[…]
_sign64:
	movl	8(%esp), %eax	#	mov	eax, [esp+8]
	sarl	$31, %eax	#	cdq
	cltd			#	mov	eax, edx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Compile the source file case23.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -mabi=sysv -o- -O3 -S case23.c
Note: the left column shows the generated code, while the right column shows properly optimised shorter code as comment.
[…]
sign32:
	movl	%edi, %eax
	sarl	$31, %eax
	ret
[…]
sign64:
	sarq	$63, %rdi	#	mov	rax, rdi
	movslq	%edi, %rax	#	sar	rax, 63
	ret			#	ret
[…]
sign128:
	sarq	$63, %rsi	#	mov	rax, rsi
	movslq	%esi, %rax	#	cqo
	cqto			#	mov	rax, rdx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Compile the source file case23.c a third time with GCC, again targetting the AMD64 platform, now with the preprocessor macro ALTERNATE defined, and display the generated assembly code:
gcc -m64 -mabi=sysv -o- -O3 -S case23.c
[…]
sign32:
	movl	%edi, %eax
	sarl	$31, %eax
	ret
[…]
sign64:
	sarq	$63, %rdi
	movslq	%edi, %rax
	ret
[…]
sign128:
	shrq	$63, %rsi
	xorl	%edx, %edx
	movq	%rsi, %rax
	negq	%rax
	adcq	$0, %rdx
	negq	%rdx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: that’s what I call a complete failure!

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):


Copyright © 1995–2020 • Stefan Kanthak • <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>