Me, myself & IT

Deficiencies in GCC’s Code Generator and Optimiser

Purpose
Reason
Case 0
Case 1: __udivmod?i4() Functions
Case 2: Missing __rotl?i3() and __rotr?i3() Functions
Case 3: __ashl?i3() Functions
Case 4: __ashr?i3() Functions
Case 5: __lshr?i3() Functions
Case 6: __cmp?i2() Functions
Case 7: __ucmp?i2() Functions
Case 8: __mul?i3() Functions
Case 9: __neg?i2() Functions
Case 10: __absv?i2() Functions
Case 11: __addv?i3() Functions
Case 12: __subv?i3() Functions
Case 13: __mulv?i3() Functions
Case 14: __negv?i2() Functions
Case 15: __parity?i2() Functions
Case 16: __builtin_parity() Builtin
Case 17: __builtin_mul_overflow() Builtin
Case 18: __builtin_sub_overflow() Builtin
Case 19: __builtin_copysign() Builtin
Case 20: -fsanitize=signed-integer-overflow Command Line Option
Case 21: -ftrapv Command Line Option
Case 22: Shell Game, or ~~Crazy~~ Braindead Register Allocator
Case 23: Undefined Behaviour in __bswapsi2() Function
Case 24: Undefined Behaviour or Optimiser Failure?
Case 25: Optimiser Failures
Case 26: More Optimiser Failures
Case 27: Yet More Optimiser Failures
Case 28: BDead Code Generation
Case 29: Superfluous Extraneous Code Generation
Case 30: Epic Optimiser Failures
Case 31: Cargo Cult, or Ignorance is Bliss, plus Epic Optimiser Failure
Case 32: More Epic Optimiser Failures
Case 33: Yet More Epic Optimiser Failures

Purpose

Demonstrate multiple bugs, defects, deficiencies and flaws of GCC and its libgcc library, especially the rather poor code generator.

Reason

The code generated by GCC (not only for its libgcc library) is rather poor, up to an order of magnitude worse than properly written (optimised) code.

Note: I recommend to read Randall Hyde’s ACM article, especially the following part:

Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.

Case 0

In his ACM queue article Optimizations in C++ Compilers – A practical journey, Matt Godbolt presents the following function as example (t):

bool isWhitespace(char c)
{
    return c == ' '
      || c == '\r'
      || c == '\n'
      || c == '\t';
}

He then shows the following code generated by GCC 9.1 for the AMD64 platform, which uses 7 instructions in 27 bytes:

isWhitespace(char):
  xor    eax, eax           ; result = false
  cmp    dil, 32            ; is c > 32
  ja     .L4                ; if so, exit with false
  movabs rax, 4294977024    ; rax = 0x100002600
  shrx   rax, rax, rdi      ; rax >>= c
  and    eax, 1             ; result = rax & 1
.L4:
  ret

This code is but not optimal – the conditional branch impairs performance, it should and can be avoided, as demonstrated by the following equivalent code, which uses 7 instructions in 26 bytes:

; Copyleft © 2020-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	movabs	rax, 100002600h
	shrx	rax, rax, rdi
	cmp	dil, 33
	sbb	edi, edi
	neg	edi
	and	eax, edi
	ret

Since the SHRX instruction is only available on newer CPUs, also branch-free code that runs on all AMD64 processors, with 8 instructions in just 25 bytes:

; Copyleft © 2020-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	mov	ecx, edi
	movabs	rax, 100002600h
	shr	rax, cl
	cmp	cl, 33
	sbb	ecx, ecx
	neg	ecx
	and	eax, ecx
	ret

Finally equivalent and branch-free code that works on 35 year old i386 processors, needing neither an AMD64 processor nor the SHRX instruction, with but 10 instructions in 26 bytes when the preprocessor macro OPTIMIZE is not defined, else just 25 bytes:

; Copyleft © 2020-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	mov	ecx, [esp+4]
#ifndef OPTIMIZE
	mov	eax, 2600h	; eax = (1 << '\r') | (1 << '\n') | (1 << '\t')
	test	cl, cl
	setnz	al		; eax |= (c != '\0')
#else
	xor	eax, eax
	cmp	eax, ecx	; CF = (c != '\0')
	adc	eax, 2600h	; eax = (c != '\0') | (1 << '\r') | (1 << '\n') | (1 << '\t')
#endif
	shr	eax, cl		; eax >>= (c % ' ')
	xor	edx, edx
	cmp	ecx, 33		; CF = (c <= ' ')
	adc	edx, edx	; edx = (c <= ' ')
	and	eax, edx
	ret

Case 1: `__udivmod?i4()` Functions

libgcc provides the __udivmoddi4() and __udivmodti4() functions for unsigned integer division:

Runtime Function: unsigned long __udivmoddi4 (unsigned long a, unsigned long b, unsigned long *c)

Runtime Function: unsigned long long __udivmodti4 (unsigned long long a, unsigned long long b, unsigned long long *c)

These functions calculate both the quotient and remainder of the unsigned division of a and b. The return value is the quotient, and the remainder is placed in variable pointed to by c.

The last 5 statements of the following part of their source perform two multi-word left shifts on the variable pair d1, d0 holding the divisor and the variable triple n2, n1, n0 holding the dividend:

/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
UDWtype
__udivmoddi4 (UDWtype n, UDWtype d, UDWtype *rp)
{
[…]
  UWtype d0, d1, n0, n1, n2;
  UWtype b, bm;
[…]
	  count_leading_zeros (bm, d1);
	  if (bm == 0)
[…]
	  else
	    {
	      UWtype m1, m0;
	      /* Normalize.  */

	      b = W_TYPE_SIZE - bm;

	      d1 = (d1 << bm) | (d0 >> b);
	      d0 = d0 << bm;
	      n2 = n1 >> b;
	      n1 = (n1 << bm) | (n0 >> b);
	      n0 = n0 << bm;
[…]

For the AMD64 target platform, GCC compiles this sequence of shifts to the following unoptimised and rather awful code:

[…]
0000000000000000 <__udivmodti4>:
[…]
  b0: 4c 0f bd da           bsr    %rdx,%r11
  b4: 49 83 f3 3f           xor    $0x3f,%r11
  b8: 45 85 db              test   %r11d,%r11d
  bb: 75 33                 jne    f0 <__udivmodti4+0xf0>
[…]
  f0: 49 63 c3              movslq %r11d,%rax
  f3: bd 40 00 00 00        mov    $0x40,%ebp
  f8: 44 89 d9              mov    %r11d,%ecx
  fb: 4d 89 ec              mov    %r13,%r12
  fe: 48 29 c5              sub    %rax,%rbp
 101: 48 d3 e2              shl    %cl,%rdx
 104: 49 89 f2              mov    %rsi,%r10
 107: 48 89 f8              mov    %rdi,%rax
 10a: 89 e9                 mov    %ebp,%ecx
 10c: 44 89 db              mov    %r11d,%ebx
 10f: 49 d3 ec              shr    %cl,%r12
 112: 44 89 d9              mov    %r11d,%ecx
 115: 49 d3 e5              shl    %cl,%r13
 118: 89 e9                 mov    %ebp,%ecx
 11a: 49 09 d4              or     %rdx,%r12
 11d: 49 d3 ea              shr    %cl,%r10
 120: 44 89 d9              mov    %r11d,%ecx
 123: 48 d3 e6              shl    %cl,%rsi
 126: 89 e9                 mov    %ebp,%ecx
 128: 4c 89 d2              mov    %r10,%rdx
 12b: 48 d3 e8              shr    %cl,%rax
 12e: 44 89 d9              mov    %r11d,%ecx
 131: 48 09 c6              or     %rax,%rsi
 134: 48 d3 e7              shl    %cl,%rdi
 137: 48 89 f0              mov    %rsi,%rax
 13a: 49 89 f9              mov    %rdi,%r9
[…]

Oops: the TEST instruction at address b8 is superfluous.

Ouch: the optimiser fails to recognise the common and well-known idiom of complementary shifts, for which it should but generate the following properly optimised and significantly shorter code:

[…]
                            mov    %r11d, %ecx
                            xor    n2, n2
                            shld   %cl, n1, n2
                            shld   %cl, n0, n1
                            shl    %cl, n0
                            shld   %cl, d0, d1
                            shl    %cl, d0
[…]

Note: the optimiser but recognises the following similar, also common and well-known idiom used for rotation:

unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}

The code generated by GCC for the __udivmodti4() function shows 141 instructions in 439 bytes, while properly written and faster code uses only 85 instructions in just 232 bytes:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# * The software is provided "as is" without any warranty, neither express
#   nor implied.
# * In no event will the author be held liable for any damage(s) arising
#   from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
#   and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
#   as an agent, (independent) contractor, employee, member or officer of
#   a business, corporation or organization (commercial or non-commercial)
#   does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
#   for a business, for commercial, corporate, governmental, military or
#   organizational purposes of any kind, or in a commercial, corporate,
#   governmental, military or organizational environment of any kind.

# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
#   in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
#   (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
#   left), 8-byte aligned;
# - 128-bit integer arguments are passed as pair of 64-bit integer arguments,
#   low part before/below high part;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
#   RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
#   preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
#   case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
#   before calling other functions (CALL instruction pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.

# NOTE: raises "division exception" when divisor is 0!

.file	"udivmodti4.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__umodti3:
	sub	rsp, 24
	mov	r8, rsp		# r8 = oword ptr remainder
	call	__udivmodti4
	pop	rax
	pop	rdx		# rdx:rax = remainder
	pop	rcx
	ret
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__udivti3:
	xor	r8, r8
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
__udivmodti4:
	cmp	rdi, rdx
	mov	rax, rsi
	sbb	rax, rcx
	jb	.trivial	# dividend < divisor?

	mov	r10, rdx
	mov	r11, rcx	# r11:r10 = divisor

	bsr	rcx, rcx	# rcx = index of most significant '1' bit
				#        in high qword of divisor
	jnz	.extended	# high qword of divisor <> 0?

	# remainder < divisor < 2**64

	cmp	rsi, rdx
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64: quotient < 2**64
	# perform normal division
.normal:
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = dividend
	div	r10		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	0f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
0:
	mov	rdx, r11	# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64: quotient >= 2**64
	# perform "long" alias "schoolbook" division
.long:
	mov	rdx, r11	# rdx = 0
	mov	rax, rsi	# rdx:rax = high qword of dividend
	div	r10		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	mov	rcx, rax	# rcx = high qword of quotient
	mov	rax, rdi	# rax = low qword of dividend
	div	r10		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	1f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
1:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	test	r8, r8
	jz	2f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0
	ret

	# dividend >= divisor >= 2**64: quotient < 2**64
.extended:
	xor	ecx, 63		# ecx = number of leading '0' bits
				#        in (high qword of) divisor
	jz	.special	# divisor >= 2**127?

	# perform "extended & adjusted" division

	shld	r11, r10, cl
	shl	r10, cl		# r11:r10 = divisor * 2**(index + 1)
				#         = divisor'
	xor	edx, edx
	shld	rdx, rsi, cl
	shld	rsi, rdi, cl
	shl	rdi, cl		# rdx:rsi:rdi = dividend * 2**(index + 1)
				#             = dividend'
	mov	rax, rsi
	div	r11		# rax = quotient',
				# rdx = high qword of remainder'
	mov	rsi, rdx	# rsi:rdi = remainder'
	mov	r9, rax		# r9 = quotient'
	mul	r10		# rdx:rax = quotient' * low qword of divisor'
				#         = remainder"
.ifnotdef JCCLESS
	cmp	rdx, rsi
	ja	4f
	jb	3f
	cmp	rax, rdi
	ja	4f		# remainder" > remainder'?
.else
	push	rbx
	mov	rbx, rdx
	cmp	rax, rdi
	sbb	rbx, rsi
	pop	rbx
	ja	4f		# quotient' > quotient?
.endif
3:
	test	r8, r8
	jnz	5f		# address of remainder <> 0?

	xor	edx, edx
	mov	rax, r9		# rdx:rax = quotient
	ret
4:
	dec	r9		# r9 = quotient' - 1
				#    = (low qword of) quotient
	test	r8, r8
	jz	6f		# address of remainder = 0?

	sub	rax, r10
	sbb	rdx, r11	# rdx:rax = remainder" - divisor'
5:
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = remainder * 2**(index + 1)
	shrd	rdi, rsi, cl
	shr	rsi, cl		# rsi:rdi = remainder

	mov	[r8], rdi
	mov	[r8+8], rsi
6:
	xor	edx, edx
	mov	rax, r9		# rdx:rax = quotient
	ret

	# dividend >= divisor >= 2**127:
	# quotient = 1, remainder = dividend - divisor
.special:
	test	r8, r8
	jz	7f		# address of remainder = 0?

	sub	rdi, r10
	sbb	rsi, r11	# rsi:rdi = dividend - divisor
				#         = remainder
	mov	[r8], rdi
	mov	[r8+8], rsi
7:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# rdx:rax = quotient = 1
	ret

.size	__udivmodti4, .-__udivmodti4
.type	__udivmodti4, @function
.global	__udivmodti4
.size	__udivti3, .-__udivti3
.type	__udivti3, @function
.global	__udivti3
.size	__umodti3, .-__umodti3
.type	__umodti3, @function
.global	__umodti3
.end

Note: exploration of the same unoptimised and awful code in the __divmodti4(), __divti3(), __modti3(), __udivti3() and __umodti3() as well as the __divmoddi4(), __divdi3(), __moddi3(), __udivmoddi4(), __udivdi3() and __umoddi3() functions is left as an exercise to the reader.

[…]
00000000 <__udivmoddi4>:
[…]
  90:	0f bd ef             	bsr    %edi,%ebp
  93:	83 f5 1f             	xor    $0x1f,%ebp
  96:	75 58                	jne    f0 <__udivmoddi4+0xf0>
[…]
  f0:	b8 20 00 00 00       	mov    $0x20,%eax
  f5:	89 e9                	mov    %ebp,%ecx
  f7:	89 c2                	mov    %eax,%edx
  f9:	d3 e7                	shl    %cl,%edi
  fb:	89 f0                	mov    %esi,%eax
  fd:	29 ea                	sub    %ebp,%edx
  ff:	89 d1                	mov    %edx,%ecx
 101:	89 54 24 1c          	mov    %edx,0x1c(%esp)
 105:	d3 e8                	shr    %cl,%eax
 107:	89 e9                	mov    %ebp,%ecx
 109:	09 f8                	or     %edi,%eax
 10b:	89 f7                	mov    %esi,%edi
 10d:	d3 e7                	shl    %cl,%edi
 10f:	89 44 24 0c          	mov    %eax,0xc(%esp)
 113:	89 d1                	mov    %edx,%ecx
 115:	89 d8                	mov    %ebx,%eax
 117:	d3 e8                	shr    %cl,%eax
 119:	89 e9                	mov    %ebp,%ecx
 11b:	89 7c 24 14          	mov    %edi,0x14(%esp)
 11f:	89 c6                	mov    %eax,%esi
 121:	8b 44 24 10          	mov    0x10(%esp),%eax
 125:	d3 e3                	shl    %cl,%ebx
 127:	89 d1                	mov    %edx,%ecx
 129:	89 f2                	mov    %esi,%edx
 12b:	d3 e8                	shr    %cl,%eax
 12d:	89 e9                	mov    %ebp,%ecx
 12f:	89 c7                	mov    %eax,%edi
 131:	8b 44 24 10          	mov    0x10(%esp),%eax
 135:	09 df                	or     %ebx,%edi
 137:	d3 e0                	shl    %cl,%eax
 139:	89 44 24 18          	mov    %eax,0x18(%esp)
 13d:	89 f8                	mov    %edi,%eax
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# Common "cdecl" calling and naming convention for i386 platform:
# - arguments are pushed on stack in reverse order (from right to left),
#   4-byte aligned;
# - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
#   low part below high part;
# - 64-bit integer result is returned in registers EAX (low part) and
#   EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.

# NOTE: raises "division exception" when divisor is 0!

.file	"udivmoddi4.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+20] = (optional) qword ptr remainder
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__udivmoddi4:
	mov	ecx, [esp+8]	# ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	0f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
0:
	xor	edx, edx	# edx:eax = quotient
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+24]	# ecx = address of remainder
	test	ecx, ecx
	jz	1f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
1:
	pop	edx		# edx:eax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	2f		# address of remainder = 0?

	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	mov	[ecx], eax
	mov	[ecx+4], edx	# [ecx] = remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# edx:eax = quotient = 0
	ret

	# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor >= 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	3f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
3:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+12]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	push	ebx		# [esp] = quotient"
	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	# ebx = high dword of divisor * quotient"
	mov	ecx, [esp+16]	# ecx = high dword of dividend
	sub	ecx, ebx	# ecx = high dword of dividend
				#     - high dword of divisor * quotient"
	mov	ebx, [esp+12]	# ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	# ecx:ebx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JCCLESS
	pop	eax		# eax = quotient"
	jnb	4f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	# ecx:ebx = remainder" + divisor
				#         = remainder
	dec	eax		# eax = quotient" - 1
				#     = (low dword of) quotient
4:
.else # JCCLESS
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	# [esp] = quotient" - (remainder" < 0)
				#       = (low dword of) quotient
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	ebx, eax
	adc	ecx, edx	# ecx:ebx = remainder" + divisor
				#         = remainder
	pop	eax		# eax = (low dword of) quotient
.endif # JCCLESS
	mov	edx, [esp+24]	# edx = address of remainder
	test	edx, edx
	jz	5f		# address of remainder = 0?

	mov	[edx], ebx
	mov	[edx+4], ecx	# [edx] = remainder
5:
	xor	edx, edx	# edx:eax = quotient
	pop	ebx
	ret

	# dividend >= divisor >= 2**63:
	# quotient = 1, remainder = dividend - divisor
.special:
	or	ecx, [esp+20]	# ecx = address of remainder
	jz	6f		# address of remainder = 0?
.if 0
	neg	edx
	neg	eax
	sbb	edx, 0		# edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	# edx:eax = dividend - divisor
				#         = remainder
.else
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	# edx:eax = dividend - divisor
				#         = remainder
.endif
	mov	[ecx], eax
	mov	[ecx+4], edx	# [ecx] = remainder
6:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# edx:eax = quotient = 1
	ret

.size	__udivmoddi4, .-__udivmoddi4
.type	__udivmoddi4, @function
.global	__udivmoddi4
.end

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"udivdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__udivdi3:
	mov	ecx, [esp+8]	# ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
	xor	edx, edx	# edx:eax = quotient
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	pop	edx		# edx:eax = quotient
	ret

	# dividend < divisor: quotient = 0
.trivial:
	xor	eax, eax
	xor	edx, edx	# edx:eax = quotient = 0
	ret

	# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor >= 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+12]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+16]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
.ifnotdef JCCLESS
	mov	ecx, [esp+20]	# ecx = high dword of divisor
	imul	ecx, ebx	# ecx = high dword of divisor * quotient"
	add	edx, ecx	# edx:eax = divisor * quotient"
	jc	1f		# divisor * quotient" >= 2**64?

	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	# CF = (dividend < divisor * quotient")
				#    = (remainder" < 0)
1:
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) quotient
	xor	edx, edx	# edx:eax = quotient
.else # JCCLESS
	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	# ecx:... = dividend
				#         - low dword of divisor * quotient"
	mov	eax, [esp+20]	# eax = high dword of divisor
	imul	eax, ebx	# eax = high dword of divisor * quotient"
.if 0
	sub	ecx, eax	# ecx:... = dividend - divisor * quotient"
				#         = remainder"
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) quotient
	xor	edx, edx	# edx:eax = quotient
.else
	xor	edx, edx	# edx = high dword of quotient = 0
	sub	ecx, eax	# ecx:... = dividend - divisor * quotient"
				#         = remainder"
	mov	eax, ebx	# eax = quotient"
	sbb	eax, edx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) quotient
.endif
.endif # JCCLESS
	pop	ebx
	ret

	# dividend >= divisor >= 2**63: quotient = 1
.special:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# edx:eax = quotient = 1
	ret

.size	__udivdi3, .-__udivdi3
.type	__udivdi3, @function
.global	__udivdi3
.end

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"umoddi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__umoddi3:
	mov	ecx, [esp+8]	# ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	mov	eax, [esp+4]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret

	# dividend < divisor: remainder = dividend
.trivial:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = remainder = dividend
	ret

	# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor >= 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+12]	# edx = high dword of dividend
	sub	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+12]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+16]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+20]	# ebx = high dword of divisor * quotient"
	mov	ecx, [esp+12]	# ecx = high dword of dividend
	sub	ecx, ebx	# ecx = high dword of dividend
				#     - high dword of divisor * quotient"
	mov	ebx, [esp+8]	# ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	# ecx:ebx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JCCLESS
	jnb	1f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	# ecx:ebx = remainder" + divisor
				#         = remainder
1:
	mov	eax, ebx
	mov	edx, ecx	# edx:eax = remainder
.else # JCCLESS
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+16]
	and	edx, [esp+20]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ebx
	adc	edx, ecx	# edx:eax = remainder" + divisor
				#         = remainder
.endif # JCCLESS
	pop	ebx
	ret

	# dividend >= divisor >= 2**63: remainder = dividend - divisor
.special:
.if 0
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	# edx:eax = dividend - divisor
				#         = remainder
.else
	neg	edx
	neg	eax
	sbb	edx, ecx	# edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	# edx:eax = dividend - divisor
				#         = remainder
.endif
	ret

.size	__umoddi3, .-__umoddi3
.type	__umoddi3, @function
.global	__umoddi3
.end

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"divdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__divdi3:
	# determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	# eax:ecx = dividend

	cdq			# edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (dividend < 0) ? -dividend : dividend
				#         = |dividend|

	mov	[esp+4], ecx	# write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		# [esp] = (dividend < 0) ? -1 : 0

	# determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	# edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		# ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = (divisor < 0) ? -divisor : divisor
				#         = |divisor|

	mov	[esp+16], eax	# write |divisor| back on stack
	mov	[esp+20], edx

	xor	[esp], ecx	# [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
				#       = (quotient < 0) ? -1 : 0

	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+12]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+8]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
#	xor	edx, edx	# edx:eax = |quotient|

	jmp	.quotient

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+12]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	pop	edx		# edx:eax = |quotient|

	pop	ecx		# ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = quotient
	ret

	# dividend < divisor: quotient = 0
.trivial:
	pop	ecx		# ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	# edx:eax = quotient = 0
	ret

	# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor = 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+16]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+16]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+16]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	# ecx = high dword of divisor
	imul	ecx, ebx	# ecx = high dword of divisor * quotient"
	add	edx, ecx	# edx:eax = divisor * quotient"
#	jc	1f		# divisor * quotient" >= 2**64?

	mov	ecx, [esp+16]	# ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	# CF = (dividend < divisor * quotient")
				#    = (remainder" < 0)
1:
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) |quotient|
#	xor	edx, edx	# edx:eax = |quotient|
	pop	ebx
.quotient:
	pop	edx		# edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	# edx:eax = quotient
	ret

	# dividend = divisor = -2**63: quotient = 1
.special:
	pop	eax		# eax = sign of quotient = 0
	inc	eax		# eax = (low dword of) quotient = 1
	xor	edx, edx	# edx:eax = quotient = 1
	ret

.size	__divdi3, .-__divdi3
.type	__divdi3, @function
.global	__divdi3
.end

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"moddi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__moddi3:
	# determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	# eax:ecx = dividend

	cdq			# edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	# ecx:eax = (dividend < 0) ? -dividend : dividend
				#         = |dividend|

	mov	[esp+4], ecx	# write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		# [esp] = (dividend < 0) ? -1 : 0

	# determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	# edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		# ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = (divisor < 0) ? -divisor : divisor
				#         = |divisor|

	mov	[esp+16], eax	# write |divisor| back on stack
	mov	[esp+20], edx

	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+12]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	jmp	.next

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
.next:
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) |remainder|
#	xor	edx, edx	# edx:eax = |remainder|

	pop	edx		# edx = (remainder < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	# edx:eax = remainder
	ret

	# dividend < divisor: remainder = dividend
.trivial:
	mov	eax, [esp+8]
	mov	edx, [esp+12]	# edx:eax = |remainder| = |dividend|

	jmp	.remainder

	# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor = 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+16]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+16]	# edx = high dword of dividend
	sub	edx, ebx	# edx = high dword of dividend - divisor'
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+16]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	# ebx = high dword of divisor * quotient"
	add	edx, ebx	# edx:eax = divisor * quotient"
	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JCCLESS
	jnb	1f		# remainder" >= 0?
				#  (with borrow, it is off by divisor
				#   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	# ebx:ecx = remainder" + divisor
				#         = remainder
1:
	mov	eax, ecx
	mov	edx, ebx	# edx:eax = |remainder|
.else # JCCLESS
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	# edx:eax = remainder" + divisor
				#         = |remainder|
.endif # JCCLESS
	pop	ebx
.remainder:
	pop	ecx		# ecx = (remainder < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = remainder
	ret

	# dividend = divisor = -2**63: remainder = 0
.special:
	pop	eax		# eax = sign of remainder = -1
	inc	eax
	xor	edx, edx	# edx:eax = remainder = 0
	ret

.size	__moddi3, .-__moddi3
.type	__moddi3, @function
.global	__moddi3
.end

Case 2: the missing `rotl?i3()` and `rotr?i3()` Functions

libgcc fails to provide the __rotlsi3(), __rotldi3() and __rotlti3() as well as the __rotrsi3(), __rotrdi3() and __rotrti3() functions for (unsigned) integer rotation alias circular shift.

Create the text file case2.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __rotlsi3(unsigned int value, int count) {
    return (value << (31 & count))
         | (value >> (31 & -count));
}

unsigned int __rotrsi3(unsigned int value, int count) {
    return (value >> (31 & count))
         | (value << (31 & -count));
}

unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}

unsigned long long __rotrdi3(unsigned long long value, int count) {
    return (value >> (63 & count))
         | (value << (63 & -count));
}

#ifdef __amd64__
__uint128_t __rotlti3(__uint128_t value, int count) {
    return (value << (127 & count))
         | (value >> (127 & -count));
}

__uint128_t __rotrti3(__uint128_t value, int count) {
#ifndef OPTIMIZE
    return (value >> (127 & count))
         | (value << (127 & -count));
#else
    __asm__("movq\t%[low], %%rax\n\t"
            "shrdq\t%%cl, %[high], %%rax\n\t"
            "shrdq\t%%cl, %[low], %[high]\n\t"
            "movq\t%[high], %%rdx\n\t"
            "test\t$64, %%cl\n\t"
            "cmovnz\t%%rax, %%rdx\n\t"
            "cmovnz\t%[high], %%rax"
           :"=A" (value)
           :"c" (count),
            [low] "r" ((unsigned long long) (value & ~0ULL)),
            [high] "r" ((unsigned long long) (value >> 64))
           :"cc");
    return value;
#endif // OPTIMIZE
}
#endif // __amd64__

Compile the source file case2.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -Os -S case2.c

[…]
__rotlsi3:
	movl	%edi, %eax
	movl	%esi, %ecx
	roll	%cl, %eax
	ret
[…]
__rotrsi3:
	movl	%edi, %eax
	movl	%esi, %ecx
	rorl	%cl, %eax
	ret
[…]
__rotldi3:
	movq	%rdi, %rax
	movl	%esi, %ecx
	rolq	%cl, %rax
	ret
[…]
__rotrdi3:
	movq	%rdi, %rax
	movl	%esi, %ecx
	rorq	%cl, %rax
	ret
[…]
__rotlti3:
	movq	%rdi, %r9
	movq	%rsi, %r8
	andl	$127, %edx
	movq	%rdi, %rsi
	movl	%edx, %ecx
	movq	%r9, %rax
	movq	%r8, %rdx
	movq	%r8, %rdi
	shldq	%cl, %r9, %rdx
	salq	%cl, %rax
	xorl	%r8d, %r8d
	testb	$64, %cl
	cmovne	%rax, %rdx
	cmovne	%r8, %rax
	negl	%ecx
	xorl	%r9d, %r9d
	andl	$127, %ecx
	shrdq	%cl, %rdi, %rsi
	shrq	%cl, %rdi
	testb	$64, %cl
	cmovne	%rdi, %rsi
	cmovne	%r9, %rdi
	orq	%rsi, %rax
	orq	%rdi, %rdx
	ret
[…]
__rotrti3:
	movq	%rsi, %r8
	movq	%rdi, %r9
	andl	$127, %edx
	movq	%rdi, %rsi
	movl	%edx, %ecx
	movq	%r9, %rax
	movq	%r8, %rdx
	movq	%r8, %rdi
	shrdq	%cl, %r8, %rax
	shrq	%cl, %rdx
	xorl	%r8d, %r8d
	testb	$64, %cl
	cmovne	%rdx, %rax
	cmovne	%r8, %rdx
	negl	%ecx
	andl	$127, %ecx
	shldq	%cl, %r9, %rdi
	salq	%cl, %rsi
	xorl	%r9d, %r9d
	testb	$64, %cl
	cmovne	%rsi, %rdi
	cmovne	%r9, %rsi
	orq	%rdi, %rdx
	orq	%rsi, %rax
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Ouch: while the optimiser recognises the common and well-known idiom for rotation of 32-bit and 64-bit integers, it but fails rather bad at 128-bit integers and generates awful code!

Properly written code uses but only 9 instructions in just 28 bytes instead of 25 instructions in 77 bytes for each of the __rotlti3() and __rotrti3() functions:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"rolti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__rotlti3:
	mov	ecx, edx
	mov	rax, rdi
	shld	rax, rsi, cl
	shld	rsi, rdi, cl
	mov	rdx, rsi
	test	cl, 64
	cmovnz	rdx, rax
	cmovnz	rax, rsi
	ret

.size	__rotlti3, .-__rotlti3
.type	__rotlti3, @function
.global	__rotlti3

__rotrti3:
	mov	ecx, edx
	mov	rax, rdi
	shrd	rax, rsi, cl
	shrd	rsi, rdi, cl
	mov	rdx, rsi
	test	cl, 64
	cmovnz	rdx, rax
	cmovnz	rax, rsi
	ret

.size	__rotrti3, .-__rotrti3
.type	__rotrti3, @function
.global	__rotrti3
.end

Note: exploration of the equally bad code generated for rotation of 64-bit integers on the i386 platform is left as an exercise to the reader.

gcc -m32 -o- -Os -S case2.c

[…]
__rotlsi3:
	movl	8(%esp), %ecx
	movl	4(%esp), %eax
	roll	%cl, %eax
	ret
[…]
__rotrsi3:
	movl	8(%esp), %ecx
	movl	4(%esp), %eax
	rorl	%cl, %eax
	ret
[…]
__rotldi3:
	pushl	%edi
	pushl	%esi
	movl	12(%esp), %eax
	movl	16(%esp), %edx
	movl	20(%esp), %ecx
	movl	%eax, %esi
	andl	$63, %ecx
	movl	%edx, %edi
	shldl	%cl, %esi, %edi
	sall	%cl, %esi
	testb	$32, %cl
	je	L8
	movl	%esi, %edi
	xorl	%esi, %esi
L8:
	negl	%ecx
	andl	$63, %ecx
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	testb	$32, %cl
	je	L9
	movl	%edx, %eax
	xorl	%edx, %edx
L9:
	orl	%esi, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	ret
[…]
__rotrdi3:
	pushl	%edi
	pushl	%esi
	movl	12(%esp), %eax
	movl	16(%esp), %edx
	movl	20(%esp), %ecx
	movl	%eax, %esi
	andl	$63, %ecx
	movl	%edx, %edi
	shrdl	%cl, %edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	je	L12
	movl	%edi, %esi
	xorl	%edi, %edi
L12:
	negl	%ecx
	andl	$63, %ecx
	shldl	%cl, %eax, %edx
	sall	%cl, %eax
	testb	$32, %cl
	je	L13
	movl	%eax, %edx
	xorl	%eax, %eax
L13:
	orl	%esi, %eax
	orl	%edi, %edx
	popl	%esi
	popl	%edi
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Case 3: `__ashl?i3()` Functions

libgcc provides the __ashlsi3(), __ashlti3() and __ashlti3() functions for integer shift operations:

Runtime Function: int __ashlsi3 (int a, int b)

Runtime Function: long __ashldi3 (long a, int b)

Runtime Function: long long __ashlti3 (long long a, int b)

These functions return the result of shifting a left by b bits.

For the AMD64 target platform, GCC compiles __ashlti3() to the following unoptimised and awful code:

[…]
0000000000000000 <__ashlti3>:
   0:	48 85 d2             	test   %rdx,%rdx
   3:	74 3b                	je     40 <__ashlti3+0x40>
   5:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   b:	49 29 d0             	sub    %rdx,%r8
   e:	4d 85 c0             	test   %r8,%r8
  11:	7e 1d                	jle    30 <__ashlti3+0x30>
  13:	89 d1                	mov    %edx,%ecx
  15:	48 89 f8             	mov    %rdi,%rax
  18:	48 d3 e6             	shl    %cl,%rsi
  1b:	48 d3 e0             	shl    %cl,%rax
  1e:	44 89 c1             	mov    %r8d,%ecx
  21:	48 89 f2             	mov    %rsi,%rdx
  24:	48 d3 ef             	shr    %cl,%rdi
  27:	48 09 fa             	or     %rdi,%rdx
  2a:	c3                   	retq   
  2b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	31 c0                	xor    %eax,%eax
  35:	f7 d9                	neg    %ecx
  37:	48 d3 e7             	shl    %cl,%rdi
  3a:	48 89 fa             	mov    %rdi,%rdx
  3d:	c3                   	retq   
  3e:	66 90                	xchg   %ax,%ax
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]

Oops: the TEST instruction at address e is superfluous.

Properly written branch-free code uses but only 11 instructions in just 32 bytes instead of 26 instructions in 71 bytes and handles shift counts greater than 127:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ashlti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__ashlti3:
	mov	ecx, edx
	xor	edx, edx
	xor	eax, eax
	shld	rsi, rdi, cl
	shl	rdi, cl		# rsi:rdi = value << count % 64
	cmp	ecx, 127
	cmovbe	rdx, rdi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi	# rdx:rax = value << count
	ret

.size	__ashlti3, .-__ashlti3
.type	__ashlti3, @function
.global	__ashlti3
.end

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ashlti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__ashlti3:
	cmp	edx, 127
	ja	.zero		# count > 127?

	mov	ecx, edx
	cmp	edx, 63
	ja	.high		# count > 63?
.low:
	shld	rsi, rdi, cl
	shl	rdi, cl		# rsi:rdi = value << count % 64
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = value << count
	ret
.high:
	shl	rdi, cl
	mov	rdx, rdi
	xor	eax, eax	# rdx:rax = value << count
	ret
.zero:
	xor	edx, edx
	xor	eax, eax	# rdx:rax = 0
				#         = value << count
	ret

.size	__ashlti3, .-__ashlti3
.type	__ashlti3, @function
.global	__ashlti3
.end

Note: exploration of the equally bad code generated for the __ashldi3() function on the i386 platform is left as an exercise to the reader.

[…]
00000000 <__ashldi3>:
   0:	56                   	push   %esi
   1:	53                   	push   %ebx
   2:	8b 5c 24 0c          	mov    0xc(%esp),%ebx
   6:	8b 4c 24 14          	mov    0x14(%esp),%ecx
   a:	8b 54 24 10          	mov    0x10(%esp),%edx
   e:	89 d8                	mov    %ebx,%eax
  10:	85 c9                	test   %ecx,%ecx
  12:	74 15                	je     29 <__ashldi3+0x29>
  14:	be 20 00 00 00       	mov    $0x20,%esi
  19:	29 ce                	sub    %ecx,%esi
  1b:	85 f6                	test   %esi,%esi
  1d:	7e 11                	jle    30 <__ashldi3+0x30>
  1f:	d3 e2                	shl    %cl,%edx
  21:	d3 e0                	shl    %cl,%eax
  23:	89 f1                	mov    %esi,%ecx
  25:	d3 eb                	shr    %cl,%ebx
  27:	09 da                	or     %ebx,%edx
  29:	5b                   	pop    %ebx
  2a:	5e                   	pop    %esi
  2b:	c3                   	ret    
  2c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  30:	83 e9 20             	sub    $0x20,%ecx
  33:	31 c0                	xor    %eax,%eax
  35:	d3 e3                	shl    %cl,%ebx
  37:	89 da                	mov    %ebx,%edx
  39:	5b                   	pop    %ebx
  3a:	5e                   	pop    %esi
  3b:	c3                   	ret    
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: applies shift count modulo 64!

.file	"ashldi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+12] = count
				# [esp+8] = high dword of value
				# [esp+4] = low dword of value
__ashldi3:
	mov	ecx, [esp+12]
	mov	eax, [esp+4]
	test	cl, 32
	jnz	0f		# count > 31?

	mov	edx, [esp+8]
	shld	edx, eax, cl
	shl	eax, cl		# edx:eax = value << count
	ret
0:
	mov	edx, eax
	shl	edx, cl
	xor	eax, eax	# edx:eax = value << count
	ret

.size	__ashldi3, .-__ashldi3
.type	__ashldi3, @function
.global	__ashldi3
.end

Case 4: `__ashr?i3()` Functions

libgcc provides the __ashrsi3(), __ashrdi3() and __ashrti3() functions for integer shift operations:

Runtime Function: int __ashrsi3 (int a, int b)

Runtime Function: long __ashrdi3 (long a, int b)

Runtime Function: long long __ashrti3 (long long a, int b)

These functions return the result of arithmetically shifting a right by b bits.

For the AMD64 target platform, GCC compiles __ashrti3() to the following unoptimised and awful code:

[…]
0000000000000000 <__ashrti3>:
   0:	48 89 d1             	mov    %rdx,%rcx
   3:	48 85 d2             	test   %rdx,%rdx
   6:	74 38                	je     40 <__ashrti3+0x40>
   8:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   e:	49 29 d0             	sub    %rdx,%r8
  11:	48 89 f2             	mov    %rsi,%rdx
  14:	4d 85 c0             	test   %r8,%r8
  17:	7e 17                	jle    30 <__ashrti3+0x30>
  19:	48 d3 ef             	shr    %cl,%rdi
  1c:	48 d3 fa             	sar    %cl,%rdx
  1f:	48 89 f0             	mov    %rsi,%rax
  22:	44 89 c1             	mov    %r8d,%ecx
  25:	48 d3 e0             	shl    %cl,%rax
  28:	48 09 f8             	or     %rdi,%rax
  2b:	c3                   	retq   
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	48 89 f0             	mov    %rsi,%rax
  36:	48 c1 fa 3f          	sar    $0x3f,%rdx
  3a:	f7 d9                	neg    %ecx
  3c:	48 d3 f8             	sar    %cl,%rax
  3f:	c3                   	retq   
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]

Oops: the TEST instruction at address 14 is superfluous.

Properly written branch-free code uses but only 12 instructions in just 36 bytes instead of 25 instructions in 71 bytes and handles shift counts greater than 127:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ashrti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__ashrti3:
	mov	ecx, edx
	mov	rax, rsi
	cqo
	mov	rax, rdx
	shrd	rdi, rsi, cl
	sar	rsi, cl		# rsi:rdi = value >> count % 64
	cmp	ecx, 127
	cmovbe	rax, rsi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi	# rdx:rax = value >> count
	ret

.size	__ashrti3, .-__ashrti3
.type	__ashrti3, @function
.global	__ashrti3
.end

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ashrti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__ashrti3:
	cmp	edx, 127
	ja	.sign		# count > 127?

	mov	ecx, edx
	cmp	edx, 63
	ja	.high		# count > 63?
.low:
	shrd	rdi, rsi, cl
	sar	rsi, cl		# rsi:rdi = value >> count % 64
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = value >> count
	ret
.high:
	sar	rsi, cl
	mov	rax, rsi
	cqo			# rdx:rax = value >> count
	ret
.sign:
	mov	rax, rsi
	cqo
	mov	rax, rdx	# rdx:rax = (value < 0) ? -1 : 0
				#         = value >> count
	ret

.size	__ashrti3, .-__ashrti3
.type	__ashrti3, @function
.global	__ashrti3
.end

Note: exploration of the equally bad code generated for the __ashrdi3() function on the i386 platform is left as an exercise to the reader.

[…]
00000000 <__ashrdi3>:
   0:	56                   	push   %esi
   1:	53                   	push   %ebx
   2:	8b 5c 24 10          	mov    0x10(%esp),%ebx
   6:	8b 4c 24 14          	mov    0x14(%esp),%ecx
   a:	8b 44 24 0c          	mov    0xc(%esp),%eax
   e:	89 da                	mov    %ebx,%edx
  10:	85 c9                	test   %ecx,%ecx
  12:	74 15                	je     29 <__ashrdi3+0x29>
  14:	be 20 00 00 00       	mov    $0x20,%esi
  19:	29 ce                	sub    %ecx,%esi
  1b:	85 f6                	test   %esi,%esi
  1d:	7e 11                	jle    30 <__ashrdi3+0x30>
  1f:	d3 e8                	shr    %cl,%eax
  21:	d3 fa                	sar    %cl,%edx
  23:	89 f1                	mov    %esi,%ecx
  25:	d3 e3                	shl    %cl,%ebx
  27:	09 d8                	or     %ebx,%eax
  29:	5b                   	pop    %ebx
  2a:	5e                   	pop    %esi
  2b:	c3                   	ret    
  2c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  30:	89 d8                	mov    %ebx,%eax
  32:	83 e9 20             	sub    $0x20,%ecx
  35:	c1 fa 1f             	sar    $0x1f,%edx
  38:	5b                   	pop    %ebx
  39:	d3 f8                	sar    %cl,%eax
  3b:	5e                   	pop    %esi
  3c:	c3                   	ret    
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: applies shift count modulo 64!

.file	"ashrdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+12] = count
				# [esp+8] = high dword of value
				# [esp+4] = low dword of value
__ashrdi3:
	mov	ecx, [esp+12]
	mov	edx, [esp+8]
	test	cl, 32
	jnz	0f		# count > 31?

	mov	eax, [esp+4]
	shrd	eax, edx, cl
	sar	edx, cl		# edx:eax = value >> count
	ret
0:
	mov	eax, edx
	sar	eax, cl
	sar	edx, 31		# edx:eax = value >> count
	ret

.size	__ashrdi3, .-__ashrdi3
.type	__ashrdi3, @function
.global	__ashrdi3
.end

Case 5: `__lshr?i3()` Functions

libgcc provides the __lshrsi3(), __lshrdi3() and __lshrti3() functions for integer shift operations:

Runtime Function: int __lshrsi3 (int a, int b)

Runtime Function: long __lshrdi3 (long a, int b)

Runtime Function: long long __lshrti3 (long long a, int b)

These functions return the result of logically shifting a right by b bits.

For the AMD64 target platform, GCC compiles __ashlti3() to the following unoptimised and awful code:

[…]
0000000000000000 <__lshrti3>:
   0:	48 89 d1             	mov    %rdx,%rcx
   3:	48 85 d2             	test   %rdx,%rdx
   6:	74 38                	je     40 <__lshrti3+0x40>
   8:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   e:	49 29 d0             	sub    %rdx,%r8
  11:	4d 85 c0             	test   %r8,%r8
  14:	7e 1a                	jle    30 <__lshrti3+0x30>
  16:	48 89 f2             	mov    %rsi,%rdx
  19:	48 d3 ef             	shr    %cl,%rdi
  1c:	48 89 f0             	mov    %rsi,%rax
  1f:	48 d3 ea             	shr    %cl,%rdx
  22:	44 89 c1             	mov    %r8d,%ecx
  25:	48 d3 e0             	shl    %cl,%rax
  28:	48 09 f8             	or     %rdi,%rax
  2b:	c3                   	retq   
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	48 89 f0             	mov    %rsi,%rax
  36:	31 d2                	xor    %edx,%edx
  38:	f7 d9                	neg    %ecx
  3a:	48 d3 e8             	shr    %cl,%rax
  3d:	c3                   	retq   
  3e:	66 90                	xchg   %ax,%ax
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]

Oops: the TEST instruction at address 11 is superfluous.

Properly written branch-free code uses but only 11 instructions in just 32 bytes instead of 25 instructions in 71 bytes and handles shift counts greater than 127:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"lshrti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__lshrti3:
	mov	ecx, edx
	xor	edx, edx
	xor	eax, eax
	shrd	rdi, rsi, cl
	shr	rsi, cl		# rsi:rdi = value >> count % 64
	cmp	ecx, 127
	cmovbe	rax, rsi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi	# rdx:rax = value >> count
	ret

.size	__lshrti3, .-__lshrti3
.type	__lshrti3, @function
.global	__lshrti3
.end

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"lshrti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__lshrti3:
	cmp	edx, 127
	ja	.zero		# count > 127?

	mov	ecx, edx
	cmp	edx, 63
	ja	.high		# count > 63?
.low:
	shrd	rdi, rsi, cl
	shr	rsi, cl		# rsi:rdi = value >> count % 64
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = value >> count
	ret
.high:
	shr	rsi, cl
	xor	edx, edx
	mov	rax, rsi	# rdx:rax = value >> count
	ret
.zero:
	xor	edx, edx
	xor	eax, eax	# rdx:rax = 0
				#         = value >> count
	ret

.size	__lshrti3, .-__lshrti3
.type	__lshrti3, @function
.global	__lshrti3
.end

Note: exploration of the equally bad code generated for the __lshrdi3() function on the i386 platform is left as an exercise to the reader.

[…]
00000000 <__lshrdi3>:
   0:	56                   	push   %esi
   1:	53                   	push   %ebx
   2:	8b 5c 24 10          	mov    0x10(%esp),%ebx
   6:	8b 4c 24 14          	mov    0x14(%esp),%ecx
   a:	8b 44 24 0c          	mov    0xc(%esp),%eax
   e:	89 da                	mov    %ebx,%edx
  10:	85 c9                	test   %ecx,%ecx
  12:	74 15                	je     29 <__lshrdi3+0x29>
  14:	be 20 00 00 00       	mov    $0x20,%esi
  19:	29 ce                	sub    %ecx,%esi
  1b:	85 f6                	test   %esi,%esi
  1d:	7e 11                	jle    30 <__lshrdi3+0x30>
  1f:	d3 e8                	shr    %cl,%eax
  21:	d3 ea                	shr    %cl,%edx
  23:	89 f1                	mov    %esi,%ecx
  25:	d3 e3                	shl    %cl,%ebx
  27:	09 d8                	or     %ebx,%eax
  29:	5b                   	pop    %ebx
  2a:	5e                   	pop    %esi
  2b:	c3                   	ret    
  2c:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  30:	89 d8                	mov    %ebx,%eax
  32:	83 e9 20             	sub    $0x20,%ecx
  35:	31 d2                	xor    %edx,%edx
  37:	5b                   	pop    %ebx
  38:	d3 e8                	shr    %cl,%eax
  3a:	5e                   	pop    %esi
  3b:	c3                   	ret    
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: applies shift count modulo 64!

.file	"lshrdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+12] = count
				# [esp+8] = high dword of value
				# [esp+4] = low dword of value
__lshrdi3:
	mov	ecx, [esp+12]
	mov	edx, [esp+8]
	test	cl, 32
	jnz	0f		# count > 31?

	mov	eax, [esp+4]
	shrd	eax, edx, cl
	shr	edx, cl		# edx:eax = value >> count
	ret
0:
	mov	eax, edx
	shr	eax, cl
	xor	edx, edx	# edx:eax = value >> count
	ret

.size	__lshrdi3, .-__lshrdi3
.type	__lshrdi3, @function
.global	__lshrdi3
.end

Case 6: `__cmp?i2()` Functions

libgcc provides the __cmpsi2(), __cmpdi2() and __cmpti2() functions for signed integer comparison:

Runtime Function: int __cmpsi2 (int a, int b)

Runtime Function: int __cmpdi2 (long a, long b)

Runtime Function: int __cmpti2 (long long a, long long b)

These functions perform a signed comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.

For the AMD64 target platform, GCC compiles __cmpti2() to the following unoptimised and awkward code using (superfluous) conditional branches which impair performance:

[…]
0000000000000000 <__cmpti2>:
   0:	31 c0                	xor    %eax,%eax
   2:	48 39 ce             	cmp    %rcx,%rsi
   5:	7c 1c                	jl     23 <__cmpti2+0x23>
   7:	b8 02 00 00 00       	mov    $0x2,%eax
   c:	7f 15                	jg     23 <__cmpti2+0x23>
   e:	31 c0                	xor    %eax,%eax
  10:	48 39 d7             	cmp    %rdx,%rdi
  13:	72 0e                	jb     23 <__cmpti2+0x23>
  15:	b8 01 00 00 00       	mov    $0x1,%eax
  1a:	ba 02 00 00 00       	mov    $0x2,%edx
  1f:	48 0f 47 c2          	cmova  %rdx,%rax
  23:	c3                   	retq   
[…]

Properly written branch-free code uses 11 instructions in 21 bytes instead of 12 instructions in 36 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"cmpti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = left argument
				# rcx:rdx = right argument
__cmpti2:
	sub	rdi, rdx
	sbb	rsi, rcx	# rsi:rdi = left argument - right argument
	setg	ah		# ah = (left argument > right argument) ? 1 : 0
	setl	al		# al = (left argument < right argument) ? 1 : 0
	stc
	sbb	al, ah		# al = (left argument < right argument)
				#    - (left argument > right argument)
				#    - 1
				#    = {0, -1, -2}
	movsx	eax, al
	neg	eax		# rax = (left argument > right argument)
				#     - (left argument < right argument)
				#     + 1
				#     = {0, 1, 2}
	ret

.size	__cmpti2, .-__cmpti2
.type	__cmpti2, @function
.global	__cmpti2
.end

Note: exploration of the equally bad code generated for the __cmpdi2() and __cmpsi2() functions is left as an exercise to the reader.

[…]
00000000 <__cmpdi2>:
   0:	53                   	push   %ebx
   1:	31 c0                	xor    %eax,%eax
   3:	8b 4c 24 08          	mov    0x8(%esp),%ecx
   7:	8b 54 24 10          	mov    0x10(%esp),%edx
   b:	8b 5c 24 14          	mov    0x14(%esp),%ebx
   f:	39 5c 24 0c          	cmp    %ebx,0xc(%esp)
  13:	7c 1a                	jl     2f <__cmpdi2+0x2f>
  15:	b8 02 00 00 00       	mov    $0x2,%eax
  1a:	7f 13                	jg     2f <__cmpdi2+0x2f>
  1c:	31 c0                	xor    %eax,%eax
  1e:	39 d1                	cmp    %edx,%ecx
  20:	72 0d                	jb     2f <__cmpdi2+0x2f>
  22:	b8 01 00 00 00       	mov    $0x1,%eax
  27:	ba 02 00 00 00       	mov    $0x2,%edx
  2c:	0f 47 c2             	cmova  %edx,%eax
  2f:	5b                   	pop    %ebx
  30:	c3                   	ret    
[…]

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __cmpsi2(int comparand, int comparator) {
    return (comparand > comparator) - (comparand < comparator) + 1;
}

int __cmpdi2(long long comparand, long long comparator) {
    return (comparand > comparator) - (comparand < comparator) + 1;
}

#ifdef __amd64__
int __cmpti2(__int128_t comparand, __int128_t comparator) {
    return (comparand > comparator) - (comparand < comparator) + 1;
}
#endif

Case 7: `__ucmp?i2()` Functions

libgcc provides the __ucmpsi2(), __ucmpdi2() and __ucmpti2() functions for unsigned integer comparison:

Runtime Function: int __ucmpsi2 (unsigned int a, unsigned int b)

Runtime Function: int __ucmpdi2 (unsigned long a, unsigned long b)

Runtime Function: int __ucmpti2 (unsigned long long a, unsigned long long b)

These functions perform an unsigned comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.

For the AMD64 target platform, GCC compiles __ucmpti2() to the following unoptimised and awkward code using (superfluous) conditional branches which impair performance:

[…]
0000000000000000 <__ucmpti2>:
   0:	31 c0                	xor    %eax,%eax
   2:	48 39 ce             	cmp    %rcx,%rsi
   5:	72 1c                	jb     23 <__ucmpti2+0x23>
   7:	b8 02 00 00 00       	mov    $0x2,%eax
   c:	77 15                	ja     23 <__ucmpti2+0x23>
   e:	31 c0                	xor    %eax,%eax
  10:	48 39 d7             	cmp    %rdx,%rdi
  13:	72 0e                	jb     23 <__ucmpti2+0x23>
  15:	b8 01 00 00 00       	mov    $0x1,%eax
  1a:	ba 02 00 00 00       	mov    $0x2,%edx
  1f:	48 0f 47 c2          	cmova  %rdx,%rax
  23:	c3                   	retq   
[…]

Properly written branch-free code uses but only 6 instructions in just 15 bytes instead of 12 instructions in 36 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ucmpti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = left argument
				# rcx:rdx = right argument
__ucmpti2:
	xor	eax, eax	# rax = 0
	sub	rdi, rdx
	sbb	rsi, rcx	# rsi:rdi = left argument - right argument
	seta	al		# rax = (left argument > right argument)
	sbb	eax, -1		# rax = (left argument > right argument)
				#     - (left argument < right argument)
				#     + 1
				#     = {0, 1, 2}
	ret

.size	__ucmpti2, .-__ucmpti2
.type	__ucmpti2, @function
.global	__ucmpti2
.end

Note: exploration of the equally bad code generated for the __ucmpdi2() and __ucmpsi2() functions is left as an exercise to the reader.

[…]
00000000 <__ucmpdi2>:
   0:	53                   	push   %ebx
   1:	31 c0                	xor    %eax,%eax
   3:	8b 4c 24 08          	mov    0x8(%esp),%ecx
   7:	8b 54 24 10          	mov    0x10(%esp),%edx
   b:	8b 5c 24 14          	mov    0x14(%esp),%ebx
   f:	39 5c 24 0c          	cmp    %ebx,0xc(%esp)
  13:	72 1a                	jb     2f <__ucmpdi2+0x2f>
  15:	b8 02 00 00 00       	mov    $0x2,%eax
  1a:	77 13                	ja     2f <__ucmpdi2+0x2f>
  1c:	31 c0                	xor    %eax,%eax
  1e:	39 d1                	cmp    %edx,%ecx
  20:	72 0d                	jb     2f <__ucmpdi2+0x2f>
  22:	b8 01 00 00 00       	mov    $0x1,%eax
  27:	ba 02 00 00 00       	mov    $0x2,%edx
  2c:	0f 47 c2             	cmova  %edx,%eax
  2f:	5b                   	pop    %ebx
  30:	c3                   	ret    
[…]

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __ucmpsi2(unsigned int comparand, unsigned int comparator) {
    return (comparand > comparator) - (comparand < comparator) + 1;
}

int __ucmpdi2(unsigned long long comparand, unsigned long long comparator) {
    return (comparand > comparator) - (comparand < comparator) + 1;
}

#ifdef __amd64__
int __ucmpti2(__uint128_t comparand, __uint128_t comparator) {
    return (comparand > comparator) - (comparand < comparator) + 1;
}
#endif

Case 8: `__mul?i3()` Functions

libgcc provides the __mulsi3(), __muldi3() and __multi3() functions for integer multiplication:

Runtime Function: int __mulsi3 (int a, int b)

Runtime Function: long __muldi3 (long a, long b)

Runtime Function: long long __multi3 (long long a, long long b)

These functions return the product of a and b.

For the AMD64 target platform, GCC compiles __multi3() to the following code, which uses a LEA instruction instead of the appropriate (shorter) ADD instruction to compute the high part of the product, and two superfluous MOV instructions:

[…]
0000000000000000 <__multi3>:
   0:	49 89 f0             	mov    %rsi,%r8
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 89 d6             	mov    %rdx,%rsi
   9:	48 0f af f9          	imul   %rcx,%rdi
   d:	49 0f af f0          	imul   %r8,%rsi
  11:	48 f7 e2             	mul    %rdx
  14:	48 01 d7             	add    %rdx,%rdi
  17:	48 8d 14 37          	lea    (%rdi,%rsi,1),%rdx
  1b:	c3                   	retq   
[…]

Properly written code uses but 7 instructions in 21 bytes instead of 9 instructions in 28 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"multi3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__multi3:
	mov	rax, rdi	# rax = low qword of multiplicand
	imul	rsi, rdx	# rsi = high qword of multiplicand
				#     * low qword of multiplier
	imul	rdi, rcx	# rdi = low qword of multiplicand
				#     * high qword of multiplier
	mul	rdx		# rdx:rax = low qword of multiplicand
				#         * low qword of multiplier
	add	rdi, rsi
	add	rdx, rdi	# rdx:rax = product % 2**128
	ret

.size	__multi3, .-__multi3
.type	__multi3, @function
.global	__multi3
.end

Note: exploration of the equally bad code generated for the __muldi3() function on the i386 platform is left as an exercise to the reader.

[…]
00000000 <__muldi3>:
   0:	53                   	push   %ebx
   1:	8b 4c 24 08          	mov    0x8(%esp),%ecx
   5:	8b 5c 24 10          	mov    0x10(%esp),%ebx
   9:	89 c8                	mov    %ecx,%eax
   b:	0f af 4c 24 14       	imul   0x14(%esp),%ecx
  10:	f7 e3                	mul    %ebx
  12:	0f af 5c 24 0c       	imul   0xc(%esp),%ebx
  17:	01 d1                	add    %edx,%ecx
  19:	8d 14 19             	lea    (%ecx,%ebx,1),%edx
  1c:	5b                   	pop    %ebx
  1d:	c3                   	ret    
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"muldi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of multiplier
				# [esp+12] = low dword of multiplier
				# [esp+8] = high dword of multiplicand
				# [esp+4] = low dword of multiplicand
__muldi3:
	push	ebx

	mov	edx, [esp+16]	# edx = low dword of multiplier
	mov	ecx, [esp+12]	# ecx = high dword of multiplicand
	imul	ecx, edx	# ecx = high dword of multiplicand
				#     * low dword of multiplier
	mov	eax, [esp+8]	# eax = low dword of multiplicand
	mov	ebx, [esp+20]	# ebx = high dword of multiplier
	imul	ebx, eax	# ebx = high dword of multiplier
				#     * low dword of multiplicand
	mul	edx		# edx:eax = low dword of multiplicand
				#         * low dword of multiplier
	add	ecx, ebx	# ecx = high dword of multiplicand
				#     * low dword of multiplier
				#     + high dword of multiplier
				#     * low dword of multiplicand
	add	edx, ecx	# edx:eax = product % 2**64

	pop	ebx
	ret

.size	__muldi3, .-__muldi3
.type	__muldi3, @function
.global	__muldi3
.end

Case 9: `__neg?i2()` Functions

libgcc provides the __negsi2(), __negdi2() and __negti2() functions for signed integer negation:

Runtime Function: int __negsi2 (int a)

Runtime Function: long __negdi2 (long a)

Runtime Function: long long __negti2 (long long a)

These functions return the negation of a.

For the AMD64 target platform, GCC compiles __negti2() to the following unoptimised and awkward code:

[…]
0000000000000000 <__negti2>:
   0:	48 f7 de             	neg    %rsi
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 f7 d8             	neg    %rax
   9:	48 89 f2             	mov    %rsi,%rdx
   c:	48 83 ff 01          	cmp    $0x1,%rdi
  10:	48 83 d2 ff          	adc    $0xffffffffffffffff,%rdx
  14:	c3                   	retq   
[…]

Properly written code uses but only 5 instructions in just 10 bytes instead of 7 instructions in 21 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"negti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__negti2:
	xor	eax, eax
	cdq			# rdx:rax = 0
	sub	rax, rdi
	sbb	rdx, rsi	# rdx:rax = -argument
	ret

.size	__negti2, .-__negti2
.type	__negti2, @function
.global	__negti2
.end

Note: exploration of the equally bad code generated for the __negdi2() function on the i386 platform is left as an exercise to the reader.

[…]
00000000 <__negdi2>:
   0:	8b 4c 24 04          	mov    0x4(%esp),%ecx
   4:	8b 54 24 08          	mov    0x8(%esp),%edx
   8:	89 c8                	mov    %ecx,%eax
   a:	f7 da                	neg    %edx
   c:	f7 d8                	neg    %eax
   e:	83 f9 01             	cmp    $0x1,%ecx
  11:	83 d2 ff             	adc    $0xffffffff,%edx
  14:	c3                   	ret    
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"negdi2.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+8] = high dword of argument
				# [esp+4] = low dword of argument
__negdi2:
	xor	eax, eax
	cdq			# edx:eax = 0
	sub	eax, [esp+4]
	sbb	edx, [esp+8]	# edx:eax = -argument
	ret

.size	__negdi2, .-__negdi2
.type	__negdi2, @function
.global	__negdi2
.end

Case 10: `__absv?i2()` Functions

libgcc provides the __absvsi2(), __absvdi2() and __absvti2() functions:

Runtime Function: int __absvsi2 (int a)

Runtime Function: long __absvdi2 (long a)

Runtime Function: long long __absvti2 (long long a)

These functions return the absolute value of a.

For the AMD64 target platform, GCC compiles __absvti2() to the following unoptimised and awkward code:

[…]
0000000000000000 <__absvti2>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 89 f2             	mov    %rsi,%rdx
   6:	48 85 f6             	test   %rsi,%rsi
   9:	78 05                	js     10 <__absvti2+0x10>
   b:	c3                   	retq   
   c:	0f 1f 40 00          	nopl   0x0(%rax)
  10:	48 83 ec 08          	sub    $0x8,%rsp
  14:	48 f7 d8             	neg    %rax
  17:	48 83 d2 00          	adc    $0x0,%rdx
  1b:	48 f7 da             	neg    %rdx
  1e:	48 85 d2             	test   %rdx,%rdx
  21:	0f 88 00 00 00 00    	js     …
  27:	48 83 c4 08          	add    $0x8,%rsp
  2b:	c3                   	retq   
[…]

Oops: the TEST instruction at address 1e is superfluous.

Properly written (and almost branch-free) code uses but only 11 instructions in just 26 bytes instead of 14 instructions in 44 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"absvti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__absvti2:
	mov	rax, rsi
	cqo			# rdx = (argument < 0) ? -1 : 0
	add	rdi, rdx
	adc	rsi, rdx	# rsi:rsi = (argument < 0) ? argument - 1 : argument
	jo	.overflow

	mov	rax, rdx
	xor	rax, rdi
	xor	rdx, rsi	# rdx:rax = (argument < 0) ? -argument : argument
				#         = |argument|
	ret
.overflow:
	int	4
	ret

.size	__absvti2, .-__absvti2
.type	__absvti2, @function
.global	__absvti2
.end

Note: exploration of the equally bad code generated for the __absvdi2() and __absvsi2() functions is left as an exercise to the reader.

[…]
00000000 <__absvdi2>:
   0:	83 ec 0c             	sub    $0xc,%esp
   3:	8b 54 24 14          	mov    0x14(%esp),%edx
   7:	8b 44 24 10          	mov    0x10(%esp),%eax
   b:	85 d2                	test   %edx,%edx
   d:	78 09                	js     18 <__absvdi2+0x18>
   f:	83 c4 0c             	add    $0xc,%esp
  12:	c3                   	ret    
  13:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  17:	90                   	nop
  18:	f7 d8                	neg    %eax
  1a:	83 d2 00             	adc    $0x0,%edx
  1d:	f7 da                	neg    %edx
  1f:	85 d2                	test   %edx,%edx
  21:	0f 88 00 00 00 00    	js     …
  27:	83 c4 0c             	add    $0xc,%esp
  2a:	c3                   	ret    
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"absvdi2.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+8] = high dword of argument
				# [esp+4] = low dword of argument
__absvdi2:
	mov	eax, [esp+8]
	mov	ecx, [esp+4]	# eax:ecx = argument
	cdq			# edx = (argument < 0) ? -1 : 0
	add	ecx, edx
	adc	eax, edx	# eax:ecx = (argument < 0) ? argument - 1 : argument
	into
	xor	ecx, edx
	xor	edx, eax
	mov	eax, ecx	# edx:eax = (argument < 0) ? -argument : argument
				#         = |argument|
	ret

.size	__absvdi2, .-__absvdi2
.type	__absvdi2, @function
.global	__absvdi2
.end

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef ALTERNATE
int __absvsi2(int value) {
    if (value < 0)
        value = 0 - (unsigned int) value;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __absvdi2(long long value) {
    if (value < 0)
        value = 0 - (unsigned long long) value;

    if (value < 0)
        __builtin_trap();

    return value;
}

#ifdef __amd64__
__int128_t __absvti2(__int128_t value) {
    if (value < 0)
        value = 0 - (__uint128_t) value;

    if (value < 0)
        __builtin_trap();

    return value;
}
#endif
#else
int __absvsi2(int value) {
    const int sign = 0 - (value < 0);

    if (__builtin_add_overflow(sign, value, &value))
        __builtin_trap();

    return sign ^ value;
}

long long __absvdi2(long long value) {
    const long long sign = 0 - (value < 0);

    if (__builtin_add_overflow(sign, value, &value))
        __builtin_trap();

    return sign ^ value;
}

#ifdef __amd64__
__int128_t __absvti2(__int128_t value) {
    const __int128_t sign = 0 - (value < 0);

    if (__builtin_add_overflow(sign, value, &value))
        __builtin_trap();

    return sign ^ value;
}
#endif
#endif // ALTERNATE

Case 11: `__addv?i3()` Functions

libgcc provides the __addvsi3(), __addvdi3() and __addvti3() functions for overflow-trapping signed integer addition:

Runtime Function: int __addvsi3 (int a, int b)

Runtime Function: long __addvdi3 (long a, long b)

Runtime Function: long long __addvti3 (long long a, long long b)

These functions return the sum of a and b; that is a + b.

For the AMD64 target platform, GCC compiles __addvti3() to the following unoptimised and awful code:

[…]
0000000000000000 <__addvti3>:
   0:	49 89 f9             	mov    %rdi,%r9
   3:	49 89 d0             	mov    %rdx,%r8
   6:	48 83 ec 08          	sub    $0x8,%rsp
   a:	48 89 f2             	mov    %rsi,%rdx
   d:	4c 89 c8             	mov    %r9,%rax
  10:	48 89 f7             	mov    %rsi,%rdi
  13:	4c 01 c0             	add    %r8,%rax
  16:	48 11 ca             	adc    %rcx,%rdx
  19:	48 85 c9             	test   %rcx,%rcx
  1c:	78 12                	js     30 <__addvti3+0x30>
  1e:	4c 39 c8             	cmp    %r9,%rax
  21:	48 89 d1             	mov    %rdx,%rcx
  24:	48 19 f1             	sbb    %rsi,%rcx
  27:	7c 18                	jl     41 <__addvti3+0x41>
  29:	48 83 c4 08          	add    $0x8,%rsp
  2d:	c3                   	retq   
  2e:	66 90                	xchg   %ax,%ax
  30:	49 39 c1             	cmp    %rax,%r9
  33:	48 19 d7             	sbb    %rdx,%rdi
  36:	0f 8c 00 00 00 00    	jl     …
  3c:	48 83 c4 08          	add    $0x8,%rsp
  40:	c3                   	retq   
  41:	e9 00 00 00 00       	jmpq   …
[…]

Properly written (and almost branch-free code) uses but only 8 instructions in just 18 bytes instead of 23 instructions in 70 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"addvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = augend
				# rcx:rdx = addend
__addvti3:
	add	rdi, rdx
	adc	rsi, rcx	# rsi:rdi = augend + addend
				#         = sum
	jo	.overflow

	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = sum
	ret
.overflow:
	int	4
	ret

.size	__addvti3, .-__addvti3
.type	__addvti3, @function
.global	__addvti3
.end

Note: exploration of the equally bad code generated for the __addvdi3() and __addvsi3() functions is left as an exercise to the reader.

[…]
00000000 <__addvdi3>:
   0:	56                   	push   %esi
   1:	53                   	push   %ebx
   2:	83 ec 04             	sub    $0x4,%esp
   5:	8b 4c 24 10          	mov    0x10(%esp),%ecx
   9:	8b 5c 24 14          	mov    0x14(%esp),%ebx
   d:	8b 74 24 1c          	mov    0x1c(%esp),%esi
  11:	89 c8                	mov    %ecx,%eax
  13:	89 da                	mov    %ebx,%edx
  15:	03 44 24 18          	add    0x18(%esp),%eax
  19:	13 54 24 1c          	adc    0x1c(%esp),%edx
  1d:	85 f6                	test   %esi,%esi
  1f:	78 0f                	js     30 <__addvdi3+0x30>
  21:	39 c8                	cmp    %ecx,%eax
  23:	89 d6                	mov    %edx,%esi
  25:	19 de                	sbb    %ebx,%esi
  27:	7c 17                	jl     40 <__addvdi3+0x40>
  29:	83 c4 04             	add    $0x4,%esp
  2c:	5b                   	pop    %ebx
  2d:	5e                   	pop    %esi
  2e:	c3                   	ret    
  2f:	90                   	nop
  30:	39 c1                	cmp    %eax,%ecx
  32:	19 d3                	sbb    %edx,%ebx
  34:	0f 8c 00 00 00 00    	jl     …
  3a:	83 c4 04             	add    $0x4,%esp
  3d:	5b                   	pop    %ebx
  3e:	5e                   	pop    %esi
  3f:	c3                   	ret    
  40:	e9 00 00 00 00       	jmp    …
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"addvdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of addend
				# [esp+12] = low dword of addend
				# [esp+8] = high dword of augend
				# [esp+4] = low dword of augend
__addvdi3:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = augend
	add	eax, [esp+12]
	adc	edx, [esp+16]	# edx:eax = augend + addend
				#         = sum
	into
	ret

.size	__addvdi3, .-__addvdi3
.type	__addvdi3, @function
.global	__addvdi3
.end

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef ALTERNATE
// overflow if both augend and addend have opposite sign of sum,
//  which is equivalent to augend has sign of addend
//   and addend has opposite sign of sum
//    (or augend has opposite sign of sum)

int __addvsi3(int augend, int addend) {
    const int sum = (unsigned int) augend + (unsigned int) addend;

    if (((augend ^ sum) & (addend ^ sum)) < 0)
        __builtin_trap();

    return sum;
}

long long __addvdi3(long long augend, long long addend) {
    const long long sum = (unsigned long long) augend + (unsigned long long) addend;

    if (((augend ^ sum) & (addend ^ sum)) < 0)
        __builtin_trap();

    return sum;
}

#ifdef __amd64__
__int128_t __addvti3(__int128_t augend, __int128_t addend) {
    const __int128_t sum = (__uint128_t) augend + (__uint128_t) addend;

    if (((augend ^ sum) & (addend ^ sum)) < 0)
        __builtin_trap();

    return sum;
}
#endif
#else
int __addvsi3(int augend, int addend) {
    int sum;

    if (__builtin_add_overflow(augend, addend, &sum))
        __builtin_trap();

    return sum;
}

long long __addvdi3(long long augend, long long addend) {
    long long sum;

    if (__builtin_add_overflow(augend, addend, &sum))
        __builtin_trap();

    return sum;
}

#ifdef __amd64__
__int128_t __addvti3(__int128_t augend, __int128_t addend) {
    __int128_t sum;

    if (__builtin_add_overflow(augend, addend, &sum))
        __builtin_trap();

    return sum;
}
#endif
#endif // ALTERNATE

Case 12: `__subv?i3()` Functions

libgcc provides the __subvsi3(), __subvdi3() and __subvti3() functions for overflow-trapping signed integer subtraction:

Runtime Function: int __subvsi3 (int a, int b)

Runtime Function: long __subvdi3 (long a, long b)

Runtime Function: long long __subvti3 (long long a, long long b)

These functions return the difference between a and b; that is a - b.

For the AMD64 target platform, GCC compiles __subvti3() to the following unoptimised and awful code:

[…]
0000000000000000 <__subvti3>:
   0:	49 89 f9             	mov    %rdi,%r9
   3:	49 89 d0             	mov    %rdx,%r8
   6:	48 83 ec 08          	sub    $0x8,%rsp
   a:	48 89 f2             	mov    %rsi,%rdx
   d:	4c 89 c8             	mov    %r9,%rax
  10:	4c 29 c0             	sub    %r8,%rax
  13:	48 19 ca             	sbb    %rcx,%rdx
  16:	48 85 c9             	test   %rcx,%rcx
  19:	78 15                	js     30 <__subvti3+0x30>
  1b:	49 39 c1             	cmp    %rax,%r9
  1e:	48 89 f1             	mov    %rsi,%rcx
  21:	48 19 d1             	sbb    %rdx,%rcx
  24:	7c 1e                	jl     44 <__subvti3+0x44>
  26:	48 83 c4 08          	add    $0x8,%rsp
  2a:	c3                   	retq   
  2b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  30:	4c 39 c8             	cmp    %r9,%rax
  33:	48 89 d1             	mov    %rdx,%rcx
  36:	48 19 f1             	sbb    %rsi,%rcx
  39:	0f 8c 00 00 00 00    	jl     …
  3f:	48 83 c4 08          	add    $0x8,%rsp
  43:	c3                   	retq   
  44:	e9 00 00 00 00       	jmpq   …
[…]

Properly written (and almost branch-free) code uses but only 8 instructions in just 18 bytes instead of 23 instructions in 73 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"subvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = minuend
				# rcx:rdx = subtrahend
__subvti3:
	sub	rdi, rdx
	sbb	rsi, rcx	# rsi:rdi = minuend - subtrahend
				#         = difference
	jo	.overflow

	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = difference
	ret
.overflow:
	int	4
	ret

.size	__subvti3, .-__subvti3
.type	__subvti3, @function
.global	__subvti3
.end

Note: exploration of the equally bad code generated for the __subvdi3() and __subvsi3() functions is left as an exercise to the reader.

[…]
00000000 <__subvdi3>:
   0:	56                   	push   %esi
   1:	53                   	push   %ebx
   2:	83 ec 04             	sub    $0x4,%esp
   5:	8b 4c 24 10          	mov    0x10(%esp),%ecx
   9:	8b 5c 24 14          	mov    0x14(%esp),%ebx
   d:	8b 74 24 1c          	mov    0x1c(%esp),%esi
  11:	89 c8                	mov    %ecx,%eax
  13:	89 da                	mov    %ebx,%edx
  15:	2b 44 24 18          	sub    0x18(%esp),%eax
  19:	1b 54 24 1c          	sbb    0x1c(%esp),%edx
  1d:	85 f6                	test   %esi,%esi
  1f:	78 0f                	js     30 <__subvdi3+0x30>
  21:	39 c1                	cmp    %eax,%ecx
  23:	19 d3                	sbb    %edx,%ebx
  25:	7c 1b                	jl     42 <__subvdi3+0x42>
  27:	83 c4 04             	add    $0x4,%esp
  2a:	5b                   	pop    %ebx
  2b:	5e                   	pop    %esi
  2c:	c3                   	ret    
  2d:	8d 76 00             	lea    0x0(%esi),%esi
  30:	39 c8                	cmp    %ecx,%eax
  32:	89 d6                	mov    %edx,%esi
  34:	19 de                	sbb    %ebx,%esi
  36:	0f 8c 00 00 00 00    	jl     …
  3c:	83 c4 04             	add    $0x4,%esp
  3f:	5b                   	pop    %ebx
  40:	5e                   	pop    %esi
  41:	c3                   	ret    
  42:	e9 00 00 00 00       	jmp    …
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"subvdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of subtrahend
				# [esp+12] = low dword of subtrahend
				# [esp+8] = high dword of minuend
				# [esp+4] = low dword of minuend
__subvdi3:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = minuend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	# edx:eax = minuend - subtrahend
				#         = difference
	into
	ret

.size	__subvdi3, .-__subvdi3
.type	__subvdi3, @function
.global	__subvdi3
.end

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef ALTERNATE
// overflow if minuend has opposite sign of subtrahend
//  and minuend has opposite sign of difference
//   (or subtrahend has sign of difference)

int __subvsi3(int minuend, int subtrahend) {
    const int difference = (unsigned int) minuend - (unsigned int) subtrahend;

    if (((minuend ^ subtrahend) & (minuend ^ difference)) < 0)
        __builtin_trap();

    return difference;
}

long long __subvdi3(long long minuend, long long subtrahend) {
    const long long difference = (unsigned long long) minuend - (unsigned long long) subtrahend;

    if (((minuend ^ subtrahend) & (minuend ^ difference)) < 0)
        __builtin_trap();

    return difference;
}

#ifdef __amd64__
__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
    const __int128_t difference = (__uint128_t) minuend - (__uint128_t) subtrahend;

    if (((minuend ^ subtrahend) & (minuend ^ difference)) < 0)
        __builtin_trap();

    return difference;
}
#endif
#else
int __subvsi3(int minuend, int subtrahend) {
    int difference;

    if (__builtin_sub_overflow(minuend, subtrahend, &difference))
        __builtin_trap();

    return difference;
}

long long __subvdi3(long long minuend, long long subtrahend) {
    long long difference;

    if (__builtin_sub_overflow(minuend, subtrahend, &difference))
        __builtin_trap();

    return difference;
}

#ifdef __amd64__
__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
    __int128_t difference;

    if (__builtin_sub_overflow(minuend, subtrahend, &difference))
        __builtin_trap();

    return difference;
}
#endif
#endif // ALTERNATE

Case 13: `__mulv?i3()` Functions

libgcc provides the __mulvsi3(), __mulvdi3() and __mulvti3() functions for overflow-trapping signed integer multiplication:

Runtime Function: int __mulvsi3 (int a, int b)

Runtime Function: long __mulvdi3 (long a, long b)

Runtime Function: long long __mulvti3 (long long a, long long b)

These functions return the product of a and b; that is a * b.

For the AMD64 target platform, GCC compiles __mulvti3() to the following horrible code:

[…]
0000000000000000 <__mulvti3>:
   0:	41 55                	push   %r13
   2:	49 89 cb             	mov    %rcx,%r11
   5:	48 89 d0             	mov    %rdx,%rax
   8:	49 89 d2             	mov    %rdx,%r10
   b:	41 54                	push   %r12
   d:	49 89 fc             	mov    %rdi,%r12
  10:	48 89 d1             	mov    %rdx,%rcx
  13:	49 89 f0             	mov    %rsi,%r8
  16:	4c 89 e2             	mov    %r12,%rdx
  19:	49 89 f5             	mov    %rsi,%r13
  1c:	53                   	push   %rbx
  1d:	48 89 fe             	mov    %rdi,%rsi
  20:	48 c1 fa 3f          	sar    $0x3f,%rdx
  24:	48 c1 f8 3f          	sar    $0x3f,%rax
  28:	4c 89 df             	mov    %r11,%rdi
  2b:	4c 39 c2             	cmp    %r8,%rdx
  2e:	75 18                	jne    48 <__mulvti3+0x48>
  30:	4c 39 d8             	cmp    %r11,%rax
  33:	75 6b                	jne    a0 <__mulvti3+0xa0>
  35:	4c 89 e0             	mov    %r12,%rax
  38:	49 f7 ea             	imul   %r10
  3b:	5b                   	pop    %rbx
  3c:	41 5c                	pop    %r12
  3e:	41 5d                	pop    %r13
  40:	c3                   	retq   
  41:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  48:	4c 39 d8             	cmp    %r11,%rax
  4b:	0f 85 a7 00 00 00    	jne    f8 <__mulvti3+0xf8>
  51:	4c 89 e0             	mov    %r12,%rax
  54:	49 f7 e2             	mul    %r10
  57:	49 89 c1             	mov    %rax,%r9
  5a:	4c 89 c0             	mov    %r8,%rax
  5d:	48 89 d7             	mov    %rdx,%rdi
  60:	49 f7 e2             	mul    %r10
  63:	4d 85 c0             	test   %r8,%r8
  66:	79 09                	jns    71 <__mulvti3+0x71>
  68:	49 89 d0             	mov    %rdx,%r8
  6b:	49 29 c8             	sub    %rcx,%r8
  6e:	4c 89 c2             	mov    %r8,%rdx
  71:	48 85 c9             	test   %rcx,%rcx
  74:	79 06                	jns    7c <__mulvti3+0x7c>
  76:	4c 29 e0             	sub    %r12,%rax
  79:	4c 19 ea             	sbb    %r13,%rdx
  7c:	49 89 fa             	mov    %rdi,%r10
  7f:	45 31 db             	xor    %r11d,%r11d
  82:	49 01 c2             	add    %rax,%r10
  85:	4c 89 d0             	mov    %r10,%rax
  88:	49 11 d3             	adc    %rdx,%r11
  8b:	48 c1 f8 3f          	sar    $0x3f,%rax
  8f:	4c 39 d8             	cmp    %r11,%rax
  92:	0f 85 00 00 00 00    	jne    …
  98:	4c 89 c8             	mov    %r9,%rax
  9b:	4c 89 d2             	mov    %r10,%rdx
  9e:	eb 9b                	jmp    3b <__mulvti3+0x3b>
  a0:	4c 89 d0             	mov    %r10,%rax
  a3:	49 f7 e4             	mul    %r12
  a6:	49 89 c0             	mov    %rax,%r8
  a9:	4c 89 d8             	mov    %r11,%rax
  ac:	48 89 d3             	mov    %rdx,%rbx
  af:	49 f7 e4             	mul    %r12
  b2:	4d 85 db             	test   %r11,%r11
  b5:	79 09                	jns    c0 <__mulvti3+0xc0>
  b7:	48 89 d7             	mov    %rdx,%rdi
  ba:	4c 29 e7             	sub    %r12,%rdi
  bd:	48 89 fa             	mov    %rdi,%rdx
  c0:	48 85 f6             	test   %rsi,%rsi
  c3:	79 06                	jns    cb <__mulvti3+0xcb>
  c5:	4c 29 d0             	sub    %r10,%rax
  c8:	4c 19 da             	sbb    %r11,%rdx
  cb:	48 89 de             	mov    %rbx,%rsi
  ce:	31 ff                	xor    %edi,%edi
  d0:	48 01 c6             	add    %rax,%rsi
  d3:	48 89 f0             	mov    %rsi,%rax
  d6:	48 11 d7             	adc    %rdx,%rdi
  d9:	48 c1 f8 3f          	sar    $0x3f,%rax
  dd:	48 39 f8             	cmp    %rdi,%rax
  e0:	0f 85 00 00 00 00    	jne    …
  e6:	4c 89 c0             	mov    %r8,%rax
  e9:	48 89 f2             	mov    %rsi,%rdx
  ec:	e9 4a ff ff ff       	jmpq   3b <__mulvti3+0x3b>
  f1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  f8:	4d 85 c0             	test   %r8,%r8
  fb:	78 23                	js     120 <__mulvti3+0x120>
  fd:	4d 85 db             	test   %r11,%r11
 100:	78 4d                	js     14f <__mulvti3+0x14f>
 102:	4c 09 c7             	or     %r8,%rdi
 105:	0f 85 00 00 00 00    	jne    …
 10b:	4c 89 e0             	mov    %r12,%rax
 10e:	49 f7 e2             	mul    %r10
 111:	48 85 d2             	test   %rdx,%rdx
 114:	0f 89 21 ff ff ff    	jns    3b <__mulvti3+0x3b>
 11a:	e9 00 00 00 00       	jmpq   …
 11f:	90                   	nop
 120:	4d 85 db             	test   %r11,%r11
 123:	78 57                	js     17c <__mulvti3+0x17c>
 125:	0f 85 00 00 00 00    	jne    …
 12b:	49 83 f8 ff          	cmp    $0xffffffffffffffff,%r8
 12f:	0f 85 00 00 00 00    	jne    …
 135:	4c 89 e0             	mov    %r12,%rax
 138:	49 f7 e2             	mul    %r10
 13b:	49 89 d0             	mov    %rdx,%r8
 13e:	4d 29 d0             	sub    %r10,%r8
 141:	0f 89 00 00 00 00    	jns    …
 147:	4c 89 c2             	mov    %r8,%rdx
 14a:	e9 ec fe ff ff       	jmpq   3b <__mulvti3+0x3b>
 14f:	49 83 fb ff          	cmp    $0xffffffffffffffff,%r11
 153:	0f 85 00 00 00 00    	jne    …
 159:	4d 85 c0             	test   %r8,%r8
 15c:	0f 85 00 00 00 00    	jne    …
 162:	4c 89 d0             	mov    %r10,%rax
 165:	49 f7 e4             	mul    %r12
 168:	48 89 d7             	mov    %rdx,%rdi
 16b:	4c 29 e7             	sub    %r12,%rdi
 16e:	0f 89 00 00 00 00    	jns    …
 174:	48 89 fa             	mov    %rdi,%rdx
 177:	e9 bf fe ff ff       	jmpq   3b <__mulvti3+0x3b>
 17c:	4c 21 c7             	and    %r8,%rdi
 17f:	48 83 ff ff          	cmp    $0xffffffffffffffff,%rdi
 183:	0f 85 00 00 00 00    	jne    …
 189:	4c 89 d0             	mov    %r10,%rax
 18c:	4c 09 e0             	or     %r12,%rax
 18f:	0f 84 00 00 00 00    	je     …
 195:	4c 89 e0             	mov    %r12,%rax
 198:	49 f7 e2             	mul    %r10
 19b:	4c 29 e2             	sub    %r12,%rdx
 19e:	49 89 c0             	mov    %rax,%r8
 1a1:	48 89 d6             	mov    %rdx,%rsi
 1a4:	4c 29 d6             	sub    %r10,%rsi
 1a7:	0f 88 00 00 00 00    	js     …
 1ad:	e9 34 ff ff ff       	jmpq   e6 <__mulvti3+0xe6>
[…]

Properly written (and almost branch-free) code uses but only 45 instructions in just 119 bytes instead of 130 instructions in 434 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__mulvti3:
	mov	r10, rdx	# r10 = low qword of multiplier
	mov	r11, rcx	# r11 = high qword of multiplier

	mov	rax, rcx	# rax = high qword of multiplier
	cqo			# rdx = (multiplier < 0) ? -1 : 0
	mov	r9, rdx		# r9 = (multiplier < 0) ? -1 : 0
	xor	r10, rdx
	xor	r11, rdx	# r11:r10 = (multiplier < 0) ? ~multiplier : multiplier
	sub	r10, rdx
	sbb	r11, rdx	# r11:r10 = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	rax, rsi	# rax = high qword of multiplicand
	cqo			# rdx = (multiplicand < 0) ? -1 : 0
	xor	r9, rdx		# r9 = (multiplicand < 0) <> (multiplier < 0) ? -1 : 0
				#    = (product < 0) ? -1 : 0
	xor	rdi, rdx
	xor	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	rdi, rdx
	sbb	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	xor	ecx, ecx
	cmp	rcx, rsi
	sbb	edx, edx	# edx = (high qword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**64) ? 0 : -1
	cmp	rcx, r11
	sbb	ecx, ecx	# ecx = (high qword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplier| < 2**64) ? 0 : -1
	and	ecx, edx	# ecx = (|multiplicand| < 2**64)
				#     & (|multiplier| < 2**64) ? 0 : -1
				#     = (|product| < 2**128) ? 0 : -1
	mov	rax, rsi
	mul	r10		# rdx:rax = high qword of |multiplicand|
				#         * low qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rsi, rax
	mov	rax, rdi
	mul	r11		# rdx:rax = low qword of |multiplicand|
				#         * high qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	add	rsi, rax	# rsi = high qword of |multiplicand|
				#     * low qword of |multiplier|
				#     + low qword of |multiplicand|
				#     * high qword of |multiplier|
#	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rax, rdi
	mul	r10		# rdx:rax = low qword of |multiplicand|
				#         * low qword of |multiplier|
	add	rdx, rsi	# rdx:rax = |product % 2**128|
				#         = |product| % 2**128
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *
.if 0
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = (product < 0)
				#         ? product % 2**128 - 1 : product % 2**128
	sub	rax, r9
	sbb	rdx, r9		# rdx:rax = product % 2**128

	xor	r9, rdx		# r9 = (product % 2**128 < -2**127)
				#    | (product % 2**128 >= 2**127)
				#    ? negative : positive
	add	r9, r9
.else
	add	rax, r9
	adc	rdx, r9		# rdx:rax = (product < 0)
				#         ? ~product % 2**128 : product % 2**128
	mov	rsi, rdx	# rsi = (product % 2**128 < -2**127)
				#     | (product % 2**128 >= 2**127)
				#     ? negative : positive
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = product % 2**128

	add	rsi, rsi
.endif
	adc	ecx, ecx	# ecx = (-2**127 <= product < 2**127) ? 0 : *
	jnz	.overflow	# product < -2**127?
				# product >= 2**127?
	ret
.overflow:
	int	4
	ret

.size	__mulvti3, .-__mulvti3
.type	__mulvti3, @function
.global	__mulvti3
.end

Note: exploration of the equally horrible code generated for the __mulvdi3() and __mulvsi3() functions is left as an exercise to the reader.

[…]
00000000 <__mulvdi3>:
   0:	55                   	push   %ebp
   1:	57                   	push   %edi
   2:	56                   	push   %esi
   3:	53                   	push   %ebx
   4:	83 ec 1c             	sub    $0x1c,%esp
   7:	8b 44 24 38          	mov    0x38(%esp),%eax
   b:	8b 54 24 3c          	mov    0x3c(%esp),%edx
   f:	8b 7c 24 30          	mov    0x30(%esp),%edi
  13:	8b 6c 24 34          	mov    0x34(%esp),%ebp
  17:	89 54 24 0c          	mov    %edx,0xc(%esp)
  1b:	89 c2                	mov    %eax,%edx
  1d:	89 c3                	mov    %eax,%ebx
  1f:	c1 fa 1f             	sar    $0x1f,%edx
  22:	89 44 24 08          	mov    %eax,0x8(%esp)
  26:	89 f9                	mov    %edi,%ecx
  28:	89 d0                	mov    %edx,%eax
  2a:	8b 54 24 0c          	mov    0xc(%esp),%edx
  2e:	89 14 24             	mov    %edx,(%esp)
  31:	89 fa                	mov    %edi,%edx
  33:	c1 fa 1f             	sar    $0x1f,%edx
  36:	39 ea                	cmp    %ebp,%edx
  38:	75 16                	jne    50 <__mulvdi3+0x50>
  3a:	3b 04 24             	cmp    (%esp),%eax
  3d:	75 61                	jne    a0 <__mulvdi3+0xa0>
  3f:	89 f8                	mov    %edi,%eax
  41:	f7 eb                	imul   %ebx
  43:	83 c4 1c             	add    $0x1c,%esp
  46:	5b                   	pop    %ebx
  47:	5e                   	pop    %esi
  48:	5f                   	pop    %edi
  49:	5d                   	pop    %ebp
  4a:	c3                   	ret    
  4b:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  4f:	90                   	nop
  50:	89 ee                	mov    %ebp,%esi
  52:	3b 04 24             	cmp    (%esp),%eax
  55:	0f 85 95 00 00 00    	jne    f0 <__mulvdi3+0xf0>
  5b:	89 f8                	mov    %edi,%eax
  5d:	f7 e3                	mul    %ebx
  5f:	89 04 24             	mov    %eax,(%esp)
  62:	89 44 24 08          	mov    %eax,0x8(%esp)
  66:	89 e8                	mov    %ebp,%eax
  68:	89 54 24 04          	mov    %edx,0x4(%esp)
  6c:	f7 e3                	mul    %ebx
  6e:	85 ed                	test   %ebp,%ebp
  70:	79 06                	jns    78 <__mulvdi3+0x78>
  72:	89 d6                	mov    %edx,%esi
  74:	29 de                	sub    %ebx,%esi
  76:	89 f2                	mov    %esi,%edx
  78:	85 db                	test   %ebx,%ebx
  7a:	79 04                	jns    80 <__mulvdi3+0x80>
  7c:	29 f8                	sub    %edi,%eax
  7e:	19 ea                	sbb    %ebp,%edx
  80:	8b 4c 24 04          	mov    0x4(%esp),%ecx
  84:	31 db                	xor    %ebx,%ebx
  86:	01 c1                	add    %eax,%ecx
  88:	89 c8                	mov    %ecx,%eax
  8a:	11 d3                	adc    %edx,%ebx
  8c:	c1 f8 1f             	sar    $0x1f,%eax
  8f:	39 d8                	cmp    %ebx,%eax
  91:	0f 85 00 00 00 00    	jne    …
  97:	8b 44 24 08          	mov    0x8(%esp),%eax
  9b:	89 ca                	mov    %ecx,%edx
  9d:	eb a4                	jmp    43 <__mulvdi3+0x43>
  9f:	90                   	nop
  a0:	89 d8                	mov    %ebx,%eax
  a2:	f7 e7                	mul    %edi
  a4:	8b 3c 24             	mov    (%esp),%edi
  a7:	89 c6                	mov    %eax,%esi
  a9:	8b 04 24             	mov    (%esp),%eax
  ac:	89 d5                	mov    %edx,%ebp
  ae:	f7 e1                	mul    %ecx
  b0:	89 c3                	mov    %eax,%ebx
  b2:	85 ff                	test   %edi,%edi
  b4:	79 0c                	jns    c2 <__mulvdi3+0xc2>
  b6:	89 d0                	mov    %edx,%eax
  b8:	29 c8                	sub    %ecx,%eax
  ba:	89 04 24             	mov    %eax,(%esp)
  bd:	8b 14 24             	mov    (%esp),%edx
  c0:	89 d8                	mov    %ebx,%eax
  c2:	85 c9                	test   %ecx,%ecx
  c4:	79 08                	jns    ce <__mulvdi3+0xce>
  c6:	2b 44 24 08          	sub    0x8(%esp),%eax
  ca:	1b 54 24 0c          	sbb    0xc(%esp),%edx
  ce:	89 e9                	mov    %ebp,%ecx
  d0:	31 db                	xor    %ebx,%ebx
  d2:	01 c1                	add    %eax,%ecx
  d4:	89 c8                	mov    %ecx,%eax
  d6:	11 d3                	adc    %edx,%ebx
  d8:	c1 f8 1f             	sar    $0x1f,%eax
  db:	39 d8                	cmp    %ebx,%eax
  dd:	0f 85 00 00 00 00    	jne    …
  e3:	89 f0                	mov    %esi,%eax
  e5:	89 ca                	mov    %ecx,%edx
  e7:	e9 57 ff ff ff       	jmp    43 <__mulvdi3+0x43>
  ec:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
  f0:	85 ed                	test   %ebp,%ebp
  f2:	78 24                	js     118 <__mulvdi3+0x118>
  f4:	8b 04 24             	mov    (%esp),%eax
  f7:	85 c0                	test   %eax,%eax
  f9:	78 45                	js     140 <__mulvdi3+0x140>
  fb:	09 e8                	or     %ebp,%eax
  fd:	0f 85 00 00 00 00    	jne    …
 103:	89 f8                	mov    %edi,%eax
 105:	f7 e3                	mul    %ebx
 107:	85 d2                	test   %edx,%edx
 109:	0f 89 34 ff ff ff    	jns    43 <__mulvdi3+0x43>
 10f:	e9 00 00 00 00       	jmp    …
 114:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
 118:	8b 04 24             	mov    (%esp),%eax
 11b:	85 c0                	test   %eax,%eax
 11d:	78 48                	js     167 <__mulvdi3+0x167>
 11f:	0f 85 00 00 00 00    	jne    …
 125:	83 fd ff             	cmp    $0xffffffff,%ebp
 128:	0f 85 00 00 00 00    	jne    …
 12e:	89 f8                	mov    %edi,%eax
 130:	f7 e3                	mul    %ebx
 132:	89 d1                	mov    %edx,%ecx
 134:	89 c6                	mov    %eax,%esi
 136:	29 d9                	sub    %ebx,%ecx
 138:	78 a9                	js     e3 <__mulvdi3+0xe3>
 13a:	e9 00 00 00 00       	jmp    …
 13f:	90                   	nop
 140:	83 3c 24 ff          	cmpl   $0xffffffff,(%esp)
 144:	0f 85 00 00 00 00    	jne    …
 14a:	85 ed                	test   %ebp,%ebp
 14c:	0f 85 00 00 00 00    	jne    …
 152:	89 d8                	mov    %ebx,%eax
 154:	f7 e7                	mul    %edi
 156:	89 d3                	mov    %edx,%ebx
 158:	29 cb                	sub    %ecx,%ebx
 15a:	0f 89 00 00 00 00    	jns    …
 160:	89 da                	mov    %ebx,%edx
 162:	e9 dc fe ff ff       	jmp    43 <__mulvdi3+0x43>
 167:	23 34 24             	and    (%esp),%esi
 16a:	83 fe ff             	cmp    $0xffffffff,%esi
 16d:	0f 85 00 00 00 00    	jne    …
 173:	89 d8                	mov    %ebx,%eax
 175:	09 f8                	or     %edi,%eax
 177:	0f 84 00 00 00 00    	je     …
 17d:	89 f8                	mov    %edi,%eax
 17f:	f7 e3                	mul    %ebx
 181:	89 c6                	mov    %eax,%esi
 183:	89 d0                	mov    %edx,%eax
 185:	29 c8                	sub    %ecx,%eax
 187:	29 d8                	sub    %ebx,%eax
 189:	89 c1                	mov    %eax,%ecx
 18b:	0f 88 00 00 00 00    	js     …
 191:	e9 4d ff ff ff       	jmp    e3 <__mulvdi3+0xe3>
[…]

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulvdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of multiplier
				# [esp+12] = low dword of multiplier
				# [esp+8] = high dword of multiplicand
				# [esp+4] = low dword of multiplicand
__mulvdi3:
	push	ebx
	mov	eax, [esp+20]	# eax = high dword of multiplier
	mov	ecx, [esp+16]	# ecx = low dword of multiplier
	cdq			# edx = (multiplier < 0) ? -1 : 0
	mov	ebx, edx	# ebx = (multiplier < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (multiplier < 0) ? ~multiplier : multiplier
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	[esp+16], ecx
	mov	[esp+20], eax	# multiplier = |multiplier|

	mov	eax, [esp+12]	# eax = high dword of multiplicand
	mov	ecx, [esp+8]	# ecx = low dword of multiplicand
	cdq			# edx = (multiplicand < 0) ? -1 : 0
	xor	ebx, edx	# ebx = (multiplier < 0) ^ (multiplicand < 0) ? -1 : 0
				#     = (product < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	mov	[esp+8], ecx
	mov	[esp+12], eax	# multiplicand = |multiplicand|

	push	ebx		# [esp] = (product < 0) ? -1 : 0

	xor	edx, edx	# edx = 0
#	mov	eax, [esp+16]	# eax = high dword of |multiplicand|
	cmp	edx, eax
	sbb	ebx, ebx	# ebx = (high dword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**32) ? 0 : -1
	mov	ecx, [esp+24]	# ecx = high dword of |multiplier|
	cmp	edx, ecx
	sbb	edx, edx	# edx = (high dword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplier| < 2**32) ? 0 : -1
	and	ebx, edx	# ebx = (|multiplicand| < 2**32)
				#     & (|multiplier| < 2**32) ? 0 : -1
				#     = (|product| < 2**64) ? 0 : -1
.ifdef INTO
	shr	ebx, 1
	into
	mov	ebx, [esp+20]	# ebx = low dword of |multiplier|
	mul	ebx		# edx:eax = high dword of |multiplicand|
				#         * low dword of |multiplier|
	into
.if 0
	xchg	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|,
				# eax = high dword of |multiplier|
	mov	edx, [esp+12]	# edx = low dword of |multiplicand|
.else
	mov	edx, ecx	# edx = high dword of |multiplier|
	mov	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|
	mov	eax, [esp+12]	# eax = low dword of |multiplicand|
.endif
	mul	edx		# edx:eax = high dword of |multiplier|
				#         * low dword of |multiplicand|
	into
	add	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|
				#     + high dword of |multiplier|
				#     * low dword of |multiplicand|
#	sbb	eax, eax	# eax = (|product| < 2**64) ? 0 : -1
#	shr	eax, 1
#	into
	mov	eax, [esp+12]	# eax = low dword of |multiplicand|
	mul	ebx		# edx:eax = low dword of |multiplicand|
				#         * low dword of |multiplier|
	add	edx, ecx	# edx:eax = |product % 2**64|
				#         = |product| % 2**64
	sbb	ecx, ecx	# ecx = (|product| < 2**64) ? 0 : -1
	shr	ecx, 1
	into
	pop	ebx		# ebx = (product < 0) ? -1 : 0
	add	eax, ebx
	adc	edx, ebx	# edx:eax = (product < 0) ? ~product % 2**64 : product % 2**64
	mov	ecx, edx
	shr	ecx, 1
	into
	xor	eax, ebx
	xor	edx, ebx	# edx:eax = product % 2**64
	pop	ebx
	ret
.else
	mov	edx, [esp+20]	# edx = low dword of |multiplier|
	mul	edx		# edx:eax = high dword of |multiplicand|
				#         * low dword of |multiplier|
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *
.if 0
	xchg	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|,
				# eax = high dword of |multiplier|
	mov	edx, [esp+12]	# edx = low dword of |multiplicand|
.else
	mov	edx, ecx	# edx = high dword of |multiplier|
	mov	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|
	mov	eax, [esp+12]	# eax = low dword of |multiplicand|
.endif
	mul	edx		# edx:eax = high dword of |multiplier|
				#         * low dword of |multiplicand|
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	add	ecx, eax	# ecx = high dword of |multiplicand|
				#     * low dword of |multiplier|
				#     + high dword of |multiplier|
				#     * low dword of |multiplicand|
#	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	mov	eax, [esp+12]	# eax = low dword of |multiplicand|
	mov	edx, [esp+20]	# edx = low dword of |multiplier|
	mul	edx		# edx:eax = low dword of |multiplicand|
				#         * low dword of |multiplier|
	add	edx, ecx	# edx:eax = |product % 2**64|
				#         = |product| % 2**64
	adc	ebx, ebx	# ebx = (|product| < 2**64) ? 0 : *

	pop	ecx		# ecx = (product < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = product % 2**64

	xor	ecx, edx	# ecx = (multiplicand < 0)
				#     ^ (multiplier < 0)
				#     ^ (product < 0) ? negative : positive
	add	ecx, ecx
	adc	ebx, ebx	# ebx = (-2**63 <= product < 2**63) ? 0 : *
	jnz	.overflow	# product < -2**63?
				# product >= 2**63?
	pop	ebx
	ret
.overflow:
	int	4
	ret
.endif # INTO
.size	__mulvdi3, .-__mulvdi3
.type	__mulvdi3, @function
.global	__mulvdi3
.end

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulvdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of multiplier
				# [esp+12] = low dword of multiplier
				# [esp+8] = high dword of multiplicand
				# [esp+4] = low dword of multiplicand
__mulvdi3:
	push	ebx
	mov	ebx, [esp+16]	# ebx = low dword of multiplier
	mov	eax, [esp+8]	# eax = low dword of multiplicand
	mul	ebx		# edx:eax = low dword of multiplicand
				#         * low dword of multiplier
	push	eax
	mov	ecx, edx	# ecx = low dword of "inner" product
	mov	eax, [esp+16]	# eax = high dword of multiplicand
	mul	ebx		# edx:eax = high dword of multiplicand
				#         * low dword of multiplier
	xor	ebx, ebx	# ebx = high dword of "inner" product
	add	ecx, eax
	adc	ebx, edx	# ebx:ecx = intermediate "inner" product
	mov	eax, [esp+12]	# eax = low dword of multiplicand
	mul	dword ptr [esp+24]
				# edx:eax = low dword of multiplicand
				#         * high dword of multiplier
	add	ecx, eax
	adc	ebx, edx	# ebx:ecx = final "inner" product
	push	ecx
	sbb	ecx, ecx	# ecx = 0 - carry from "inner" product
	mov	eax, [esp+20]	# eax = high dword of multiplicand
	mul	dword ptr [esp+28]
				# edx:eax = high dword of multiplicand
				#         * high dword of multiplier
	neg	ecx		# ecx = carry from "inner" product
	add	eax, ebx
	adc	edx, ecx	# edx:eax = high dword of multiplicand
				#         * high dword of multiplier
				#         + high dword of "inner" product
				#         + carry from "inner" product
				#         = high qword of (unsigned) product"

	xor	ebx, ebx	# ebx = sign of multiplier = 0
	cmp	ebx, [esp+28]
	jle	0f		# (high dword of) multiplier >= 0?

	not	ebx		# ebx = 0 - sign of multiplier = -1
	sub	eax, [esp+16]
	sbb	edx, [esp+20]	# edx:eax = high qword of product"
				#         - multiplicand
				#         = high qword of product'
0:
	xor	ecx, ecx
	cmp	ecx, [esp+20]
	jle	1f		# (high dword of) multiplicand >= 0?

	not	ebx		# ebx = (0 - sign of multiplier)
				#     ^ (0 - sign of multiplicand)
	sub	eax, [esp+24]
	sbb	edx, [esp+28]	# edx:eax = high qword of product'
				#         - multiplier
				#         = high qword of (signed) product
1:
	xor	eax, ebx
	xor	edx, ebx	# edx:eax = high qword of product
				#         ^ (0 - sign of multiplier)
				#         ^ (0 - sign of multiplicand)
	or	eax, edx	# eax = high qword of product
				#     <> (0 - sign of multiplier)
				#      ^ (0 - sign of multiplicand)
	pop	edx		# edx = high dword of (low qword of) product
	shld	ecx, edx, 1	# ecx = sign of product
	add	ebx, ecx	# ebx = sign of product
				#     <> sign of multiplier
				#      ^ sign of multiplicand
	or	eax, ebx
	jnz	.overflow

	pop	eax		# edx:eax = (low qword of) product
	pop	ebx
	ret
.overflow:
	int	4
	ret

.size	__mulvdi3, .-__mulvdi3
.type	__mulvdi3, @function
.global	__mulvdi3
.end

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __mulvsi3(int multiplicand, int multiplier) {
    int product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}

long long __mulvdi3(long long multiplicand, long long multiplier) {
    long long product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}

#ifdef __amd64__
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
    __int128_t product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}
#endif

Case 14: `__negv?i2()` Functions

libgcc provides the __negvsi2(), __negvdi2() and __negvti2() functions for overflow-trapping signed integer negation:

Runtime Function: int __negvsi2 (int a)

Runtime Function: long __negvdi2 (long a)

Runtime Function: long long __negvti2 (long long a)

These functions return the negation of a; that is -a.

For the AMD64 target platform, GCC compiles __negvti2() to the following unoptimised and awful code:

[…]
0000000000000000 <__negvti2>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 83 ec 08          	sub    $0x8,%rsp
   7:	48 89 f2             	mov    %rsi,%rdx
   a:	48 f7 d8             	neg    %rax
   d:	48 83 d2 00          	adc    $0x0,%rdx
  11:	48 f7 da             	neg    %rdx
  14:	48 85 f6             	test   %rsi,%rsi
  17:	78 17                	js     30 <__negvti2+0x30>
  19:	31 c9                	xor    %ecx,%ecx
  1b:	48 39 c1             	cmp    %rax,%rcx
  1e:	48 19 d1             	sbb    %rdx,%rcx
  21:	7c 1b                	jl     3e <__negvti2+0x3e>
  23:	48 83 c4 08          	add    $0x8,%rsp
  27:	c3                   	retq   
  28:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  2f:	00 
  30:	48 85 d2             	test   %rdx,%rdx
  33:	0f 88 00 00 00 00    	js     …
  39:	48 83 c4 08          	add    $0x8,%rsp
  3d:	c3                   	retq   
  3e:	e9 00 00 00 00       	jmpq   …
[…]

Properly written code uses but only 8 instructions in just 15 bytes instead of 20 instructions in 67 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"negvti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__negvti2:
	xor	eax, eax
	cdq			# rdx:rax = 0
	sub	rax, rdi
	sbb	rdx, rsi	# rdx:rax = -argument
	jo	.overflow

	ret
.overflow:
	int	4
	ret

.size	__negvti2, .-__negvti2
.type	__negvti2, @function
.global	__negvti2
.end

Note: exploration of the equally bad code generated for the __negvdi2() and __negvsi2() functions is left as an exercise to the reader.

[…]
00000000 <__negvdi2>:
   0:	53                   	push   %ebx
   1:	83 ec 08             	sub    $0x8,%esp
   4:	8b 4c 24 10          	mov    0x10(%esp),%ecx
   8:	8b 5c 24 14          	mov    0x14(%esp),%ebx
   c:	89 c8                	mov    %ecx,%eax
   e:	89 da                	mov    %ebx,%edx
  10:	f7 d8                	neg    %eax
  12:	83 d2 00             	adc    $0x0,%edx
  15:	f7 da                	neg    %edx
  17:	85 db                	test   %ebx,%ebx
  19:	78 15                	js     30 <__negvdi2+0x30>
  1b:	31 c9                	xor    %ecx,%ecx
  1d:	39 c1                	cmp    %eax,%ecx
  1f:	19 d1                	sbb    %edx,%ecx
  21:	7c 1a                	jl     3d <__negvdi2+0x3d>
  23:	83 c4 08             	add    $0x8,%esp
  26:	5b                   	pop    %ebx
  27:	c3                   	ret    
  28:	8d b4 26 00 00 00 00 	lea    0x0(%esi,%eiz,1),%esi
  2f:	90                   	nop
  30:	85 d2                	test   %edx,%edx
  32:	0f 88 00 00 00 00    	js     …
  38:	83 c4 08             	add    $0x8,%esp
  3b:	5b                   	pop    %ebx
  3c:	c3                   	ret    
  3d:	e9 00 00 00 00       	jmp    …
[…]

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"negvdi2.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+8] = high dword of argument
				# [esp+4] = low dword of argument
__negvdi2:
	xor	eax, eax
	cdq			# edx:eax = 0
	sub	eax, [esp+4]
	sbb	edx, [esp+8]	# edx:eax = -argument
	into
	ret

.size	__negvdi2, .-__negvdi2
.type	__negvdi2, @function
.global	__negvdi2
.end

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef ALTERNATE
// overflow if negend and negation are negative

int __negvsi2(int negend) {
    const int negation = 0 - (unsigned int) negend;

    if ((negend & negation) < 0)
        __builtin_trap();

    return negation;
}

long long __negvdi2(long long negend) {
    const long long negation = 0 - (unsigned long long) negend;

    if ((negend & negation) < 0)
        __builtin_trap();

    return negation;
}

#ifdef __amd64__
__int128_t __negvti2(__int128_t negend) {
    const __int128_t negation = 0 - (__uint128_t) negend;

    if ((negend & negation) < 0)
        __builtin_trap();

    return negation;
}
#endif
#else
int __negvsi2(int negend) {
    int negation;

    if (__builtin_sub_overflow(0, negend, &negation))
        __builtin_trap();

    return negation;
}

long long __negvdi2(long long negend) {
    long long negation;

    if (__builtin_sub_overflow(0, negend, &negation))
        __builtin_trap();

    return negation;
}

#ifdef __amd64__
__int128_t __negvti2(__int128_t negend) {
    __int128_t negation;

    if (__builtin_sub_overflow(0, negend, &negation))
        __builtin_trap();

    return negation;
}
#endif
#endif // ALTERNATE

Case 15: `__parity?i2()` Functions

libgcc provides the __paritysi2(), __paritydi2() and __parityti2() functions:

Runtime Function: int __paritysi2 (unsigned int a)

Runtime Function: int __paritydi2 (unsigned long a)

Runtime Function: int __parityti2 (unsigned long long a)

These functions return the value zero if the number of bits set in a is even, and the value one otherwise.

For the AMD64 target platform, GCC compiles __parityti2() to the following unoptimised and awful code:

[…]
0000000000000000 <__parityti2>:
   0:	48 31 f7             	xor    %rsi,%rdi
   3:	b8 96 69 00 00       	mov    $0x6996,%eax
   8:	48 89 fe             	mov    %rdi,%rsi
   b:	48 89 f9             	mov    %rdi,%rcx
   e:	48 c1 ee 20          	shr    $0x20,%rsi
  12:	48 31 f1             	xor    %rsi,%rcx
  15:	48 89 ce             	mov    %rcx,%rsi
  18:	48 c1 ee 10          	shr    $0x10,%rsi
  1c:	48 31 ce             	xor    %rcx,%rsi
  1f:	48 89 f1             	mov    %rsi,%rcx
  22:	48 c1 e9 08          	shr    $0x8,%rcx
  26:	48 31 ce             	xor    %rcx,%rsi
  29:	48 89 f1             	mov    %rsi,%rcx
  2c:	48 c1 e9 04          	shr    $0x4,%rcx
  30:	48 31 f1             	xor    %rsi,%rcx
  33:	83 e1 0f             	and    $0xf,%ecx
  36:	d3 f8                	sar    %cl,%eax
  38:	83 e0 01             	and    $0x1,%eax
  3b:	c3                   	retq   
[…]

Properly written code uses but only 9 instructions in just 24 bytes instead of 19 instructions in 60 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"parityti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__parityti2:
	xor	rdi, rsi
	shld	rdx, rdi, 32
	xor	edx, edi
	shld	edx, edi, 16
	xor	edx, edi
	xor	eax, eax
	xor	dl, dh
	setpo	al
	ret

.size	__parityti2, .-__parityti2
.type	__parityti2, @function
.global	__parityti2
.end

Note: exploration of the equally bad code generated for the __paritydi2() and __paritysi2() functions is left as an exercise to the reader.

[…]
00000000 <__paritydi2>:
   0:	8b 44 24 08          	mov    0x8(%esp),%eax
   4:	33 44 24 04          	xor    0x4(%esp),%eax
   8:	89 c1                	mov    %eax,%ecx
   a:	c1 e8 10             	shr    $0x10,%eax
   d:	31 c8                	xor    %ecx,%eax
   f:	89 c1                	mov    %eax,%ecx
  11:	c1 e9 08             	shr    $0x8,%ecx
  14:	31 c8                	xor    %ecx,%eax
  16:	89 c1                	mov    %eax,%ecx
  18:	c1 e9 04             	shr    $0x4,%ecx
  1b:	31 c1                	xor    %eax,%ecx
  1d:	b8 96 69 00 00       	mov    $0x6996,%eax
  22:	83 e1 0f             	and    $0xf,%ecx
  25:	d3 f8                	sar    %cl,%eax
  27:	83 e0 01             	and    $0x1,%eax
  2a:	c3                   	ret    
[…]

Case 16: `__builtin_parity()` Builtin

Create the text file case16.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __paritysi2(unsigned int value) {
    return __builtin_parity(value);
}

int __paritydi2(unsigned long long value) {
    return __builtin_parityll(value);
}

Compile the source file case16.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case16.c

[…]
__paritysi2:
	movl	4(%esp), %eax
	movl	%eax, %edx
	shrl	$16, %eax
	xorl	%eax, %edx
	xorl	%eax, %eax
	xorb	%dh, %dl
	setnp	%al
	ret
[…]
__paritydi2:
	pushl	%ebx
	movl	8(%esp), %eax
	movl	12(%esp), %edx
	movl	%edx, %ebx
	xorl	%eax, %ebx
	movl	%ebx, %edx
	shrl	$16, %ebx
	xorl	%ebx, %edx
	xorl	%eax, %eax
	xorb	%dh, %dl
	setnp	%al
	popl	%ebx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Oops: the code for the __paritydi2() function clobbers register EBX instead of the volatile register ECX, thus wasting a PUSH and a POP instruction which perform 2 superfluous memory accesses.

Case 17: `__builtin_mul_overflow()` Builtin

Create the text file case17.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef OPTIMIZE
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
    __int128_t product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}

__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    __uint128_t product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}
#else
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
    __uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);

    if (__builtin_mul_overflow((multiplicand ^ sign) - sign,
                               (multiplier ^ tmp) - tmp,
                               &product))
        __builtin_trap();

    sign ^= tmp;
    product = (product ^ sign) - sign;

    if ((__int128_t) (product ^ sign) < 0)
        __builtin_trap();

    return product;
}

__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    union {
        __uint128_t all;
        struct {
            unsigned long long low, high;
        };
    } product;
    unsigned long long tmp;

    product.all = (multiplicand & ~0ULL) * (multiplier & ~0ULL);

    if ((((multiplicand >> 64) != 0) && ((multiplier >> 64) != 0))
     || __builtin_umulll_overflow(multiplicand & ~0ULL, multiplier >> 64, &tmp)
     || __builtin_uaddll_overflow(tmp, product.high, &product.high)
     || __builtin_umulll_overflow(multiplicand >> 64, multiplier & ~0ULL, &tmp)
     || __builtin_uaddll_overflow(tmp, product.high, &product.high))
        __builtin_trap();

    return product.all;
}
#endif // OPTIMIZE

Compile the source file case17.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -Os -S case17.c

[…]
__mulvti3:
	pushq	%rbp
	movq	%rsi, %r10
	movq	%rdx, %rsi
	movq	%rdi, %rdx
	movq	%rsi, %rax
	sarq	$63, %rdx
	movq	%rcx, %r11
	sarq	$63, %rax
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	xorl	%r14d, %r14d
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	cmpq	%r10, %rdx
	jne	.L4
	cmpq	%rcx, %rax
	jne	.L5
	movq	%rdi, %rax
	imulq	%rsi
	movq	%rax, %rbx
	movq	%rdx, %r8
	jmp	.L2
.L5:
	movq	%rsi, %rcx
	movq	%r11, %rbx
	movq	%r11, %r8
	movq	%rdi, %r15
	jmp	.L6
.L4:
	cmpq	%rcx, %rax
	jne	.L7
	movq	%rdi, %rcx
	movq	%r10, %rbx
	movq	%r10, %r8
	movq	%rsi, %r15
.L6:
	movq	%rdi, %rax
	mulq	%rsi
	movq	%rax, %r12
	movq	%r15, %rax
	movq	%rdx, %r13
	mulq	%r8
	testq	%r8, %r8
	jns	.L8
	xorl	%r8d, %r8d
	subq	%r8, %rax
	sbbq	%r15, %rdx
.L8:
	testq	%r15, %r15
	jns	.L9
	subq	%rcx, %rax
	sbbq	%rbx, %rdx
.L9:
	movq	%r13, %r8
	xorl	%r9d, %r9d
	movq	%r12, %rbx
	addq	%rax, %r8
	adcq	%rdx, %r9
	movq	%r8, %rdx
	sarq	$63, %rdx
	cmpq	%r9, %rdx
	je	.L2
	imulq	%rsi, %r10
	movq	%rdi, %rax
	imulq	%rdi, %r11
	mulq	%rsi
	leaq	(%r10,%r11), %r8
	addq	%rdx, %r8
	movq	%rax, %rbx
	jmp	.L3
.L7:
	movq	%r10, %r8
	movq	%rcx, %rdx
	movq	%rdi, %rax
	imulq	%rdi, %rdx
	imulq	%rsi, %r8
	addq	%rdx, %r8
	mulq	%rsi
	addq	%rdx, %r8
	leaq	1(%r10), %rdx
	movq	%rax, %rbx
	cmpq	$1, %rdx
	ja	.L3
	leaq	1(%rcx), %rdx
	cmpq	$1, %rdx
	ja	.L3
	cmpq	%rcx, %r10
	jne	.L11
	cmpq	%rbx, %r14
	movq	%r14, %rax
	sbbq	%r8, %rax
	jl	.L2
	jmp	.L3
.L11:
	testq	%r8, %r8
	js	.L2
.L3:
	movl	$1, %r14d
.L2:
	testq	%r14, %r14
	je	.L1
	ud2
.L1:
	movq	%rbx, %rax
	movq	%r8, %rdx
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	ret
[…]
__umulvti3:
	pushq	%rbp
	movq	%rsi, %r10
	movq	%rcx, %r11
	movq	%rdx, %rsi
	movq	%rsp, %rbp
	pushq	%r12
	xorl	%r12d, %r12d
	pushq	%rbx
	testq	%r10, %r10
	jne	.L18
	testq	%rcx, %rcx
	jne	.L19
	movq	%rdi, %rax
	mulq	%rdx
	movq	%rax, %rbx
	movq	%rdx, %r8
	jmp	.L16
.L19:
	movq	%rcx, %r8
	movq	%rdi, %r9
	jmp	.L20
.L18:
	testq	%rcx, %rcx
	jne	.L22
	movq	%r10, %r8
	movq	%rdx, %r9
.L20:
	movq	%rdi, %rax
	mulq	%rsi
	movq	%rax, %rcx
	movq	%r9, %rax
	movq	%rdx, %rbx
	xorl	%r9d, %r9d
	mulq	%r8
	movq	%rbx, %r8
	movq	%rcx, %rbx
	addq	%rax, %r8
	adcq	%rdx, %r9
	testq	%r9, %r9
	je	.L16
.L22:
	imulq	%rsi, %r10
	movq	%rdi, %rax
	movl	$1, %r12d
	imulq	%rdi, %r11
	mulq	%rsi
	leaq	(%r10,%r11), %r8
	addq	%rdx, %r8
	movq	%rax, %rbx
.L16:
	testq	%r12, %r12
	je	.L15
	ud2
.L15:
	movq	%rbx, %rax
	movq	%r8, %rdx
	popq	%rbx
	popq	%r12
	popq	%rbp
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: 104 instructions in 295 bytes for the __mulvti3() function – still horrible, but not as bad as the __mulvti3() function provided in libgcc any more!

Ouch: 54 instructions in 148 bytes for the __umulvti3() function – also rather terrible!

Compile the source file case17.c a second time with GCC, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

gcc -DOPTIMIZE -m64 -o- -Os -S case17.c

[…]
__mulvti3:
	pushq	%rbp
	movq	%rcx, %rax
	movq	%rdi, %r10
	movq	%rsi, %r11
	movq	%rax, %rdi
	sarq	$63, %rdi
	movq	%rsp, %rbp
	pushq	%r14
	movq	%rdi, %r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	movq	%rsi, %rbx
	movslq	%edi, %rsi
	sarq	$63, %rbx
	movslq	%ebx, %rcx
	movq	%rbx, %r12
	movq	%rcx, %r8
	sarq	$63, %rcx
	xorq	%r8, %r10
	movq	%rcx, %r9
	movq	%rsi, %rcx
	sarq	$63, %rsi
	movq	%rsi, %rbx
	xorq	%r9, %r11
	movq	%r10, %rsi
	subq	%r8, %rsi
	movq	%r11, %rdi
	sbbq	%r9, %rdi
	xorq	%rcx, %rdx
	xorq	%rbx, %rax
	movq	%rdx, %r10
	movq	%rax, %r11
	subq	%rcx, %r10
	sbbq	%rbx, %r11
	xorl	%r13d, %r13d
	movq	%r11, %r8
	testq	%rdi, %rdi
	jne	.L4
	testq	%r11, %r11
	jne	.L5
	movq	%rsi, %rax
	mulq	%r10
	movq	%rax, %rcx
	movq	%rdx, %r8
	jmp	.L2
.L5:
	movq	%rsi, %r9
	jmp	.L6
.L4:
	testq	%r11, %r11
	jne	.L8
	movq	%rdi, %r8
	movq	%r10, %r9
.L6:
	movq	%rsi, %rax
	mulq	%r10
	movq	%rax, %rcx
	movq	%r9, %rax
	movq	%rdx, %rbx
	xorl	%r9d, %r9d
	mulq	%r8
	movq	%rbx, %r8
	addq	%rax, %r8
	adcq	%rdx, %r9
	testq	%r9, %r9
	je	.L2
.L8:
	movq	%rdi, %r8
	movq	%r11, %rax
	movl	$1, %r13d
	imulq	%rsi, %rax
	imulq	%r10, %r8
	addq	%rax, %r8
	movq	%rsi, %rax
	mulq	%r10
	addq	%rdx, %r8
	movq	%rax, %rcx
.L2:
	testq	%r13, %r13
	je	.L9
.L10:
	ud2
.L9:
	xorl	%r14d, %r12d
	movslq	%r12d, %r12
	movq	%r12, %rsi
	sarq	$63, %r12
	xorq	%rsi, %rcx
	xorq	%r12, %r8
	movq	%r12, %rbx
	movq	%rcx, %rax
	movq	%r8, %rdx
	subq	%rsi, %rax
	sbbq	%r12, %rdx
	xorq	%rdx, %rbx
	js	.L10
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%rbp
	ret
[…]
__umulvti3:
	movq	%rdi, %rax
	movq	%rdx, %r8
	mulq	%rdx
	movq	%rdx, %r11
	movq	%rax, %r9
	testq	%rsi, %rsi
	je	.L14
	testq	%rcx, %rcx
	je	.L14
.L17:
	ud2
.L14:
	movq	%rdi, %rax
	mulq	%rcx
	movq	%rax, %rdi
	jo	.L17
	addq	%r11, %rdi
	jc	.L17
	movq	%rsi, %rax
	mulq	%r8
	movq	%rax, %rsi
	jo	.L17
	addq	%rdi, %rsi
	movq	%rsi, %rdx
	jc	.L17
	movq	%r9, %rax
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Oops: the optimised implementation coaxes GCC to generate 96 instructions in 273 bytes for the __mulvti3() function – a little shorter and faster than the code generated for __builtin_mul_overflow(), but more than twice the size of the proper assembly code shown in case 13 above, and less than half as fast!

OOPS: 25 instructions in 66 bytes for the __umulvti3() function – also not optimal, but much better than the rather terrible code generated for __builtin_mul_overflow().

Note: exploration of the equally bad code generated for checked multiplication of signed and unsigned 64-bit integers on the i386 platform is left as an exercise to the reader.

Case 18: `__builtin_sub_overflow()` Builtin

Create the text file case18.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

__int128_t __absvti2(__int128_t value) {
    __int128_t sign = 0 - (value < 0);

    if (__builtin_sub_overflow(value ^ sign, sign, &value))
        __builtin_trap();

    return value;
}

__int128_t __negvti2(__int128_t negend) {
    __int128_t negation;

    if (__builtin_sub_overflow(0, negend, &negation))
        __builtin_trap();

    return negation;
}

__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
    __int128_t difference;

    if (__builtin_sub_overflow(minuend, subtrahend, &difference))
        __builtin_trap();

    return difference;
}

Compile the source file case18.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -Os -S case18.c

[…]
__absvti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	cltq
	movq	%rax, %rsi
	sarq	$63, %rax
	movq	%rax, %rdi
	xorq	%rsi, %r8
	xorq	%rdi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rdi, %rdx
	jno	.L1
	ud2
.L1:
	ret
[…]
__negvti2:
	movq	%rdi, %r9
	movq	%rsi, %r8
	movq	%rdi, %rsi
	xorl	%edx, %edx
	movq	%r8, %rdi
	movq	%r9, %r8
	movl	$1, %eax
	negq	%r8
	movq	%rdi, %r9
	adcq	$0, %r9
	salq	$63, %rax
	negq	%r9
	cmpq	%rax, %rdi
	jne	.L7
	xorl	%edx, %edx
	testq	%rsi, %rsi
	sete	%dl
.L7:
	testq	%rdx, %rdx
	je	.L6
	ud2
.L6:
	movq	%r8, %rax
	movq	%r9, %rdx
	ret
[…]
__subvti3:
	movq	%rsi, %r8
	movq	%rdi, %rsi
	movq	%rsi, %rax
	movq	%r8, %rdi
	movq	%rdx, %r8
	subq	%r8, %rax
	movq	%rdi, %rdx
	sbbq	%rcx, %rdx
	jno	.L11
	ud2
.L11:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: terrible code for the __absvti2() and __negvti2() functions!

OOPS: awful code with 4 superfluous MOV instructions in a total of 11 instructions for the __subvti3() function.

Case 19: `__builtin_copysign()` Builtin

Create the text file case19.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

double copysign(double destination, double source) {
    return __builtin_copysign(destination, source);
}

Compile the source file case19.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case19.c

[…]
copysign:
	fldl	12(%esp)
	fxam
	fnstsw	%ax
	fstp	%st(0)
	fldl	4(%esp)
	fabs
	testb	$2, %ah
	je	L1
	fchs
L1:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Ouch: 10 instructions in 24 bytes, including a conditional branch which impairs performance!

Properly written code uses but only 5 instructions in just 19 bytes, without conditional branch:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"copysign.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text

copysign:
	mov	eax, [esp+16]
	shld	[esp+8], eax, 1
	ror	[esp+8], 1
	fld	qword ptr [esp+4]
	ret

.size	copysign, .-copysign
.type	copysign, @function
.global	copysign
.end

Compile the source file case19.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case19.c

[…]
copysign:
	pushq	%rbp
	andpd	.LC1(%rip), %xmm0
	movapd	%xmm1, %xmm2
	andpd	.LC0(%rip), %xmm2
	movq	%rsp, %rbp
	orpd	%xmm2, %xmm0
	popq	%rbp
	ret
	.section .rdata,"dr"
	.align 16
.LC0:
	.long	0
	.long	-2147483648
	.long	0
	.long	0
	.align 16
.LC1:
	.long	-1
	.long	2147483647
	.long	0
	.long	0
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: 8 instructions in 30 bytes, plus 2 constants in 32 bytes, of which either 4 instructions or (at least) one of the 2 constants are superfluous!

Properly written code uses but only 6 instructions in either 28 or just 24 bytes, and performs no memory access:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"copysign.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# xmm0 = destination
				# xmm1 = source
copysign:
.ifnotdef OPTIMIZE
	mov	rax, 0x7FFFFFFFFFFFFFFF
	movd	xmm2, rax
	andpd	xmm0, xmm2
	andnpd	xmm2, xmm1
	orpd	xmm0, xmm2
.else
	movd	rcx, xmm0
	movd	rdx, xmm1
	shld	rcx, rdx, 1
	ror	rcx, 1
	movd	xmm0, rcx
.endif
	ret

.size	copysign, .-copysign
.type	copysign, @function
.global	copysign
.end

Note: exploration of the equally bad code generated for the copysignf() and copysignl() functions is left as an exercise to the reader.

Case 20: `-fsanitize=signed-integer-overflow` Command Line Option

Create the text file case20.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int si2(int u) {
    return v << 31;
}

long long di2(long long u) {
    return v << 63;
}

#ifdef __amd64__
__int128_t ti2(__int128_t u) {
    return v << 127;
}
#endif

Compile the source file case20.c with GCC, enabling signed overflow detection, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

gcc -fsanitize=signed-integer-overflow -m64 -o- -Os -S case20.c

[…]
si2:
	movl	%edi, %eax
	sall	$31, %eax
	ret
[…]
di2:
	movq	%rdi, %rax
	salq	$63, %rax
	ret
[…]
ti2:
	movq	%rdi, %rdx
	xorl	%eax, %eax
	salq	$63, %rdx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: GCC fails to generate code for integer overflow detection – which occurs for arguments not equal 0 here!

Note: exploration of this bug on the i386 platform is left as an exercise to the reader.

gcc -fsanitize=signed-integer-overflow -m32 -o- -Os -S case20.c

[…]
_si2:
	movl	4(%esp), %eax
	sall	$31, %eax
	ret
[…]
_di2:
	movl	4(%esp), %edx
	xorl	%eax, %eax
	sall	$31, %edx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Case 21: `-ftrapv` Command Line Option

Create the text file case21.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int si3(int v, int w) {
    volatile int p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}

long long di3(long long v, long long w) {
    volatile long long p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}

#ifdef __amd64__
__int128_t ti3(__int128_t v, __int128_t w) {
    volatile __int128_t p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}
#endif

Compile the source file case21.c with GCC, enabling overflow-trapping, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

gcc -ftrapv -m64 -o- -Os -S case21.c

[…]
	.globl	__mulvsi3
	.globl	__addvsi3
	.globl	__subvsi3
	.globl	__negvsi2
	.globl	si3
	.type	si3, @function
si3:
	pushq	%rbp
	movl	%esi, %ebp
	pushq	%rbx
	movl	%edi, %ebx
	subq	$40, %rsp
	call	__mulvsi3
	movl	%ebp, %esi
	movl	%ebx, %edi
	movl	%eax, 8(%rsp)
	movl	%ebx, %eax
	cltd
	idivl	%ebp
	movl	%eax, 12(%rsp)
	movl	%edx, 16(%rsp)
	call	__addvsi3
	movl	%ebp, %esi
	movl	%ebx, %edi
	movl	%eax, 20(%rsp)
	call	__subvsi3
	movl	%ebx, %edi
	movl	%eax, 24(%rsp)
	call	__negvsi2
	movl	%ebx, %esi
	movl	%ebx, %edi
	sarl	$31, %esi
	movl	%eax, 28(%rsp)
	xorl	%esi, %edi
	call	__subvsi3
	addq	$40, %rsp
	popq	%rbx
	popq	%rbp
	ret
[…]
	.globl	__mulvdi3
	.globl	__addvdi3
	.globl	__subvdi3
	.globl	__negvdi2
	.globl	di3
	.type	di3, @function
di3:
	pushq	%rbp
	movq	%rsi, %rbp
	pushq	%rbx
	movq	%rdi, %rbx
	subq	$56, %rsp
	call	__mulvdi3
	movq	%rbp, %rsi
	movq	%rbx, %rdi
	movq	%rax, (%rsp)
	movq	%rbx, %rax
	cqto
	idivq	%rbp
	movq	%rax, 8(%rsp)
	movq	%rdx, 16(%rsp)
	call	__addvdi3
	movq	%rbp, %rsi
	movq	%rbx, %rdi
	movq	%rax, 24(%rsp)
	call	__subvdi3
	movq	%rbx, %rdi
	movq	%rax, 32(%rsp)
	call	__negvdi2
	movq	%rbx, %rsi
	movq	%rbx, %rdi
	sarq	$63, %rsi
	movq	%rax, 40(%rsp)
	xorq	%rsi, %rdi
	call	__subvdi3
	addq	$56, %rsp
	popq	%rbx
	popq	%rbp
	ret
[…]
	.globl	__mulvti3
	.globl	__divti3
	.globl	__modti3
	.globl	__addvti3
	.globl	__subvti3
	.globl	__negvti2
	.globl	ti3
	.type	ti3, @function
ti3:
	pushq	%r14
	movq	%rcx, %r14
	pushq	%r13
	movq	%rdx, %r13
	pushq	%rbp
	movq	%rdi, %rbp
	pushq	%rbx
	movq	%rsi, %rbx
	subq	$104, %rsp
	call	__mulvti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, (%rsp)
	movq	%rdx, 8(%rsp)
	movq	%r13, %rdx
	call	__divti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 16(%rsp)
	movq	%rdx, 24(%rsp)
	movq	%r13, %rdx
	call	__modti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 32(%rsp)
	movq	%rdx, 40(%rsp)
	movq	%r13, %rdx
	call	__addvti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 48(%rsp)
	movq	%rdx, 56(%rsp)
	movq	%r13, %rdx
	call	__subvti3
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 64(%rsp)
	movq	%rdx, 72(%rsp)
	call	__negvti2
	movq	%rbx, %rcx
	movq	%rbp, %rdi
	sarq	$63, %rcx
	movq	%rax, 80(%rsp)
	xorq	%rcx, %rbx
	movq	%rdx, 88(%rsp)
	xorq	%rcx, %rdi
	movq	%rcx, %rdx
	movq	%rbx, %rsi
	call	__subvti3
	addq	$104, %rsp
	popq	%rbx
	popq	%rbp
	popq	%r13
	popq	%r14
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Oops: instead to generate a single call of the library function __divmodti4() which returns both quotient and remainder, GCC generates separate calls of the library functions __divti3() and __modti3(), doubling the execution time.

Ouch: instead to generate the trivial code for overflow-trapping addition, negation and subtraction inline, GCC generates calls of the library functions __addv?i3(), __mulv?i3(), __negv?i2() and __subv?i3(), increasing the execution time about an order of magnitude!

Note: exploration of the equally bad code generated for overflow-trapping integer arithmetic of 64-bit integers on the i386 platform is left as an exercise to the reader.

gcc -ftrapv -m32 -o- -Os -S case21.c

[…]
_si3:
	pushl	%esi
	pushl	%ebx
	subl	$52, %esp
	movl	64(%esp), %ebx
	movl	68(%esp), %esi
	movl	%ebx, (%esp)
	movl	%esi, 4(%esp)
	call	___mulvsi3
	movl	%esi, 4(%esp)
	movl	%eax, 24(%esp)
	movl	%ebx, %eax
	cltd
	movl	%ebx, (%esp)
	idivl	%esi
	movl	%eax, 28(%esp)
	movl	%edx, 32(%esp)
	call	___addvsi3
	movl	%esi, 4(%esp)
	movl	%ebx, (%esp)
	movl	%eax, 36(%esp)
	call	___subvsi3
	movl	%ebx, (%esp)
	movl	%eax, 40(%esp)
	call	___negvsi2
	movl	%eax, 44(%esp)
	movl	%ebx, %eax
	sarl	$31, %eax
	xorl	%eax, %ebx
	movl	%eax, 4(%esp)
	movl	%ebx, (%esp)
	call	___subvsi3
	addl	$52, %esp
	popl	%ebx
	popl	%esi
	ret
[…]
_di3:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$76, %esp
	movl	100(%esp), %ebx
	movl	96(%esp), %ebp
	movl	104(%esp), %esi
	movl	108(%esp), %edi
	movl	%ebp, (%esp)
	movl	%esi, 8(%esp)
	movl	%edi, 12(%esp)
	movl	%ebx, 4(%esp)
	call	___mulvdi3
	movl	%esi, 8(%esp)
	movl	%edi, 12(%esp)
	movl	%ebp, (%esp)
	movl	%ebx, 4(%esp)
	movl	%eax, 16(%esp)
	movl	%edx, 20(%esp)
	call	___divdi3
	movl	%esi, 8(%esp)
	movl	%edi, 12(%esp)
	movl	%ebp, (%esp)
	movl	%ebx, 4(%esp)
	movl	%eax, 24(%esp)
	movl	%edx, 28(%esp)
	call	___moddi3
	movl	%esi, 8(%esp)
	movl	%edi, 12(%esp)
	movl	%ebp, (%esp)
	movl	%ebx, 4(%esp)
	movl	%eax, 32(%esp)
	movl	%edx, 36(%esp)
	call	___addvdi3
	movl	%esi, 8(%esp)
	movl	%edi, 12(%esp)
	movl	%ebp, (%esp)
	movl	%ebx, 4(%esp)
	movl	%eax, 40(%esp)
	movl	%edx, 44(%esp)
	call	___subvdi3
	movl	%ebp, (%esp)
	movl	%ebx, 4(%esp)
	movl	%eax, 48(%esp)
	movl	%edx, 52(%esp)
	call	___negvdi2
	movl	%eax, 56(%esp)
	movl	%ebx, %eax
	sarl	$31, %eax
	movl	%edx, 60(%esp)
	xorl	%eax, %ebp
	xorl	%eax, %ebx
	movl	%eax, 8(%esp)
	movl	%ebp, (%esp)
	movl	%ebx, 4(%esp)
	movl	%eax, 12(%esp)
	call	___subvdi3
	addl	$76, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Case 22: Shell Game, or Crazy Braindead Register Allocator

Take a second look at (the beginning of) the __mulvti3() function and the code generated for it:

/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
typedef union {
  DWtype ll;
  struct {
    Wtype low, high;
  } s;
} DWunion;
[…]
DWtype
__mulvDI3 (DWtype u, DWtype v)
{
  /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
     but the checked multiplication needs only two.  */
  const DWunion uu = {.ll = u};
  const DWunion vv = {.ll = v};

  if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
    {
      /* u fits in a single Wtype.  */
      if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
	{
	  /* v fits in a single Wtype as well.  */
	  /* A single multiplication.  No overflow risk.  */
	  return (DWtype) uu.s.low * (DWtype) vv.s.low;
	}
[…]

[…]
0000000000000000 <__mulvti3>:
   0:	41 55                	push   %r13
   2:	49 89 cb             	mov    %rcx,%r11
   5:	48 89 d0             	mov    %rdx,%rax
   8:	49 89 d2             	mov    %rdx,%r10
   b:	41 54                	push   %r12
   d:	49 89 fc             	mov    %rdi,%r12
  10:	48 89 d1             	mov    %rdx,%rcx
  13:	49 89 f0             	mov    %rsi,%r8
  16:	4c 89 e2             	mov    %r12,%rdx
  19:	49 89 f5             	mov    %rsi,%r13
  1c:	53                   	push   %rbx
  1d:	48 89 fe             	mov    %rdi,%rsi
  20:	48 c1 fa 3f          	sar    $0x3f,%rdx
  24:	48 c1 f8 3f          	sar    $0x3f,%rax
  28:	4c 89 df             	mov    %r11,%rdi
  2b:	4c 39 c2             	cmp    %r8,%rdx
  2e:	75 18                	jne    48 <__mulvti3+0x48>
  30:	4c 39 d8             	cmp    %r11,%rax
  33:	75 6b                	jne    a0 <__mulvti3+0xa0>
  35:	4c 89 e0             	mov    %r12,%rax
  38:	49 f7 ea             	imul   %r10
  3b:	5b                   	pop    %rbx
  3c:	41 5c                	pop    %r12
  3e:	41 5d                	pop    %r13
  40:	c3                   	retq   
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: 25 instructions in 65 bytes, including 8 superfluous MOV instructions shuffling registers, clobbering the non-volatile registers RBX, R12 and R13 without necessity, and performing 6 superfluous memory accesses!

Properly written and straightforward code uses but only 11 instructions in just 31 bytes, without clobbering non-volatile registers, and without memory accesses at all:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.global	__mulvti3
.type	__mulvti3, @function
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__mulvti3:
	mov	r8, rdi
	mov	r9, rdx
	sra	r8, 63
	sra	r9, 63
	cmp	r8, rsi
	jne	__mulvti3+0x48
	cmp	r9, rcx
	jne	__mulvti3+0xa0
	mov	rax, rdi
	imul	rdx
	ret
[…]

Case 23: Undefined Behaviour in `__bswapsi2()` Function

The function __bswapsi2() defined in libgcc2.c exhibits Undefined Behaviour in its last line:

/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
typedef int SItype __attribute__ ((mode (SI)));
[…]
SItype
__bswapsi2 (SItype u)
{
  return ((((u) & 0xff000000) >> 24)
	  | (((u) & 0x00ff0000) >>  8)
	  | (((u) & 0x0000ff00) <<  8)
	  | (((u) & 0x000000ff) << 24));
}
[…]

3.4.3
Undefined behavior
[…]
EXAMPLE An example of undefined behavior is the behavior on integer overflow.

Case 24: Undefined Behaviour or Optimiser Failure?

Create the text file case24.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __absusi2(int value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

int __absvsi2(int value) {
    const int sign = value >> 31;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

int __abswsi2(int value) {
    const int sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __absudi2(long long value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __absvdi2(long long value) {
    const long long sign = value >> 63;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __abswdi2(long long value) {
    const long long sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

#ifdef __amd64__
__int128_t __absuti2(__int128_t value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

__int128_t __absvti2(__int128_t value) {
    const __int128_t sign = value >> 127;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

__int128_t __abswti2(__int128_t value) {
    const __int128_t sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}
#endif // __amd64__

Compile the source file case24.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -Os -S -Wall -Wextra case24.c

[…]
__absusi2:
	movl	%edi, %eax
	cltd
	xorl	%edx, %eax
	subl	%edx, %eax
	ret
[…]
__absvsi2:
	movl	%edi, %edx
	sarl	$31, %edx
	leal	(%rdi,%rdx), %eax
	xorl	%edx, %eax
	jns	.L2
	ud2
.L2:
	ret
[…]
__abswsi2:
	movl	%edi, %eax
	movl	%edi, %edx
	shrl	$31, %eax
	movl	%eax, %edi
	negl	%edi
	xorl	%edx, %edi
	addl	%edi, %eax
	jns	.L5
	ud2
.L5:
	ret
[…]
__absudi2:
	movq	%rdi, %rax
	cqto
	xorq	%rdx, %rax
	subq	%rdx, %rax
	ret
[…]
__absvdi2:
	movq	%rdi, %rdx
	sarq	$63, %rdx
	leaq	(%rdi,%rdx), %rax
	xorq	%rdx, %rax
	jns	.L8
	ud2
.L8:
	ret
[…]
__abswdi2:
	movq	%rdi, %rax
	cqto
	movslq	%edx, %rdx
	xorq	%rdx, %rax
	subq	%rdx, %rax
	jns	.L10
	ud2
.L10:
	ret
[…]
__absuti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	movq	%rax, %rsi
	xorq	%rax, %r8
	xorq	%rsi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rsi, %rdx
	ret
[…]
__absvti2:
	movq	%rdi, %rax
	movq	%rsi, %rdi
	movq	%rsi, %rdx
	sarq	$63, %rdi
	addq	%rdi, %rax
	adcq	%rdi, %rdx
	xorq	%rdi, %rax
	xorq	%rdi, %rdx
	jns	.L13
	ud2
.L13:
	ret
[…]
__abswti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	cltq
	movq	%rax, %rsi
	sarq	$63, %rax
	movq	%rax, %rdi
	xorq	%rsi, %r8
	xorq	%rdi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rdi, %rdx
	testq	%rdx, %rdx
	jns	.L15
	ud2
.L15:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Ouch: the ~~optimiser~~ pessimiser removes the conditional expression detecting possible integer overflow of the negation, i.e. undefined behaviour, for the __absusi2(), __absudi2() and __absuti2() functions – without warning, despite the command line options -Wall -Wextra given.

Oops: the optimiser fails to recognise the common and well-known alternative implementations used for the __absvsi2(), __absvdi2() and __absvti2() plus the __abswsi2(), __abswdi2() and __abswti2() functions, and generates unoptimised and awful code especially for the latter.

Note: exploration of the equally bad code generated for the i386 platform is left as an exercise to the reader.

Case 25: Optimiser Failures

Create the text file case25.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
inline
unsigned long long __rotldi3(unsigned long long m, unsigned n) {
    return (m << n) | (m >> (64 - n));
}

inline
unsigned long long __rotrdi3(unsigned long long m, unsigned n) {
    return (m >> n) | (m << (64 - n));
}

unsigned long long right63(unsigned long long o) {
    return __rotrdi3(o, 63);
}

unsigned long long right33(unsigned long long o) {
    return __rotrdi3(o, 33);
}

unsigned long long right31(unsigned long long o) {
    return __rotrdi3(o, 31);
}

unsigned long long right1(unsigned long long o) {
    return __rotrdi3(o, 1);
}
#else
inline
__uint128_t __rotlti3(__uint128_t m, unsigned n) {
    return (m << n) | (m >> (128 - n));
}

inline
__uint128_t __rotrti3(__uint128_t m, unsigned n) {
    return (m >> n) | (m << (128 - n));
}

__uint128_t left127(__uint128_t o) {
    return __rotlti3(o, 127);
}

__uint128_t left65(__uint128_t o) {
    return __rotlti3(o, 65);
}

__uint128_t left63(__uint128_t o) {
    return __rotlti3(o, 63);
}

__uint128_t left1(__uint128_t o) {
    return __rotlti3(o, 1);
}
#endif // __amd64__

Compile the source file case26.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case26.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.

[…]
right63:
	movl	4(%esp), %eax	#	mov	eax, [esp+4]
	movl	8(%esp), %edx	#	mov	edx, [esp+8]
	movl	%eax, %ecx	#	bt	eax, 31
	shldl	$1, %edx, %eax	#	adc	edx, edx
	shldl	$1, %ecx, %edx	#	adc	eax, eax
	ret			#	ret
[…]
right33:
	movl	4(%esp), %eax	#	mov	edx, [esp+4]
	movl	8(%esp), %edx	#	mov	eax, [esp+8]
	movl	%eax, %ecx	#	bt	edx, 1
	shldl	$31, %edx, %eax	#	rcr	eax, 1
	shldl	$31, %ecx, %edx	#	rcr	edx, 1
	ret			#	ret
[…]
right31:
	movl	4(%esp), %eax	#	mov	edx, [esp+4]
	movl	8(%esp), %edx	#	mov	eax, [esp+8]
	movl	%eax, %ecx	#	bt	edx, 31
	shrdl	$31, %edx, %eax	#	adc	eax, eax
	shrdl	$31, %ecx, %edx	#	adc	edx, edx
	ret			#	ret
[…]
right1:
	movl	4(%esp), %eax	#	mov	eax, [esp+4]
	movl	8(%esp), %edx	#	mov	edx, [esp+8]
	movl	%eax, %ecx	#	bt	eax, 1
	shrdl	$1, %edx, %eax	#	rcr	edx, 1
	shrdl	$1, %ecx, %edx	#	rcr	eax, 1
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OOPS: while GCC replaces the circular left shift by 63 with the equivalent circular right shift by 1, it but fails to replace the circular left shift by 33, i.e. register size plus 1, with the equivalent circular left shift by 1 of the exchanged registers, and also fails to replace the circular left shift by 31, i.e. register size − 1, with the equivalent circular right shift by 1 of the exchanged registers!

OUCH: additionally GCC clobbers register ECX in all 4 examples without necessity, and (ab)uses slow SHLD respectively SHRD instructions instead of fast RCL respectively RCR instructions or even faster ADC instructions!

Compile the source file case25.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case25.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.

[…]
left127:
	movq	%rdi, %rax	#	mov	rax, rdi
	movq	%rsi, %rdx	#	mov	rdx, rsi
				#	shr	esi, 1
	shrdq	$1, %rsi, %rax	#	rcr	rax, 1
	shrdq	$1, %rdi, %rdx	#	rcr	rdx, 1
	ret			#	ret
[…]
left65:
	movq	%rdi, %rax	#	mov	rax, rsi
	movq	%rsi, %rdx	#	mov	rdx, rdi
				#	add	rdi, rdi
	shrdq	$63, %rsi, %rax	#	adc	rax, rax
	shrdq	$63, %rdi, %rdx	#	adc	rdx, rdx
	ret			#	ret
[…]
left63:
	movq	%rdi, %rax	#	mov	rax, rsi
	movq	%rsi, %rdx	#	mov	rdx, rdi
				#	shr	edi, 1
	shldq	$63, %rsi, %rax	#	rcr	rax, 1
	shldq	$63, %rdi, %rdx	#	rcr	rdx, 1
	ret			#	ret
[…]
left1:
	movq	%rdi, %rax	#	mov	rax, rdi
	movq	%rsi, %rdx	#	mov	rdx, rsi
				#	add	rsi, rsi
	shldq	$1, %rsi, %rax	#	adc	rax, rax
	shldq	$1, %rdi, %rdx	#	adc	rdx, rdx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: like above, GCC (ab)uses slow SHLD respectively SHRD instructions instead of fast RCL respectively RCR instructions or even faster ADC instructions!

Note: unlike above, the second of the SHLD respectively SHRD instructions does not depend on the first one, so processors with 2 execution ports for slow shift instructions can execute them in parallel.

Case 26: More Optimiser Failures

Create the text file case26.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int sign32(int value) {
    return 0 - (value < 0);
}

long long sign64(long long value) {
#ifdef ALTERNATE
    return 0LL - (value < 0);
#else
    return 0 - (value < 0);
#endif
}

#ifdef __amd64__
__int128_t sign128(__int128_t value) {
#ifdef ALTERNATE
    return (__int128_t) 0 - (value < 0);
#else
    return 0 - (value < 0);
#endif
}
#endif // __amd64__

Compile the source file case26.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case26.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.

[…]
_sign32:
	movl	4(%esp), %eax
	sarl	$31, %eax
	ret
[…]
_sign64:
	movl	8(%esp), %eax	#	mov	eax, [esp+8]
	sarl	$31, %eax	#	cdq
	cltd			#	mov	eax, edx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Compile the source file case26.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case26.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.

[…]
sign32:
	movl	%edi, %eax
	sarl	$31, %eax
	ret
[…]
sign64:
	sarq	$63, %rdi	#	mov	rax, rdi
	movslq	%edi, %rax	#	sar	rax, 63
	ret			#	ret
[…]
sign128:
	sarq	$63, %rsi	#	mov	rax, rsi
	movslq	%esi, %rax	#	cqo
	cqto			#	mov	rax, rdx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

Compile the source file case26.c a third time with GCC, again targetting the AMD64 platform, now with the preprocessor macro ALTERNATE defined, and display the generated assembly code:

gcc -DALTERNATE -m64 -o- -O3 -S case26.c

[…]
sign32:
	movl	%edi, %eax
	sarl	$31, %eax
	ret
[…]
sign64:
	sarq	$63, %rdi
	movslq	%edi, %rax
	ret
[…]
sign128:
	shrq	$63, %rsi
	xorl	%edx, %edx
	movq	%rsi, %rax
	negq	%rax
	adcq	$0, %rdx
	negq	%rdx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: that’s what I call a complete failure!

Case 27: Yet More Optimiser Failures

Create the text file case27.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __clzti2(__uint128_t value) {
    unsigned long long high = value >> 64, low = value, mask = -(high == 0);
    return (64 & (int) mask) + __builtin_clzll((high & ~mask) | (low & mask));
}

int __ctzti2(__uint128_t value) {
    unsigned long long high = value >> 64, low = value, mask = -(low == 0);
    return (64 & (int) mask) + __builtin_ctzll((high & mask) | (low & ~mask));
}

Compile the source file case27.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case27.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.

[…]
__clzti2:
	xorl	%edx, %edx	#	xor	edx, edx
	testq	%rsi, %rsi	#	test	rsi, rsi
	sete	%dl		#	setz	dl
	movl	%edx, %ecx	#
	subl	$1, %edx	#
	negl	%ecx		#
	movslq	%edx, %rax	#
	movslq	%ecx, %rdx	#
	andq	%rsi, %rax	#
	andl	$64, %ecx	#
	andq	%rdi, %rdx	#	cmovnz	rdi, rsi
	orq	%rdx, %rax	#	shl	edx, 6
	bsrq	%rax, %rax	#	bsr	rax, rdi
	xorq	$63, %rax	#	xor	eax, 63
	addl	%ecx, %eax	#	add	eax, edx
	ret			#	ret
[…]
__ctzti2:
	xorl	%eax, %eax	#	xor	edx, edx
	testq	%rdi, %rdi	#	test	rdi, rdi
	sete	%al		#	setz	dl
	movl	%eax, %ecx	#
	subl	$1, %eax	#
	negl	%ecx		#
	cltq			#
	movslq	%ecx, %rdx	#
	andq	%rdi, %rax	#
	andl	$64, %ecx	#
	andq	%rsi, %rdx	#	cmovnz	rsi, rdi
	orq	%rdx, %rax	#	shl	edx, 6
	rep bsfq %rax, %rax	#	bsf	rax, rsi
	addl	%ecx, %eax	#	add	eax, edx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: that’s what I call a complete failure!

Case 28: BDead Code Generation

Create the text file case28.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Stein's algorithm: greatest common divisor

unsigned long long __gcddi3(unsigned long long p, unsigned long long q) {
    unsigned           r, s = 0;
    unsigned long long t;

    if (p == 0)
        return q;

    if (q == 0)
        return p;

    if (p == q)
        return p;
#ifndef CTZ
    while (((p | q) & 1) == 0)
        p >>= 1, q >>= 1, s++;

    while ((p & 1) == 0)
        p >>= 1;

    do {
        while ((q & 1) == 0)
            q >>= 1;
#elif CTZ != 32
    s = __builtin_ctzll(p | q), p >>= s, q >>= s;
    r = __builtin_ctzll(p), p >>= r;

    do {
        r = __builtin_ctzll(q), q >>= r;
#else
    if (((unsigned long) p | (unsigned long) q) == 0)
        p >>= 32, q >>= 32, s = 32;

    r = __builtin_ctzl((unsigned long) p | (unsigned long) q), p >>= r, q >>= r, s += r;

    if ((unsigned long) p == 0)
        p >>= 32;

    r = __builtin_ctzl(p), p >>= r;

    do
    {
        if ((unsigned long) q == 0)
            q >>= 32;

        r = __builtin_ctzl(q), q >>= r;
#endif
        if (p < q)
            t = q, q = p, p = t;
    } while (q -= p);

    return p << s;
}

Compile the source file case28.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -DCTZ -m32 -mtune=i386 -o- -O3 -S case28.c

[…]
__gcddi3:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$44, %esp
	movl	64(%esp), %ecx
	movl	68(%esp), %ebx
	movl	72(%esp), %esi
	movl	76(%esp), %edi
	movl	%ebx, %ebp
	movl	%ecx, 24(%esp)
	orl	%ecx, %ebp
	movl	%ebx, 28(%esp)
	movl	%esi, %eax
	movl	%edi, %edx
	je	L1
	movl	%ecx, %edx
	movl	%ebx, %eax
	xorl	%esi, %edx
	xorl	%edi, %eax
	orl	%eax, %edx
	je	L8
	movl	%esi, %eax
	orl	%edi, %eax
	je	L8
	movl	%ecx, %edx
	orl	%edi, %ebx
	orl	%esi, %edx
	movl	%ebx, 4(%esp)
	xorl	%ebx, %ebx
	movl	%edx, (%esp)
	call	___ctzdi2
	movl	28(%esp), %edx
	movl	%eax, %ebp
	movl	24(%esp), %eax
	movl	%ebp, %ecx
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	testb	$32, %cl
	cmovne	%edx, %eax
	cmovne	%ebx, %edx
	shrdl	%cl, %edi, %esi
	xorl	%ebx, %ebx
	shrl	%cl, %edi
	testb	$32, %cl
	movl	%edx, 4(%esp)
	cmovne	%edi, %esi
	cmovne	%ebx, %edi
	movl	%eax, (%esp)
	movl	%edx, 28(%esp)
	movl	%eax, 24(%esp)
	call	___ctzdi2
	movl	28(%esp), %edx
	movl	%eax, %ebx
	movl	24(%esp), %eax
	movl	%ebx, %ecx
	xorl	%ebx, %ebx
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	andl	$32, %ecx
	cmovne	%edx, %eax
	cmovne	%ebx, %edx
	movl	%eax, 24(%esp)
	movl	%edx, 28(%esp)
L6:
	movl	%esi, (%esp)
	movl	%edi, 4(%esp)
	call	___ctzdi2
	movl	28(%esp), %edx
	movl	%eax, %ecx
	xorl	%eax, %eax
	shrdl	%cl, %edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	cmovne	%edi, %esi
	cmovne	%eax, %edi
	movl	24(%esp), %eax
	cmpl	%eax, %esi
	movl	%edi, %ebx
	movl	%esi, %ecx
	sbbl	%edx, %edi
	jc	L3
	subl	%eax, %esi
	movl	%ebx, %edi
	sbbl	%edx, %edi
	movl	%edi, %eax
	orl	%esi, %eax
	jne	L6
	movl	24(%esp), %eax
	movl	28(%esp), %edx
	movl	%ebp, %ecx
	xorl	%ebx, %ebx
	shldl	%cl, %eax, %edx
	sall	%cl, %eax
	andl	$32, %ecx
	cmovne	%eax, %edx
	cmovne	%ebx, %eax
L1:
	addl	$44, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
L8:
	movl	24(%esp), %eax
	movl	28(%esp), %edx
	addl	$44, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
L3:
	movl	%eax, %esi
	movl	%edx, %edi
	movl	%ecx, 24(%esp)
	subl	%ecx, %esi
	movl	%ebx, 28(%esp)
	sbbl	%ebx, %edi
	jmp	L6
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: the CMOVcc instruction is not available on the i386 processor!

The code generated by GCC shows 117 instructions in 325 bytes, with 3 calls of the library function __ctzdi2() instead of inlined code for the __builtin_ctzll() builtin, while properly written code uses either only 70 instructions in just 146 bytes or (with the symbol SPEED defined) 75 instructions in 157 bytes:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"gcddi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of second argument
				# [esp+12] = low dword of second argument
				# [esp+8] = high dword of first argument
				# [esp+4] = low dword of first argument
__gcddi3:
	push	edi
	push	esi
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = first argument
	mov	edi, [esp+20]
	mov	esi, [esp+24]	# esi:edi = second argument
	mov	ecx, eax
	or	ecx, edx	# ecx = low qword of first argument
				#     | high qword of first argument
	jz	.Lsecond	# first argument = 0?

	mov	ecx, edi
	or	ecx, esi	# ecx = low qword of second argument
				#     | high qword of second argument
	jz	.Lfirst		# second argument = 0?

	push	ebx
	mov	ecx, eax
	mov	ebx, edx	# ebx:ecx = first argument
	xor	ecx, edi
	xor	ebx, esi	# ebx:ecx = first argument ^ second argument
	or	ebx, ecx
	jz	.Lequal		# first argument = second argument?

	xor	ebx, ebx	# ebx = exponent of common divisor 2 = 0
	mov	ecx, eax
	or	ecx, edi	# ecx = low dword of first argument
				#     | low dword of second argument
	jnz	.Lcommon	# first argument % 2**32 <> 0?
				# second argument % 2**32 <> 0?
.ifnotdef SPEED
	xchg	eax, edx	# edx:eax = first argument / 2**32
	xchg	edi, esi	# esi:edi = second argument / 2**32
.else
	mov	eax, edx
	mov	edx, ecx	# edx:eax = first argument / 2**32
	mov	edi, esi
	mov	esi, ecx	# esi:edi = second argument / 2**32
.endif
	add	ebx, 32		# ebx = exponent of common divisor 2 = 32
	mov	ecx, eax
	or	ecx, edi	# ecx = high dword of first argument
				#     | high dword of second argument
.Lcommon:
	bsf	ecx, ecx	# ecx = index of least significant '1' bit
	add	ebx, ecx	# ebx = exponent of common divisor 2
	shrd	eax, edx, cl
	shr	edx, cl		# edx:eax = first argument / 2**exponent
				#         = first argument'
	shrd	edi, esi, cl
	shr	esi, cl		# esi:edi = second argument / 2**exponent
				#         = second argument'
	test	eax, eax
	jnz	1f		# first argument' % 2**32 <> 0?
.ifnotdef SPEED
	xchg	eax, edx	# edx:eax = first argument' / 2**32
.else
	mov	eax, edx
	xor	edx, edx	# edx:eax = first argument' / 2**32
.endif
1:
	bsf	ecx, eax	# ecx = index of least significant '1' bit
	shrd	eax, edx, cl
	shr	edx, cl		# edx:eax = first argument' / 2**index
				#         = first argument"
.Lloop:
	test	edi, edi
	jnz	2f		# second argument' % 2**32 <> 0?
.ifnotdef SPEED
	xchg	edi, esi	# esi:edi = second argument' / 2**32
.else
	mov	edi, esi
	xor	esi, esi	# esi:edi = second argument' / 2**32
.endif
2:
	bsf	ecx, edi	# ecx = index of least significant '1' bit
	shrd	edi, esi, cl
	shr	esi, cl		# esi:edi = second argument' / 2**index
				#         = second argument"
.Lsubtract:
	sub	edi, eax
	sbb	esi, edx	# esi:edi = second argument" - first argument"
	jnb	.Lcontinue	# second argument" >= first argument"?
.Lswap:
	add	eax, edi
	adc	edx, esi	# edx:eax = first argument"
				#         + second argument" - first argument"
				#         = second argument"
.ifnotdef SPEED
	neg	esi
	neg	edi
	sbb	esi, 0		# esi:edi = 0
				#         - (second argument" - first argument")
				#         = first argument" - second argument"
.else
	mov	ecx, esi
	xor	esi, esi
	neg	edi
	sbb	esi, ecx	# esi:edi = 0
				#         - (second argument" - first argument")
				#         = first argument" - second argument"
.endif
	jmp	.Lloop
.Lcontinue:
	mov	ecx, edi
	or	ecx, esi
	jnz	.Lloop		# second argument" <> first argument"?

	mov	ecx, ebx	# ecx = exponent of common divisor 2
	shld	edx, eax, cl
	shl	eax, cl		# edx:eax = first argument" * 2**exponent
				#         = greatest common divisor
.Lequal:
	pop	ebx
.Lfirst:
	pop	esi
	pop	edi
	ret
.Lsecond:
	mov	eax, edi
	mov	edx, esi	# edx:eax = second argument
	pop	esi
	pop	edi
	ret

.size	__gcddi3, .-__gcddi3
.type	__gcddi3, @function
.global	__gcddi3
.end

Compile the source file case28.c a second time with GCC, now with the preprocessor macro CTZ=32 defined, and display the generated assembly code:

gcc -DCTZ=32 -m32 -mtune=i386 -o- -O3 -S case28.c

[…]
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$8, %esp
	movl	24(%esp), %esi
	movl	28(%esp), %edi
	movl	32(%esp), %eax
	movl	36(%esp), %edx
	movl	%edi, %ebx
	orl	%esi, %ebx
	je	L1
	movl	%edx, %ebx
	orl	%eax, %ebx
	je	L11
	movl	%edi, %ebx
	xorl	%edx, %ebx
	movl	%esi, %ecx
	xorl	%eax, %ecx
	orl	%ecx, %ebx
	je	L1
	movl	%esi, %ecx
	orl	%eax, %ecx
	je	L17
	xorl	%ebx, %ebx
L3:
	bsfl	%ecx, %ecx
	shrdl	%cl, %edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	je	L19
	movl	%edi, %esi
	xorl	%edi, %edi
L19:
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	testb	$32, %cl
	je	L20
	movl	%edx, %eax
	xorl	%edx, %edx
L20:
	addl	%ecx, %ebx
	movl	%esi, %ecx
	testl	%esi, %esi
	jne	L4
	movl	%edi, %esi
	xorl	%edi, %edi
	movl	%esi, %ecx
L4:
	bsfl	%ecx, %ecx
	shrdl	%cl, %edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	je	L18
	movl	%edi, %esi
	xorl	%edi, %edi
L18:
	movl	%esi, (%esp)
	movl	%edi, 4(%esp)
L9:
	movl	%eax, %ecx
	testl	%eax, %eax
	jne	L5
	movl	%edx, %eax
	xorl	%edx, %edx
	movl	%eax, %ecx
L5:
	bsfl	%ecx, %ecx
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	xorl	%esi, %esi
	testb	$32, %cl
	cmovne	%edx, %eax
	cmovne	%esi, %edx
	movl	%eax, %esi
	movl	%edx, %edi
	movl	(%esp), %edx
	movl	4(%esp), %ecx
	cmpl	%edx, %eax
	movl	%edi, %eax
	sbbl	%ecx, %eax
	jc	L6
	subl	%edx, %esi
	sbbl	%ecx, %edi
	movl	%esi, %eax
	movl	%edi, %edx
	orl	%esi, %edi
	jne	L9
	movl	(%esp), %eax
	movl	4(%esp), %edx
	movb	%bl, %cl
	shldl	%cl, %eax, %edx
	sall	%cl, %eax
	xorl	%ebx, %ebx
	andl	$32, %ecx
	cmovne	%eax, %edx
	cmovne	%ebx, %eax
L1:
	addl	$8, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	ret
L11:
	movl	%esi, %eax
	movl	%edi, %edx
	addl	$8, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	ret
L17:
	movl	%edi, %esi
	xorl	%edi, %edi
	movl	%edx, %eax
	xorl	%edx, %edx
	movl	%esi, %ecx
	orl	%eax, %ecx
	movl	$32, %ebx
	jmp	L3
L6:
	movl	%edx, %eax
	movl	%ecx, %edx
	subl	%esi, %eax
	sbbl	%edi, %edx
	movl	%esi, (%esp)
	movl	%edi, 4(%esp)
	jmp	L9
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OOPS: the BSF instruction can’t set bit 2⁵, the 15 highlighted instructions are therefore superfluous!

Now the code generated by GCC shows 116 instructions in 282 bytes, without calls of a library function – better than before, but still outright bad!

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Stein's algorithm: greatest common divisor

__uint128_t __gcdti3(__uint128_t p, __uint128_t q) {
    unsigned    r, s = 0;
    __uint128_t t;

    if (p == 0)
        return q;

    if (q == 0)
        return p;

    if (p == q)
        return p;

    if (((unsigned long long) p | (unsigned long long) q) == 0)
        p >>= 64, q >>= 64, s = 64;

    r = __builtin_ctzll((unsigned long long) p | (unsigned long long) q), p >>= r, q >>= r, s += r;

    if ((unsigned long long) p == 0)
        p >>= 64;

    r = __builtin_ctzll(p), p >>= r;

    do
    {
        if ((unsigned long long) q == 0)
            q >>= 64;

        r = __builtin_ctzll(q), q >>= r;

        if (p < q)
            t = q, q = p, p = t;
    } while (q -= p);

    return p << s;
}

gcc -m64 -mtune=core2 -o- -O3 -S case28.c

[…]
__gcdti3:
	movq	%rsi, %r9
	movq	%rdi, %r8
	movq	%rdx, %rsi
	movq	%r9, %rax
	movq	%rcx, %rdi
	orq	%r8, %rax
	je	.L10
	movq	%r8, %rdx
	movq	%r9, %rax
	xorq	%rsi, %rdx
	xorq	%rcx, %rax
	orq	%rax, %rdx
	je	.L11
	movq	%rsi, %rax
	orq	%rcx, %rax
	je	.L11
	movq	%r8, %rcx
	xorl	%r10d, %r10d
	orq	%rsi, %rcx
	jne	.L3
	movq	%r9, %r8
	movq	%rdi, %rsi
	xorl	%r9d, %r9d
	movq	%r8, %rcx
	xorl	%edi, %edi
	movl	$64, %r10d
	orq	%rsi, %rcx
.L3:
	bsfq	%rcx, %rcx
	movq	%r8, %rax
	movq	%r9, %rdx
	shrdq	%cl, %r9, %rax
	shrq	%cl, %rdx
	xorl	%r8d, %r8d
	testb	$64, %cl
	cmovne	%rdx, %rax
	cmovne	%r8, %rdx
	shrdq	%cl, %rdi, %rsi
	xorl	%r9d, %r9d
	shrq	%cl, %rdi
	testb	$64, %cl
	cmovne	%rdi, %rsi
	cmovne	%r9, %rdi
	addl	%ecx, %r10d
	testq	%rax, %rax
	movq	%rax, %rcx
	je	.L15
.L4:
	bsfq	%rcx, %rcx
	xorl	%r11d, %r11d
	shrdq	%cl, %rdx, %rax
	shrq	%cl, %rdx
	andl	$64, %ecx
	cmovne	%rdx, %rax
	cmovne	%r11, %rdx
.L9:
	testq	%rsi, %rsi
	movq	%rsi, %rcx
	jne	.L5
	movq	%rdi, %rsi
	xorl	%edi, %edi
	movq	%rsi, %rcx
.L5:
	bsfq	%rcx, %rcx
	xorl	%r8d, %r8d
	shrdq	%cl, %rdi, %rsi
	shrq	%cl, %rdi
	testb	$64, %cl
	cmovne	%rdi, %rsi
	cmovne	%r8, %rdi
	cmpq	%rax, %rsi
	movq	%rsi, %r8
	movq	%rdi, %r9
	sbbq	%rdx, %rdi
	jc	.L6
	subq	%rax, %rsi
	movq	%r9, %rdi
	sbbq	%rdx, %rdi
	movq	%rdi, %rcx
	orq	%rsi, %rcx
	jne	.L9
	movl	%r10d, %ecx
	xorl	%esi, %esi
	shldq	%cl, %rax, %rdx
	salq	%cl, %rax
	andl	$64, %ecx
	cmovne	%rax, %rdx
	cmovne	%rsi, %rax
	ret
.L11:
	movq	%r8, %rax
	movq	%r9, %rdx
	ret
.L6:
	subq	%rsi, %rax
	sbbq	%r9, %rdx
	movq	%rax, %rsi
	movq	%r8, %rax
	movq	%rdx, %rdi
	movq	%r9, %rdx
	jmp	.L9
.L10:
	movq	%rdx, %rax
	movq	%rcx, %rdx
	ret
.L15:
	movq	%rdx, %rax
	xorl	%edx, %edx
	movq	%rax, %rcx
	jmp	.L4
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OOPS: the BSF instruction can’t set bit 2⁶, the 16 highlighted instructions are therefore superfluous!

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"gcdti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = first argument
				# rcx:rdx = second argument
__gcdti3:
	mov	rax, rdi
	or	rax, rsi	# rax = low qword of first argument
				#     | high qword of first argument
	jz	.Lsecond	# first argument = 0?

	mov	rax, rdx
	or	rax, rcx	# rax = low qword of second argument
				#     | high qword of second argument
	jz	.Lfirst		# second argument = 0?

	mov	r10, rdx
	mov	r11, rcx	# r11:r10 = second argument
	xor	rdx, rdi
	xor	rcx, rsi	# rcx:rdx = second argument ^ first argument
	or	rcx, rdx
	jz	.Lequal		# first argument = second argument?

	mov	eax, 64		# rax = 64
	xor	edx, edx	# rdx = exponent of common divisor 2 = 0
	mov	rcx, rdi
	or	rcx, r10	# rcx = low qword of first argument
				#     | low qword of second argument,
				# ZF = (first argument % 2**64 = 0)
				#    & (second argument % 2**64 = 0)
	cmovnz	eax, edx	# rax = ZF ? 64 : 0
	cmovz	rdi, rsi
	cmovz	esi, edx	# rsi:rdi = ZF ? first argument / 2**64
	cmovz	r10, r11
	cmovz	r11, rdx	# r11:r10 = ZF ? second argument / 2**64
	mov	rcx, rdi
	or	rcx, r10	# rcx = high qword of first argument
				#     | high qword of second argument
.Lcommon:
	bsf	rcx, rcx	# rcx = index of least significant '1' bit
	add	eax, ecx	# rax = exponent of common divisor 2
	shrd	rdi, rsi, cl
	shr	rsi, cl		# rsi:rdi = first argument / 2**exponent
				#         = first argument'
	shrd	r10, r11, cl
	shr	r11, cl		# r11:r10 = second argument / 2**exponent
				#         = second argument'

	test	rdi, rdi	# ZF = (first argument' % 2**64 = 0)
	cmovz	rdi, rsi
	cmovz	esi, edx	# rsi:rdi = ZF ? first argument' / 2**64

	bsf	rcx, rdi	# rcx = index of least significant '1' bit
	shrd	rdi, rsi, cl
	shr	rsi, cl		# rsi:rdi = first argument' / 2**index
				#         = first argument"
.Lloop:
	test	r10, r10	# ZF = (second argument' % 2**64 = 0)
	cmovz	r10, r11
	cmovz	r11, rdx	# r11:r10 = ZF ? second argument' / 2**64

	bsf	rcx, r10	# rcx = index of least significant '1' bit
	shrd	r10, r11, cl
	shr	r11, cl		# r11:r10 = second argument' / 2**index
				#         = second argument"
.Lcompare:
	cmp	r10, rdi
	mov	rcx, r11
	sbb	rcx, rsi	# CF = (second argument" < first argument")
.Lswap:
	mov	rcx, r10
	cmovb	r10, rdi
	cmovb	rdi, rcx
	mov	rcx, r11
	cmovb	r11, rsi	# r11:r10 = CF ? first argument" : second argument"
	cmovb	rsi, rcx	# rsi:rdi = CF ? second argument" : first argument"
.Lsubtract:
	sub	r10, rdi
	sbb	r11, rsi	# r11:r10 = second argument" - first argument"
	mov	rcx, r10
	or	rcx, r11
	jnz	.Lloop		# second argument" <> first argument"?

	mov	ecx, eax	# rcx = exponent of common divisor 2
	shld	rsi, rdi, cl
	shl	rdi, cl		# rsi:rdi = first argument" * 2**exponent
				#         = greatest common divisor
.Lequal:
.Lfirst:
	mov	rdx, rsi
	mov	rax, rdi
	ret
.Lsecond:
	mov	rax, rdx
	mov	rdx, rcx	# rdx:rax = second argument
	ret

.size	__gcdti3, .-__gcdti3
.type	__gcdti3, @function
.global	__gcdti3
.end

Case 29: Superfluous Extraneous Code Generation

Create the text file case29.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Widening carry-less multiplication

#ifndef __amd64__
unsigned long long clmul64(unsigned long p, unsigned long q) {
    unsigned long long r = 0;
    unsigned long      s = 1UL << 31;

    do {
        r <<= 1;
        if (q & s)
            r ^= p; // no need for promotion or conversion of p here!
    } while (s >>= 1);

    return r;
}
#else // __amd64__
__uint128_t clmul128(unsigned long long p, unsigned long long q) {
    __uint128_t        r = 0;
    unsigned long long s = 1UL << 63;

    do {
        r <<= 1;
        if (q & s)
            r ^= p; // no need for promotion or conversion of p here!
    } while (s >>= 1);

    return r;
}
#endif // __amd64__

Compile the source file case29.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case29.c

[…]
clmul64:
	pushl	%ebp
	movl	$-2147483648, %ecx
	xorl	%eax, %eax
	xorl	%edx, %edx
	pushl	%edi
	xorl	%edi, %edi
	pushl	%esi
	pushl	%ebx
	movl	24(%esp), %ebp
	movl	$32, %ebx
	movl	20(%esp), %esi
L3:
	shldl	$1, %eax, %edx
	addl	%eax, %eax
	testl	%ecx, %ebp
	je	L2
	xorl	%esi, %eax
	xorl	%edi, %edx
L2:
	shrl	$1, %ecx
	subl	$1, %ebx
	jne	L3
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: 8 (highlighted) superfluous instructions, including 4 extraneous synthesized instructions which use register EBX, out of the total 25 instructions!

Compile the source file case29.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case29.c

[…]
clmul128:
	movq	%rdi, %r10
	xorl	%eax, %eax
	movq	%rsi, %rdi
	xorl	%edx, %edx
	movabsq	$-9223372036854775808, %rcx
	movl	$64, %esi
.L3:
	shldq	$1, %rax, %rdx
	addq	%rax, %rax
	testq	%rcx, %rdi
	je	.L2
	movq	%rax, %r9	#
	xorq	%r10, %r9	#	xorq	%rsi, %rax
	movq	%r9, %rax	#
.L2:
	shrq	$1, %rcx
	subl	$1, %esi
	jne	.L3
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"

OUCH: 7 (highlighted) superfluous instructions (minus the single instruction added as comment) out of the total 17 instructions!

Case 30: Epic Optimiser Failures

Create the text file case30.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
int ispowerof2(unsigned long long argument) {
#else
int ispowerof2(__uint128_t argument) {
#endif
    return (argument != 0) && ((argument & argument - 1) == 0);
}

Compile the source file case30.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case30.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.

[…]
ispowerof2:
	pushl	%esi		#
	pushl	%ebx		#
	movl	12(%esp), %ecx	#	mov	ecx, [esp+4]
	movl	16(%esp), %ebx	#	mov	edx, [esp+8]
	movl	%ecx, %eax	#	mov	eax, ecx
	orl	%ebx, %eax	#	or	eax, edx
	je	.L5		#	jz	.L5
	movl	%ecx, %eax	#
	movl	%ebx, %edx	#	xor	eax, eax
	addl	$-1, %eax	#	add	ecx, -1
	adcl	$-1, %edx	#	adc	edx, -1
	andl	%ecx, %eax	#	and	ecx, [esp+4]
	andl	%ebx, %edx	#	and	edx, [esp+8]
	popl	%ebx		#
	orl	%edx, %eax	#	or	ecx, edx
	sete	%al		#	setz	al
	movzbl	%al, %eax	#
	movl	%eax, %esi	#
	movl	%esi, %eax	#
	popl	%esi		#
	ret			#
.L5:
	xorl	%esi, %esi	#
	popl	%ebx		#
	movl	%esi, %eax	#
	popl	%esi		#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 13.1.0"

[…]
ispowerof2:
	pushl   %edi		#
	pushl   %esi		#
	pushl   %ebx		#
	movl	20(%esp), %edi	#	mov	ecx, [esp+4]
	movl	16(%esp), %esi	#	mov	edx, [esp+8]
	movl	%edi, %eax	#	mov	eax, ecx
	orl	%esi, %eax	#	or	eax, edx
	je	.L5		#	jz	.L5
	movl	%esi, %eax	#
	movl	%edi, %edx	#
	addl	$-1, %eax	#	add	ecx, -1
	adcl	$-1, %edx	#	adc	edx, -1
	movl	%eax, %ecx	#
	movl	%esi, %eax	#
	movl	%edx, %ebx	#
	movl	%edi, %edx	#
	andl	%ecx, %eax	#	and	ecx, [esp+4]
	xorl	%ecx, %ecx	#	xor	eax, eax
	andl	%ebx, %edx	#	and	edx, [esp+8]
	popl	%ebx		#
	popl	%esi		#
	orl	%edx, %eax	#	or	ecx, edx
	popl	%edi		#
	sete	%cl		#	setz	al
	movl	%ecx, %eax	#
	ret			#
.L5:
	xorl	%ecx, %ecx	#
	popl	%ebx		#
	popl	%esi		#
	movl	%ecx, %eax	#
	popl	%edi		#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: 19 (highlighted) superfluous instructions (which abuse registers EDI and ESI to play a Shell Game and perform 3 superfluous memory writes) out of the total 32 instructions in 60 bytes, plus an epic failure of the optimiser which doesn’t recognise that register EAX is already 0 at label .L5!

Properly written straightforward code uses but only 13 instructions in 36 bytes and doesn’t clobber any non-volatile register!

Compile the source file case30.c a second time with GCC, again targetting the i386 platform, now with explicit support for the SSE4.1 alias Penryn New Instruction Set, and display the generated assembly code:

gcc -m32 -msse4.1 -o- -O3 -S case30.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.

[…]
ispowerof2:
	movq	4(%esp), %xmm1	#	movq	xmm0, qword ptr [esp+4]
	xorl	%eax, %eax	#	xor	eax, eax
	movdqa	%xmm1, %xmm0	#
	punpcklqdq %xmm1, %xmm0	#
	ptest	%xmm0, %xmm0	#	ptest	xmm0, xmm0
	je	.L1		#	je	.L1
	pcmpeqd	%xmm0, %xmm0	#	pcmpeqq	xmm1, xmm1
	xorl	%eax, %eax	#
	paddq	%xmm1, %xmm0	#	paddq	xmm1, xmm0
	pand	%xmm1, %xmm0	#
	punpcklqdq %xmm0, %xmm0	#
	ptest	%xmm0, %xmm0	#	ptest	xmm1, xmm0
	sete	%al		#	sete	al
.L1:
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: 5 (highlighted) superfluous instructions out of the total 14 instructions in 50 bytes, plus 2 epic failures of the optimiser which doesn’t recognise that the MOVQ instruction clears the upper lane of the SSE register XMM1 and that register EAX is already 0!

Properly written straightforward code uses but only 9 instructions in 32 bytes!

Compile the source file case30.c a third time with GCC, again targetting the i386 platform, now with explicit support for the AVX2 alias Haswell New Instruction Set or the AVX instruction set, and display the generated assembly code:

gcc -m32 -mavx -o- -O3 -S case30.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.

[…]
ispowerof2:
	vmovq	4(%esp), %xmm0		#	movq	xmm0, [esp+4]
	xorl	%eax, %eax		#	xor	eax, eax
	vpunpcklqdq %xmm0, %xmm0, %xmm0	#
	vptest	%xmm0, %xmm0		#	ptest	xmm0, xmm0
	je	.L1			#	jz	.L1
	vpcmpeqd %xmm1, %xmm1, %xmm1	#	pcmpeqq	xmm1, xmm1
	xorl	%eax, %eax		#
	vpaddq	%xmm1, %xmm0, %xmm1	#	paddq	xmm1, xmm0
	vpand	%xmm0, %xmm1, %xmm0	#
	vpunpcklqdq %xmm0, %xmm0, %xmm0	#
	vptest	%xmm0, %xmm0		#	ptest	xmm1, xmm0
	sete	%al			#	setz	al
.L1:
	ret				#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: 4 (highlighted) superfluous instructions out of the total 13 instructions in 46 bytes, plus 2 epic failures of the optimiser which doesn’t recognise that the MOVQ instruction clears the upper lane of the SSE register XMM1 and that register EAX is already 0!

OUCH: additionally GCC generates AVX instructions without need where SSE4.1 instructions are sufficient!

Compile the source file case30.c a fourth time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case30.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.

[…]
ispowerof2:
	movq	%rsi, %rax	#	mov	rax, rdi
	xorl	%ecx, %ecx	#
	movq	%rdi, %r8	#
	movq	%rsi, %r9	#
	orq	%rdi, %rax	#	or	rax, rsi
	je	.L1		#	jz	.L1
	movq	%rdi, %rax	#	mov	rcx, rdi
	movq	%rsi, %rdx	#	mov	rdx, rsi
	addq	$-1, %rax	#	add	rcx, -1
	adcq	$-1, %rdx	#	add	rdx, -1
	movq	%rax, %rsi	#
	movq	%r8, %rax	#
	xorl	%ecx, %ecx	#	xor	eax, eax
	movq	%rdx, %rdi	#
	movq	%r9, %rdx	#
	andq	%rsi, %rax	#	and	rcx, rdi
	andq	%rdi, %rdx	#	and	rdx, rsi
	orq	%rdx, %rax	#	or	rcx, rdx
	sete	%cl		#	setz	al
.L1:
	movl	%ecx, %eax	#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: 8 (highlighted) superfluous instructions (which abuse registers R8 and R9 to play a Shell Game) out of the total 21 instructions in 59 bytes, plus an epic failure of the optimiser which doesn’t recognise that register ECX is already 0!

Properly written straightforward code uses but only 13 instructions in 37 bytes!

Case 31: Cargo Cult, or Ignorance is Bliss, plus Epic Optimiser Failure

In the Hardware Specification Errata HSD146, BDM85, SKL029 and 026, all titled POPCNT Instruction May Take Longer to Execute Than Expected, Intel documents a bug in their Haswell, Broadwell, Skylake and Coffee Lake families of CPUs: the POPCNT instruction has a false dependency on its destination register.

Note: a XOR instruction that zeroes the destination register before the POPCNT instruction breaks this false dependency.

Intel processors from the Sandy Bridge, Ivy Bridge and Coffee Lake Refresh families suffer from this bug too; processors of the Alder Lake, Cannon Lake, Cascade Lake, Comet Lake, Ice Lake, Raptor Lake, Rocket Lake and Tiger Lake families are but not affected.

Create the text file case31.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int ispowerof2(unsigned long long argument) {
    return __builtin_popcountll(argument) == 1;
}

Compile the source file case31.c with GCC, engaging its optimiser, targetting one of the not affected processor families, and display the generated assembly code:

gcc -m32 -march=cannonlake -o- -O3 -S case31.c

[…]
ispowerof2:
	xorl	%eax, %eax
	xorl	%edx, %edx
	popcntl	4(%esp), %eax
	popcntl	8(%esp), %edx
	addl	%edx, %eax
	cmpl	$1, %eax
	sete	%al
	movzbl	%al, %eax
	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OOPS: GCC generates superfluous XOR instructions!

Compile the source file case31.c with GCC, again targetting one of the not affected processor families, but now optimising for size, and display the generated assembly code:

gcc -m32 -march=cannonlake -o- -Os -S case31.c

[…]
ispowerof2:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$16, %esp
	pushl	12(%ebp)
	pushl	8(%ebp)
	call	__popcountdi2
	addl	$16, %esp
	leave
	decl	%eax
	sete	%al
	movzbl	%al, %eax
	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: optimising for speed GCC generates 9 instructions in 24 bytes; optimising for size it but generates 12 instructions in 29 bytes, including a superfluous (highlighted) ADD instruction, in addition to the uncounted instructions and bytes of the __popcountdi2() function!

Case 32: More Epic Optimiser Failures

Create the text file case32.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int ispowerof2(unsigned long long argument) {
    return (argument != 0) && (__builtin_clzll(argument) + __builtin_ctzll(argument) == 63);
}

Compile the source file case32.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case32.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.

[…]
ispowerof2:
	pushl	%ebx		#
	subl	$8, %esp	#
	movq	16(%esp), %xmm1	#
	movdqa	%xmm1, %xmm0	#
	movd	%xmm1, %edx	#	mov	ecx, [esp+4]
	psrlq	$32, %xmm0	#
	movd	%xmm0, %eax	#	mov	edx, [esp+8]
	movl	%eax, %ebx	#	mov	eax, ecx
	orl	%edx, %ebx	#	or	eax, edx
	je	.L7		#	jz	.L0
	testl	%eax, %eax	#	bsr	eax, edx
	je	.L3		#	jnz	.L1
	bsrl	%eax, %ebx	#	bsr	eax, ecx
	xorl	$31, %ebx	#	xor	eax, 32
				# L1:
				#	xor	eax, 31
.L4:
	subl	$16, %esp	#	bsf	ecx, ecx
	movq	%xmm1, (%esp)	#	jnz	.L2
	call	__ctzdi2	#	bsf	ecx, edx
	addl	$16, %esp	#	xor	ecx, 32
	xorl	%ecx, %ecx	# L2:
	addl	%eax, %ebx	#	add	eax, ecx
	cmpl	$63, %ebx	#	cmp	eax, 63
	sete	%cl		#	sete	al
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		# L0:
	ret			#	ret
.L7:
	xorl	%ecx, %ecx	#
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		#
	ret			#
.L3:
	bsrl	%edx, %edx	#
	xorl	$31, %edx	#
	leal	32(%edx), %ebx	#
	jmp	.L4		#
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: GCC generates 35 instructions in 96 bytes which clobber register EBX, perform 2 superfluous memory writes plus a superfluous memory read and call the __ctzdi2() function instead to use 2 BSF instructions; straightforward code uses but only 18 instructions in 48 bytes without call of a library function!

OUCH: the highlighted SSE2 instructions (ab)used to load the 64-bit argument into the 32-bit registers EAX and EDX are completely braindead!

Compile the source file case32.c a second time with GCC, engaging its optimiser, targetting the i386 platform, now explicitly enabling the LZCNT and TZCNT instructions, and display the generated assembly code:

gcc -m32 -mabm -mbmi -mlzcnt -o- -O3 -S case32.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.

[…]
ispowerof2:
	pushl	%ebx		#
	subl	$8, %esp	#
	movq	16(%esp), %xmm1	#
	movdqa	%xmm1, %xmm0	#
	movd	%xmm1, %edx	#	mov	ecx, [esp+4]
	psrlq	$32, %xmm0	#
	movd	%xmm0, %eax	#	mov	edx, [esp+8]
	movl	%eax, %ebx	#	mov	eax, ecx
	orl	%edx, %ebx	#	or	eax, edx
	je	.L7		#	jz	.L0
	xorl	%ebx, %ebx	#
	lzcntl	%eax, %ebx	#	lzcnt	eax, edx
	testl	%eax, %eax	#
	je	.L10		#	jnc	.L4
				#	lzcnt	eax, ecx
				#	add	eax, 32
.L4:
	subl	$16, %esp	#	tzcnt	ecx, ecx
	movq	%xmm1, (%esp)	#	jnc	.L1
	call	__ctzdi2	#	tzcnt	edx, edx
	addl	$16, %esp	#	add	ecx, edx
	xorl	%ecx, %ecx	# L1:
	addl	%eax, %ebx	#	add	eax, ecx
	cmpl	$63, %ebx	#	cmp	eax, 63
	sete	%cl		#	sete	al
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		# L0:
	ret			#	ret
.L7:
	xorl	%ecx, %ecx	#
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		#
	ret			#
.L10:
	lzcntl	%edx, %edx	#
	leal	32(%edx), %ebx	#
	jmp	.L4		#
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: GCC generates 34 instructions in 94 bytes which clobber register EBX, perform 2 superfluous memory writes plus a superfluous memory read and call the __ctzdi2() function instead to use 2 TZCNT instructions, despite the -mabm and -mbmi options; straightforward code uses but only 17 instructions in 48 bytes without call of a library function!

OUCH: the highlighted SSE2 instructions (ab)used to load the 64-bit argument into the 32-bit registers EAX and EDX are completely braindead!

Compile the source file case32.c a third time with GCC, again targetting the i386 platform, but now optimising for size, and display the generated assembly code:

gcc -m32 -mabm -mbmi -mlzcnt -o- -Os -S case32.c

[…]
ispowerof2:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	12(%ebp), %edi
	movl	8(%ebp), %esi
	movl	%edi, %edx
	orl	%esi, %edx
	je	.L7
	lzcntl	%edi, %ebx
	testl	%edi, %edi
	jne	.L4
	lzcntl	%esi, %ebx
	addl	$32, %ebx
.L4:
	pushl	%eax
	pushl	%eax
	pushl	%edi
	pushl	%esi
	call	__ctzdi2
	addl	$16, %esp
	addl	%eax, %ebx
	cmpl	$63, %ebx
	sete	%al
	jmp	.L2
.L7:
	xorl	%eax, %eax
.L2:
	leal	-12(%ebp), %esp
	movzbl	%al, %eax
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"

OUCH: again GCC generates 34 instructions, now in 71 bytes, with a call of the __ctzdi2() function, instead of the straighforward code shown above – this is as bad as code might get!

Case 33: Yet More Epic Optimiser Failures

Create the text file case33.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <arpa/inet.h>

unsigned long long htonll(unsigned long long h)
{
#ifndef ALTERNATE
    return htonl(1) == 1 ? h : htonl((unsigned long) (h >> 32))
                             | htonl((unsigned long) h) * 0x100000000u;
#else
    return ((unsigned char *) &h)[0] * 0x0100000000000000u
         | ((unsigned char *) &h)[1] * 0x0001000000000000u
         | ((unsigned char *) &h)[2] * 0x0000010000000000u
         | ((unsigned char *) &h)[3] * 0x0000000100000000u
         | ((unsigned char *) &h)[4] * 0x0000000001000000u
         | ((unsigned char *) &h)[5] * 0x0000000000010000u
         | ((unsigned char *) &h)[6] * 0x0000000000000100u
         | ((unsigned char *) &h)[7];
#endif
}

unsigned long long ntohll(unsigned long long n)
{
#ifdef ALTERNATE
    return ntohl(1) == 1 ? n : ntohl((unsigned long) (n >> 32))
                             | ntohl((unsigned long) n) * 0x100000000u;
#else
    return ((unsigned char *) &n)[0] * 0x0100000000000000u
         | ((unsigned char *) &n)[1] * 0x0001000000000000u
         | ((unsigned char *) &n)[2] * 0x0000010000000000u
         | ((unsigned char *) &n)[3] * 0x0000000100000000u
         | ((unsigned char *) &n)[4] * 0x0000000001000000u
         | ((unsigned char *) &n)[5] * 0x0000000000010000u
         | ((unsigned char *) &n)[6] * 0x0000000000000100u
         | ((unsigned char *) &n)[7];
#endif
}

Compile the source file case33.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:

gcc -m32 -o- -O3 -S case33.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.

[…]
htonll:
	movl	8(%esp), %edx	#	mov	eax, [esp+8]
	movl	4(%esp), %eax	#	mov	edx, [esp+4]
	bswapl	%edx		#	bswap	eax
	bswapl	%eax		#	bswap	edx
	xchgl	%edx, %eax	#
	ret			#	ret
[…]
ntohll:
	pushl	%edi		#
	xorl	%ecx, %ecx	#
	pushl	%esi		#
	pushl	%ebx		#
	movl	16(%esp), %edx	#	mov	edx, [esp+4]
	movl	20(%esp), %eax	#	mov	eax, [esp+8]
	movl	%edx, %edi	#	bswap	eax
	movzbl	%dh, %esi	#	bswap	edx
	sall	$24, %edi	#
	movl	%edi, %ebx	#
	movl	%esi, %edi	#
	xorl	%esi, %esi	#
	orl	%ecx, %esi	#
	movl	%eax, %ecx	#
	sall	$16, %edi	#
	shrl	$24, %ecx	#
	orl	%ebx, %edi	#
	xorl	%ebx, %ebx	#
	orl	%ecx, %esi	#
	movl	%edx, %ecx	#
	orl	%ebx, %edi	#
	shrl	$24, %edx	#
	shrl	$16, %ecx	#
	movzbl	%cl, %ecx	#
	movl	%ecx, %ebx	#
	xorl	%ecx, %ecx	#
	orl	%ecx, %esi	#
	xorl	%ecx, %ecx	#
	sall	$8, %ebx	#
	orl	%esi, %ecx	#
	movl	%eax, %esi	#
	orl	%ebx, %edi	#
	movl	%edx, %ebx	#
	sall	$24, %esi	#
	orl	%edi, %ebx	#
	orl	%esi, %ecx	#
	movl	%esi, %edi	#
	movzbl	%ah, %esi	#
	shrl	$8, %eax	#
	sall	$16, %esi	#
	sarl	$31, %edi	#
	andl	$65280, %eax	#
	orl	%edi, %ebx	#
	movl	%esi, %edi	#
	orl	%esi, %ecx	#
	cltd			#
	sarl	$31, %edi	#
	orl	%ecx, %eax	#
	orl	%edi, %ebx	#
	orl	%ebx, %edx	#
	popl	%ebx		#
	popl	%esi		#
	popl	%edi		#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 13.2.0"

OUCH: while the code generated for the first function is almost perfect, its desastrous for the second function!

Compile the source file case33.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case33.c

Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.

[…]
htonll:
	movq	%rdi, %rdx	#
	movl	%edi, %eax	#	mov	rax, rdi
	shrq	$32, %rdx	#
	bswapl	%eax		#	bswap	rax
	salq	$32, %rax	#
	bswapl	%edx		#
	movl	%edx, %edx	#
	orq	%rdx, %rax	#
	ret			#	ret
ntohll:
	movq	%rdi, %rdx	#
	movq	%rdi, %rax	#	mov	rax, rdi
	movzbl	%dh, %ecx	#	bswap	rax
	salq	$56, %rax	#
	salq	$48, %rcx	#
	shrq	$40, %rdx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	andl	$65280, %edx	#
	shrq	$56, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$16, %rcx	#
	movzbl	%cl, %ecx	#
	salq	$40, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$24, %rcx	#
	movzbl	%cl, %ecx	#
	salq	$32, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$32, %rcx	#
	sall	$24, %ecx	#
	movslq	%ecx, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$24, %rcx	#
	andl	$16711680, %ecx	#
	orq	%rcx, %rax	#
	orq	%rdx, %rax	#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 13.2.0"

OUCH: now the code generated for both functions is desastrous!

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

The software and the documentation on this site are provided as is without any warranty, neither express nor implied.
In no event will the author be held liable for any damage(s) arising from the use of the software or the documentation.
Permission is granted to use the current version of the software and the current version of the documentation solely for personal private and non-commercial purposes.
An individuals use of the software or the documentation in his or her capacity or function as an agent, (independent) contractor, employee, member or officer of a business, corporation or organisation (commercial or non-commercial) does not qualify as personal private and non-commercial purpose.
Without written approval from the author the software or the documentation must not be used for a business, for commercial, corporate, governmental, military or organisational purposes of any kind, or in a commercial, corporate, governmental, military or organisational environment of any kind.
Redistribution of the software and the documentation is allowed only in unmodified form of its current version and free of charge.

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):

the (pseudonymised) IP address;
the date and time of the request;
the URL of the requested web page or file;
the Referer and User-Agent HTTP headers sent by the web browser;
the result (success or failure) of the request;
the amount of data received and sent.

Me, myself & IT

Deficiencies in GCC’s Code Generator and Optimiser

Purpose

Reason

Case 0

Case 1: __udivmod?i4() Functions

Case 2: the missing __rotl?i3() and __rotr?i3() Functions

Case 3: __ashl?i3() Functions

Case 4: __ashr?i3() Functions

Case 5: __lshr?i3() Functions

Case 6: __cmp?i2() Functions

Case 7: __ucmp?i2() Functions

Case 8: __mul?i3() Functions

Case 9: __neg?i2() Functions

Case 10: __absv?i2() Functions

Case 11: __addv?i3() Functions

Case 12: __subv?i3() Functions

Case 13: __mulv?i3() Functions

Case 14: __negv?i2() Functions

Case 15: __parity?i2() Functions

Case 16: __builtin_parity() Builtin

Case 17: __builtin_mul_overflow() Builtin

Case 18: __builtin_sub_overflow() Builtin

Case 19: __builtin_copysign() Builtin

Case 20: -fsanitize=signed-integer-overflow Command Line Option

Case 21: -ftrapv Command Line Option

Case 22: Shell Game, or Crazy Braindead Register Allocator

Case 23: Undefined Behaviour in __bswapsi2() Function

Case 24: Undefined Behaviour or Optimiser Failure?

Case 25: Optimiser Failures

Case 26: More Optimiser Failures

Case 27: Yet More Optimiser Failures

Case 28: BDead Code Generation

Case 29: Superfluous Extraneous Code Generation

Case 30: Epic Optimiser Failures

Case 31: Cargo Cult, or Ignorance is Bliss, plus Epic Optimiser Failure

Case 32: More Epic Optimiser Failures

Case 33: Yet More Epic Optimiser Failures

Contact

Terms and Conditions

Data Protection Declaration

Case 1: `__udivmod?i4()` Functions

Case 2: the missing `rotl?i3()` and `rotr?i3()` Functions

Case 3: `__ashl?i3()` Functions

Case 4: `__ashr?i3()` Functions

Case 5: `__lshr?i3()` Functions

Case 6: `__cmp?i2()` Functions

Case 7: `__ucmp?i2()` Functions

Case 8: `__mul?i3()` Functions

Case 9: `__neg?i2()` Functions

Case 10: `__absv?i2()` Functions

Case 11: `__addv?i3()` Functions

Case 12: `__subv?i3()` Functions

Case 13: `__mulv?i3()` Functions

Case 14: `__negv?i2()` Functions

Case 15: `__parity?i2()` Functions

Case 16: `__builtin_parity()` Builtin

Case 17: `__builtin_mul_overflow()` Builtin

Case 18: `__builtin_sub_overflow()` Builtin

Case 19: `__builtin_copysign()` Builtin

Case 20: `-fsanitize=signed-integer-overflow` Command Line Option

Case 21: `-ftrapv` Command Line Option

Case 23: Undefined Behaviour in `__bswapsi2()` Function