Me, myself & IT

Fast(est) Double-Word Integer Division

Purpose
Algorithms: Extended Precision Division; Shift & Subtract Division; Hybrid Variant
Implementation for AMD64 Processors: 128÷128-bit Unsigned Integer Division (128-bit Quotient and Remainder); 128÷128-bit Signed Integer Division (128-bit Quotient and Remainder); 128÷128-bit Signed Integer Division (128-bit Quotient); 128÷128-bit Signed Integer Division (128-bit Remainder); 128-bit Signed and Unsigned Integer Addition, Comparison, Magnitude, Maximum, Minimum, Multiplication, Negation, Parity, Rotation, Shift, Subtraction, … (128-bit Sum, Product, Result, Difference, …); 64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)
Implementation for i386 Processors: 64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder); 64÷64-bit Unsigned Integer Division (64-bit Quotient); 64÷64-bit Unsigned Integer Division (64-bit Remainder); 64÷64-bit Signed Integer Division (64-bit Quotient and Remainder); 64÷64-bit Signed Integer Division (64-bit Quotient); 64÷64-bit Signed Integer Division (64-bit Remainder); 64×64-bit Signed and Unsigned Integer Multiplication (64-bit Product); 64-bit Signed and Unsigned Integer Shift (64-bit Result)
Execution Times (Sustained Reciprocal Throughput): Running ’round in ~~Circles~~ Cycles; Summary; Benchmark Programs for AMD64 Processors; Benchmark Program for i386 Processors

Purpose

Present the fast 128÷128-bit unsigned integer division routines __udivmodti4(), __udivti3() and __umodti3() plus the fast 128÷128-bit signed integer division routines __divmodti4(), __divti3() and __modti3() for AMD64 processors, the fast 64÷64-bit unsigned integer division routines __udivmoddi4(), __udivdi3() and __umoddi3() plus the fast 64÷64-bit signed integer division routines __divdi3() and __moddi3() for i386 processors, as well as the fast compiler helper routines _alldiv(), _alldvrm(), _allmul(), _allrem(), _allshl(), _allshr(), _aulldiv(), _aulldvrm(), _aullrem() and _aullshr() for the Microsoft^® Visual C compiler to perform 64-bit integer division and multiplication on i386 processors.

Note: the fast 128÷64-bit unsigned integer division routine _udiv128() for i386 processors, implemented in ANSI C and Assembler, is presented in my related article Donald Knuth’s Algorithm D …, and provided with my NOMSVCRT.LIB runtime library; in the latter it is complemented by the so-called widening 64×64-bit signed and unsigned integer multiplication routines _mul128() and _umul128() which yield a 128-bit product.

Note: double word means twice the register width here!

Algorithms

The following outline assumes a machine word and corresponding register width of 64 bit.

Extended Precision Division

The extended precision division algorithm is quite simple: it relies on the processor’s native DIV instruction, which performs a so-called narrowing 128÷64-bit division, producing a 64-bit quotient and a 64-bit remainder from an 128-bit dividend and a 64-bit divisor.

Exceptional case: If the divisor is 0, a divide by 0 exception is raised.; Note: this is handled by the second of the simple cases!
Trivial cases: If the divisor is greater than the dividend (which implies that the divisor is greater than 0), the quotient is 0, while the remainder is equal to the dividend.; If the divisor is equal to the dividend and (both are) greater than 0, the quotient is 1, while the remainder is 0.; If the divisor is 1=2⁰, the quotient is equal to the dividend, and the remainder is 0.; If the divisor is a power of 2, i.e. the logical AND of the divisor and the divisor−1 is 0, the quotient is equal to the dividend shifted right by the number of trailing '0' bits of the divisor, while the remainder is the result of the logical AND of the divisor−1 and the dividend.
Simple cases: If the divisor is less than 2⁶⁴, i.e. its upper half is 0, but (its lower half is) greater than the upper half of the dividend, quotient and remainder are less than 2⁶⁴, i.e. their upper halves are 0, and a single DIV instruction yields their lower halves.; If the divisor is less than 2⁶⁴, i.e. its upper half is 0, and (its lower half is) not greater than the upper half of the dividend, the upper half of the remainder is 0 too, while its lower half and (both halves of) the quotient are produced with two consecutive DIV instructions using the so-called long alias schoolbook division (and 64-bit numbers as digits) to avoid an overflow of the quotient.
Special case: If the divisor is not less than 2¹²⁷ (which implies that due to the first trivial case the dividend too is not less than 2¹²⁷), i.e. it has no leading '0' bits and its most significant bit is '1', the quotient is 1 and the remainder is computed as difference of the dividend and the divisor.
Hard case (multiple steps): If the divisor is not less than 2⁶⁴, it is normalised, i.e. shifted left until its most significant bit is set, which is equivalent to a division by 2^{64−number of leading '0' bits}, and its lower half discarded.; The truncated normalised divisor′ is eventually subtracted from the upper half of the dividend to prevent an overflow, then used to produce the lower half of an intermediate approximate quotient′ with a single DIV instruction; if the normalised divisor′ was subtracted from the upper half of the dividend before, the higher half of the intermediate approximate quotient′ is 1, else 0.; The intermediate approximate quotient′ is shifted left by the same amount as the normalised divisor′, giving the final approximate quotient″, which might be 1 to high due to the discarded lower half of the normalised divisor′ (the lower half of the final approximate quotient″ is 0 and discarded).; The approximate remainder′ is computed as difference of the dividend and the product of original divisor and final approximate quotient″ (when the latter is 1 to high and the dividend is not less than 2¹²⁷, the product might not be less than 2¹²⁸, i.e. it may overflow).; If the approximate remainder′ is less than 0, the original divisor is added, while the final approximate quotient″ is decremented by 1, producing the proper quotient and remainder.

Shift & Subtract Division

The shift & subtract alias binary long division algorithm is almost trivial: it’s the schoolbook algorithm using bits as digits.

Exceptional case: If the divisor is 0, a divide by 0 exception is raised.
Trivial cases: If the divisor is greater than the dividend (what implies that the divisor is greater than 0), the quotient is 0, while the remainder is equal to the dividend.; If the divisor is equal to the dividend and (both are) greater than 0, the quotient is 1, while the remainder is 0.; If the divisor is 1=2⁰, the quotient is equal to the dividend, and the remainder is 0.; If the divisor is a power of 2, i.e. the logical AND of the divisor and the divisor−1 is 0, the quotient is equal to the dividend shifted right by the number o trailing '0' bits of the divisor, while the remainder is the resu of the logical AND of the divisor−1 and the dividend.
Long case (with multiple steps repeated in a loop): The quotient is set to 0.; The divisor is aligned to the dividend, i.e. shifted left until their most significant bits are in the same position.; Until the divisor is back in its original position,
¹ the quotient is shifted left one bit,
² if the dividend is not less than the divisor, the divisor is subtracted from the dividend, and the quotient incremented by 1, i.e. its least significant bit is set,
³ the divisor is shifted right one bit.; The remainder is the (remaining) dividend.

Hybrid Variant

The hybrid variant combines the long alias schoolbook division with with the binary long alias shift & subtract division algorithm:
if the divisor is less than 2⁶⁴, it uses the simple cases of the extended precision division algorithm,
else it uses the trivial cases and the long case of the shift & subtract alias binary long division algorithm.

Implementation for AMD64 Processors

Prototypes for the __udivmodti4(), __udivti3(), __umodti3(), __divmodti4(), __divti3() and __modti3() functions:

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);
uint128_t __udivti3(uint128_t dividend, uint128_t divisor);
uint128_t __umodti3(uint128_t dividend, uint128_t divisor);

int128_t __divmodti4(int128_t dividend, int128_t divisor, int128_t *remainder);
int128_t __divti3(int128_t dividend, int128_t divisor);
int128_t __modti3(int128_t dividend, int128_t divisor);

Prototype for the __udivmoddi4() function, and sample implementation of the 64÷64-bit shift & subtract division for the Microsoft Visual C compiler:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _MSC_VER
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
#else
#pragma intrinsic(_BitScanReverse64)

typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;

uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder)
{
    uint64_t quotient;
    uint32_t index1, index2;

    if (_BitScanReverse64(&index2, divisor))
        if (_BitScanReverse64(&index1, dividend))
#if 0
            if (index1 >= index2)
#else
            if (dividend >= divisor)
#endif
            {
                // dividend >= divisor > 0,
                //  64 > index1 >= index2 >= 0
                //   (number of leading '0' bits = 63 - index)

                divisor <<= index1 - index2;
                quotient = 0;

                do
                {
                    quotient <<= 1;

                    if (dividend >= divisor)
                    {
                        dividend -= divisor;
                        quotient |= 1;
                    }

                    divisor >>= 1;
                } while (index1 >= ++index2);

                if (remainder != 0)
                    *remainder = dividend;

                return quotient;
            }
            else // divisor > dividend > 0:
                 //  quotient = 0, remainder = dividend
            {
                if (remainder != 0)
                    *remainder = dividend;

                return 0;
            }
        else // divisor > dividend == 0:
             //  quotient = 0, remainder = 0
        {
            if (remainder != 0)
                *remainder = 0;

            return 0;
        }
    else // divisor == 0
    {
        if (remainder != 0)
            *remainder = dividend % divisor;

        return dividend / divisor;
    }
}
#endif // _MSC_VER

The suffix di4 specifies the number of arguments plus return value and their size: double integer denotes an 8-byte QWORD alias 64-bit uint64_t.

Prototype for the __udivmodti4() function, and sample implementation of the 128÷128-bit extended precision division as well as the shift & subtract division for the Microsoft Visual C compiler:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _MSC_VER
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);
#else
typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;
#if 0
typedef unsigned __int128 uint128_t;
#else
typedef struct
{
    uint64_t low, high;
} uint128_t;
#endif
#if _MSC_VER >= 1920 // MSC 19.20 alias 2019
#pragma intrinsic(__shiftleft128, __shiftright128, _udiv128, _umul128, _BitScanReverse64)

uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder)
{
    uint128_t quotient;
#ifndef HYBRID
    uint64_t high, low, tmp;
    uint32_t index;

    if (_BitScanReverse64(&index, divisor.high))
    {
        high = __shiftleft128(divisor.low, divisor.high, 63 - index);

        if (high > dividend.high)
        {
            tmp = _udiv128(dividend.high, dividend.low, high, &low);
            low = __shiftleft128(low, 0, 63 - index);
        }
        else // prevent overflow
        {
            tmp = _udiv128(dividend.high - high, dividend.low, high, &low);
            low = __shiftleft128(low, 1, 63 - index);
        }

        quotient.high = 0;
        quotient.low = low;

        tmp = low * divisor.high;
        low = _umul128(low, divisor.low, &high);
        high += tmp;

        if ((high < tmp)           // quotient * divisor >= 2**128 > dividend
         || (high > dividend.high) // quotient * divisor > dividend
         || ((high == dividend.high) && (low > dividend.low)))
        {
            quotient.low -= 1;

            low = _umul128(quotient.low, divisor.low, &high);
            high += quotient.low * divisor.high;
        }

        if (remainder != 0)
        {
            dividend.high -= high + (dividend.low < low);
            dividend.low -= low;

            *remainder = dividend;
        }
    }
#else // HYBRID
    uint64_t tmp;
    uint32_t index1, index2;

    if (_BitScanReverse64(&index2, divisor.high))
        if (_BitScanReverse64(&index1, dividend.high))
            if (index1 >= index2)
            {
                // dividend >= divisor >= 2**64,
                //  64 > index1 >= index2 >= 0
                //   (number of leading '0' bits = 63 - index)

                divisor.high = __shiftleft128(divisor.low, divisor.high, index1 - index2);
                divisor.low <<= index1 - index2;

                quotient.high = quotient.low = 0;

                do
                {
                    quotient.low <<= 1;

                    if ((dividend.high > divisor.high)
                     || ((dividend.high == divisor.high) && (dividend.low >= divisor.low)))
                    {
                        dividend.high -= divisor.high + (dividend.low < divisor.low);
                        dividend.low -= divisor.low;

                        quotient.low |= 1;
                    }

                    divisor.low = __shiftright128(divisor.low, divisor.high, 1);
                    divisor.high >>= 1;
                } while (index1 >= ++index2);

                if (remainder != 0)
                    *remainder = dividend;
            }
            else // divisor > dividend >= 2**64:
                 //  quotient = 0, remainder = dividend
            {
                if (remainder != 0)
                    *remainder = dividend;
            }
        else // divisor >= 2**64 > dividend:
             //  quotient = 0, remainder = dividend
        {
            if (remainder != 0)
#if 0
            {
                remainder->high = 0;
                remainder->low = dividend.low;
            }
#else
                *remainder = dividend;
#endif
        }
#endif // HYBRID
    else // divisor < 2**64
    {
        if (dividend.high < divisor.low)
        {
            quotient.high = 0;
            quotient.low = _udiv128(dividend.high, dividend.low, divisor.low, &tmp);
        }
        else // "long" alias "schoolbook" division
        {
            quotient.high = _udiv128(0, dividend.high, divisor.low, &tmp);
            quotient.low = _udiv128(tmp, dividend.low, divisor.low, &tmp);
        }

        if (remainder != 0)
        {
            remainder->high = 0;
            remainder->low = tmp;
        }
    }

    return quotient;
}
#endif
#endif // _MSC_VER

The suffix ti4 specifies the number of arguments plus return value and their size: tetra integer denotes a 16-byte OWORD alias 128-bit uint128_t.

Note: the Microsoft Visual C compiler does not provide a 128-bit integer data type; the keyword __int128 is reserved, but unsupported, its use yields error C4235.

Note: with the preprocessor macro HYBRID defined, the hybrid variant of the division algorithm is used.

128÷128-bit Unsigned Integer Division (128-bit Quotient and Remainder)

__udivmodti4(), __udivti3() and __umodti3() functions for AMD64 processors, supporting the Unix^® System V calling convention, using the extended precision division algorithm:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# * The software is provided "as is" without any warranty, neither express
#   nor implied.
# * In no event will the author be held liable for any damage(s) arising
#   from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
#   and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
#   as an agent, (independent) contractor, employee, member or officer of
#   a business, corporation or organization (commercial or non-commercial)
#   does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
#   for a business, for commercial, corporate, governmental, military or
#   organizational purposes of any kind, or in a commercial, corporate,
#   governmental, military or organizational environment of any kind.

# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
#   in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
#   (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
#   left), 8-byte aligned;
# - 128-bit integer arguments are passed as pair of 64-bit integer arguments,
#   low part before/below high part;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
#   RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
#   preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
#   case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
#   before calling other functions (CALL instruction pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.

# NOTE: raises "division exception" when divisor is 0!

.file	"udivmodti4.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__umodti3:
	sub	rsp, 24
	mov	r8, rsp		# r8 = address of remainder
	call	__udivmodti4
	pop	rax
	pop	rdx		# rdx:rax = remainder
	pop	rcx
	ret
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__udivti3:
	xor	r8, r8
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
__udivmodti4:
	cmp	rdi, rdx
	mov	rax, rsi
	sbb	rax, rcx
	jb	.trivial	# dividend < divisor?

	mov	r11, rcx	# r11 = high qword of divisor
	mov	r10, rdx	# r10 = low qword of divisor

	bsr	rcx, rcx	# rcx = index of most significant '1' bit
				#        in high qword of divisor
	jnz	.extended	# high qword of divisor <> 0?

	# remainder < divisor < 2**64

	cmp	rsi, rdx
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64: quotient < 2**64
	# perform normal division
.normal:
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = dividend
	div	r10		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	0f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
0:
	mov	rdx, r11	# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64: quotient >= 2**64
	# perform "long" alias "schoolbook" division
.long:
	mov	rdx, r11	# rdx = 0
	mov	rax, rsi	# rdx:rax = high qword of dividend
	div	r10		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	mov	rcx, rax	# rcx = high qword of quotient
	mov	rax, rdi	# rax = low qword of dividend
	div	r10		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	1f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
1:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	test	r8, r8
	jz	2f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0
	ret

	# dividend >= divisor >= 2**64: quotient < 2**64
.extended:
	xor	ecx, 63		# ecx = number of leading '0' bits
				#        in (high qword of) divisor
	jz	.special	# divisor >= 2**127?

	# perform "extended & adjusted" division

	mov	r9, r11		# r9 = high qword of divisor
	shld	r9, r10, cl	# r9 = divisor / 2**(index + 1)
				#    = divisor'
#	shl	r10, cl

	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = dividend

	push	rbx
.ifnotdef JCCLESS
	xor	ebx, ebx	# rbx = high qword of quotient' = 0

	cmp	rdx, r9
	jb	3f		# high qword of dividend < divisor'?

	# high qword of dividend >= divisor':
	# subtract divisor' from high qword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	rdx, r9		# rdx = high qword of dividend - divisor'
				#     = high qword of dividend'
	inc	ebx		# rbx = high qword of quotient' = 1
3:
.elseif 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	sbb	rbx, rbx	# rbx = (high qword of dividend < divisor') ? -1 : 0
	and	rbx, r9		# rbx = (high qword of dividend < divisor') ? divisor' : 0
	add	rdx, rbx	# rdx = high qword of dividend
				#     - (high qword of dividend < divisor') ? 0 : divisor'
				#     = high qword of dividend'
	neg	rbx		# CF = (high qword of dividend < divisor')
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.elseif 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	cmovb	rdx, rsi	#     = high qword of dividend'
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.else # JCCLESS
	xor	ebx, ebx	# rbx = high qword of quotient' = 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	cmovb	rdx, rsi	#     = high qword of dividend'
	sbb	ebx, -1		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.endif # JCCLESS
	# high qword of dividend' < divisor'

	div	r9		# rax = dividend' / divisor'
				#     = low qword of quotient',
				# rdx = remainder'
	shld	rbx, rax, cl	# rbx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	rax, cl
	mov	rax, r10	# rax = low qword of divisor
	mov	r9, r11		# r9 = high qword of divisor
	imul	r9, rbx		# r9 = high qword of divisor * quotient"
	mul	rbx		# rdx:rax = low qword of divisor * quotient"
.ifnotdef JCCLESS
	add	rdx, r9		# rdx:rax = divisor * quotient"
	jnc	4f		# divisor * quotient" < 2**64?
				#  (with carry, it is off by divisor,
				#   and quotient" is off by 1)
.if 0
	sbb	rbx, 0		# rbx = quotient" - 1
.else
	dec	rbx		# rbx = quotient" - 1
.endif
	sub	rax, r10
	sbb	rdx, r11	# rdx:rax = divisor * (quotient" - 1)
4:
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.else # JCCLESS
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend
				#         - low qword of divisor * quotient"
	sub	rsi, r9		# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.endif # JCCLESS
	jnb	5f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
.if 0
	sbb	rbx, 0		# rbx = quotient" - 1
				#     = quotient
.else
	dec	rbx		# rbx = quotient" - 1
				#     = quotient
.endif
	add	rdi, r10
	adc	rsi, r11	# rsi:rdi = remainder" + divisor
				#         = remainder
5:
	test	r8, r8
	jz	6f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = rsi:rdi
6:
	mov	rax, rbx	# rax = (low qword of) quotient
	xor	edx, edx	# rdx:rax = quotient

	pop	rbx
	ret

	# dividend >= divisor >= 2**127:
	# quotient = 1, remainder = dividend - divisor
.special:
	test	r8, r8
	jz	7f		# address of remainder = 0?

	sub	rdi, r10
	sbb	rsi, r11	# rsi:rdi = dividend - divisor
				#         = remainder
	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
7:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# rdx:rax = quotient = 1
	ret

.size	__udivmodti4, .-__udivmodti4
.type	__udivmodti4, @function
.global	__udivmodti4
.size	__udivti3, .-__udivti3
.type	__udivti3, @function
.global	__udivti3
.size	__umodti3, .-__umodti3
.type	__umodti3, @function
.global	__umodti3
.end

__udivmodti4() function for AMD64 processors, supporting the Microsoft calling convention, using the extended precision division algorithm:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; Microsoft calling convention for AMD64 platform:
; - first 4 integer or pointer arguments (from left to right) are passed
;   in registers RCX/R1, RDX/R2, R8 and R9;
; - 16-byte arguments are passed by reference;
; - surplus arguments are pushed on stack in reverse order (from right
;   to left), 8-byte aligned;
; - caller allocates memory for 16-byte return value and passes pointer
;   to it as (hidden) first argument, thus shifting all other arguments;
; - caller always allocates "home space" for 4 arguments on stack,
;   even when less than 4 arguments are passed, but does not need to push
;   first 4 arguments;
; - callee can spill first 4 arguments from registers to "home space";
; - callee can clobber "home space";
; - stack is 16-byte aligned: callee must decrement RSP by 8+n*16
;   bytes when it calls other functions (CALL instruction pushes 8 bytes);
; - 64-bit integer or pointer result is returned in register RAX/R0;
; - 32-bit integer result is returned in register EAX;
; - registers RAX/R0, RCX/R1, RDX/R2, R8, R9, R10 and R11 are volatile
;   and can be clobbered;
; - registers RBX/R3, RSP/R4, RBP/R5, RSI/R6, RDI/R7, R12, R13, R14 and
;   R15 must be preserved.

; NOTE: raises "division exception" when divisor is 0!

	.code
				; rcx = oword ptr quotient
				; rdx = oword ptr dividend
				; r8 = oword ptr divisor
				; r9 = oword ptr remainder
__udivmodti4 proc public

	mov	rax, [rdx]	; rax = low qword of dividend
	mov	rdx, [rdx+8]	; rdx = high qword of dividend

	mov	r10, [r8]	; r10 = low qword of divisor
	mov	r11, [r8+8]	; r11 = high qword of divisor

	mov	r8, rcx		; r8 = address of quotient

	cmp	rax, r10
	mov	rcx, rdx
	sbb	rcx, r11
	jb	trivial		; dividend < divisor?

	bsr	rcx, r11	; rcx = index of most significant '1' bit
				;        in high qword of divisor
	jnz	extended	; high qword of divisor <> 0?

	; divisor < 2**64

	cmp	rdx, r10
	jae	long		; high qword of dividend >= (low qword of) divisor?

	; dividend < divisor * 2**64: quotient < 2**64
	; perform normal division
normal:
	div	r10		; rax = (low qword of) quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], r11	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend >= divisor * 2**64: quotient >= 2**64
	; perform "long" alias "schoolbook" division
long:
	mov	rcx, rax	; rcx = low qword of dividend
	mov	rax, rdx	; rax = high qword of dividend
	mov	rdx, r11	; rdx:rax = high qword of dividend
	div	r10		; rax = high qword of quotient,
				; rdx = high qword of remainder'
	xchg	rcx, rax	; rcx = high qword of quotient,
				; rax = low qword of dividend
	div	r10		; rax = low qword of quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], rcx	; quotient = rcx:rax

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	xor	ecx, ecx
	mov	[r8], rcx
	mov	[r8+8], rcx	; quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; divisor >= 2**64: quotient < 2**64
extended:
	xor	ecx, 63		; ecx = number of leading '0' bits
				;        in (high qword of) divisor
	jz	special		; divisor >= 2**127?

	; perform "extended & adjusted" division

	mov	[rsp+8], rbx
	mov	[rsp+16], r12
	mov	[rsp+24], r13
	mov	[rsp+32], r14

	mov	r12, r11	; r12 = high qword of divisor
	shld	r12, r10, cl	; r12 = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	r10, cl

	mov	r13, rax
	mov	r14, rdx	; r14:r13 = high qword of dividend
ifndef JccLess
	xor	ebx, ebx	; rbx = high qword of quotient' = 0

	cmp	rdx, r12
	jb	@f		; high qword of dividend < divisor'?

	; high qword of dividend >= divisor':
	; subtract divisor' from high qword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	rdx, r12	; rdx = high qword of dividend - divisor'
				;     = high qword of dividend'
	inc	ebx		; rbx = high qword of quotient' = 1
@@:
elseif 0
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	sbb	rbx, rbx	; rbx = (high qword of dividend < divisor') ? -1 : 0
	and	rbx, r12	; rbx = (high qword of dividend < divisor') ? divisor' : 0
	add	rdx, rbx	; rdx = high qword of dividend
				;     - (high qword of dividend < divisor') ? 0 : divisor'
				;     = high qword of dividend'
	neg	rbx		; CF = (high qword of dividend < divisor')
	sbb	ebx, ebx	; ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
elseif 0
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	cmovb	rdx, r14	;     = high qword of dividend'
	sbb	ebx, ebx	; ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
else ; JccLess
	xor	ebx, ebx	; rbx = high qword of quotient' = 0
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	cmovb	rdx, r14	;     = high qword of dividend'
	sbb	ebx, -1		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
endif ; JccLess
	; high qword of dividend' < divisor'

	div	r12		; rax = dividend' / divisor'
				;     = low qword of quotient',
				; rdx = remainder'
	shld	rbx, rax, cl	; rbx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	rax, cl
ifndef JccLess
	mov	rax, r10	; rax = low qword of divisor
	mov	r12, r11	; r12 = high qword of divisor
	imul	r12, rbx	; r12 = high qword of divisor * quotient"
	mul	rbx		; rdx:rax = low qword of divisor * quotient"
	add	rdx, r12	; rdx:rax = divisor * quotient"
	jnc	@f		; divisor * quotient" < 2**64?
				;  (with carry, it is off by divisor,
				;   and quotient" is off by 1)
if 0
	sbb	rbx, 0		; rbx = quotient" - 1
else
	dec	rbx		; rbx = quotient" - 1
endif
	sub	rax, r10
	sbb	rdx, r11	; rdx:rax = divisor * (quotient" - 1)
@@:
	sub	r13, rax
	sbb	r14, rdx	; r14:r13 = dividend - divisor * quotient"
				;         = remainder"
else ; JccLess
	mov	rax, r10	; rax = low qword of divisor
	mov	r12, r11	; r12 = high qword of divisor
	imul	r12, rbx	; r12 = high qword of divisor * quotient"
	mul	rbx		; rdx:rax = low qword of divisor * quotient"
	sub	r13, rax
	sbb	r14, rdx	; r14:r13 = dividend
				;         - low qword of divisor * quotient"
	sub	r14, r12	; r14:r13 = dividend - divisor * quotient"
				;         = remainder"
endif ; JccLess
	jnb	@f		; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
if 0
	sbb	rbx, 0		; rbx = quotient" - 1
				;     = quotient
else
	dec	rbx		; rbx = quotient" - 1
				;     = quotient
endif
	add	r13, r10
	adc	r14, r11	; r14:r13 = remainder" + divisor
				;         = remainder
@@:
	xor	eax, eax	; rax = high qword of quotient = 0
	mov	[r8], rbx
	mov	[r8+8], rax	; quotient = rax:rbx

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], r13
	mov	[r9+8], r14	; remainder = r14:r13
@@:
	mov	rbx, [rsp+8]
	mov	r12, [rsp+16]
	mov	r13, [rsp+24]
	mov	r14, [rsp+32]

	mov	rax, r8		; rax = address of quotient
	ret

	# dividend >= divisor >= 2**127:
	# quotient = 1, remainder = dividend - divisor
special:
	mov	[r8+8], rcx
	inc	ecx
	mov	[r8], rcx	; quotient = 1

	test	r9, r9
	jz	@f		; address of remainder = 0?

	sub	rax, r10
	sbb	rdx, r11	; rdx:rax = dividend - divisor
				;         = remainder
	mov	[r9], rax
	mov	[r9+8], rdx
@@:
	mov	rax, r8		; rax = address of quotient
	ret

__udivmodti4 endp
	end

__udivmodti4(), __udivti3() and __umodti3() functions for AMD64 processors, supporting the Unix System V calling convention, using the hybrid variant of the division algorithm:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"udivmodti4.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__umodti3:
	sub	rsp, 24
	mov	r8, rsp		# r8 = address of remainder
	call	__udivmodti4
	pop	rax
	pop	rdx		# rdx:rax = remainder
	pop	rcx
	ret
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__udivti3:
	xor	r8, r8
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
__udivmodti4:
	cmp	rdi, rdx
	mov	rax, rsi
	sbb	rax, rcx
	jb	.trivial	# dividend < divisor?

	bsr	r9, rcx		# r9 = index of most significant '1' bit
				#       in high qword of divisor
	jz	.simple		# high qword of divisor = 0?

	# dividend >= divisor >= 2**64: quotient < 2**64

	mov	r11, rcx	# r11 = high qword of divisor
	bsr	rcx, rdx	# rcx = index of most significant '1' bit
				#        in high qword of dividend
#	jz	.trivial	# high qword of dividend = 0?

	# perform "shift & subtract" alias "binary long" division
.large:
	sub	rcx, r9		# rcx = distance of leading '1' bits
#	jb	.trivial	# dividend < divisor?

	xor	r9, r9		# r9 = (low qword of) quotient' = 0
	mov	r10, rdx	# r10 = low qword of divisor
	shld	r11, r10, cl
	shl	r10, cl		# r11:r10 = divisor << distance of leading '1' bits
				#         = divisor'
.loop:
	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = dividend'
	sub	rdi, r10
	sbb	rsi, r11	# rsi:rdi = dividend' - divisor'
				#         = dividend",
				# CF = (dividend' < divisor')
	cmovb	rdi, rax
	cmovb	rsi, rdx	# rsi:rdi = (dividend' < divisor') ? dividend' : dividend"
	cmc			# CF = (dividend' >= divisor')
	adc	r9, r9		# r9 = quotient' << 1
				#    + dividend' >= divisor'
				#    = quotient"
.if 0
	shrd	r10, r11, 1
	shr	r11, 1		# r11:r10 = divisor' >> 1
				#         = divisor"
.else
	shr	r11, 1
	rcr	r10, 1		# r11:r10 = divisor' >> 1
				#         = divisor"
.endif
	dec	ecx
	jns	.loop

	test	r8, r8
	jz	0f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend"
0:
	mov	rax, r9		# rax = (low qword of) quotient
	xor	edx, edx	# rdx:rax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	test	r8, r8
	jz	1f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
1:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0
	ret

	# divisor < 2**64
.simple:
	mov	r9, rdx		# r9 = (low qword of) divisor
	cmp	rsi, rdx
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64: quotient < 2**64
	# perform normal division
.normal:
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = dividend
	div	r9		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	2f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], rcx	# high qword of remainder = 0
2:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64: quotient >= 2**64
	# perform "long" alias "schoolbook" division
.long:
	mov	rdx, rcx	# rdx = 0
	mov	rax, rsi	# rdx:rax = high qword of dividend
	div	r9		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	mov	r10, rax	# r10 = high qword of quotient
	mov	rax, rdi	# rax = low qword of dividend
	div	r9		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	3f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], rcx	# high qword of remainder = 0
3:
	mov	rdx, r10	# rdx:rax = quotient
	ret

.size	__udivmodti4, .-__udivmodti4
.type	__udivmodti4, @function
.global	__udivmodti4
.size	__udivti3, .-__udivti3
.type	__udivti3, @function
.global	__udivti3
.size	__umodti3, .-__umodti3
.type	__umodti3, @function
.global	__umodti3
.end

__udivmodti4() function for AMD64 processors, supporting the Microsoft calling convention, using the hybrid variant of the division algorithm:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.code

__udivmodti4 proc public

	mov	rax, [rdx]	; rax = low qword of dividend
	mov	rdx, [rdx+8]	; rdx = high qword of dividend

	mov	r10, [r8]	; r10 = low qword of divisor
	mov	r11, [r8+8]	; r11 = high qword of divisor

	mov	r8, rcx		; r8 = address of quotient

	cmp	rax, r10
	mov	rcx, rdx
	sbb	rcx, r11
	jb	trivial		; dividend < divisor?

	bsr	rcx, r11	; rcx = index of most significant '1' bit
				,        in high qword of divisor
	jz	simple		; high qword of divisor = 0?

	; dividend >= divisor >= 2**64: quotient < 2**64

	mov	[rsp+8], rbx

	bsr	rbx, rdx	; rbx = index of most significant '1' bit
				;        in high qword of dividend
;;	jz	trivial		; high qword of dividend = 0?

	; perform "shift & subtract" alias "binary long" division
large:
	sub	ebx, ecx	; ebx = distance of leading '1' bits
;;	jb	trivial		; dividend < divisor?

	mov	ecx, ebx
	xor	ebx, ebx	; rbx = (low qword of) quotient' = 0

	shld	r11, r10, cl
	shl	r10, cl		; r11:r10 = divisor << distance of leading '1' bits
				;         = divisor'
	mov	[rsp+16], r12
	mov	[rsp+24], r13
@@:
	mov	r12, rax
	mov	r13, rdx	; r13:r12 = dividend'
	sub	rax, r10
	sbb	rdx, r11	; rdx:rax = dividend' - divisor'
				;         = dividend",
				; CF = (dividend' < divisor')
	cmovb	rax, r12
	cmovb	rdx, r13	; rdx:rax = (dividend' < divisor') ? dividend' : dividend"
	cmc			; CF = (dividend' >= divisor')
	adc	rbx, rbx	; rbx = quotient' << 1
				;     + dividend' >= divisor'
				;     = quotient
if 0
	shrd	r10, r11, 1
	shr	r11, 1		; r11:r10 = divisor' >> 1
				;         = divisor
else
	shr	r11, 1
	rcr	r10, 1		; r11:r10 = divisor' >> 1
				;         = divisor
endif
	dec	ecx
	jns	@b

	mov	r12, [rsp+16]
	mov	r13, [rsp+24]

	xor	ecx, ecx
	mov	[r8], rbx
	mov	[r8+8], rcx	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend"
@@:
	mov	rax, r8		; rax = address of quotient

	mov	rbx, [rsp+8]
	ret

	; dividend < (2**64 <=) divisor: quotient = 0, remainder = dividend
trivial:
	xor	ecx, ecx
	mov	[r8], rcx
	mov	[r8+8], rcx	; quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend
@@:
	mov	rax, r8		; rax = address of quotient

	mov	rbx, [rsp+8]
	ret

	; divisor < 2**64
simple:
	cmp	rdx, r10
	jae	long		; high qword of dividend >= (low qword of) divisor?

	; dividend < divisor * 2**64: quotient < 2**64
	; perform normal division
normal:
	div	r10		; rax = (low qword of) quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], r11	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend >= divisor * 2**64: quotient >= 2**64
	; perform "long" alias "schoolbook" division
long:
	mov	rcx, rax	; rcx = low qword of dividend
	mov	rax, rdx	; rax = high qword of dividend
	mov	rdx, r11	; rdx:rax = high qword of dividend
	div	r10		; rax = high qword of quotient,
				; rdx = high qword of remainder'
	xchg	rcx, rax	; rcx = high qword of quotient,
				; rax = low qword of dividend
	div	r10		; rax = low qword of quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], rcx	; quotient = rcx:rax

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

__udivmodti4 endp
	end

128÷128-bit Signed Integer Division (128-bit Quotient and Remainder)

__divmodti4() function for AMD64 processors, supporting the Unix^® System V calling convention, wrapping the __udivmodti4() function:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: returns ±2**127 for -2**127 / -1!

.file	"divmodti4.s"
.extern	__udivmodti4
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
__divmodti4:
	mov	rax, rsi
	sar	rax, 63		# rax = (dividend < 0) ? -1 : 0
	xor	rdi, rax
	xor	rsi, rax	# rsi:rdi = (dividend < 0) ? ~dividend : dividend
	sub	rdi, rax
	sbb	rsi, rax	# rsi:rdi = (dividend < 0) ? -dividend : dividend
				#         = |dividend|
	mov	r9, rcx
	sar	r9, 63		# r9 = (divisor < 0) ? -1 : 0
	xor	rdx, r9
	xor	rcx, r9		# rcx:rdx = (divisor < 0) ? ~divisor : divisor
	sub	rdx, r9
	sbb	rcx, r9		# rcx:rdx = (divisor < 0) ? -divisor : divisor
				#         = |divisor|
	push	r8
	push	rax
	xor	rax, r9		# rax = (dividend < 0) ^ (divisor < 0) ? -1 : 0
				#     = (quotient < 0) ? -1 : 0
	push	rax
	call	__udivmodti4	# rdx:rax = |quotient|

	pop	r9		# r9 = (quotient < 0) ? -1 : 0
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = (quotient < 0) ? |~quotient| : |quotient|
	sub	rax, r9
	sbb	rdx, r9		# rdx:rax = (quotient < 0) ? |-quotient| : |quotient|
				#         = quotient
	pop	r9		# r9 = (dividend < 0) ? -1 : 0
				#    = (remainder < 0) ? -1 : 0
	pop	r8		# r8 = address of |remainder|
	test	r8, r9
	jz	0f		# address of remainder = 0?
				# remainder >= 0?
	neg	qword ptr [r8+8]
	neg	qword ptr [r8]
	sbb	qword ptr [r8+8], 0
				# [r9] = -|remainder|
				#      = remainder
0:
	ret

.size	__divmodti4, .-__divmodti4
.type	__divmodti4, @function
.global	__divmodti4
.end

__divmodti4() function for AMD64 processors, supporting the Microsoft calling convention, wrapping the __udivmodti4() function:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: returns ±2**127 for -2**127 / -1!

	.code
				; rcx = oword ptr quotient
				; rdx = oword ptr dividend
				; r8 = oword ptr divisor
				; r9 = oword ptr remainder
__divmodti4 proc public

	mov	rax, [rdx+8]
	mov	r10, [rdx]	; rax:r10 = dividend
	cqo			; rdx = (dividend < 0) ? -1 : 0
	push	rdx		;     = (remainder < 0) ? -1 : 0
	xor	r10, rdx
	xor	rax, rdx	; rax:r10 = (dividend < 0) ? ~dividend : dividend
	sub	r10, rdx
	sbb	rax, rdx	; rax:r10 = (dividend < 0) ? -dividend : dividend
				;         = |dividend|
	mov	[rsp+16], r10
	mov	[rsp+24], rax

	mov	rax, [r8+8]
	mov	r8, [r8]	; rax:r8 = divisor
	cqo			; rdx = (divisor < 0) ? -1 : 0
	push	rdx
	xor	r8, rdx
	xor	rax, rdx	; rax:r8 = (divisor < 0) ? ~divisor : divisor
	sub	r8, rdx
	sbb	rax, rdx	; rax:r8 = (divispr < 0) ? -divisor : divisor
				;        = |divisor|
	mov	[rsp+40], r8
	mov	[rsp+48], rax
				; rcx = address of |quotient|
	lea	rdx, [rsp+24]	; rdx = address of |dividend|
	lea	r8, [rsp+40]	; r8 = address of |divisor|
	push	r9		; r9 = address of |remainder|
	extern	__udivmodti4 :proc
	call	__udivmodti4	; rax = address of |quotient|

	pop	r9		; r9 = address of |remainder|
	pop	r10		; r10 = (divisor < 0) ? -1 : 0
	pop	r11		; r11 = (dividend < 0) ? -1 : 0
				;     = (remainder < 0) ? -1 : 0
	test	r9, r11
	jz	@f		; address of remainder = 0?
				; remainder >= 0?
	neg	qword ptr [r9+8]
	neg	qword ptr [r9]
	sbb	qword ptr [r9+8], 0
				; [r9] = -|remainder|
				;      = remainder
@@:
	xor	r10, r11	; r10 = (divisor < 0) ^ (dividend < 0) ? -1 : 0
				;     = (quotient < 0) ? -1 : 0
	jz	@f		; quotient >= 0?

	neg	qword ptr [rax+8]
	neg	qword ptr [rax]
	sbb	qword ptr [rax+8], 0
				; [rax] = -|quotient|
				;       = quotient
@@:
	ret

__divmodti4 endp
	end

128÷128-bit Signed Integer Division (128-bit Quotient)

__divti3() function for AMD64 processors, supporting the Unix^® System V calling convention, using the __udivmodti4() function:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: returns ±2**127 for -2**127 / -1!

.file	"divti3.s"
.extern	__udivmodti4
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__divti3:
	mov	rax, rsi
	sar	rax, 63		# rax = (dividend < 0) ? -1 : 0
	xor	rdi, rax
	xor	rsi, rax	# rsi:rdi = (dividend < 0) ? ~dividend : dividend
	sub	rdi, rax
	sbb	rsi, rax	# rsi:rdi = (dividend < 0) ? -dividend : dividend
				#         = |dividend|
	mov	r8, rcx
	sar	r8, 63		# r8 = (divisor < 0) ? -1 : 0
	xor	rdx, r8
	xor	rcx, r8		# rcx:rdx = (divisor < 0) ? ~divisor : divisor
	sub	rdx, r8
	sbb	rcx, r8		# rcx:rdx = (divisor < 0) ? -divisor : divisor
				#         = |divisor|
	xor	rax, r8		# rax = (dividend < 0) ^ (divisor < 0) ? -1 : 0
				#     = (quotient < 0) ? -1 : 0
	push	rax
	xor	r8, r8		# r8 = address of |remainder| = 0
	call	__udivmodti4	# rdx:rax = |quotient|
	pop	rcx		# rcx = (quotient < 0) ? -1 : 0
	xor	rax, rcx
	xor	rdx, rcx	# rdx:rax = (quotient < 0) ? |~quotient| : |quotient|
	sub	rax, rcx
	sbb	rdx, rcx	# rdx:rax = (quotient < 0) ? |-quotient| : |quotient|
				#         = quotient
	ret

.size	__divti3, .-__divti3
.type	__divti3, @function
.global	__divti3
.end

128÷128-bit Signed Integer Division (128-bit Remainder)

__modti3() function for AMD64 processors, supporting the Unix^® System V calling convention, using the __udivmodti4() function:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"modti3.s"
.extern	__udivmodti4
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__modti3:
	mov	rax, rcx
	sar	rax, 63		# rax = (divisor < 0) ? -1 : 0
	xor	rdx, rax
	xor	rcx, rax	# rcx:rdx = (divisor < 0) ? ~divisor : divisor
	sub	rdx, rax
	sbb	rcx, rax	# rcx:rdx = (divisor < 0) ? -divisor : divisor
				#         = |divisor|
	mov	rax, rsi
	sar	rax, 63		# rax = (dividend < 0) ? -1 : 0
	xor	rdi, rax
	xor	rsi, rax	# rsi:rdi = (dividend < 0) ? ~dividend : dividend
	sub	rdi, rax
	sbb	rsi, rax	# rsi:rdi = (dividend < 0) ? -dividend : dividend
				#         = |dividend|
	push	rax
	sub	rsp, 16
	mov	r8, rsp		# r8 = address of |remainder|
	call	__udivmodti4	# rdx:rax = |quotient|
	pop	rax
	pop	rdx		# rdx:rax = |remainder|
	pop	rcx		# rcx = (dividend < 0) ? -1 : 0
	xor	rax, rcx
	xor	rdx, rcx	# rdx:rax = (dividend < 0) ? |~remainder| : |remainder|
	sub	rax, rcx
	sbb	rdx, rcx	# rdx:rax = (dividend < 0) ? |-remainder| : |remainder|
				#         = remainder
	ret

.size	__modti3, .-__modti3
.type	__modti3, @function
.global	__modti3
.end

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef _MSC_VER
typedef unsigned __int64 uint64_t;
typedef __int64 int64_t;
#if 0
typedef __int128 int128_t;
#else
typedef struct
{
    uint64_t low;
    int64_t high;
} int128_t;
#endif

int      __cmpti2(int128_t comparand, int128_t comparator);

int128_t __absti2(int128_t value);
int128_t __absvti2(int128_t value);
int128_t __negti2(int128_t negend);
int128_t __negvti2(int128_t negend);

int128_t __ashlti3(int128_t value, int count);
int128_t __ashrti3(int128_t value, int count);

int128_t __maxti3(int128_t left, int128_t right);
int128_t __minti3(int128_t left, int128_t right);

int128_t __addti3(int128_t augend, int128_t addend);
int128_t __addvti3(int128_t augend, int128_t addend);
int128_t __multi3(int128_t multiplicand, int128_t multiplier);
int128_t __mulvti3(int128_t multiplicand, int128_t multiplier);
int128_t __subti3(int128_t minuend, int128_t subtrahend);
int128_t __subvti3(int128_t minuend, int128_t subtrahend);

#if 0
typedef unsigned __int128 uint128_t;
#else
typedef struct
{
    uint64_t low, high;
} uint128_t;
#endif

int       __clzti2(uint128_t value);
int       __ctzti2(uint128_t value);
int       __parityti2(uint128_t value);
int       __popcountti2(uint128_t value);

int       __ucmpti2(uint128_t comparand, uint128_t comparator);

uint128_t __bswapti2(uint128_t value);
uint128_t __reverseti2(uint128_t value);

uint128_t __lshrti3(uint128_t value, int count);
uint128_t __rotlti3(uint128_t value, int count);
uint128_t __rotrti3(uint128_t value, int count);

uint128_t __umaxti3(uint128_t left, uint128_t right);
uint128_t __uminti3(uint128_t left, uint128_t right);

uint128_t __uaddti3(uint128_t augend, uint128_t addend);
uint128_t __uaddvti3(uint128_t augend, uint128_t addend);
uint128_t __umulti3(uint128_t multiplicand, uint128_t multiplier);
uint128_t __umulvti3(uint128_t multiplicand, uint128_t multiplier);
uint128_t __usubti3(uint128_t minuend, uint128_t subtrahend);
uint128_t __usubvti3(uint128_t minuend, uint128_t subtrahend);

uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);

#ifdef INTERN
#pragma intrinsic(_BitScanForward64, _BitScanReverse64)

__inline
int __clzti2(uint128_t value)
{
    int index;

    if (_BitScanReverse64(&index, value.high))
        return index ^ 63;

    if (_BitScanReverse64(&index, value.low))
        return index ^ 127;

    return 128;
}

__inline
int __ctzti2(uint128_t value)
{
    int index;

    if (_BitScanForward64(&index, value.low))
        return index;

    if (_BitScanForward64(&index, value.high))
        return index + 64;

    return 128;
}

__inline
int __parityti2(uint128_t value)
{
    unsigned long long ull = value.low ^ value.high;
    unsigned long ul = ull ^ (ull >> 32);
    ul ^= ul >> 16;
    ul ^= ul >> 8;
    ul ^= ul >> 4;

    return (0x69966996 >> ul) & 1;
}

__inline
int __popcountti2(uint128_t value)
{
    unsigned long long low = value.low, high = value.high;

    low -= (low >> 1) & 0x5555555555555555;
    high -= (high >> 1) & 0x5555555555555555;
    low = (low & 0x3333333333333333)
        + ((low >> 2) & 0x3333333333333333);
    high = (high & 0x3333333333333333)
         + ((high >> 2) & 0x3333333333333333);
    low += low >> 4;
    high += high >> 4;
    low &= 0x0F0F0F0F0F0F0F0F;
    high &= 0x0F0F0F0F0F0F0F0F;
    low *= 0x0101010101010101;
    high *= 0x0101010101010101;

    return (low + high) >> 56;
}

__inline
int __cmpti2(int128_t comparand, int128_t comparator)
{
    if (comparand.high == comparator.high)
        return (comparand.low > comparator.low)
             - (comparand.low < comparator.low);

    return (comparand.high > comparator.high)
         - (comparand.high < comparator.high);
}

__inline
int __ucmpti2(uint128_t comparand, uint128_t comparator)
{
    if (comparand.high == comparator.high)
        return (comparand.low > comparator.low)
             - (comparand.low < comparator.low);

    return (comparand.high > comparator.high)
         - (comparand.high < comparator.high);
}

#pragma intrinsic(_byteswap_uint64)

__inline
uint128_t __bswapti2(uint128_t value)
{
    uint128_t result;

    result.low = _byteswap_uint64(value.high);
    result.high = _byteswap_uint64(value.low);

    return result;
}

__inline
uint128_t __reverseti2(uint128_t value)
{
    uint128_t result;

    result.low = _byteswap_uint64(value.high);
    result.high = _byteswap_uint64(value.low);

    result.low = ((result.low >> 4) & 0x0F0F0F0F0F0F0F0F)
               | ((result.low << 4) & 0xF0F0F0F0F0F0F0F0);
    result.high = ((result.high >> 4) & 0x0F0F0F0F0F0F0F0F)
                | ((result.high << 4) & 0xF0F0F0F0F0F0F0F0);
    result.low = ((result.low >> 2) & 0x3333333333333333)
               | ((result.low << 2) & 0xCCCCCCCCCCCCCCCC);
    result.high = ((result.high >> 2) & 0x3333333333333333)
                | ((result.high << 2) & 0xCCCCCCCCCCCCCCCC);
    result.low = ((result.low >> 1) & 0x5555555555555555)
               | ((result.low << 1) & 0xAAAAAAAAAAAAAAAA);
    result.high = ((result.high >> 1) & 0x5555555555555555)
                | ((result.high << 1) & 0xAAAAAAAAAAAAAAAA);

    return result;
}

__inline
int128_t __absti2(int128_t value)
{
    if (value.high < 0)
    {
        value.low = 0 - value.low;
        value.high = 0 - value.high
                   - (0 < value.low);
    }

    return value;
}

__inline
int128_t __absvti2(int128_t value)
{
    if (value.high < 0)
    {
        value.low = 0 - value.low;
        value.high = 0 - value.high
                   - (0 < value.low);
    }

    // overflow if value is negative

    if (value.high < 0)
        __ud2();

    return value;
}

__inline
int128_t __negti2(int128_t negend)
{
    int128_t negation;

    negation.low = 0 - negend.low;
    negation.high = 0 - negend.high
                  - (0 < negend.low);

    return negation;
}

__inline
int128_t __negvti2(int128_t negend)
{
    int128_t negation;

    negation.low = 0 - negend.low;
    negation.high = 0 - negend.high
                  - (0 < negend.low);

    // overflow if negend and negation are negative

    if ((negend.high & negation.high) < 0)
        __ud2();

    return negation;
}

__inline
int128_t __addti3(int128_t augend, int128_t addend)
{
    int128_t sum;

    sum.low = augend.low + addend.low;
    sum.high = augend.high + addend.high
             + (sum.low < augend.low);

    return sum;
}

__inline
int128_t __addvti3(int128_t augend, int128_t addend)
{
    int128_t sum;

    sum.low = augend.low + addend.low;
    sum.high = augend.high + addend.high
             + (sum.low < augend.low);

    // overflow if both augend and addend have opposite sign of sum,
    //  which is equivalent to augend has sign of addend
    //   and addend has opposite sign of sum
    //    (or augend has opposite sign of sum)

    if (((augend.high ^ sum.high) & (addend.high ^ sum.high)) < 0)
        __ud2();

    return sum;
}

__inline
uint128_t __uaddti3(uint128_t augend, uint128_t addend)
{
    uint128_t sum;

    sum.low = augend.low + addend.low;
    sum.high = augend.high + addend.high
             + (sum.low < augend.low);

    return sum;
}

__inline
uint128_t __uaddvti3(uint128_t augend, uint128_t addend)
{
    uint128_t sum;

    sum.low = augend.low + addend.low;
    sum.high = augend.high + addend.high
             + (sum.low < augend.low);

    if (sum.high < augend.high)
        __ud2();

    return sum;
}

__inline
int128_t __subti3(int128_t minuend, int128_t subtrahend)
{
    int128_t difference;

    difference.low = minuend.low - subtrahend.low;
    difference.high = minuend.high - subtrahend.high
                    - (minuend.low < subtrahend.low);

    return difference;
}

__inline
int128_t __subvti3(int128_t minuend, int128_t subtrahend)
{
    int128_t difference;

    difference.low = minuend.low - subtrahend.low;
    difference.high = minuend.high - subtrahend.high
                    - (minuend.low < subtrahend.low);

    // overflow if minuend has opposite sign of subtrahend
    //  and minuend has opposite sign of difference
    //   (or subtrahend has sign of difference)

    if (((minuend.high ^ subtrahend.high) & (minuend.high ^ difference.high)) < 0)
        __ud2();

    return difference;
}

__inline
uint128_t __usubti3(uint128_t minuend, uint128_t subtrahend)
{
    uint128_t difference;

    difference.low = minuend.low - subtrahend.low;
    difference.high = minuend.high - subtrahend.high
                    - (minuend.low < subtrahend.low);

    return difference;
}

__inline
uint128_t __usubvti3(uint128_t minuend, uint128_t subtrahend)
{
    uint128_t difference;

    difference.low = minuend.low - subtrahend.low;
    difference.high = minuend.high - subtrahend.high
                    - (minuend.low < subtrahend.low);

    if (minuend.high < subtrahend.high + (minuend.low < subtrahend.low))
        __ud2();

    return difference;
}

__inline
int128_t __maxti3(int128_t left, int128_t right)
{
    if ((left.high < right.high)
     || ((left.high == right.high) && (left.low < right.low)))
        return right;

    return left;
}

__inline
uint128_t __umaxti3(uint128_t left, uint128_t right)
{
    if ((left.high < right.high)
     || ((left.high == right.high) && (left.low < right.low)))
        return right;

    return left;
}

__inline
int128_t __minti3(int128_t left, int128_t right)
{
    if ((left.high > right.high)
     || ((left.high == right.high) && (left.low > right.low)))
        return right;

    return left;
}

__inline
uint128_t __uminti3(uint128_t left, uint128_t right)
{
    if ((left.high > right.high)
     || ((left.high == right.high) && (left.low > right.low)))
        return right;

    return left;
}

#pragma intrinsic(__shiftleft128, __shiftright128)

__inline
int128_t __ashlti3(int128_t value, int count)
{
    int128_t result;

    if (count < 64)
    {
        result.low = value.low << count;
        result.high = __shiftleft128(value.low, value.high, count);
    }
    else if (count < 128)
    {
        result.low = 0;
        result.high = value.high << (count - 64);
    }
    else
        result.low = result.high = 0;

    return result;
}

__inline
int128_t __ashrti3(int128_t value, int count)
{
    int128_t result;

    if (count < 64)
    {
        result.low = __shiftright128(value.low, value.high, count);
        result.high = value.high >> count;
    }
    else if (count < 128)
    {
        result.low = value.high >> (count - 64);
#if 1
        result.high = value.high >> 63;
#else
        result.high = value.high < 0 ? -1 : 0;
#endif
    }
    else
#if 1
        result.low = result.high = value.high >> 63;
#else
        result.low = result.high = value.high < 0 ? -1 : 0;
#endif
    return result;
}

__inline
uint128_t __lshrti3(uint128_t value, int count)
{
    uint128_t result;

    if (count < 64)
    {
        result.low = __shiftright128(value.low, value.high, count);
        result.high = value.high >> count;
    }
    else if (count < 128)
    {
        result.low = value.high >> (count - 64);
        result.high = 0;
    }
    else
        result.low = result.high = 0;

    return result;
}

__inline
uint128_t __rotlti3(uint128_t value, int count)
{
    uint128_t result;

    if ((count & 64) == 0)
    {
        result.low = __shiftleft128(value.high, value.low, count);
        result.high = __shiftleft128(value.low, value.high, count);
    }
    else
    {
        result.low = __shiftleft128(value.low, value.high, count);
        result.high = __shiftleft128(value.high, value.low, count);
    }

    return result;
}

__inline
uint128_t __rotrti3(uint128_t value, int count)
{
    uint128_t result;

    if ((count & 64) == 0)
    {
        result.low = __shiftright128(value.low, value.high, count);
        result.high = __shiftright128(value.high, value.low, count);
    }
    else
    {
        result.low = __shiftright128(value.high, value.low, count);
        result.high = __shiftright128(value.low, value.high, count);
    }

    return result;
}

#pragma intrinsic(_mul128, _umul128)

__inline
int128_t __multi3(int128_t multiplicand, int128_t multiplier)
{
    int128_t product;

    product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
    product.high += multiplicand.low * multiplier.high
                  + multiplicand.high * multiplier.low;

    return product;
}

__inline
int128_t __mulvti3(int128_t multiplicand, int128_t multiplier)
{
    int128_t product, tmp;
#if 0
    if (multiplicand.high == (long long) multiplicand.low >> 63)
    {                               // -2**63 <= multiplicand < 2**63
        if (multiplier.high == (long long) multiplier.low >> 63)
        {                           // -2**63 <= multiplier < 2**63
            product.low = _mul128(multiplicand.low, multiplier.low, &product.high);

            return product;
        }

        product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
        tmp.low = _umul128(multiplicand.low, multiplier.high, &tmp.high);

        if (multiplier.high < 0)
            tmp.high -= multiplicand.low;

        if ((long long) multiplicand.low < 0)
        {
            tmp.high -= multiplier.high
                      + (tmp.low < multiplier.low);
            tmp.low -= multiplier.low;
        }

        tmp.low += product.high;
        tmp.high += tmp.low < (unsigned long long) product.high;

        product.high = tmp.low;

        if (tmp.high == (long long) tmp.low >> 63)
            return product;
    }

    if (multiplier.high == (long long) multiplier.low >> 63)
    {                               // -2**63 <= multiplier < 2**63
        product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
        tmp.low = _umul128(multiplicand.high, multiplier.low, &tmp.high);

        if (multiplicand.high < 0)
            tmp.high -= multiplier.low;

        if ((long long) multiplier.low < 0)
        {
            tmp.high -= multiplicand.high
                      + (tmp.low < multiplicand.low);
            tmp.low -= multiplicand.low;
        }

        tmp.low += product.high;
        tmp.high += tmp.low < (unsigned long long) product.high;

        product.high = tmp.low;

        if (tmp.high == (long long) tmp.low >> 63)
            return product;
    }

    product.low = _umul128(multiplicand.low, multiplier.low, &product.high);

    if (multiplicand.high < 0)
    {
        if (multiplier.high < 0)
        {
            if (((multiplicand.high & multiplier.high) == -1)
             && ((multiplicand.low | multiplier.low) != 0))
            {
                product.high -= multiplicand.low + multiplier.low;

                if (product.high >= 0)
                    return product;
            }
        }
        else
        {
            if ((multiplicand.high == -1) && (multiplier.high == 0))
            {
                product.high -= multiplier.low;

                if (product.high < 0)
                    return product;
            }
        }
    }
    else
    {
        if (multiplier.high < 0)
        {
            if ((multiplicand.high == 0) && (multiplier.high == -1))
            {
                product.high -= multiplicand.low;

                if (product.high < 0)
                    return product;
            }
        }
        else
        {
            if ((multiplicand.high == 0) && (multiplier.high == 0))
            {
                if (product.high >= 0)
                    return product;
            }
        }
    }

    __ud2();
#else
    int overflow, sign = (multiplicand.high ^ multiplier.high) < 0;

    if (multiplicand.high < 0)
    {
        multiplicand.low = 0 - multiplicand.low;
        multiplicand.high = 0 - multiplicand.high
                          - (0 < multiplicand.low);
    }

    if (multiplier.high < 0)
    {
        multiplier.low = 0 - multiplier.low;
        multiplier.high = 0 - multiplier.high
                        - (0 < multiplier.low);
    }

    overflow = (multiplicand.high != 0) & (multiplier.high != 0);

    tmp.low = _umul128(multiplicand.low, multiplier.high, &tmp.high);
    overflow |= tmp.high != 0;

    tmp.low += _umul128(multiplicand.high, multiplier.low, &tmp.high);
    overflow |= tmp.high != 0;

    product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
    product.high += tmp.low;
    overflow |= (unsigned long long) product.high < tmp.low;

    if (sign != 0)
    {
        product.low = 0 - product.low;
        product.high = 0 - product.high
                     - (0 < product.low);
        overflow |= product.high >= 0;
    }
    else
        overflow |= product.high < 0;

    if (overflow != 0)
        __ud2();
#endif
    return product;
}

__inline
uint128_t __umulti3(uint128_t multiplicand, uint128_t multiplier)
{
    uint128_t product;

    product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
    product.high += multiplicand.low * multiplier.high
                  + multiplicand.high * multiplier.low;

    return product;
}

__inline
uint128_t __umulvti3(uint128_t multiplicand, uint128_t multiplier)
{
    uint128_t product, tmp;
    int overflow = (multiplicand.high != 0) & (multiplier.high != 0);

    tmp.low = _umul128(multiplicand.high, multiplier.low, &tmp.high);
    overflow |= tmp.high != 0;

    tmp.low += _umul128(multiplicand.low, multiplier.high, &tmp.high);
    overflow |= tmp.high != 0;

    product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
    product.high += tmp.low;
    overflow |= product.high < tmp.low;

    if (overflow != 0)
        __ud2();

    return product;
}

#if _MSC_VER >= 1920 // MSC 19.20 alias 2019
#pragma intrinsic(__shiftleft128, __shiftright128, _udiv128, _umul128, _BitScanReverse64)

uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder)
{
    uint128_t quotient;
    uint64_t n0 = dividend.low, n1 = dividend.high, n2;
    uint64_t d0 = divisor.low, d1 = divisor.high;
    uint64_t p0, p1;
    unsigned bm, bn;

    if (!_BitScanReverse64(&bn, d1))
    {                               // *:q = n:n / 0:d
        if (d0 > n1)
            quotient.high = 0;
        else                        // q:q = n:n / 0:d
            quotient.high = _udiv128(0, n1, d0, &n1);

        quotient.low = _udiv128(n1, n0, d0, &n0);

        if (remainder != 0)
        {
            remainder->high = 0;
            remainder->low = n0;
        }
    }
    else
        if (d1 > n1)
        {                           // 0:0 = n:n / d:d
            quotient.low = quotient.high = 0;

            if (remainder != 0)
                *remainder = dividend;
        }
        else
        {                           // 0:q = n:n / d:d
            bm = 63 - bn;

            if (bm == 0)
            {
                // from "dividend.high >= divisor.high"
                //  and "most significant bit of divisor.high is set"
                //   follows "most significant bit of dividend.high is set"
                //    and thus "quotient.low is either 0 or 1"
                //
                // this special case is necessary, not an optimization!

                // the condition on the next line takes advantage of that
                //  (due to program flow) dividend.high >= divisor.high

                if ((n1 > d1) || (n0 >= d0))
                {
                    n1 -= d1 + (n0 < d0);
                    n0 -= d0;

                    quotient.low = 1;
                }
                else
                    quotient.low = 0;

                quotient.high = 0;

                if (remainder != 0)
                {
                    remainder->high = n1;
                    remainder->low = n0;
                }
            }
            else
            {                       // normalize
#if 0
                n2 = n1 >> ++bn;
                n1 <<= bm;
                n1 |= n0 >> bn;
#else
                n2 = __shiftleft128(n1, 0, bm);
		n1 = __shiftleft128(n0, n1, bm);
#endif
                n0 <<= bm;
#if 0
                d1 <<= bm;
                d1 |= d0 >> bn;
#else
		d1 = __shiftleft128(d0, d1, bm);
#endif
                d0 <<= bm;

                quotient.low = _udiv128(n2, n1, d1, &n1);
                quotient.high = 0;

                p0 = _umul128(quotient.low, d0, &p1);

                if ((p1 > n1) || ((p1 == n1) && (p0 > n0)))
                {
                    p1 -= d1 + (p0 < d0);
                    p0 -= d0;

                    quotient.low -= 1;
                }

                if (remainder != 0)
                {
                    n1 -= p1 + (n0 < p0);
                    n0 -= p0;
#if 0
                    remainder->low = (n0 >> bm) | (n1 << bn);
#else
                    remainder->low = __shiftright128(n0, n1, bm);
#endif
                    remainder->high = n1 >> bm;
                }
            }
        }

    return quotient;
}
#endif
#endif // INTERN
#endif // _MSC_VER

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code
				; rcx = oword ptr |argument|
				; rdx = oword ptr argument
__absti2 proc	public

	mov	r8, [rdx]
	mov	rax, [rdx+8]	; rax:r8 = argument
	cqo			; rdx = (argument < 0) ? -1 : 0
	xor	r8, rdx
	xor	rax, rdx	; rax:r8 = (argument < 0) ? ~argument : argument
	sub	r8, rdx
	sbb	rax, rdx	; rax:r8 = (argument < 0) ? -argument : argument
				;        = |argument|
	mov	[rcx], r8
	mov	[rcx+8], rax

	mov	rax, rcx	; rax = address of |argument|
	ret

__absti2 endp
				; rcx = oword ptr |argument|
				; rdx = oword ptr argument
__absvti2 proc	public

	mov	r8, [rdx]
	mov	rax, [rdx+8]	; rax:r8 = argument
	cqo			; rdx = (argument < 0) ? -1 : 0
	xor	r8, rdx
	xor	rax, rdx	; rax:r8 = (argument < 0) ? ~argument : argument
	sub	r8, rdx
	sbb	rax, rdx	; rax:r8 = (argument < 0) ? -argument : argument
				;        = |argument|
	jo	short overflow	; |argument| = argument = ±2**127?

	mov	[rcx], r8
	mov	[rcx+8], rax

	mov	rax, rcx	; rax = address of |argument|
	ret

overflow:
	ud2

__absvti2 endp
				; rcx = oword ptr result
				; rdx = oword ptr argument
__bswapti2 proc	public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = argument

	movbe	[rcx+8], rax
	movbe	[rcx], rdx

	mov	rax, rcx	; rax = address of result
	ret

__bswapti2 endp
				; rcx = oword ptr result
				; rdx = oword ptr argument
__reverseti2 proc public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = argument

	mov	r9, 0AAAAAAAAAAAAAAAAh
	lea	r10, [rax+rax]
	lea	r11, [rdx+rdx]	; r11:r10 = argument << 1
	and	rax, r9
	and	rdx, r9		; rdx:rax = argument
				;         & 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
	and	r10, r9
	and	r11, r9		; r11:r10 = (argument << 1)
				;         & 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
	shr	rax, 1
	shr	rdx, 1		; rdx:rax = (argument >> 1)
				;         & 0x55555555555555555555555555555555
	or	rax, r10
	or	rdx, r11	; rdx:rax = ((argument >> 1)
				;           & 0x55555555555555555555555555555555)
				;         | ((argument << 1)
				;           & 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
				;         = argument'
	mov	r9, 0CCCCCCCCCCCCCCCCh
if 0
	lea	r10, [4*rax]
	lea	r11, [4*rdx]	; r11:r10 = argument' << 2
else
	mov	r10, rax
	mov	r11, rdx	; r11:r10 = argument'
	shl	r10, 2
	shl	r11, 2		; r11:r10 = argument' << 2
endif
	and	rax, r9
	and	rdx, r9		; rdx:rax = argument'
				;         & 0xCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
	and	r10, r9
	and	r11, r9		; r11:r10 = (argument' << 2)
				;         & 0xCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
	shr	rax, 2
	shr	rdx, 2		; rdx:rax = (argument' >> 2)
				;         & 0x33333333333333333333333333333333
	or	rax, r10
	or	rdx, r11	; rdx:rax = ((argument' >> 2)
				;           & 0x33333333333333333333333333333333)
				;         | ((argument' << 2)
				;           & 0xCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
				;         = argument"
	mov	r9, 0F0F0F0F0F0F0F0F0h
	mov	r10, rax
	mov	r11, rdx	; r11:r10 = argument"
	shl	r10, 4
	shl	r11, 4		; r11:r10 = argument" << 4
	and	rax, r9
	and	rdx, r9		; rdx:rax = argument"
				;         & 0xF0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0
	and	r10, r9
	and	r11, r9		; r11:r10 = (argument" << 4)
				;         & 0xF0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0
	shr	rax, 4
	shr	rdx, 4		; rdx:rax = (argument" >> 4)
				;         & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F
	or	rax, r10
	or	rdx, r11	; rdx:rax = ((argument" >> 4)
				;           & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F)
				;         | ((argument" << 4)
				;           & 0xF0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0)
	movbe	[rcx+8], rax
	movbe	[rcx], rdx

	mov	rax, rcx	; rax = address of result
	ret

__reverseti2 endp
				; rcx = oword ptr argument
__clzti2 proc	public

	mov	eax, 64
	bsr	rdx, [rcx+8]	; rdx = index of most significant '1' bit
	jnz	short @f	; high qword of argument <> 0?

	add	eax, eax
	bsr	rdx, [rcx]	; rdx = index of most significant '1' bit - 64
	jz	short return
@@:
	stc
	sbb	eax, edx	; rax = 127 - index of most significant '1' bit
return:
	ret

__clzti2 endp
				; rcx = oword ptr argument
__ctzti2 proc	public

	mov	eax, 64
	bsf	rdx, [rcx+8]	; rdx = index of least significant '1' bit - 64
	cmovz	edx, eax	; rdx = (high qword of argument <> 0)
				;     ? index of least significant '1' bit - 64 : 64
	add	edx, eax	; rdx = (high qword of argument <> 0)
				;     ? index of least significant '1' bit : 128
	bsf	rax, [rcx]	; rax = index of least significant '1' bit
	cmovz	eax, edx
	ret

__ctzti2 endp
				; rcx = oword ptr argument
__parityti2 proc public

	mov	rax, [rcx]
	xor	rax, [rcx+8]
	shld	rcx, rax, 32
	xor	eax, ecx
	shld	ecx, eax, 16
	xor	ecx, eax
	xor	eax, eax
	xor	cl, ch
	setpo	al		; rax = {0, 1}
	ret

__parityti2 endp
				; rcx = oword ptr argument
__popcountti2 proc public

	mov	rax, [rcx]
	mov	rdx, [rcx+8]	; rdx:rax = argument
	mov	rcx, 5555555555555555h
	mov	r10, rax
	mov	r11, rdx	; r11:r10 = argument
	shr	rax, 1
	shr	rdx, 1		; rdx:rax = argument >> 1
	and	rax, rcx
	and	rdx, rcx	; rdx:rax = (argument >> 1)
				;         & 0x55555555555555555555555555555555
	sub	r10, rax
	sub	r11, rdx	; r11:r10 = argument
				;         - ((argument >> 1)
				;           & 0x55555555555555555555555555555555)
				;         = argument'
	mov	rcx, 3333333333333333h
	mov	rax, r10
	mov	rdx, r11	; rdx:rax = argument'
	and	r10, rcx
	and	r11, rcx	; r11:r10 = argument'
				;         & 0x33333333333333333333333333333333
	shr	rax, 2
	shr	rdx, 2		; rdx:rax = argument' >> 2
	and	rax, rcx
	and	rdx, rcx	; rdx:rax = (argument' >> 2)
				;         & 0x33333333333333333333333333333333
	add	r10, rax
	add	r11, rdx	; r11:r10 = (argument'
				;           & 0x33333333333333333333333333333333)
				;         + ((argument' >> 2)
				;           & 0x33333333333333333333333333333333)
				;         = argument"
	mov	rcx, 0F0F0F0F0F0F0F0Fh
	mov	rax, r10
	mov	rdx, r11	; rdx:rax = argument"
	shr	r10, 4
	shr	r11, 4		; r11:r10 = argument" >> 4
	add	rax, r10
	add	rdx, r11	; rdx:rax = argument" + (argument" >> 4)
	and	rax, rcx
	and	rdx, rcx	; rdx:rax = (argument" + (argument" >> 4))
				;         & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F
	mov	rcx, 0101010101010101h
	imul	rax, rcx
	imul	rdx, rcx	; rdx:rax = ((argument" + (argument" >> 4))
				;           & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F)
				;         * 0x01010101010101010101010101010101
	add	rax, rdx
	shr	rax, 56		; rax = [0, 128]
	ret

__popcountti2 endp
				; rcx = oword ptr comparand
				; rdx = oword ptr comparator
__cmpti2 proc	public

	mov	r8, [rcx]
	mov	rcx, [rcx+8]	; rcx:r8 = comparand
	sub	r8, [rdx]
	sbb	rcx, [rdx+8]	; rcx:r8 = comparand - comparator
	setg	ah		; ah = (comparand > comparator) ? 1 : 0
	setl	al		; al = (comparand < comparator) ? 1 : 0
	stc
	sbb	al, ah		; al = (comparand < comparator)
				;    - (comparand > comparator)
				;    - 1
				;    = {0, -1, -2}
	movsx	eax, al
	neg	eax		; rax = (comparand > comparator)
				;     - (comparand < comparator)
				;     + 1
				;     = {0, 1, 2}
	ret

__cmpti2 endp
				; rcx = oword ptr comparand
				; rdx = oword ptr comparator
__ucmpti2 proc	public

	xor	eax, eax	; rax = 0
	mov	r8, [rcx]
	mov	rcx, [rcx+8]	; rcx:r8 = comparand
	sub	r8, [rdx]
	sbb	rcx, [rdx+8]	; rcx:r8 = comparand - comparator
	seta	al		; rax = (comparand > comparator)
	sbb	eax, -1		; rax = (comparand > comparator)
				;     - (comparand < comparator)
				;     + 1
				;     = {0, 1, 2}
	ret

__ucmpti2 endp
				; rcx = oword ptr negation
				; rdx = oword ptr negend
__negti2 proc	public

	xor	eax, eax
	mov	r8, [rdx]
	neg	r8
	sbb	rax, [rdx+8]
	mov	[rcx], r8
	mov	[rcx+8], rax

	mov	rax, rcx	; rax = address of negation
	ret

__negti2 endp
				; rcx = oword ptr negation
				; rdx = oword ptr negend
__negvti2 proc	public

	xor	eax, eax
	mov	r8, [rdx]
	neg	r8
	sbb	rax, [rdx+8]
	jo	short overflow	; negation = negend = ±2**127?

	mov	[rcx], r8
	mov	[rcx+8], rax

	mov	rax, rcx	; rax = address of negation
	ret

overflow:
	ud2

__negvti2 endp
				; rcx = oword ptr sum
				; rdx = oword ptr augend
				; r8 = oword ptr addend
__addti3 proc	public
__uaddti3 proc	public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = augend

	add	rax, [r8]
	adc	rdx, [r8+8]	; rdx:rax = augend + addend
				;         = sum
	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of sum
	ret

__uaddti3 endp
__addti3 endp
				; rcx = oword ptr sum
				; rdx = oword ptr augend
				; r8 = oword ptr addend
__addvti3 proc	public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = augend

	add	rax, [r8]
	adc	rdx, [r8+8]	; rdx:rax = augend + addend
				;         = sum
	jo	short overflow	; sum < -2**127?
				; sum >= 2**127?
	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of sum
	ret

overflow:
	ud2

__addvti3 endp
				; rcx = oword ptr sum
				; rdx = oword ptr augend
				; r8 = oword ptr addend
__uaddvti3 proc	public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = augend

	add	rax, [r8]
	adc	rdx, [r8+8]	; rdx:rax = augend + addend
				;         = sum
	jc	short overflow	; sum >= 2**128?

	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of sum
	ret

overflow:
	ud2

__uaddvti3 endp
				; rcx = oword ptr difference
				; rdx = oword ptr minuend
				; r8 = oword ptr subtrahend
__subti3 proc	public
__usubti3 proc	public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = minuend

	sub	rax, [r8]
	sbb	rdx, [r8+8]	; rdx:rax = minuend - subtrahend
				;         = difference
	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of difference
	ret

__usubti3 endp
__subti3 endp
				; rcx = oword ptr difference
				; rdx = oword ptr minuend
				; r8 = oword ptr subtrahend
__subvti3 proc	public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = minuend

	sub	rax, [r8]
	sbb	rdx, [r8+8]	; rdx:rax = minuend - subtrahend
				;         = difference
	jo	short overflow	; difference < -2**127?
				; difference >= 2**127?
	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of difference
	ret

overflow:
	ud2

__subvti3 endp
				; rcx = oword ptr difference
				; rdx = oword ptr minuend
				; r8 = oword ptr subtrahend
__usubvti3 proc	public

	mov	rax, [rdx]
	mov	rdx, [rdx+8]	; rdx:rax = minuend

	sub	rax, [r8]
	sbb	rdx, [r8+8]	; rdx:rax = minuend - subtrahend
				;         = difference
	jb	short overflow	; difference < 0?

	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of difference
	ret

overflow:
	ud2

__usubvti3 endp
				; rcx = oword ptr maximum
				; rdx = oword ptr left argument
				; r8 = oword ptr right argument
__maxti3 proc	public

	mov	r11, [rdx+8]
	mov	r10, [rdx]	; r11:r10 = left argument

	mov	r9, [r8+8]
	mov	r8, [r8]	; r9:r8 = right argument

	cmp	r10, r8
	mov	rax, r11
	sbb	rax, r9
	cmovl	r11, r9
	cmovl	r10, r8		; r11:r10 = (left argument < right argument)
				;         ? right argument : left argument
	mov	r11, [rcx+8]
	mov	r10, [rcx]

	mov	rax, rcx	; rax = address of maximum
	ret

__maxti3 endp
				; rcx = oword ptr maximum
				; rdx = oword ptr left argument
				; r8 = oword ptr right argument
__umaxti3 proc	public

	mov	r11, [rdx+8]
	mov	r10, [rdx]	; r11:r10 = left argument

	mov	r9, [r8+8]
	mov	r8, [r8]	; r9:r8 = right argument

	cmp	r10, r8
	mov	rax, r11
	sbb	rax, r9
	cmovb	r11, r9
	cmovb	r10, r8		; r11:r10 = (left argument < right argument)
				;         ? right argument : left argument
	mov	r11, [rcx+8]
	mov	r10, [rcx]

	mov	rax, rcx	; rax = address of maximum
	ret

__umaxti3 endp
				; rcx = oword ptr maximum
				; rdx = oword ptr left argument
				; r8 = oword ptr right argument
__minti3 proc	public

	mov	r11, [rdx+8]
	mov	r10, [rdx]	; r11:r10 = left argument

	mov	r9, [r8+8]
	mov	r8, [r8]	; r9:r8 = right argument

	cmp	r10, r8
	mov	rax, r11
	sbb	rax, r9
	cmovg	r11, r9
	cmovg	r10, r8		; r11:r10 = (left argument > right argument)
				;         ? right argument : left argument
	mov	r11, [rcx+8]
	mov	r10, [rcx]

	mov	rax, rcx	; rax = address of minimum
	ret

__minti3 endp
				; rcx = oword ptr maximum
				; rdx = oword ptr left argument
				; r8 = oword ptr right argument
__uminti3 proc	public

	mov	r11, [rdx+8]
	mov	r10, [rdx]	; r11:r10 = left argument

	mov	r9, [r8+8]
	mov	r8, [r8]	; r9:r8 = right argument

	cmp	r10, r8
	mov	rax, r11
	sbb	rax, r9
	cmova	r11, r9
	cmova	r10, r8		; r11:r10 = (left argument > right argument)
				;         ? right argument : left argument
	mov	r11, [rcx+8]
	mov	r10, [rcx]

	mov	rax, rcx	; rax = address of minimum
	ret

__uminti3 endp
				; rcx = oword ptr result
				; rdx = oword ptr value
				; r8 = qword count
__ashlti3 proc	public
__lshlti3 proc	public

	mov	rax, rcx	; rax = address of result
	mov	rcx, r8		; rcx = count
	mov	r8, [rdx]
	mov	r9, [rdx+8]	; r9:r8 = value
ifdef JccLess
	xor	edx, edx

	shld	r9, r8, cl
	shl	r8, cl		; r9:r8 = value << count % 64

	cmp	ecx, 63
	cmova	r9, r8
	cmova	r8, rdx
	cmp	ecx, 127
	cmova	r9, rdx		; r9:r8 = value << count

	mov	[rax], r8
	mov	[rax+8], r9
	ret
else ; JccLess
	cmp	ecx, 127
	ja	short zero	; count > 127?

	cmp	ecx, 63
	ja	short @f	; count > 63?

	shld	r9, r8, cl
	shl	r8, cl		; r9:r8 = value << count % 64
	mov	[rax], r8
	mov	[rax+8], r9
	ret

@@:
	shl	r8, cl
	xor	ecx, ecx
	mov	[rax], rcx
	mov	[rax+8], r8
	ret

zero:
	xor	ecx, ecx
	mov	[rax], rcx
	mov	[rax+8], rcx
	ret
endif ; JccLess

__lshlti3 endp
__ashlti3 endp
				; rcx = oword ptr result
				; rdx = oword ptr value
				; r8 = qword count
__ashrti3 proc	public

	mov	rax, rcx	; rax = address of result
	mov	rcx, r8		; rcx = count
	mov	r8, [rdx]
	mov	r9, [rdx+8]	; r9:r8 = value
ifdef JccLess
	mov	rdx, r9
	sar	rdx, 63

	shrd	r8, r9, cl
	sar	r9, cl		; r9:r8 = value >> count % 64

	cmp	ecx, 63
	cmova	r8, r9
	cmova	r9, rdx
	cmp	ecx, 127
	cmova	r8, rdx		; r9:r8 = value >> count

	mov	[rax], r8
	mov	[rax+8], r9
	ret
else ; JccLess
	cmp	ecx, 127
	ja	short sign	; count > 127?

	cmp	ecx, 63
	ja	short @f	; count > 63?

	shrd	r8, r9, cl
	sar	r9, cl		; r9:r8 = value >> count % 64
	mov	[rax], r8
	mov	[rax+8], r9
	ret

@@:
	mov	r8, r9
	sar	r8, cl
	sar	r9, 63
	mov	[rax], r8
	mov	[rax+8], r9
	ret

sign:
	sar	r9, 63
	mov	[rax], r9
	mov	[rax+8], r9
	ret
endif ; JccLess

__ashrti3 endp
				; rcx = oword ptr result
				; rdx = oword ptr value
				; r8 = qword count
__lshrti3 proc	public

	mov	rax, rcx	; rax = address of result
	mov	rcx, r8		; rcx = count
	mov	r8, [rdx]
	mov	r9, [rdx+8]	; r9:r8 = value
ifdef JccLess
	xor	edx, edx

	shrd	r8, r9, cl
	shr	r9, cl		; r9:r8 = value >> count % 64

	cmp	ecx, 63
	cmova	r8, r9
	cmova	r9, rdx
	cmp	ecx, 127
	cmova	r8, rdx		; r9:r8 = value >> count

	mov	[rax], r8
	mov	[rax+8], r9
	ret
else ; JccLess
	cmp	ecx, 127
	ja	short zero	; count > 127?

	cmp	ecx, 63
	ja	short @f	; count > 63?

	shrd	r8, r9, cl
	shr	r9, cl		; r9:r8 = value >> count % 64
	mov	[rax], r8
	mov	[rax+8], r9
	ret

@@:
	shr	r9, cl
	xor	ecx, ecx
	mov	[rax], r9
	mov	[rax+8], rcx
	ret

zero:
	xor	ecx, ecx
	mov	[rax], rcx
	mov	[rax+8], rcx
	ret
endif ; JccLess

__lshrti3 endp
				; rcx = oword ptr result
				; rdx = oword ptr value
				; r8 = qword count
__rotlti3 proc	public

	mov	rax, rcx	; rax = address of result
	mov	rcx, r8		; rcx = count
	mov	r8, [rdx]
	mov	r9, [rdx+8]	; r9:r8 = value

	mov	rdx, r8
	shld	r8, r9, cl
	shld	r9, rdx, cl	; r9:r8 = value << (count % 64)
				;       | value >> (64 - count % 64)
	test	cl, 64
	jz	short @f

	xchg	r8, r9		; r9:r8 = value << (count % 128)
				;       | value >> (128 - count % 128)
@@:
	mov	[rax], r8
	mov	[rax+8], r9
	ret

__rotlti3 endp
				; rcx = oword ptr result
				; rdx = oword ptr value
				; r8 = qword count
__rotrti3 proc	public

	mov	rax, rcx	; rax = address of result
	mov	rcx, r8		; rcx = count
	mov	r8, [rdx]
	mov	r9, [rdx+8]	; r9:r8 = value

	mov	rdx, r8
	shrd	r8, r9, cl
	shrd	r9, rdx, cl	; r9:r8 = value >> (count % 64)
				;       | value << (64 - count % 64)
	test	cl, 64
	jz	short @f

	xchg	r8, r9		; r9:r8 = value >> (count % 128)
				;       | value << (128 - count % 128)
@@:
	mov	[rax], r8
	mov	[rax+8], r9
	ret

__rotrti3 endp
				; rcx = oword ptr product
				; rdx = oword ptr multiplicand
				; r8 = oword ptr multiplier
__multi3 proc	public
__umulti3 proc	public

	mov	r11, [rdx+8]	; r11 = high qword of multiplicand
	mov	r10, [rdx]	; r10 = low qword of multiplicand

	mov	r9, [r8+8]	; r9 = high qword of multiplier
	mov	r8, [r8]	; r8 = low qword of multiplier

	mov	rax, r10
	mul	r8		; rdx:rax = low qword of multiplicand
				;         * low qword of multiplier
	imul	r8, r11		; r8 = low qword of multiplier
				;    * high qword of multiplicand
	imul	r9, r10		; r9 = high qword of multiplier
				;    * low qword of multiplicand
	add	rdx, r8
	add	rdx, r9		; rdx:rax = product % 2**128

	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of product
	ret

__umulti3 endp
__multi3 endp
				; rcx = oword ptr product
				; rdx = oword ptr multiplicand
				; r8 = oword ptr multiplier
__mulvti3 proc	public

	mov	r11, [rdx+8]	; r11 = high qword of multiplicand
	mov	r10, [rdx]	; r10 = low qword of multiplicand

	mov	r9, [r8+8]	; r9 = high qword of multiplier
	mov	r8, [r8]	; r8 = low qword of multiplier

	mov	[rsp+8], rcx
	mov	[rsp+16], rbx

	mov	rax, r11
	cqo			; rdx = (multiplicand < 0) ? -1 : 0
	mov	rcx, rdx	; rcx = (multiplicand < 0) ? -1 : 0
	xor	r10, rdx
	xor	r11, rdx	; r11:r10 = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	r10, rdx
	sbb	r11, rdx	; r11:r10 = (multiplicand < 0) ? -multiplicand : multiplicand
				;         = |multiplicand|
	mov	rax, r9
	cqo			; rdx = (multiplier < 0) ? -1 : 0
	xor	rcx, rdx	; rcx = (multiplier < 0) <> (multiplicand < 0) ? -1 : 0
				;     = (product < 0) ? -1 : 0
	xor	r8, rdx
	xor	r9, rdx		; r9:r8 = (multiplier < 0) ? ~multiplier : multiplier
	sub	r8, rdx
	sbb	r9, rdx		; r9:r8 = (multiplier < 0) ? -multiplier : multiplier
				;       = |multiplier|
	xor	ebx, ebx	; rbx = 0
	cmp	rbx, r11
	sbb	eax, eax	; eax = (high qword of |multiplicand| = 0) ? 0 : -1
				;     = (|multiplicand| < 2**64) ? 0 : -1
	cmp	rbx, r9
	sbb	ebx, ebx	; ebx = (high qword of |multiplier| = 0) ? 0 : -1
				;     = (|multiplier| < 2**64) ? 0 : -1
	and	ebx, eax	; ebx = (|multiplicand| < 2**64)
				;     & (|multiplier| < 2**64) ? 0 : -1
				;     = (|product| < 2**128) ? 0 : -1
	mov	rax, r11
	mul	r8		; rdx:rax = high qword of |multiplicand|
				;         * low qword of |multiplier|
	adc	ebx, ebx	; ebx = (|product| < 2**128) ? 0 : *
	mov	r11, rax

	mov	rax, r10
	mul	r9		; rdx:rax = low qword of |multiplicand|
				;         * high qword of |multiplier|
	adc	ebx, ebx	; ebx = (|product| < 2**128) ? 0 : *

	add	r11, rax	; r11 = high qword of |multiplicand|
				;     * low qword of |multiplier|
				;     + low qword of |multiplicand|
				;     * high qword of |multiplier|
;;	adc	ebx, ebx	; ebx = (|product| < 2**128) ? 0 : *

	mov	rax, r10
	mul	r8		; rdx:rax = low qword of |multiplicand|
				;         * low qword of |multiplier|
	add	rdx, r11	; rdx:rax = |product % 2**128|
				;         = |product| % 2**128
	adc	ebx, ebx	; ebx = (|product| < 2**128) ? 0 : *
if 0
	xor	rax, rcx
	xor	rdx, rcx	; rdx:rax = (product < 0) ? product % 2**128 - 1 : product % 2**128
	sub	rax, rcx
	sbb	rdx, rcx	; rdx:rax = product % 2**128

	xor	rcx, rdx	; rcx = (multiplicand < 0)
				;     ^ (multiplier < 0)
				;     ^ (product < 0) ? negative : positive
	add	rcx, rcx
else
	add	rax, rcx
	adc	rdx, rcx	; rdx:rax = (product < 0) ? ~product % 2**128 : product % 2**128
	mov	r11, rdx	; r11 = (multiplicand < 0)
				;     ^ (multiplier < 0)
				;     ^ (product < 0) ? negative : positive
	xor	rax, rcx
	xor	rdx, rcx	; rdx:rax = product % 2**128
	add	r11, r11
endif
	adc	ebx, ebx	; ebx = (-2**127 <= product < 2**127) ? 0 : *
	jnz	short overflow	; product < -2**127?
				; product >= 2**127?
	mov	rcx, [rsp+8]
	mov	rbx, [rsp+16]

	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of product
	ret

overflow:
	ud2

__mulvti3 endp
				; rcx = oword ptr product
				; rdx = oword ptr multiplicand
				; r8 = oword ptr multiplier
__umulvti3 proc	public

	mov	r11, [rdx+8]	; r11 = high qword of multiplicand
	mov	r10, [rdx]	; r10 = low qword of multiplicand

	mov	r9, [r8+8]	; r9 = high qword of multiplier
	mov	r8, [r8]	; r8 = low qword of multiplier
ifndef JccLess
	test	r11, r11
	jz	short @f	; multiplicand < 2**64?

	test	r9, r9
	jnz	short overflow	; multiplier >= 2**64?

@@:
	mov	rax, r11
	mul	r8		; rdx:rax = high qword of multiplicand
				;         * low qword of multiplier
	jc	short overflow	; product >= 2**128?

	mov	r11, rax

	mov	rax, r10
	mul	r9		; rdx:rax = low qword of multiplicand
				;         * high qword of multiplier
	jc	short overflow	; product >= 2**128?

	add	r11, rax	; r11 = high qword of multiplicand
				;     * low qword of multiplier
				;     + low qword of multiplicand
				;     * high qword of multiplier
;;	jc	short overflow

	mov	rax, r10
	mul	r8		; rdx:rax = low qword of multiplicand
				;         * low qword of multiplier
	add	rdx, r11	; rdx:rax = product % 2**128
	jc	short overflow	; product >= 2**128?
else ; JccLess
	mov	[rsp+8], rbx
if 0
	mov	rax, r11
	mul	r9		; rdx:rax = high qword of multiplicand
				;         * high qword of multiplier
	sbb	ebx, ebx	; ebx = (product < 2**192) ? 0 : -1
	neg	rax
	adc	ebx, ebx	; ebx = (product < 2**128) ? 0 : *
else
	xor	eax, eax
	cmp	rax, r11
	sbb	ebx, ebx	; ebx = (high qword of multiplicand = 0) ? 0 : -1
				;     = (multiplicand < 2**64) ? 0 : -1
	cmp	rax, r9
	sbb	eax, eax	; eax = (high qword of multiplier = 0) ? 0 : -1
				;     = (multiplier < 2**64) ? 0 : -1
	and	ebx, eax	; ebx = (multiplicand < 2**64)
				;     & (multiplier < 2**64) ? 0 : -1
				;     = (product < 2**128) ? 0 : -1
endif
	mov	rax, r11
	mul	r8		; rdx:rax = high qword of multiplicand
				;         * low qword of multiplier
	adc	ebx, ebx	; ebx = (product < 2**128) ? 0 : *
	mov	r11, rax

	mov	rax, r10
	mul	r9		; rdx:rax = low qword of multiplicand
				;         * high qword of multiplier
	adc	ebx, ebx	; ebx = (product < 2**128) ? 0 : *

	add	r11, rax	; r11 = high qword of multiplicand
				;     * low qword of multiplier
				;     + low qword of multiplicand
				;     * high qword of multiplier
;;	adc	ebx, ebx	; ebx = (product < 2**128) ? 0 : *

	mov	rax, r10
	mul	r8		; rdx:rax = low qword of multiplicand
				;         * low qword of multiplier
	add	rdx, r11	; rdx:rax = product % 2**128
	adc	ebx, ebx	; ebx = (product < 2**128) ? 0 : *
	jnz	short overflow	; product >= 2**128?

	mov	rbx, [rsp+8]
endif ; JccLess
	mov	[rcx], rax
	mov	[rcx+8], rdx

	mov	rax, rcx	; rax = address of product
	ret

overflow:
	ud2

__umulvti3 endp
	end

64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)

__udivmoddi4() function for AMD64 processors, supporting the Microsoft calling convention, using the shift & subtract division:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.code
				; rcx = dividend
				; rdx = divisor
				; r8 = oword ptr remainder
__udivmoddi4 proc public

	cmp	rcx, rdx
	jb	short trivial	; dividend < divisor?

	bsr	rax, rdx	; rax = index of most significant '1' bit of divisor
	jz	short error	; divisor = 0?

	mov	r9, rcx		; r9 = dividend
	bsr	rcx, rcx	; rcx = index of most significant '1' bit of dividend
	jz	short zero	; dividend = 0?

	sub	ecx, eax	; ecx = distance of leading '1' bits
;;	jb	short trivial	; dividend < divisor?

	shl	rdx, cl		; rdx = divisor << distance of leading '1' bits
				;     = divisor'
	xor	eax, eax	; eax = quotient' = 0
@@:
	mov	r10, r9		; r10 = dividend'
	sub	r9, rdx		; r9 = dividend' - divisor'
				;    = dividend"
				; CF = (dividend' < divisor')
	cmovb	r9, r10		; r9 = (dividend' < divisor') ? dividend' : dividend"
	cmc			; CF = (dividend' >= divisor')
	adc	rax, rax	; rax = quotient' << 1
				;     + (dividend' >= divisor')
				;     = quotient
	shr	rdx, 1		; rdx = divisor' >> 1
				;     = divisor
	dec	ecx
	jns	short @b

	test	r8, r8
	jz	short @f	; address of remainder = 0?

	mov	[r8], r9	; remainder = dividend"
@@:
	ret

	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	test	r8, r8
	jz	short @f	; address of remainder = 0?

	mov	[r8], rcx	; remainder = dividend
@@:
	xor	eax, eax	; rax = quotient = 0
	ret

	; dividend = 0: quotient = 0, remainder = 0
zero:
	test	r8, r8
	jz	short @f	; address of remainder = 0?

	mov	[r8], r9	; remainder = 0
@@:
	mov	rax, r9		; rax = quotient = 0
	ret

	; divisor = 0
error:
	div	rdx
	ret

__udivmoddi4 endp
	end

Implementation for i386 Processors

Prototypes for the __udivmoddi4(), __udivdi3(), __umoddi3(), __divmoddi4(), __divdi3() and __moddi3() functions:

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
uint64_t __udivdi3(uint64_t dividend, uint64_t divisor);
uint64_t __umoddi3(uint64_t dividend, uint64_t divisor);

int64_t __divmoddi4(int64_t dividend, int64_t divisor, int64_t *remainder);
int64_t __divdi3(int64_t dividend, int64_t divisor);
int64_t __moddi3(int64_t dividend, int64_t divisor);

The suffixes di4 and di3 specify the number of arguments plus return value and their size: double integer denotes an 8-byte QWORD alias 64-bit uint64_t.

Note: the compiler helper routines for the Microsoft Visual C compiler use non-standard calling or naming conventions and can therefore not be prototyped; they are for internal use by the compiler only!

Note: the other code following here supports the common, so-called cdecl calling and naming convention used by C compilers on Linux^®, Unix, Windows^™, plus other operating systems, and runs on 35 year old Intel^® 80386 micro-processors.

Note: the branch-free code paths, which are assembled when the macro JCCLESS is defined, actually yield lower performance!

64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)

__udivmoddi4() function for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+20] = (optional) qword ptr remainder
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__udivmoddi4 proc public

	push	ebx
	mov	ebx, [esp+20]	; ebx = high dword of divisor
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short trivial	; (high dword of) dividend < (high dword of) divisor?

	bsr	eax, ebx	; eax = index of most significant '1' bit
				;        in high dword of divisor
	jz	short simple	; high dword of divisor = 0?

	; dividend >= divisor >= 2**32: quotient < 2**32

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of dividend
;;	jz	short trivial	; dividend < 2**32 (<= divisor)?

	; perform "shift & subtract" alias "binary long" division

	sub	ecx, eax	; ecx = distance of leading '1' bits
;;	jb	short trivial	; dividend < divisor?

	mov	eax, [esp+16]	; eax = low dword of divisor
	shld	ebx, eax, cl
	shl	eax, cl		; ebx:eax = divisor'

	push	esi
	push	edi
	mov	esi, [esp+16]	; edx:esi = dividend

	push	ebp
	xor	ebp, ebp	; ebp = quotient = 0
next:
	mov	edi, edx	; edi = high dword of dividend
	cmp	esi, eax
	sbb	edi, ebx
	jb	short @f	; dividend < divisor'?

	sub	esi, eax
	sbb	edx, ebx	; edx:esi = dividend - divisor'
				;         = dividend'
@@:
	cmc			; CF = (dividend >= divisor')
	adc	ebp, ebp	; ebp = quotient << 1
				;     + dividend >= divisor'
				;     = quotient'
if 0
	shrd	eax, ebx, 1
	shr	ebx, 1		; ebx:eax = divisor' >> 1
				;         = divisor"
else
	shr	ebx, 1
	rcr	eax, 1		; ebx:eax = divisor' >> 1
				;         = divisor"
endif
	dec	ecx
	jns	short next

	mov	ecx, [esp+36]	; ecx = address of remainder
	test	ecx, ecx
	jz	short @f	; address of remainder = 0?

	mov	[ecx+4], edx
	mov	[ecx], esi	; [ecx] = remainder
@@:
	xor	edx, edx
	mov	eax, ebp	; edx:eax = quotient

	pop	ebp
	pop	edi
	pop	esi
	pop	ebx
	ret

	; dividend < (2**32 <=) divisor: quotient = 0, remainder = dividend
trivial:
	mov	ecx, [esp+24]	; ecx = address of remainder
	test	ecx, ecx
	jz	short @f	; address of remainder = 0?

	mov	eax, [esp+8]	; eax = low dword of dividend,
				; edx:eax = remainder
	mov	[ecx+4], edx
	mov	[ecx], eax	; [ecx] = remainder
@@:
	xor	edx, edx
	xor	eax, eax	; edx:eax = quotient = 0

	pop	ebx
	ret

	; remainder < divisor < 2**32
simple:
	mov	ecx, [esp+16]	; ecx = (low dword of) divisor
	cmp	ecx, edx
	ja	short normal	; divisor > high dword of dividend?

	; perform "long" alias "schoolbook" division
long:
	mov	eax, edx	; eax = high dword of dividend
	mov	edx, ebx	; edx = 0,
				; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder

	mov	ecx, [esp+28]	; ecx = address of remainder
	test	ecx, ecx
	jz	short @f	; address of remainder = 0?

	mov	[ecx+4], ebx	; [ecx+4] = 0 = high dword of remainder
	mov	[ecx], edx	; [ecx] = (low dword of) remainder
@@:
	pop	edx		; edx:eax = quotient
	pop	ebx
	ret

	; perform normal division
normal:
	mov	eax, [esp+8]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder

	mov	ecx, [esp+24]	; ecx = address of remainder
	test	ecx, ecx
	jz	short @f	; address of remainder = 0?

	mov	[ecx+4], ebx	; [ecx+4] = 0 = high dword of remainder
	mov	[ecx], edx	; [ecx] = (low dword of) remainder
@@:
	mov	edx, ebx	; edx = 0,
				; edx:eax = quotient
	pop	ebx
	ret

__udivmoddi4 endp
	end

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# Common "cdecl" calling and naming convention for i386 platform:
# - arguments are pushed on stack in reverse order (from right to left),
#   4-byte aligned;
# - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
#   low part below high part;
# - 64-bit integer result is returned in registers EAX (low part) and
#   EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.

# NOTE: raises "division exception" when divisor is 0!

.file	"udivmoddi4.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+20] = (optional) qword ptr remainder
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__udivmoddi4:
	mov	ecx, [esp+8]	# ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	0f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
0:
	xor	edx, edx	# edx:eax = quotient
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+24]	# ecx = address of remainder
	test	ecx, ecx
	jz	1f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
1:
	pop	edx		# edx:eax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	2f		# address of remainder = 0?

	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	mov	[ecx], eax
	mov	[ecx+4], edx	# [ecx] = remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# edx:eax = quotient = 0
	ret

	# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor >= 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	3f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
3:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+12]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	push	ebx		# [esp] = quotient"
	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	# ebx = high dword of divisor * quotient"
	mov	ecx, [esp+16]	# ecx = high dword of dividend
	sub	ecx, ebx	# ecx = high dword of dividend
				#     - high dword of divisor * quotient"
	mov	ebx, [esp+12]	# ebx = low dword of dividend
	sub	ebx, eax
	sub	ecx, edx	# ecx:ebx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JCCLESS
	pop	eax		# eax = quotient"
	jnb	4f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	# ecx:ebx = remainder" + divisor
				#         = remainder
	dec	eax		# eax = quotient" - 1
				#     = quotient
4:
.else # JCCLESS
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	# edi = quotient" - (remainder" < 0)
				#     = (low dword of) quotient
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	ebx, eax
	adc	ecx, edx	# ecx:ebx = remainder" + divisor
				#         = remainder
	pop	eax		# eax = (low dword of) quotient
.endif # JCCLESS
	mov	edx, [esp+24]	# edx = address of remainder
	test	edx, edx
	jz	5f		# address of remainder = 0?

	mov	[edx], ebx
	mov	[edx+4], ecx	# [edx] = remainder
5:
	xor	edx, edx	# edx:eax = quotient
	pop	ebx
	ret

	# dividend >= divisor >= 2**63:
	# quotient = 1, remainder = dividend - divisor
.special:
	or	ecx, [esp+20]	# ecx = address of remainder
	jz	6f		# address of remainder = 0?
.if 0
	neg	edx
	neg	eax
	sbb	edx, 0		# edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	# edx:eax = dividend - divisor
				#         = remainder
.else
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	# edx:eax = dividend - divisor
				#         = remainder
.endif
	mov	[ecx], eax
	mov	[ecx+4], edx	# [ecx] = remainder
6:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# edx:eax = quotient = 1
	ret

.size	__udivmoddi4, .-__udivmoddi4
.type	__udivmoddi4, @function
.global	__udivmoddi4
.end

Microsoft Visual C compiler helper routine _aulldvrm() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; "stdcall" calling and naming convention for i386 platform:
; - arguments are pushed on stack in reverse order (from right to left),
;   4-byte aligned;
; - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
;   low part below high part;
; - 64-bit integer result is returned in registers EAX (low part) and
;   EDX (high part);
; - 32-bit integer or pointer result is returned in register EAX;
; - registers EAX, ECX and EDX are volatile and can be clobbered;
; - registers EBX, EBP, ESI and EDI must be preserved;
; - register ESP (the stack pointer) must be restored by callee;
; - function names are prefixed with an underscore and suffixed with an
;   at-sign, followed by the total number of bytes for all arguments.

; NOTE: returns quotient in EDX:EAX and remainder in EBX:ECX

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_aulldvrm proc	public

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	xor	ebx, ebx	; ebx:ecx = remainder
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	mov	edx, ebx	; edx:eax = quotient
	xor	ebx, ebx	; ebx:ecx = remainder

	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	mov	ecx, [esp+4]
	mov	ebx, [esp+8]	; ebx:ecx = remainder = dividend
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+8]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+8]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+8]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+12]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+16]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	push	ebx		; [esp] = quotient"
	mov	ebx, [esp+12]	; ebx = high dword of dividend
	sub	ebx, ecx	; ebx = high dword of dividend
				;     - high dword of divisor * quotient"
	mov	ecx, [esp+8]	; ecx = low dword of dividend
	sub	ecx, eax
	sub	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JccLess
	pop	eax		; eax = quotient"
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+12]
	adc	ebx, [esp+16]	; ebx:ecx = remainder" + divisor
				;         = remainder
	dec	eax		; eax = quotient" - 1
				;     = (low dword of) quotient
@@:
else ; JccLess
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	; [esp] = quotient" - (remainder" < 0)
				;       = (low dword of) quotient
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = remainder
	pop	eax		; eax = (low dword of) quotient
endif ; JccLess
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**63:
	; quotient = 1, remainder = dividend - divisor
special:
	mov	ecx, [esp+4]
	mov	ebx, [esp+8]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor
				;         = remainder
	xor	eax, eax
	xor	edx, edx
	inc	eax		; edx:eax = quotient = 1

	ret	16		; callee restores stack

_aulldvrm endp
	end

64÷64-bit Unsigned Integer Division (64-bit Quotient)

__udivdi3() function for i386 processors:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"udivdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__udivdi3:
	mov	ecx, [esp+8]	# ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
	xor	edx, edx	# edx:eax = quotient
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	pop	edx		# edx:eax = quotient
	ret

	# dividend < divisor: quotient = 0
.trivial:
	xor	eax, eax
	xor	edx, edx	# edx:eax = quotient = 0
	ret

	# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor >= 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
.if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+12]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+16]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
.ifnotdef JCCLESS
	mov	ecx, [esp+20]	# ecx = high dword of divisor
	imul	ecx, ebx	# ecx = high dword of divisor * quotient"
	add	edx, ecx	# edx:eax = divisor * quotient"
	jc	1f		# divisor * quotient" >= 2**64?

	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	# CF = (dividend < divisor * quotient")
				#    = (remainder" < 0)
1:
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) quotient
	xor	edx, edx	# edx:eax = quotient
.else # JCCLESS
	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	# ecx:... = dividend
				#         - low dword of divisor * quotient"
	mov	eax, [esp+20]	# eax = high dword of divisor
	imul	eax, ebx	# eax = high dword of divisor * quotient"
.if 0
	sub	ecx, eax	# ecx:... = dividend - divisor * quotient"
				#         = remainder"
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) quotient
	xor	edx, edx	# edx:eax = quotient
.else
	xor	edx, edx	# edx = high dword of quotient = 0
	sub	ecx, eax	# ecx:... = dividend - divisor * quotient"
				#         = remainder"
	mov	eax, ebx	# eax = quotient"
	sbb	eax, edx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) quotient
.endif
.endif # JCCLESS
	pop	ebx
	ret

	# dividend >= divisor >= 2**63: quotient = 1
.special:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# edx:eax = quotient = 1
	ret

.size	__udivdi3, .-__udivdi3
.type	__udivdi3, @function
.global	__udivdi3
.end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__udivdi3 proc	public

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	xor	edx, edx	; edx:eax = quotient
	ret

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = quotient
	ret

	; dividend < divisor: quotient = 0
trivial:
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0
	ret

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend - divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
.if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
	add	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
ifndef JccLess
	mov	ecx, [esp+20]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	jc	short @f	; divisor * quotient" >= 2**64?

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
@@:
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else ; JccLess
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; ecx:... = dividend
				;         - low dword of divisor divisor * quotient"
	mov	eax, [esp+20]	; eax = high dword of divisor
	imul	eax, ebx	; eax = high dword of divisor * quotient"
if 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else
	xor	edx, edx	; edx = high dword of quotient = 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	mov	eax, ebx	; eax = quotient"
	sbb	eax, edx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
endif
endif ; JccLess
	pop	ebx
	ret

	; dividend >= divisor >= 2**63: quotient = 1
special:
	xor	eax, eax
	xor	edx, edx
	inc	eax		; edx:eax = quotient = 1
	ret

__udivdi3 endp
	end

Microsoft Visual C compiler helper routine _aulldiv() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_aulldiv proc	public

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0
trivial:
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
ifndef JccLess
	mov	ecx, [esp+20]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	jc	short @f	; divisor * quotient" >= 2**64?

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
@@:
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else ; JccLess
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; ecx:... = dividend
				;         - low dword of divisor * quotient"
	mov	eax, [esp+20]	; eax = high dword of divisor
	imul	eax, ebx	; eax = high dword of divisor * quotient"
if 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else
	xor	edx, edx	; edx = high dword of quotient = 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	mov	eax, ebx	; eax = quotient"
	sbb	eax, edx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
endif
endif ; JccLess
	pop	ebx
	ret	16		; callee restores stack

	; dividend >= divisor >= 2**63: quotient = 1
special:
	xor	eax, eax
	xor	edx, edx
	inc	eax		; edx:eax = quotient = 1

	ret	16		; callee restores stack

_aulldiv endp
	end

64÷64-bit Unsigned Integer Division (64-bit Remainder)

__umoddi3() function for i386 processors:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"umoddi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__umoddi3:
	mov	ecx, [esp+8]	# ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	mov	eax, [esp+4]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret

	# dividend < divisor: remainder = dividend
.trivial:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = remainder = dividend
	ret

	# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor >= 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+12]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+12]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+16]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+20]	# ebx = high dword of divisor * quotient"
	mov	ecx, [esp+12]	# ecx = high dword of dividend
	sub	ecx, ebx	# ecx = high dword of dividend
				#     - high dword of divisor * quotient"
	mov	ebx, [esp+8]	# ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	# ecx:ebx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JCCLESS
	jnb	1f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	# ecx:ebx = remainder" + divisor
				#         = remainder
1:
	mov	eax, ebx
	mov	edx, ecx	# edx:eax = remainder
.else # JCCLESS
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+16]
	and	edx, [esp+20]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ebx
	adc	edx, ecx	# edx:eax = remainder" + divisor
				#         = remainder
.endif # JCCLESS
	pop	ebx
	ret

	# dividend >= divisor >= 2**63: remainder = dividend - divisor
.special:
.if 0
	mov	eax, [esp+4]
	mov	edx, [esp+8]	# edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	# edx:eax = dividend - divisor
				#         = remainder
.else
	neg	edx
	neg	eax
	sbb	edx, ecx	# edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	# edx:eax = dividend - divisor
				#         = remainder
.endif
	ret

.size	__umoddi3, .-__umoddi3
.type	__umoddi3, @function
.global	__umoddi3
.end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__umoddi3 proc	public

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
					; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder
	ret

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder
	ret

	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = remainder = dividend
	ret

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend - divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+20]	; ebx = high dword of divisor * quotient"
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	sub	ecx, ebx	; ecx = high dword of dividend
				;     - high dword of divisor * quotient"
	mov	ebx, [esp+8]	; ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	; ecx:ebx = dividend - divisor * quotient"
				;         = remainder"
ifndef JccLess
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	; ecx:ebx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ebx
	mov	edx, ecx	; edx:eax = remainder
else ; JccLess
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ebx
	adc	edx, ecx	; edx:eax = remainder" + divisor
				;         = remainder
endif ; JccLess
	pop	ebx
	ret

	; dividend >= divisor >= 2**63: remainder = dividend - divisor
special:
if 0
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	; edx:eax = dividend - divisor
				;         = remainder
else
	neg	edx
	neg	eax
	sbb	edx, ecx	; edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	; edx:eax = dividend - divisor
				;         = remainder
endif
	ret

__umoddi3 endp
	end

Microsoft Visual C compiler helper routine _aullrem() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_aullrem proc	public

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = remainder = dividend

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+20]	; ebx = high dword of divisor * quotient"
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	sub	ecx, ebx	; ecx = high dword of dividend
				;     - high dword of divisor * quotient"
	mov	ebx, [esp+8]	; ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	; ecx:ebx = dividend - divisor * quotient"
				;         = remainder"
ifndef JccLess
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	; ecx:ebx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ebx
	mov	edx, ecx	; edx:eax = remainder
else ; JccLess
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ebx
	adc	edx, ecx	; edx:eax = remainder" + divisor
				;         = remainder
endif ; JccLess
	pop	ebx
	ret	16		; callee restores stack

	; dividend >= divisor >= 2**63: remainder = dividend - divisor
special:
if 0
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	; edx:eax = dividend - divisor
				;         = remainder
else
	neg	edx
	neg	eax
	sbb	edx, ecx	; edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	; edx:eax = dividend - divisor
				;         = remainder
endif
	ret	16		; callee restores stack

_aullrem endp
	end

64÷64-bit Signed Integer Division (64-bit Quotient and Remainder)

__divmoddi4() function for i386 processors:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: returns ±2**63 for -2**63 / -1 and 0 for -2**63 % -1!

# NOTE: raises "division exception" when divisor is 0!

.file	"divmoddi4.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
					# [esp+20] = (optional) qword ptr remainder
					# [esp+16] = high dword of divisor
					# [esp+12] = low dword of divisor
					# [esp+8] = high dword of dividend
					# [esp+4] = low dword of dividend
__divmoddi4:
	mov	eax, [esp+16]
	mov	ecx, [esp+12]		# eax:ecx = divisor
	cdq				# edx = (divisor < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx		# eax:ecx = (divisor < 0) ? ~divisor : divisor
	sub	ecx, edx
	sbb	eax, edx		# eax:ecx = (divisor < 0) ? -divisor : divisor
					#         = |divisor|
	push	ebx
	push	edx			# [esp] = (divisor < 0)
	push	[esp+28]		# [esp] = address of remainder
	push	eax
	push	ecx			# [esp] = |divisor|
	mov	eax, [esp+28]
	mov	ecx, [esp+24]		# eax:ecx = dividend
	cdq				# edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx		# eax:ecx = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx		# eax:ecx = (dividend < 0) ? -dividend : dividend
					#         = |dividend|
	mov	ebx, edx		# ebx = (dividend < 0) ? -1 : 0
					#     = (remainder < 0) ? -1 : 0
	push	eax
	push	ecx			# [esp] = |dividend|
	call	__udivmoddi4		# edx:eax = |quotient|
	add	esp, 16
	pop	ecx			# ecx = address of remainder
	test	ecx, ebx
	jz	0f			# address of remainder = 0?
					# remainder >= 0?
	neg	dword ptr [ecx+4]
	neg	dword ptr [ecx]
	sbb	dword ptr [ecx+4], 0	# [ecx] = remainder
0:
	pop	ecx			# ecx = (divisor < 0) ? -1 : 0
	xor	ecx, ebx		# ecx = (divisor < 0) ^ (dividend < 0) ? -1 : 0
					#     = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx		# edx:eax = (quotient < 0) ? |~quotient| : |quotient|
	sub	eax, ecx
	sbb	edx, ecx		# edx:eax = (quotient < 0) ? |-quotient| : |quotient|
					#         = quotient
	pop	ebx
	ret

.size	__divmoddi4, .-__divmoddi4
.type	__divmoddi4, @function
.global	__divmoddi4
.end

Microsoft Visual C compiler helper routine _alldvrm() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: returns quotient in EDX:EAX and remainder in EBX:ECX

; NOTE: returns ±2**63 for -2**63 / -1!

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_alldvrm proc	public

	; determine sign of dividend and compute |dividend|

	mov	edx, [esp+8]
	mov	eax, [esp+4]	; edx:eax = dividend

	mov	ebx, edx
	sar	ebx, 31		; ebx = (dividend < 0) ? -1 : 0
				;     = (remainder < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx	; edx:eax = (dividend < 0) ? ~dividend : dividend
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], eax	; write |dividend| back on stack
	mov	[esp+8], edx

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+16]
	mov	eax, [esp+12]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+12], eax	; write |divisor| back on stack
	mov	[esp+16], edx

	xor	ecx, ebx	; ecx = (divisor < 0) ^ (dividend < 0) ? -1 : 0
				;     = (quotient < 0) ? -1 : 0
	push	ecx		; [esp] = (quotient < 0) ? -1 : 0
	push	ebx		; [esp] = (remainder < 0) ? -1 : 0

	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+16]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	xor	ebx, ebx	; ebx = high dword of quotient = 0
	jmp	short next

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
next:
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder

	mov	ecx, edx	; ecx = (low dword of) |remainder|
	mov	edx, ebx	; edx:eax = |quotient|
;;	xor	ebx, ebx	; ebx:ecx = |remainder|
if 0
	mov	ebx, [esp+4]	; ebx = (quotient < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = quotient

	pop	ebx		; ebx = (remainder < 0) ? -1 : 0
	xor	ecx, ebx
	sub	ecx, ebx
	sbb	ebx, ebx	; ebx:ecx = remainder
else
	pop	ebx		; ebx = (remainder < 0) ? -1 : 0
	xor	ecx, ebx
	sub	ecx, ebx
	sbb	ebx, ebx	; ebx:ecx = remainder

	xor	eax, [esp]
	xor	edx, [esp]
	sub	eax, [esp]
	sbb	edx, [esp]	; edx:eax = quotient
endif
	add	esp, 4
	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	pop	eax		; eax = (remainder < 0) ? -1 : 0
	mov	ecx, [esp+8]
	mov	ebx, [esp+12]	; ebx:ecx = |remainder| = |dividend|
	xor	ecx, eax
	xor	ebx, eax
	sub	ecx, eax
	sbb	ebx, eax	; ebx:ecx = remainder

	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	push	ebx		; [esp] = quotient"
	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+28]	; ebx = high dword of divisor * quotient"
	add	edx, ebx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JccLess
	pop	eax		; eax = quotient"
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = |remainder|
	dec	eax		; eax = quotient" - 1
				;     = low dword of |quotient|
@@:
else ; JccLess
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	; [esp] = quotient" - (remainder" < 0)
				;       = low dword of |quotient|
	and	eax, [esp+24]
	and	edx, [esp+28]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = |remainder|
	pop	eax		; eax = (low dword of) |quotient|
endif ; JccLess
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edx		; edx = (remainder < 0) ? -1 : 0
	xor	ecx, edx
	xor	ebx, edx
	sub	ecx, edx
	sbb	ebx, edx	; ebx:ecx = remainder

	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend = divisor = -2**63: quotient = 1, remainder = 0
special:
	pop	ebx		; ebx = sign of remainder = -1
	inc	ebx
;;	xor	ecx, ecx	; ebx:ecx = remainder = 0

	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	xor	edx, edx	; edx:eax = quotient = 1

	ret	16		; callee restores stack

_alldvrm endp
	end

64÷64-bit Signed Integer Division (64-bit Quotient)

__divdi3() function for i386 processors:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: returns ±2**63 for -2**63 / -1!

# NOTE: raises "division exception" when divisor is 0!

.file	"divdi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__divdi3:
	# determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	# eax:ecx = dividend

	cdq			# edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# eax:ecx = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	# eax:ecx = (dividend < 0) ? -dividend : dividend
				#         = |dividend|

	mov	[esp+4], ecx	# write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		# [esp] = (dividend < 0) ? -1 : 0

	# determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	# edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		# ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = (divisor < 0) ? -divisor : divisor
				#         = |divisor|

	mov	[esp+16], eax	# write |divisor| back on stack
	mov	[esp+20], edx

	xor	[esp], ecx	# [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
				#       = (quotient < 0) ? -1 : 0

	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+12]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+8]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
#	xor	edx, edx	# edx:eax = |quotient|

	jmp	.quotient

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+12]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	pop	edx		# edx:eax = |quotient|

	pop	ecx		# ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = quotient
	ret

	# dividend < divisor: quotient = 0
.trivial:
	pop	ecx		# ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	# edx:eax = quotient = 0
	ret

	# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor = 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+16]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+16]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend < divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+16]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	# ecx = high dword of divisor
	imul	ecx, ebx	# ecx = high dword of divisor * quotient"
	add	edx, ecx	# edx:eax = divisor * quotient"
#	jc	1f		# divisor * quotient" >= 2**64?

	mov	ecx, [esp+16]	# ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	# CF = (dividend < divisor * quotient")
				#    = (remainder" < 0)
1:
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	# eax = quotient" - (remainder" < 0)
				#     = (low dword of) |quotient|
#	xor	edx, edx	# edx:eax = |quotient|
	pop	ebx
.quotient:
	pop	edx		# edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	# edx:eax = quotient
	ret

	# dividend = divisor = -2**63: quotient = 1
.special:
	pop	eax		# eax = sign of quotient = 0
	inc	eax		# eax = (low dword of) quotient = 1
	xor	edx, edx	# edx:eax = quotient = 1
	ret

.size	__divdi3, .-__divdi3
.type	__divdi3, @function
.global	__divdi3
.end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: returns ±2**63 for -2**63 / -1!

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__divdi3 proc	public

	; determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend

	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; eax:ecx = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; eax:ecx = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		; [esp] = (dividend < 0) ? -1 : 0

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	xor	[esp], ecx	; [esp] = (dividend < 0) ? -1 : 0 ^ (divisor < 0) ? -1 : 0
				;       = (quotient < 0) ? -1 : 0

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+8]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|

	jmp	short quotient

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = |quotient|

	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient
	ret

	; dividend < divisor: quotient = 0
trivial:
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0
	ret

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend - divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|
	pop	ebx
quotient:
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient
	ret

	; dividend = divisor = -2**63: quotient = 1
special:
	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	xor	edx, edx	; edx:eax = quotient = 1
	ret

__divdi3 endp
	end

Microsoft Visual C compiler helper routine _alldiv() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: returns ±2**63 for -2**63 / -1!

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_alldiv	proc	public

	; determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend

	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; eax:ecx = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; eax:ecx = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		; [esp] = (dividend < 0) ? -1 : 0

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	xor	[esp], ecx	; [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
				;       = (quotient < 0) ? -1 : 0

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+8]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|

	jmp	short quotient

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = |quotient|

	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0
trivial:
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	; CF = (dividend - divisor * quotient")
				;    = (remainder" < 0)
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|
	pop	ebx
quotient:
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend = divisor = -2**63: quotient = 1
special:
	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	xor	edx, edx	; edx:eax = quotient = 1

	ret	16		; callee restores stack

_alldiv	endp
	end

64÷64-bit Signed Integer Division (64-bit Remainder)

__moddi3() function for i386 processors:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

.file	"moddi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
__moddi3:
	# determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	# eax:ecx = dividend

	cdq			# edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	# ecx:eax = (dividend < 0) ? -dividend : dividend
				#         = |dividend|

	mov	[esp+4], ecx	# write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		# [esp] = (dividend < 0) ? -1 : 0

	# determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	# edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		# ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = (divisor < 0) ? -divisor : divisor
				#         = |divisor|

	mov	[esp+16], eax	# write |divisor| back on stack
	mov	[esp+20], edx

	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	.trivial	# dividend < divisor?

	bsr	ecx, edx	# ecx = index of most significant '1' bit
				#        in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# remainder < divisor < 2**32

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+12]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	jmp	.next

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
.next:
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) |remainder|
#	xor	edx, edx	# edx:eax = |remainder|

	pop	edx		# edx = (remainder < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	# edx:eax = remainder
	ret

	# dividend < divisor: remainder = dividend
.trivial:
	mov	eax, [esp+8]
	mov	edx, [esp+12]	# edx:eax = |remainder| = |dividend|

	jmp	.remainder

	# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
	xor	ecx, 31		# ecx = number of leading '0' bits
				#        in (high dword of) divisor
	jz	.special	# divisor = 2**63?

	# perform "extended & adjusted" division

	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	push	ebx
	mov	ebx, edx	# ebx = divisor'
.ifnotdef JCCLESS
	xor	eax, eax	# eax = high dword of quotient' = 0
	mov	edx, [esp+16]	# edx = high dword of dividend
	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	eax		# eax = high dword of quotient' = 1
0:
	push	eax		# [esp] = high dword of quotient'
.else # JCCLESS
	mov	edx, [esp+16]	# edx = high dword of dividend
	cmp	edx, ebx	# CF = (high dword of dividend - divisor')
	sbb	eax, eax	# eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		# eax = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
	push	eax		# [esp] = high dword of quotient'
if 0
	neg	eax		# eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	# eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
.endif # JCCLESS
	mov	eax, [esp+16]	# eax = low dword of dividend
				#     = low dword of dividend'
	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	pop	ebx		# ebx = high dword of quotient'
	shld	ebx, eax, cl	# ebx = quotient' / 2**(index + 1)
				#     = dividend / divisor'
				#     = quotient"
#	shl	eax, cl
	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	ebx		# edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	# ebx = high dword of divisor * quotient"
	add	edx, ebx	# edx:eax = divisor * quotient"
	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JCCLESS
	jnb	1f		# remainder" >= 0?
				#  (with borrow, it is off by divisor,
				#   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	# ebx:ecx = remainder" + divisor
				#         = remainder
1:
	mov	eax, ecx
	mov	edx, ebx	# edx:eax = |remainder|
.else # JCCLESS
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	# edx:eax = remainder" + divisor
				#         = remainder
.endif # JCCLESS
	pop	ebx
.remainder:
	pop	ecx		# ecx = (remainder < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = remainder
	ret

	# dividend = divisor = -2**63: remainder = 0
.special:
	pop	eax		# ebx = sign of remainder = -1
	inc	eax
	xor	edx, edx	# edx:eax = remainder = 0
	ret

.size	__moddi3, .-__moddi3
.type	__moddi3, @function
.global	__moddi3
.end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__moddi3 proc	public

	; determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend

	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; ecx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		; [esp] = (dividend < 0) ? -1 : 0

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	jmp	short next

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
next:
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) |remainder|
;;	xor	edx, edx	; edx:eax = |remainder|

	pop	edx		; edx = (remainder < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = remainder
	ret

	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+8]
	mov	edx, [esp+12]	; edx:eax = |remainder| = |dividend|

	jmp	short remainder

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; eax = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	; ebx = high dword of divisor * quotient"
	add	edx, ebx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JccLess
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ecx
	mov	edx, ebx	; edx:eax = |remainder|
else ; JccLess
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	; edx:eax = remainder" + divisor
				;         = remainder
endif ; JccLess
	pop	ebx
remainder:
	pop	ecx		; ecx = (remainder < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = remainder
	ret

	; dividend = divisor = -2**63: remainder = 0
special:
	pop	eax		; eax = sign of remainder = -1
	inc	eax
	xor	edx, edx	; edx:eax = remainder = 0
	ret

__moddi3 endp
	end

Microsoft Visual C compiler helper routine _allrem() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_allrem	proc	public

	; determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend

	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; ecx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		; [esp] = (dividend < 0) ? -1 : 0

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	jmp	short next

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
next:
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) |remainder|
;;	xor	edx, edx	; edx:eax = |remainder|

	pop	edx		; edx = (remainder < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+8]
	mov	edx, [esp+12]	; edx:eax = |remainder| = |dividend|

	jmp	short remainder

	; 2**63 %gt;= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JccLess
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JccLess
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JccLess
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	eax, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	; ebx = high dword of divisor * quotient"
	add	edx, ebx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JccLess
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ecx
	mov	edx, ebx	; edx:eax = |remainder|
else ; JccLess
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	; edx:eax = remainder" + divisor
				;         = |remainder|
endif ; JccLess
	pop	ebx
remainder:
	pop	ecx		; ecx = (remainder < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = remainder

	ret	16		; callee restores stack

	; dividend = divisor = -2**63: remainder = 0
special:
	pop	eax		; eax = sign of remainder = -1
	inc	eax
	xor	edx, edx	; edx:eax = remainder = 0

	ret	16		; callee restores stack

_allrem	endp
	end

64×64-bit Signed and Unsigned Integer Multiplication (64-bit Product)

__muldi3() alias __umuldi3() function for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of multiplier
				; [esp+12] = low dword of multiplier
				; [esp+8] = high dword of multiplicand
				; [esp+4] = low dword of multiplicand
__muldi3 proc	public
__umuldi3 proc	public

	push	ebx

	mov	edx, [esp+16]	; edx = low dword of multiplier
	mov	ecx, [esp+12]	; ecx = high dword of multiplicand
	imul	ecx, edx	; ecx = high dword of multiplicand
				;     * low dword of multiplier
	mov	eax, [esp+8]	; eax = low dword of multiplicand
	mov	ebx, [esp+20]	; ebx = high dword of multiplier
	imul	ebx, eax	; ebx = high dword of multiplier
				;     * low dword of multiplicand
	mul	edx		; edx:eax = low dword of multiplicand
				;         * low dword of multiplier
	add	ecx, ebx	; ecx = high dword of multiplicand
				;     * low dword of multiplier
				;     + high dword of multiplier
				;     * low dword of multiplicand
	add	edx, ecx	; edx:eax = product % 2**64

	pop	ebx
	ret

__umuldi3 endp
__muldi3 endp
	end

Microsoft Visual C compiler helper routine _allmul(), for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of multiplier
				; [esp+12] = low dword of multiplier
				; [esp+8] = high dword of multiplicand
				; [esp+4] = low dword of multiplicand
_allmul	proc	public

	mov	eax, [esp+4]	; eax = low dword of multiplicand
	mov	edx, [esp+8]	; edx = high dword of multiplicand
	imul	edx, [esp+12]	; edx = high dword of multiplicand
				;     * low dword of multiplier
	mov	ecx, [esp+16]	; ecx = high dword of multiplier
	imul	ecx, eax	; ecx = high dword of multiplier
				;     * low dword of multiplicand
	add	ecx, edx	; ecx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
	mul	dword ptr [esp+12]
				; edx:eax = low dword of multiplicand
				;         * low dword of multiplier
	add	edx, ecx	; edx:eax = product % 2**64
	ret	16		; callee restores stack

_allmul	endp
	end

64-bit Signed and Unsigned Integer Shift (64-bit Result)

Microsoft Visual C compiler helper routine _allshl() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: applies shift count % 64

	.386
	.model	flat, C
	.code
				; edx:eax = value
				; ecx = count
_allshl	proc	public

	test	cl, 32
	jnz	short @f	; count > 31?

	shld	edx, eax, cl
	shl	eax, cl

	ret
@@:
	mov	edx, eax
	shl	edx, cl
	xor	eax, eax

	ret

_allshl	endp
	end

Microsoft Visual C compiler helper routine _allshr() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: applies shift count % 64

	.386
	.model	flat, C
	.code
				; edx:eax = value
				; ecx = count
_allshr	proc	public

	test	cl, 32
	jnz	short @f	; count > 31?

	shrd	eax, edx, cl
	sar	edx, cl

	ret
@@:
	mov	eax, edx
	sar	eax, cl
	sar	edx, 31

	ret

_allshr	endp
	end

Microsoft Visual C compiler helper routine _aullshr() for i386 processors:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: applies shift count % 64

	.386
	.model	flat, C
	.code
				; edx:eax = value
				; ecx = count
_aullshr proc	public

	test	cl, 32
	jnz	short @f	; count > 31?

	shrd	eax, edx, cl
	shr	edx, cl

	ret
@@:
	mov	eax, edx
	shr	eax, cl
	xor	edx, edx

	ret

_aullshr endp
	end

Execution Times (Sustained Reciprocal Throughput)

Measurements were performed on Windows 10 with the benchmark programs presented below, which are available for download in the cabinet file INTEGER.CAB: the console programs *.com measure execution times in nano-seconds, while the console programs *.exe measure processor clock cycles.

The makefile INTEGER.MAK for Microsoft’s NMAKE.EXE performs all following steps, using but slightly different filenames; it contains the sources presented above and below as inline files and was used to create the cabinet file INTEGER.CAB.

Running ’round in Circles Cycles

The table shows the execution times for 128÷128-bit division, 64÷64-bit division and 64×64-bit multiplication on several processors in clock cycles per function call or instruction; the upper half for the AMD64 platform, the lower half for the i386 platform.

For division, the left columns show the execution times for 128-bit or 64-bit uniform distributed pseudo-random dividend and divisor, i.e. the (rather unlikely) special case with numbers of (almost) equal magnitude, while the right columns show the execution times for 128-bit to 65-bit or 64-bit to 33-bit pseudo-random dividend and divisor respectively, i.e. the (more likely) general case with numbers of different magnitude.

Execution times for 128÷128-bit and 64÷64-bit division in processor clock cycles: AMD64 platform
	128÷128-bit division						64÷64-bit division
	`__udivmodti4()`		`__udivmodti4()`		`__udivmodti4()`		`__udivmoddi4()`		`DIV`
	eSKamation		LLVM		eSKamation		eSKamation		AMD, Intel
AMD EPYC^™ 7713	9	10	25	130	11	33	13	23	3	3
AMD Ryzen^™9 3900XT	19	19	39	190	20	56	21	39	13	16
AMD Ryzen^™5 3600	20	20	41	201	21	59	22	41	14	17
AMD Ryzen^™7 2700X	20	19	44	212	23	63	21	41	14	17
AMD Radeon^™R3	38	41	56	300	26	78	25	59	11	22
Intel Core i5-9500	74	t.b.s.	t.b.s.	t.b.s.	15	t.b.s.	13	t.b.s.	30	t.b.s.
Intel Core i7-8550U	31	32	24	122	11	37	12	21	28	28
Intel Core i5-7400	55	56	41	214	18	65	15	32	28	29
Intel Core i5-6600	91	t.b.s.	t.b.s.	t.b.s.	20	t.b.s.	16	t.b.s.	37	t.b.s.
Intel Core i5-4670	53	55	44	217	22	74	20	39	31	32
Intel Core^™2 Duo P8700	60	60	62	296	27	117	33	60	28	29

Execution times for 64÷64-bit division and multiplication in processor clock cycles: i386 platform
	64÷64-bit division								64×64-bit multiplication
	`__udivmoddi4()`		`__udivmoddi4()`		`_aulldvrm()`		`_aulldvrm()`		`_aullmul()`	`__umuldi3()`	`__umuldi3()`
	eSKamation		LLVM		eSKamation		Microsoft		Microsoft	LLVM	eSKamation
AMD EPYC^™ 7713	9	8	34	63	11	10	47	28	3	4	•
AMD Ryzen^™9 3900XT	17	13	53	99	19	14	70	41	4	7	•
AMD Ryzen^™5 3600	18	13	56	105	20	14	73	44	3	7	•
AMD Ryzen^™7 2700X	19	15	61	114	22	17	82	58	5	9	1
AMD Radeon^™R3	42	29	79	165	45	31	101	71	6	13	1
Intel Core i5-9500	16	t.b.s.	t.b.s.	t.b.s.	t.b.s.	t.b.s.	101	t.b.s.	5	t.b.s.	•
Intel Core i7-8550U	10	9	26	52	11	9	72	46	3	5	1
Intel Core i5-7400	16	14	41	83	19	15	115	78	5	8	1
Intel Core i5-6600	19	t.b.s.	t.b.s.	t.b.s.	t.b.s.	t.b.s.	125	t.b.s.	5	t.b.s.	•
Intel Core i5-4670	21	19	50	97	24	19	124	84	8	10	1
Intel Core^™2 Duo P8700	25	19	82	145	30	24	136	98	9	18	1

Note: the deviation of the measurements for my own __udivmoddi4() and _aulldvrm() division routines is due to their different calling convention.

Summary

The following summary can be given from the benchmarks:

The extended precision algorithm shows (almost) constant runtime, independent of the magnitude of dividend and divisor, and the best overall performance.
On AMD Ryzen processors, the 128÷128-bit division routine using the hybrid algorithm is always slower than the __udivmodti4() routine using the extended precision algorithm – more than 3 times slower with dividend and divisor of (large) different magnitude!
Contrary to this, on Intel Core processors, the __udivmodti4() 128÷128-bit division routine using the hybrid algorithm is up to 3 times faster than the routine using the extended precision algorithm – but only with dividend and divisor of (nearly) equal magnitude, and slower otherwise.
On modern Core processors, the __udivmoddi4() 64÷64-bit division routines presented above run twice as fast as their native 64-bit DIV instruction, especially in 32-bit mode!
The 64÷64-bit division routine _aulldvrm() for 32-bit i386 processors, which Microsoft dares to ship with Windows, for example in NTDLL.DLL and their MSVCRT libraries, but sucks: it is 4 to 6 times slower than a properly implemented division routine!
The same holds for their 64×64-bit multiplication routine _allmul() alias _aullmul() for 32-bit i386 processors, which consumes up to 9 clock cycles.
The 64×64-bit multiplication routine __muldi3() shipped by LLVM in their clang_rt.builtins-i386.lib library is even worse and wastes up to 18 clock cycles, while their 64÷64-bit division routine __udivmoddi4() and their 128÷128-bit division routine __udivmodti4() suck similarly: they are 3 to 13 (in words: thirteen) times slower than properly implemented division routines!
In their own ~~true~~ false words:
The builtins library provides optimized implementations of this and other low-level routines, either in target-independent C form, or as a heavily-optimized assembly.

Benchmark Programs for AMD64 Processors

The first of the following two C programs for 64-bit processors measures the execution time for one billion divisions of uniform distributed 128-bit pseudo-random numbers and for one billion divisions of pseudo-random numbers in the interval from 2¹²⁸−1 to 2⁶⁴ with the __udivmodti4() function.

With the preprocessor macro NATIVE defined, the second C program measures the execution time for one billion divisions of uniform distributed 64-bit pseudo-random numbers and one billion divisions of pseudo-random numbers in the interval from 2⁶⁴−1 to 2³² with the DIV instruction, else with the shift & subtract algorithm, both disguised as the __udivmoddi4() function.

Note: with the preprocessor macro CYCLES defined, both programs measure the execution time in processor clock cycles and run on 64-bit editions of Windows Vista^® and newer versions, else they measure the execution time in nano-seconds and run on 64-bit editions of Windows^™ XP and newer versions.

// Copyright © 2018-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <stdint.h>
#include <stdio.h>
#include <time.h>

extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);

__attribute__ ((noinline))
static
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
    if (remainder != NULL)
        *remainder = divisor;

    return dividend;
}

__attribute__ ((always_inline))
static
__uint128_t lfsr128r(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA,
    //   initialised with bit-vector of prime numbers:
    //    2**prime is set for each prime in [0, 127]

    static __uint128_t lfsr = (__uint128_t) 0x800228A202088288 << 64 | 0x28208A20A08A28AC;
    const  __uint128_t poly = (__uint128_t) 0xB64E4D3FA8E7331B << 64 | 0xD871FA30D46D4DBA;
    const  __uint128_t mask = 0 - (lfsr & 1);

    return lfsr = (lfsr >> 1) ^ (poly & mask);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr128l(void)
{
    // 128-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
    //   initialised with 2**128 / golden ratio

    static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
    const  __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
    const  __uint128_t sign = (__int128_t) lfsr >> 127;

    return lfsr = (lfsr << 1) ^ (poly & sign);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
    // 64-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
    //   initialised with 2**64 / golden ratio

    static uint64_t lfsr = 0x9E3779B97F4A7C15;
    const  uint64_t sign = (int64_t) lfsr >> 63;

    return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}

__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
    // 32-bit linear feedback shift register (Galois form) using
    //  primitive polynomial 0xDB710641 (CRC-32 IEEE),
    //   initialised with 2**32 / golden ratio

    static uint32_t lfsr = 0x9E3779B9;
    const  uint32_t sign = (int32_t) lfsr >> 31;

    return lfsr = (lfsr << 1) ^ (0xDB710641 & sign);
}

int main(void)
{
    clock_t t0, t1, t2, tt;
    uint32_t m, n;
    __uint128_t dividend, divisor = ~0, remainder;
    volatile __uint128_t quotient;

    for (m = 0u; m < 64u; m += m + 1u)
    {
        t0 = clock();

        for (n = 500000000u; n > 0u; n--)
        {
            dividend = lfsr128l();
            dividend >>= dividend & m;
            quotient = __unopti4(dividend, divisor, NULL);
            divisor = lfsr128r();
            divisor >>= divisor & m;
            quotient = __unopti4(dividend, divisor, &remainder);
        }

        t1 = clock();

        for (n = 500000000u; n > 0u; n--)
        {
            dividend = lfsr128l();
            dividend >>= dividend & m;
            quotient = __udivmodti4(dividend, divisor, NULL);
            divisor = lfsr128r();
            divisor >>= divisor & m;
            quotient = __udivmodti4(dividend, divisor, &remainder);
        }

        t2 = clock();
        tt = t2 - t0;
        t2 -= t1;
        t1 -= t0;
        t0 = t2 - t1;

        printf("\n"
               "__unopti4()       %4lu.%06lu       0\n"
               "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
               "                  %4lu.%06lu nano-seconds\n",
               t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
               tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
    }

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128l();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128l();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr64();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128l();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr128l();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr64();
        quotient = __unopti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __unopti4(dividend, divisor, &remainder);
    }

    t1 = clock();

    for (n = 500000000u; n > 0u; n--)
    {
        dividend = lfsr64();
        quotient = __udivmodti4(dividend, divisor, NULL);
        divisor = lfsr32();
        quotient = __udivmodti4(dividend, divisor, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu nano-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);

    t0 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __unopti4(~0, 3, NULL);
        quotient = __unopti4(~0, 3, &remainder);
    }

    t1 = clock();

    for (n = 500000u; n > 0u; n--)
    {
        quotient = __udivmodti4(~0, 3, NULL);
        quotient = __udivmodti4(~0, 3, &remainder);
    }

    t2 = clock();
    tt = t2 - t0;
    t2 -= t1;
    t1 -= t0;
    t0 = t2 - t1;

    printf("\n"
           "__unopti4()       %4lu.%06lu       0\n"
           "__udivmodti4()    %4lu.%06lu    %4lu.%06lu\n"
           "                  %4lu.%06lu micro-seconds\n",
           t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
           tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	ULONGLONG	QWORD;

typedef	struct
{
	QWORD	qwLow, qwHigh;
} OWORD;

const	struct
{
	OWORD	owDividend, owDivisor, owQuotient, owRemainder;
} owTable[] = {{0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {2ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL, 0ULL, 0ULL},
               {2ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {2ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
               {2ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
               {~0ULL, 0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, 0ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
               {0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {1ULL, 1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
               {1ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL},
               {1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {1ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {1ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {1ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {~0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL},
               {~0ULL, 1ULL, ~0ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL},
               {~0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
               {~0ULL, 1ULL, ~0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
               {~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
               {~0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
               {~0ULL, 0xFULL, 0xFULL, 0ULL, 0x1111111111111111ULL, 1ULL, 0ULL, 0ULL},
               {~0xFULL, ~1ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, ~0xFULL, ~1ULL},
               {0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 2ULL, 0ULL},
               {0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
               {0ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
               {1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL},
               {1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL, 0ULL},
               {1ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 3ULL, 0ULL},
               {1ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, ~0ULL},
               {~0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, ~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, ~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, 1ULL, 3ULL, 0x5555555555555555ULL, 0ULL, 0xAAAAAAAAAAAAAAAAULL, 0ULL},
               {~0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
               {~0ULL, ~0ULL, ~0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, ~1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x0000000001110001ULL, 0ULL, 0x00000000003EB455ULL, 0ULL},
               {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x0000000001110001ULL, 0ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x00000000003EB455ULL, 0ULL}};

#pragma intrinsic(_umul128)

__declspec(noinline)
OWORD	__umulti3(OWORD owMultiplicand, OWORD owMultiplier)
{
	OWORD	owProduct;

	owProduct.qwLow = _umul128(owMultiplicand.qwLow, owMultiplier.qwLow, &owProduct.qwHigh);
	owProduct.qwHigh += owMultiplicand.qwLow * owMultiplier.qwHigh
	                  + owMultiplicand.qwHigh * owMultiplier.qwLow;

	return owProduct;
}

__declspec(noinline)
OWORD	__unopti4(OWORD owDividend, OWORD owDivisor, OWORD *owRemainder)
{
	if (owRemainder != NULL)
		*owRemainder = owDivisor;

	return owDividend;
}

OWORD	__udivmodti4(OWORD owDividend, OWORD owDivisor, OWORD *owRemainder);

#pragma intrinsic(__shiftleft128, __shiftright128)

__forceinline
VOID	lfsr128l(OWORD *ow)
{
#ifndef XORSHIFT
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D

	QWORD	qw = (LONGLONG) ow->qwHigh >> 63;
	ow->qwHigh = __shiftleft128(ow->qwLow, ow->qwHigh, 1)
	              ^ (qw & 0x5DB2B62B0C5F8E1BULL);
	ow->qwLow = (qw & 0xD8CCE715FCB2726DULL) ^ (ow->qwLow << 1);
#elif 1
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Richard Peirce Brent

	QWORD	qw = ow->qwHigh;
	ow->qwHigh = ow->qwLow;
	ow->qwLow ^= ow->qwLow << 33;
	qw ^= qw << 28;
	ow->qwLow ^= ow->qwLow >> 31;
	qw ^= qw >> 29;
	ow->qwLow ^= qw;
#else
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 26);
	ow->qwLow ^= ow->qwLow << 26;
	ow->qwLow ^= __shiftright128(ow->qwLow, ow->qwHigh, 61);
	ow->qwHigh ^= ow->qwHigh >> 61;
	ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 37);
	ow->qwLow ^= ow->qwLow << 37;
#endif
}

__forceinline
VOID	lfsr128r(OWORD *ow)
{
#ifndef XORSHIFT
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA

	QWORD	qw = 0ULL - (ow->qwLow & 1ULL);
	ow->qwLow = __shiftright128(ow->qwLow, ow->qwHigh, 1)
	             ^ (qw & 0xD871FA30D46D4DBAULL);
	ow->qwHigh = (qw & 0xB64E4D3FA8E7331BULL) ^ (ow->qwHigh >> 1);
#elif 1
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Sebastiano Vigna

	QWORD	qw = ow->qwHigh;
	ow->qwHigh = ow->qwLow;
	qw ^= qw << 23;
	ow->qwLow ^= ow->qwLow >> 26;
	qw ^= qw >> 17;
	ow->qwLow ^= qw;
#else
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 11);
	ow->qwLow ^= ow->qwLow << 11;
	ow->qwLow ^= __shiftright128(ow->qwLow, ow->qwHigh, 61);
	ow->qwHigh ^= ow->qwHigh >> 61;
	ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 45);
	ow->qwLow ^= ow->qwLow << 45;
#endif
}

__forceinline
VOID	scale128(OWORD *owOut, OWORD *owIn)
{
	owOut->qwLow = __shiftright128(owIn->qwLow, owIn->qwHigh, owIn->qwLow /* & 63 */);
	owOut->qwHigh = owIn->qwHigh >> (owIn->qwLow /* & 63 */);
}

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

__declspec(noreturn)
__declspec(safebuffers)
VOID	CDECL	mainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	OWORD	owDividend, owDivisor = {~0ULL, ~0ULL}, owQuotient, owRemainder;
		// 2**128 / square root of 2
	OWORD	owLeft = {0x597D89B3754ABE9FULL, 0xB504F333F9DE6484ULL};
		// 2**128 / square root of 3
	OWORD	owRight = {0x0C7C0F257D92BE83ULL, 0x93CD3A2C8198E269ULL};

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		__movsb(dwCPUID, "undetermined processor", sizeof("undetermined processor"));

	PrintFormat(hOutput, "\r\nTesting 128-bit division...\r\n");

	for (dw = 1UL; dw < sizeof(owTable) / sizeof(*owTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);
#if 0
		if ((owTable[dw].owDivisor.qwHigh | owTable[dw].owDivisor.qwLow) == 0ULL)
			continue;
#endif
		owQuotient = __udivmodti4(owTable[dw].owDividend, owTable[dw].owDivisor, &owRemainder);

		if ((owQuotient.qwHigh != owTable[dw].owQuotient.qwHigh)
		 || (owQuotient.qwLow != owTable[dw].owQuotient.qwLow))
			PrintFormat(hOutput,
			            "\t0x%016I64X:%016I64X\a / %016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n",
			            owTable[dw].owDividend.qwHigh, owTable[dw].owDividend.qwLow,
			            owTable[dw].owDivisor.qwHigh, owTable[dw].owDivisor.qwLow,
			            owTable[dw].owQuotient.qwHigh, owTable[dw].owQuotient.qwLow,
			            owQuotient.qwHigh, owQuotient.qwLow);

		if ((owRemainder.qwHigh != owTable[dw].owRemainder.qwHigh)
		 || (owRemainder.qwLow != owTable[dw].owRemainder.qwLow))
			PrintFormat(hOutput,
			            "\t0x%016I64X:%016I64X\a %% %016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n",
			            owTable[dw].owDividend.qwHigh, owTable[dw].owDividend.qwLow,
			            owTable[dw].owDivisor.qwHigh, owTable[dw].owDivisor.qwLow,
			            owTable[dw].owRemainder.qwHigh, owTable[dw].owRemainder.qwLow,
			            owRemainder.qwHigh, owRemainder.qwLow);
	}

	PrintFormat(hOutput, "\r\nTiming 128-bit division on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(&owLeft);
		owQuotient = __unopti4(owLeft, owRight, NULL);
		lfsr128r(&owRight);
		owQuotient = __unopti4(owLeft, owRight, &owRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(&owLeft);
		owQuotient = __udivmodti4(owLeft, owRight, NULL);
		lfsr128r(&owRight);
		owQuotient = __udivmodti4(owLeft, owRight, &owRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(&owLeft);
		owQuotient = __umulti3(owLeft, owRight);
		lfsr128r(&owRight);
		owQuotient = __umulti3(owLeft, owRight);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 - qwT1;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%09I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "__umulti3()     %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "                %6I64u.%09I64u clock cycles\r\n",
	            qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	            qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	            qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	            qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	            qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	            qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%07I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "__umulti3()     %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "                %6I64u.%07I64u nano-seconds\r\n",
	            qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	            qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	            qwTx / 10000000ULL, qwTx % 10000000ULL,
	            qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	            qwTy / 10000000ULL, qwTy % 10000000ULL,
	            qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(&owLeft);
		scale128(&owDividend, &owLeft);
		owQuotient = __unopti4(owDividend, owDivisor, NULL);
		lfsr128r(&owRight);
		scale128(&owDivisor, &owRight);
		owQuotient = __unopti4(owDividend, owDivisor, &owRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(&owLeft);
		scale128(&owDividend, &owLeft);
		owQuotient = __udivmodti4(owDividend, owDivisor, NULL);
		lfsr128r(&owRight);
		scale128(&owDivisor, &owRight);
		owQuotient = __udivmodti4(owDividend, owDivisor, &owRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(&owLeft);
		scale128(&owDividend, &owLeft);
		owQuotient = __umulti3(owDividend, owDivisor);
		lfsr128r(&owRight);
		scale128(&owDivisor, &owRight);
		owQuotient = __umulti3(owDividend, owDivisor);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 - qwT1;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%09I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "__umulti3()     %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "                %6I64u.%09I64u clock cycles\r\n",
	            qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	            qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	            qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	            qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	            qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	            qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%07I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "__umulti3()     %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "                %6I64u.%07I64u nano-seconds\r\n",
	            qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	            qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	            qwTx / 10000000ULL, qwTx % 10000000ULL,
	            qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	            qwTy / 10000000ULL, qwTy % 10000000ULL,
	            qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
	ExitProcess(0UL);
}

DWORD_PTR	__security_cookie = 3141592653589793241ULL >> 16;	// π * 10**18 / 2**16

const	IMAGE_LOAD_CONFIG_DIRECTORY64	_load_config_used = {sizeof(_load_config_used),
					                     'DATE',	// = 2006-04-15 20:15:01 UTC
					                     _MSC_VER / 100, _MSC_VER % 100,
					                     0UL, 0UL, 0UL,
					                     0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
					                     0UL,
					                     0U, 0U,
					                     0ULL,
					                     &__security_cookie,
					                     0ULL, 0ULL};

VOID	__security_check_cookie(DWORD_PTR _stackcookie)
{
	if (_stackcookie != __security_cookie)
		__ud2();
}

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	ULONGLONG	QWORD;

const	struct
{
	QWORD	qwDividend[2], qwDivisor[2], qwQuotient[2], qwRemainder[2];
} owTable[] = {{0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
               {2ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL, 0ULL, 0ULL},
               {2ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {2ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
               {2ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
               {~0ULL, 0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, 0ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
               {0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
               {0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
               {1ULL, 1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
               {1ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL},
               {1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {1ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {1ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {1ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
               {~0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL},
               {~0ULL, 1ULL, ~0ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL},
               {~0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
               {~0ULL, 1ULL, ~0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
               {~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
               {~0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
               {~0ULL, 0xFULL, 0xFULL, 0ULL, 0x1111111111111111ULL, 1ULL, 0ULL, 0ULL},
               {~0xFULL, ~1ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, ~0xFULL, ~1ULL},
               {0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 2ULL, 0ULL},
               {0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {0ULL, ~0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
               {0ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
               {1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL},
               {1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL, 0ULL},
               {1ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 3ULL, 0ULL},
               {1ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {1ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {1ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, ~0ULL},
               {~0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, ~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, ~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, 1ULL, 3ULL, 0x5555555555555555ULL, 0ULL, 0xAAAAAAAAAAAAAAAAULL, 0ULL},
               {~0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
               {~0ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
               {~0ULL, ~0ULL, ~0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
               {~0ULL, ~0ULL, ~1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
               {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x0000000001110001ULL, 0ULL, 0x00000000003EB455ULL, 0ULL},
               {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x0000000001110001ULL, 0ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x00000000003EB455ULL, 0ULL}};

#pragma intrinsic(_umul128)

__declspec(noinline)
QWORD	*__umulti3(QWORD qwProduct[2], QWORD qwMultiplicand[2], QWORD qwMultiplier[2])
{
	qwProduct[0] = _umul128(qwMultiplicand[0], qwMultiplier[0], qwProduct + 1);
	qwProduct[1] += qwMultiplicand[0] * qwMultiplier[1]
	              + qwMultiplicand[1] * qwMultiplier[0];

	return qwProduct;
}

__declspec(noinline)
QWORD	*__unopti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2])
{
	if (qwRemainder != NULL)
		*qwDivisor = *qwDividend;

	return qwQuotient;
}

QWORD	*__udivmodti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2]);

#pragma intrinsic(__shiftleft128, __shiftright128)

__forceinline
VOID	lfsr128l(QWORD qw[2])
{
#ifndef XORSHIFT
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D

	QWORD	qwMask = (LONGLONG) (qw[1]) >> 63;
	qw[1] = __shiftleft128(qw[0], qw[1], 1)
	      ^ (qwMask & 0x5DB2B62B0C5F8E1BULL);
	qw[0] = (qwMask & 0xD8CCE715FCB2726DULL) ^ (qw[0] << 1);
#elif 1
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Richard Peirce Brent

	QWORD	qwTemp = qw[1];
	qw[1] = qw[0];
	qw[0] ^= qw[0] << 33;
	qwTemp ^= qwTemp << 28;
	qw[0] ^= qw[0] >> 31;
	qwTemp ^= qwTemp >> 29;
	qw[0] ^= qwTemp;
#else
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	qw[1] ^= __shiftleft128(qw[0], qw[1], 26);
	qw[0] ^= qw[0] << 26;
	qw[0] ^= __shiftright128(qw[0], qw[1], 61);
	qw[1] ^= qw[1] >> 61;
	qw[1] ^= __shiftleft128(qw[0], qw[1], 37);
	qw[0] ^= qw[0] << 37;
#endif
}

__forceinline
VOID	lfsr128r(QWORD qw[2])
{
#ifndef XORSHIFT
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA

	QWORD	qwMask = 0ULL - (qw[0] & 1ULL);
	qw[0] = __shiftright128(qw[0], qw[1], 1)
	      ^ (qwMask & 0xD871FA30D46D4DBAULL);
	qw[1] = (qwMask & 0xB64E4D3FA8E7331BULL) ^ (qw[1] >> 1);
#elif 1
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Sebastiano Vigna

	QWORD	qwTemp = qw[1];
	qw[1] = qw[0];
	qwTemp ^= qwTemp << 23;
	qw[0] ^= qw[0] >> 26;
	qwTemp ^= qwTemp >> 17;
	qw[0] ^= qwTemp;
#else
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	qw[1] ^= __shiftleft128(qw[0], qw[1], 11);
	qw[0] ^= qw[0] << 11;
	qw[0] ^= __shiftright128(qw[0], qw[1], 61);
	qw[1] ^= qw[1] >> 61;
	qw[1] ^= __shiftleft128(qw[0], qw[1], 45);
	qw[0] ^= qw[0] << 45;
#endif
}

__forceinline
VOID	scale128(QWORD qwOut[2], QWORD qwIn[2])
{
	qwOut[0] = __shiftright128(qwIn[0], qwIn[1], qwIn[0] /* & 63 */);
	qwOut[1] = qwIn[1] >> (qwIn[0] /* & 63 */);
}

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	QWORD	qwDividend[2], qwDivisor[2], qwQuotient[2], qwRemainder[2];
		// 2**128 / golden ratio
	QWORD	qwLeft[2] = {0xF39CC0605CEDC834ULL, 0x9E3779B97F4A7C15ULL};
		// bit-vector of prime numbers:
		//  2**prime is set for each prime in [0, 127]
	QWORD	qwRight[2] = {0x28208A20A08A28ACULL, 0x800228A202088288ULL};

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		__movsb(dwCPUID, "undetermined processor", sizeof("undetermined processor"));

	PrintFormat(hOutput, "\r\nTesting 128-bit division...\r\n");

	for (dw = 1UL; dw < sizeof(owTable) / sizeof(*owTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);
#if 0
		if ((owTable[dw].qwDivisor[1] | owTable[dw].qwDivisor[0]) == 0ULL)
			continue;
#endif
		__udivmodti4(qwQuotient, owTable[dw].qwDividend, owTable[dw].qwDivisor, qwRemainder);

		if ((qwQuotient[1] != owTable[dw].qwQuotient[1])
		 || (qwQuotient[0] != owTable[dw].qwQuotient[0]))
			PrintFormat(hOutput,
			            "\t0x%016I64X:%016I64X\a / %016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n",
			            owTable[dw].qwDividend[1], owTable[dw].qwDividend[0],
			            owTable[dw].qwDivisor[1], owTable[dw].qwDivisor[0],
			            owTable[dw].qwQuotient[1], owTable[dw].qwQuotient[0],
			            qwQuotient[1], qwQuotient[0]);

		if ((qwRemainder[1] != owTable[dw].qwRemainder[1])
		 || (qwRemainder[0] != owTable[dw].qwRemainder[0]))
			PrintFormat(hOutput,
			            "\t0x%016I64X:%016I64X\a %% %016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n"
			            "\t0x%016I64X:%016I64X\r\n",
			            owTable[dw].qwDividend[1], owTable[dw].qwDividend[0],
			            owTable[dw].qwDivisor[1], owTable[dw].qwDivisor[0],
			            owTable[dw].qwRemainder[1], owTable[dw].qwRemainder[0],
			            qwRemainder[1], qwRemainder[0]);
	}

	PrintFormat(hOutput, "\r\nTiming 128-bit division on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		__unopti4(qwQuotient, qwLeft, qwRight, NULL);
		lfsr128r(qwRight);
		__unopti4(qwQuotient, qwLeft, qwRight, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		__udivmodti4(qwQuotient, qwLeft, qwRight, NULL);
		lfsr128r(qwRight);
		__udivmodti4(qwQuotient, qwLeft, qwRight, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		__umulti3(qwQuotient, qwLeft, qwRight);
		lfsr128r(qwRight);
		__umulti3(qwQuotient, qwLeft, qwRight);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 - qwT1;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%09I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "__umulti3()     %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "                %6I64u.%09I64u clock cycles\r\n",
	            qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	            qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	            qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	            qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	            qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	            qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%07I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "__umulti3()     %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "                %6I64u.%07I64u nano-seconds\r\n",
	            qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	            qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	            qwTx / 10000000ULL, qwTx % 10000000ULL,
	            qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	            qwTy / 10000000ULL, qwTy % 10000000ULL,
	            qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		scale128(qwDividend, qwLeft);
		__unopti4(qwQuotient, qwDividend, qwDivisor, NULL);
		lfsr128r(qwRight);
		scale128(qwDivisor, qwRight);
		__unopti4(qwQuotient, qwDividend, qwDivisor, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		scale128(qwDividend, qwLeft);
		__udivmodti4(qwQuotient, qwDividend, qwDivisor, NULL);
		lfsr128r(qwRight);
		scale128(qwDivisor, qwRight);
		__udivmodti4(qwQuotient, qwDividend, qwDivisor, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		scale128(qwDividend, qwLeft);
		__umulti3(qwQuotient, qwDividend, qwDivisor);
		lfsr128r(qwRight);
		scale128(qwDivisor, qwRight);
		__umulti3(qwQuotient, qwDividend, qwDivisor);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 - qwT1;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%09I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "__umulti3()     %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "                %6I64u.%09I64u clock cycles\r\n",
	            qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	            qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	            qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	            qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	            qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	            qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopti4()     %6I64u.%07I64u      0\r\n"
	            "__udivmodti4()  %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "__umulti3()     %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "                %6I64u.%07I64u nano-seconds\r\n",
	            qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	            qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	            qwTx / 10000000ULL, qwTx % 10000000ULL,
	            qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	            qwTy / 10000000ULL, qwTy % 10000000ULL,
	            qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
	ExitProcess(0UL);
}

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	ULONGLONG	QWORD;

#ifndef NATIVE
#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}

const	struct
{
	QWORD	qwDividend, qwDivisor, qwQuotient, qwRemainder;
} qwTable[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
               _(0x0000000000000001ULL, 0x0000000000000001ULL),
               _(0x0000000000000002ULL, 0x0000000000000001ULL),
               _(0x0000000000000002ULL, 0x0000000000000002ULL),
               _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
               _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
               _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
               _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
               _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
               _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
               _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
               _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
               _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
               _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
               _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
               _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
               _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
               _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
               _(0x8000000000000000ULL, 0x0000000000000001ULL),
               _(0x8000000000000000ULL, 0x0000000000000002ULL),
               _(0x8000000000000000ULL, 0x0000000000000003ULL),
               _(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
               _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
               _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
               _(0x8000000000000000ULL, 0x0000000100000000ULL),
               _(0x8000000000000000ULL, 0x0000000100000001ULL),
               _(0x8000000000000000ULL, 0x0000000100000002ULL),
               _(0x8000000000000000ULL, 0x0000000100000003ULL),
               _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
               _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
               _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
               _(0x8000000080000000ULL, 0x0000000080000000ULL),
               _(0x8000000080000001ULL, 0x0000000080000001ULL),
               _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
               _(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
               _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
               _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};

#undef _
#ifndef INTERN
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);
#else
__declspec(noinline)
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	QWORD	quotient;
	DWORD	index1, index2;

	if (_BitScanReverse64(&index2, divisor))
		if (_BitScanReverse64(&index1, dividend))
#if 0
			if (index1 >= index2)
#else
			if (dividend >= divisor)
#endif
			{
				// dividend >= divisor > 0,
				//  64 > index1 >= index2 >= 0
				//   (number of leading '0' bits = 63 - index)

				divisor <<= index1 - index2;
				quotient = 0ULL;

				do
				{
					quotient <<= 1;

					if (dividend >= divisor)
					{
						dividend -= divisor;
						quotient |= 1ULL;
					}

					divisor >>= 1;
				} while (index1 >= ++index2);

				if (remainder != NULL)
					*remainder = dividend;

				return quotient;
			}
			else // divisor > dividend > 0:
			     //  quotient = 0, remainder = dividend
			{
				if (remainder != NULL)
					*remainder = dividend;

				return 0ULL;
			}
		else // divisor > dividend == 0:
		     //  quotient = 0, remainder = 0
		{
			if (remainder != NULL)
				*remainder = 0ULL;

			return 0ULL;
		}
	else // divisor == 0
	{
		if (remainder != NULL)
			*remainder = dividend % divisor;

		return dividend / divisor;
	}
}
#endif // INTERN
#else
__declspec(noinline)
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = dividend % divisor;

	return dividend / divisor;
}
#endif // NATIVE

__declspec(noinline)
QWORD	__unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = divisor;

	return dividend;
}

__declspec(noinline)
QWORD	__umuldi4(QWORD multiplicand, QWORD multiplier, QWORD *dummy)
{
	if (dummy != NULL)
		*dummy = 0ULL;

	return multiplicand * multiplier;
}

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	volatile
	QWORD	qwQuotient;
	QWORD	qwRemainder, qwDividend, qwDivisor = ~0ULL;
		// 2**64 / golden ratio
	QWORD	qwLeft = 0x9E3779B97F4A7C15ULL;
		// bit-vector of prime numbers:
		//  2**prime is set for each prime in [0, 63]
	QWORD	qwRight = 0x28208A20A08A28ACULL;

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		__movsb(dwCPUID, "undetermined processor", sizeof("undetermined processor"));
#ifndef NATIVE
#ifndef INTERN
	PrintFormat(hOutput, "\r\nTesting 64-bit assembler division...\r\n");
#else
	PrintFormat(hOutput, "\r\nTesting 64-bit C division...\r\n");
#endif
	for (dw = 0UL; dw < sizeof(qwTable) / sizeof(*qwTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);

		qwQuotient = __udivmoddi4(qwTable[dw].qwDividend, qwTable[dw].qwDivisor, &qwRemainder);

		if (qwQuotient != qwTable[dw].qwQuotient)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
			            qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwQuotient, qwTable[dw].qwQuotient);

		if (qwQuotient > qwTable[dw].qwDividend)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
			            qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwQuotient);

		if (qwRemainder != qwTable[dw].qwRemainder)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a remainder %I64u not equal %I64u\r\n",
			            qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwRemainder, qwTable[dw].qwRemainder);

		if (qwRemainder >= qwTable[dw].qwDivisor)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
			            qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwRemainder);
	}
#ifndef INTERN
	PrintFormat(hOutput, "\r\nTiming 64-bit assembler division on %.48hs\r\n", dwCPUID);
#else
	PrintFormat(hOutput, "\r\nTiming 64-bit C division on %.48hs\r\n", dwCPUID);
#endif
#else
	PrintFormat(hOutput, "\r\nTiming 64-bit native division on %.48hs\r\n", dwCPUID);
#endif // NATIVE
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __unopdi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __unopdi4(qwLeft, qwRight, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __udivmoddi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __udivmoddi4(qwLeft, qwRight, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __umuldi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __umuldi4(qwLeft, qwRight, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6I64u.%09I64u      0\r\n"
	            "__udivmoddi4()  %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "__umuldi3()     %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "                %6I64u.%09I64u clock cycles\r\n",
	            qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	            qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	            qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	            qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	            qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	            qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6I64u.%07I64u      0\r\n"
	            "__udivmoddi4()  %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "__umuldi3()     %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "                %6I64u.%07I64u nano-seconds\r\n",
	            qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	            qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	            qwTx / 10000000ULL, qwTx % 10000000ULL,
	            qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	            qwTy / 10000000ULL, qwTy % 10000000ULL,
	            qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = qwLeft >> (qwLeft & 31ULL);
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = qwRight >> (qwRight & 31ULL);
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = qwLeft >> (qwLeft & 31ULL);
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = qwRight >> (qwRight & 31ULL);
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = qwLeft >> (qwLeft & 31ULL);
		qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = qwRight >> (qwRight & 31ULL);
		qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6I64u.%09I64u      0\r\n"
	            "__udivmoddi4()  %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "__umuldi3()     %6I64u.%09I64u %6I64u.%09I64u\r\n"
	            "                %6I64u.%09I64u clock cycles\r\n",
	            qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	            qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	            qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	            qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	            qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	            qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6I64u.%07I64u      0\r\n"
	            "__udivmoddi4()  %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "__umuldi3()     %6I64u.%07I64u %6I64u.%07I64u\r\n"
	            "                %6I64u.%07I64u nano-seconds\r\n",
	            qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	            qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	            qwTx / 10000000ULL, qwTx % 10000000ULL,
	            qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	            qwTy / 10000000ULL, qwTy % 10000000ULL,
	            qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
	ExitProcess(0UL);
}

Save the first C source presented above as 128-amd64.c and the second C source as 64-amd64.c in an arbitrary, preferable empty directory, save the second assembler source presented above as udivmodti4.asm, the fourth assembler source as udivmodti4-hybrid.asm, and the eighth assembler source as udivmoddi4.asm in this directory too, then start the command prompt of the Windows software development kit for the AMD64 platform there, run the following command lines to ~~assemble, compile and link~~ build the benchmark programs 64.exe, 64-div.exe, 128.exe plus 128-hybrid.exe, and execute them:

ML64.EXE /Brepro /Cp /Cx /c /W3 /X udivmoddi4.asm
CL.EXE /Brepro /c /DCYCLES /GAFwy /O2y /W4 /Zl 64-amd64.c
LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64.exe /RELEASE /SUBSYSTEM:CONSOLE 64-amd64.obj udivmoddi4.obj kernel32.lib user32.lib
CL.EXE /Brepro /c /DCYCLES /DNATIVE /GAFwy /O2y /W4 /Zl 64-amd64.c
LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-div.exe /RELEASE /SUBSYSTEM:CONSOLE 64-amd64.obj kernel32.lib user32.lib
ML64.EXE /Brepro /Cp /Cx /c /DJccLess /W3 /X udivmodti4.asm
ML64.EXE /Brepro /Cp /Cx /c /W3 /X udivmodti4-hybrid.asm
CL.EXE /Brepro /c /DCYCLES /GAFwy /O2y /W4 /Zl 128-amd64.c
LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128.exe /RELEASE /SUBSYSTEM:CONSOLE 128-amd64.obj udivmodti4.obj kernel32.lib user32.lib
LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128-hybrid.exe /RELEASE /SUBSYSTEM:CONSOLE 128-amd64.obj udivmodti4-hybrid.obj kernel32.lib user32.lib
.\128.exe
.\128-hybrid.exe
.\64.exe
.\64-div.exe

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: all 64-bit programs are pure Win32 console applications and build without the MSVCRT libraries.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmoddi4.asm

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

64-amd64.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

64-amd64.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmodti4.asm

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmodti4-hybrid.asm

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

128-amd64.c
128-amd64.c(197) : warning C4244: 'function' : conversion from 'QWORD' to 'BYTE', possible loss of data
128-amd64.c(266) : warning C4090: 'function' : different 'const' qualifiers
128-amd64.c(266) : warning C4090: 'function' : different 'const' qualifiers

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          8.406722043      0
__udivmodti4()      68.135103197     59.728381154
__umulti3()         15.426767694      7.020045651
                    91.968592934 clock cycles

__unopti4()         10.955079756      0
__udivmodti4()      71.382337619     60.427257863
__umulti3()         20.658580753      9.703500997
                   102.995998128 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          8.500168812      0
__udivmodti4()      35.628062934     27.127894122
__umulti3()         15.499977787      6.999808975
                    59.628209533 clock cycles

__unopti4()         10.962429071      0
__udivmodti4()     127.868276342    116.905847271
__umulti3()         20.865134980      9.902705909
                   159.695840393 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          7.046572321      0
__udivmoddi4()      39.744549176     32.697976855
__umuldi3()          8.225293991      1.178721670
                    55.016415488 clock cycles

__unopdi4()          7.939823193      0
__udivmoddi4()      67.565681569     59.625858376
__umuldi3()          8.377724642      0.437901449
                    83.883229404 clock cycles

Timing 64-bit native division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          7.267861652      0
__udivmoddi4()      35.352622330     28.084760678
__umuldi3()          7.988646681      0.720785029
                    50.609130663 clock cycles

__unopdi4()          7.972264793      0
__udivmoddi4()      37.397281198     29.425016405
__umuldi3()          8.457423360      0.485158567
                    53.826969351 clock cycles

Now without the preprocessor macro CYCLES defined:

[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          3.3852217      0
__udivmodti4()      27.3937756     24.0085539
__umulti3()          6.2556401      2.8704184
                    37.0346374 nano-seconds

__unopti4()          4.3992282      0
__udivmodti4()      28.3453817     23.9461535
__umulti3()          8.3772537      3.9780255
                    41.1218636 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          3.3852217      0
__udivmodti4()      14.5080930     11.1228713
__umulti3()          6.2868403      2.9016186
                    24.1801550 nano-seconds

__unopti4()          4.3368278      0
__udivmodti4()      51.9015327     47.5647049
__umulti3()          8.3772537      4.0404259
                    64.6156142 nano-seconds

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          3.0264194      0
__udivmoddi4()      14.6952942     11.6688748
__umuldi3()          3.1512202      0.1248008
                    20.8729338 nano-seconds

__unopdi4()          3.0888198      0
__udivmoddi4()      26.7229713     23.6341515
__umuldi3()          3.5256226      0.4368028
                    33.3374137 nano-seconds

Timing 64-bit native division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          2.7300175      0
__udivmoddi4()      14.4144924     11.6844749
__umuldi3()          3.1356201      0.4056026
                    20.2801300 nano-seconds

__unopdi4()          3.1044199      0
__udivmoddi4()      15.1320970     12.0276771
__umuldi3()          3.5100225      0.4056026
                    21.7465394 nano-seconds

[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopti4()          6.787912997      0
__udivmodti4()      59.706410079     52.918497082
__umulti3()          9.539191229      2.751278232
                    76.033514305 clock cycles

__unopti4()          9.064762228      0
__udivmodti4()      64.037532443     54.972770215
__umulti3()         12.381997632      3.317235404
                    85.484292303 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopti4()          6.721662277      0
__udivmodti4()      29.001432494     22.279770217
__umulti3()          9.428926160      2.707263883
                    45.152020931 clock cycles

__unopti4()          8.960905247      0
__udivmodti4()      83.353539778     74.392634531
__umulti3()         12.205547423      3.244642176
                   104.519992448 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          5.247617465      0
__udivmoddi4()      24.890194689     19.642577224
__umuldi3()          6.295783447      1.048165982
                    36.433595601 clock cycles

__unopdi4()          5.941744662      0
__udivmoddi4()      44.505047583     38.563302921
__umuldi3()          7.127920907      1.186176245
                    57.574713152 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          4.928616197      0
__udivmoddi4()      36.111315035     31.182698838
__umuldi3()          6.104300272      1.175684075
                    47.144231504 clock cycles

__unopdi4()          5.832601506      0
__udivmoddi4()      37.489066778     31.656465272
__umuldi3()          6.694749067      0.862147561
                    50.016417351 clock cycles

[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopti4()          6.137470372      0
__udivmodti4()      97.558400914     91.420930542
__umulti3()          7.721820100      1.584349728
                   111.417691386 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopti4()          6.169866194      0
__udivmodti4()      26.153439674     19.983573480
__umulti3()          7.535166722      1.365300528
                    39.858472590 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopdi4()          3.936338838      0
__udivmoddi4()      20.136963782     16.200624944
__umuldi3()          4.070675478      0.134336640
                    28.143978098 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopdi4()          3.938140068      0
__udivmoddi4()      40.965728700     37.027588632
__umuldi3()          4.276305996      0.338165928
                    49.180174764 clock cycles

[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopti4()          5.739886508      0
__udivmodti4()      60.265247522     54.525361014
__umulti3()          8.030493537      2.290607029
                    74.035627567 clock cycles

__unopti4()          8.376397925      0
__udivmodti4()      63.878099605     55.501701680
__umulti3()         10.674320936      2.297923011
                    82.928818466 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopti4()          5.759768704      0
__udivmodti4()      24.207704489     18.447935785
__umulti3()          7.991973095      2.232204391
                    37.959446288 clock cycles

__unopti4()          8.356751289      0
__udivmodti4()      73.164876383     64.808125094
__umulti3()         10.667141168      2.310389879
                    92.188768840 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          5.215172319      0
__udivmoddi4()      20.464980809     15.249808490
__umuldi3()          4.339737255      0.000000000
                    30.019890383 clock cycles

__unopdi4()          6.034145232      0
__udivmoddi4()      37.823775351     31.789630119
__umuldi3()          5.595748061      0.000000000
                    49.453668644 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          4.311459182      0
__udivmoddi4()      32.456528199     28.145069017
__umuldi3()          5.798396574      1.486937392
                    42.566383955 clock cycles

__unopdi4()          5.594526971      0
__udivmoddi4()      34.625131407     29.030604436
__umuldi3()          5.613384251      0.018857280
                    45.833042629 clock cycles

[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopti4()          3.384227961      0
__udivmodti4()      34.535941576     31.151713615
__umulti3()          4.561600376      1.177372415
                    42.481769913 clock cycles

__unopti4()          4.958807640      0
__udivmodti4()      36.796688055     31.837880415
__umulti3()          6.071705006      1.112897366
                    47.827200701 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopti4()          3.285746669      0
__udivmodti4()      14.265474814     10.979728145
__umulti3()          4.595527857      1.309781188
                    22.146749340 clock cycles

__unopti4()          4.911969125      0
__udivmodti4()      42.153665292     37.241696167
__umulti3()          6.065414902      1.153445777
                    53.131049319 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          4.833738399      0
__udivmoddi4()      16.775388790     11.941650391
__umuldi3()          2.665548625      0.000000000
                    24.274675814 clock cycles

__unopdi4()          4.713733765      0
__udivmoddi4()      25.241120837     20.527387072
__umuldi3()          3.737127811      0.000000000
                    33.691982413 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          5.110940314      0
__udivmoddi4()      33.205803994     28.094863680
__umuldi3()          5.463078377      0.352138063
                    43.779822685 clock cycles

__unopdi4()          5.326425239      0
__udivmoddi4()      32.876005661     27.549580422
__umuldi3()          5.327849707      0.001424468
                    43.530280607 clock cycles

[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopti4()          5.540168007      0
__udivmodti4()      79.168789872     73.628621865
__umulti3()          6.110792248      0.570624241
                    90.819750127 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopti4()          5.564817542      0
__udivmodti4()      20.911815694     15.346998152
__umulti3()          6.121753970      0.556936428
                    32.598387206 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopdi4()          3.170010417      0
__udivmoddi4()      16.466358752     13.296348335
__umuldi3()          3.334743188      0.164732771
                    22.971112357 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopdi4()          3.157196080      0
__udivmoddi4()      33.291178787     30.133982707
__umuldi3()          3.494135522      0.336939442
                    39.942510389 clock cycles

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopti4()          8.868646128      0
__udivmodti4()      46.689956608     37.821310480
__umulti3()         21.770531089     12.901884961
                    77.329133825 clock cycles

__unopti4()         11.520301596      0
__udivmodti4()      52.089834710     40.569533114
__umulti3()         17.011904024      5.491602428
                    80.622040330 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopti4()          8.881698241      0
__udivmodti4()      35.183735229     26.302036988
__umulti3()         21.749787857     12.868089616
                    65.815221327 clock cycles

__unopti4()         11.526274366      0
__udivmodti4()      89.981335374     78.455061008
__umulti3()         18.157572955      6.631298589
                   119.665182695 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()         10.051541116      0
__udivmoddi4()      34.927090936     24.875549820
__umuldi3()          9.377512848      0.000000000
                    54.356144900 clock cycles

__unopdi4()          8.748480388      0
__udivmoddi4()      67.648442691     58.899962303
__umuldi3()          8.607450535      0.000000000
                    85.004373614 clock cycles

Timing 64-bit native division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()         10.965153851      0
__udivmoddi4()      21.871554419     10.906400568
__umuldi3()          8.548907922      0.000000000
                    41.385616192 clock cycles

__unopdi4()          8.868903219      0
__udivmoddi4()      31.380700111     22.511796892
__umuldi3()          9.254742318      0.385839099
                    49.504345648 clock cycles

Now without the preprocessor macro CYCLES defined:

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopti4()          4.0781250      0
__udivmodti4()      23.7656250     19.6875000
__umulti3()         11.1250000      7.0468750
                    38.9687500 nano-seconds

__unopti4()          5.2968750      0
__udivmodti4()      23.0625000     17.7656250
__umulti3()          7.4218750      2.1250000
                    35.7812500 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopti4()          4.0937500      0
__udivmodti4()      18.9062500     14.8125000
__umulti3()          9.7968750      5.7031250
                    32.7968750 nano-seconds

__unopti4()          5.0156250      0
__udivmodti4()      37.1875000     32.1718750
__umulti3()          7.5000000      2.4843750
                    49.7031250 nano-seconds

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()          3.3906250      0
__udivmoddi4()      16.2968750     12.9062500
__umuldi3()          4.8750000      1.4843750
                    24.5625000 nano-seconds

__unopdi4()          5.1093750      0
__udivmoddi4()      24.6718750     19.5625000
__umuldi3()          5.2500000      0.1406250
                    35.0312500 nano-seconds

Timing 64-bit native division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()          3.1875000      0
__udivmoddi4()       8.7812500      5.5937500
__umuldi3()          3.1093750      0.0000000
                    15.0781250 nano-seconds

__unopdi4()          3.5000000      0
__udivmoddi4()      11.6093750      8.1093750
__umuldi3()          3.2968750      0.0000000
                    18.4062500 nano-seconds

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopti4()          6.486039210      0
__udivmodti4()      26.669040891     20.183001681
__umulti3()          9.301024928      2.814985718
                    42.456105029 clock cycles

__unopti4()          9.463028389      0
__udivmodti4()      28.686989038     19.223960649
__umulti3()         12.162446193      2.699417804
                    50.312463620 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopti4()          6.759914430      0
__udivmodti4()      29.315665619     22.555751189
__umulti3()          9.908795028      3.148880598
                    45.984375077 clock cycles

__unopti4()         10.063938544      0
__udivmodti4()      73.125239046     63.061300502
__umulti3()         12.680222751      2.616284207
                    95.869400341 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          5.169526411      0
__udivmoddi4()      26.320155604     21.150629193
__umuldi3()          5.166327910      0.000000000
                    36.656009925 clock cycles

__unopdi4()          5.596172042      0
__udivmoddi4()      47.084314600     41.488142558
__umuldi3()          5.595622223      0.000000000
                    58.276108865 clock cycles

Timing 64-bit native division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          5.242663759      0
__udivmoddi4()      19.351564974     14.108901215
__umuldi3()          6.522515592      1.279851833
                    31.116744325 clock cycles

__unopdi4()          5.654228263      0
__udivmoddi4()      22.197831810     16.543603547
__umuldi3()          6.958791467      1.304563204
                    34.810851540 clock cycles

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          6.898510224      0
__udivmodti4()      26.925326748     20.026816524
__umulti3()          9.177466284      2.278956060
                    43.001303256 clock cycles

__unopti4()          9.476578368      0
__udivmodti4()      29.322601849     19.846023481
__umulti3()         11.710555056      2.233976688
                    50.509735273 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          6.865702096      0
__udivmodti4()      27.542023885     20.676321789
__umulti3()          9.108802297      2.243100201
                    43.516528278 clock cycles

__unopti4()          9.442571687      0
__udivmodti4()      68.794109504     59.351537817
__umulti3()         11.703519360      2.260947673
                    89.940200551 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          4.722583824      0
__udivmoddi4()      26.829651312     22.107067488
__umuldi3()          4.722048143      0.000000000
                    36.274283279 clock cycles

__unopdi4()          5.156534846      0
__udivmoddi4()      46.419813577     41.263278731
__umuldi3()          5.153521140      0.000000000
                    56.729869563 clock cycles

Timing 64-bit native division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          4.721372604      0
__udivmoddi4()      19.197411303     14.476038699
__umuldi3()          5.582367577      0.860994973
                    29.501151484 clock cycles

__unopdi4()          5.153817924      0
__udivmoddi4()      22.015109233     16.861291309
__umuldi3()          6.009193188      0.855375264
                    33.178120345 clock cycles

And without the preprocessor macro CYCLES defined:

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          2.0000000      0
__udivmodti4()      10.7343750      8.7343750
__umulti3()          2.5312500      0.5312500
                    15.2656250 nano-seconds

__unopti4()          2.6250000      0
__udivmodti4()       8.1093750      5.4843750
__umulti3()          3.2031250      0.5781250
                    13.9375000 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          2.0156250      0
__udivmodti4()       8.7500000      6.7343750
__umulti3()          2.5312500      0.5156250
                    13.2968750 nano-seconds

__unopti4()          2.6250000      0
__udivmodti4()      20.0468750     17.4218750
__umulti3()          3.1875000      0.5625000
                    25.8593750 nano-seconds

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          1.3125000      0
__udivmoddi4()       7.2812500      5.9687500
__umuldi3()          1.3281250      0.0156250
                     9.9218750 nano-seconds

__unopdi4()          1.4375000      0
__udivmoddi4()      12.3281250     10.8906250
__umuldi3()          1.3125000      0.0000000
                    15.0781250 nano-seconds

Timing 64-bit native division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          1.3125000      0
__udivmoddi4()       5.3281250      4.0156250
__umuldi3()          1.5625000      0.2500000
                     8.2031250 nano-seconds

__unopdi4()          1.4218750      0
__udivmoddi4()       6.1250000      4.7031250
__umuldi3()          1.6718750      0.2500000
                     9.2187500 nano-seconds

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          6.639901585      0
__udivmodti4()      25.407730112     18.767828527
__umulti3()          8.787561378      2.147659793
                    40.835193075 clock cycles

__unopti4()          9.009978790      0
__udivmodti4()      27.641956656     18.631977866
__umulti3()         11.091630160      2.081651370
                    47.743565606 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          6.564753997      0
__udivmodti4()      26.123169316     19.558415319
__umulti3()          8.788291128      2.223537131
                    41.476214441 clock cycles

__unopti4()          9.088542617      0
__udivmodti4()      65.108671217     56.020128600
__umulti3()         11.086648437      1.998105820
                    85.283862271 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          4.487864035      0
__udivmoddi4()      25.416913553     20.929049518
__umuldi3()          4.530544610      0.042680575
                    34.435322198 clock cycles

__unopdi4()          4.909401335      0
__udivmoddi4()      43.696358312     38.786956977
__umuldi3()          4.915678491      0.006277156
                    53.521438138 clock cycles

Timing 64-bit native division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          4.486075560      0
__udivmoddi4()      17.826014215     13.339938655
__umuldi3()          5.293388605      0.807313045
                    27.605478380 clock cycles

__unopdi4()          4.913181349      0
__udivmoddi4()      20.468841039     15.555659690
__umuldi3()          5.707274649      0.794093300
                    31.089297037 clock cycles

And without the preprocessor macro CYCLES defined:

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          1.8437500      0
__udivmodti4()       6.7187500      4.8750000
__umulti3()          2.3125000      0.4687500
                    10.8750000 nano-seconds

__unopti4()          2.3906250      0
__udivmodti4()       7.2343750      4.8437500
__umulti3()          2.9531250      0.5625000
                    12.5781250 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          1.7968750      0
__udivmodti4()       7.1875000      5.3906250
__umulti3()          2.3125000      0.5156250
                    11.2968750 nano-seconds

__unopti4()          2.3437500      0
__udivmodti4()      17.2812500     14.9375000
__umulti3()          2.9062500      0.5625000
                    22.5312500 nano-seconds

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          1.1718750      0
__udivmoddi4()       6.5468750      5.3750000
__umuldi3()          1.3281250      0.1562500
                     9.0468750 nano-seconds

__unopdi4()          1.2968750      0
__udivmoddi4()      11.0937500      9.7968750
__umuldi3()          1.1562500      0.0000000
                    13.5468750 nano-seconds

Timing 64-bit native division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          1.2031250      0
__udivmoddi4()       4.7031250      3.5000000
__umuldi3()          1.3593750      0.1562500
                     7.2656250 nano-seconds

__unopdi4()          1.2968750      0
__udivmoddi4()       5.4062500      4.1093750
__umuldi3()          1.6406250      0.3437500
                     8.3437500 nano-seconds

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD EPYC 7713 64-Core Processor

__unopti4()          4.210726820      0
__udivmodti4()      13.480688960      9.269962140
__umulti3()          8.231189400      4.020462580
                    25.922605180 clock cycles

__unopti4()          6.144526480      0
__udivmodti4()      16.105263640      9.960737160
__umulti3()          7.747329420      1.602802940
                    29.997119540 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD EPYC 7713 64-Core Processor

__unopti4()          4.213688400      0
__udivmodti4()      15.444323440     11.230635040
__umulti3()          8.231448620      4.017760220
                    27.889460460 clock cycles

__unopti4()          6.144599980      0
__udivmodti4()      38.763362940     32.618762960
__umulti3()          7.749265980      1.604666000
                    52.657228900 clock cycles

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD EPYC 7713 64-Core Processor

__unopdi4()          3.566402000      0
__udivmoddi4()      16.570870520     13.004468520
__umuldi3()          3.561979020      0.000000000
                    23.699251540 clock cycles

__unopdi4()          3.889461880      0
__udivmoddi4()      26.724160800     22.834698920
__umuldi3()          3.889344680      0.000000000
                    34.502967360 clock cycles


Timing 64-bit native division on AMD EPYC 7713 64-Core Processor

__unopdi4()          3.557871840      0
__udivmoddi4()       6.782881360      3.225009520
__umuldi3()          4.205765360      0.647893520
                    14.546518560 clock cycles

__unopdi4()          3.882580440      0
__udivmoddi4()       6.782383720      2.899803280
__umuldi3()          4.532050080      0.649469640
                    15.197014240 clock cycles

And without the preprocessor macro CYCLES defined:

[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD EPYC 7713 64-Core Processor

__unopti4()          2.1718750      0
__udivmodti4()       6.7656250      4.5937500
__umulti3()          4.0781250      1.9062500
                    13.0156250 nano-seconds

__unopti4()          2.9687500      0
__udivmodti4()       8.1875000      5.2187500
__umulti3()          3.8906250      0.9218750
                    15.0468750 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on AMD EPYC 7713 64-Core Processor

__unopti4()          2.1406250      0
__udivmodti4()       8.1875000      6.0468750
__umulti3()          4.0781250      1.9375000
                    14.4062500 nano-seconds

__unopti4()          2.9687500      0
__udivmodti4()      20.0000000     17.0312500
__umulti3()          3.9062500      0.9375000
                    26.8750000 nano-seconds

Testing 64-bit assembler division...
57
Timing 64-bit assembler division on AMD EPYC 7713 64-Core Processor

__unopdi4()          1.8125000      0
__udivmoddi4()       8.4218750      6.6093750
__umuldi3()          1.8281250      0.0156250
                    12.0625000 nano-seconds

__unopdi4()          1.9843750      0
__udivmoddi4()      13.5000000     11.5156250
__umuldi3()          1.8125000      0.0000000
                    17.2968750 nano-seconds

Timing 64-bit native division on AMD EPYC 7713 64-Core Processor

__unopdi4()          1.8281250      0
__udivmoddi4()       3.4531250      1.6250000
__umuldi3()          2.1406250      0.3125000
                     7.4218750 nano-seconds

__unopdi4()          1.9843750      0
__udivmoddi4()       3.4531250      1.4687500
__umuldi3()          2.3125000      0.3281250
                     7.7500000 nano-seconds

Benchmark Program for i386 Processors

The following C program for 32-bit processors measures the execution time for one billion divisions of uniform distributed 64-bit pseudo-random numbers and for one billion divisions of pseudo-random numbers in the interval from 2⁶⁴−1 to 2³² with the __udivmoddi4() function.

Note: with the preprocessor macro HELPER defined, it uses the compiler helper routines _alldiv(), _alldvrm(), _allmul(), _allrem(), _aulldiv(), _aulldvrm() and _aullrem() instead, which the Microsoft Visual C compiler calls to perform 64-bit division and multiplication.

Note: with the preprocessor macro CYCLES defined, it measures the execution time in processor clock cycles and runs on Windows Vista^® and newer versions, else it measures the execution time in nano-seconds and runs on all versions of Windows^™ NT.

Note: it uses the same pseudo-random number generators as the second C program for 64-bit processors, so their results are directly comparable.

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_IX86
#pragma message("For I386 platform only!")
#endif

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	LONGLONG	SQWORD;
typedef	ULONGLONG	QWORD;

#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}

const	struct	_ull
{
	QWORD	ullDividend, ullDivisor, ullQuotient, ullRemainder;
} ullTable[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
                _(0x0000000000000001ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000002ULL),
                _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
                _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
                _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
                _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
                _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
                _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000000000001ULL),
                _(0x8000000000000000ULL, 0x0000000000000002ULL),
                _(0x8000000000000000ULL, 0x0000000000000003ULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000100000000ULL),
                _(0x8000000000000000ULL, 0x0000000100000001ULL),
                _(0x8000000000000000ULL, 0x0000000100000002ULL),
                _(0x8000000000000000ULL, 0x0000000100000003ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000080000000ULL, 0x0000000080000000ULL),
                _(0x8000000080000001ULL, 0x0000000080000001ULL),
                _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};

const	struct	_ll
{
	SQWORD	llDividend, llDivisor, llQuotient, llRemainder;
} llTable[] = {_(0x0000000000000000LL, 0x0000000000000001LL),	// 0, 1
               _(0x0000000000000001LL, 0x0000000000000001LL),	// 1, 1
               _(0x0000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// 0, -1
               _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFFLL),	// 1, -1
               _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFELL),	// 1, -2
               _(0x0000000000000002LL, 0xFFFFFFFFFFFFFFFELL),	// 2, -2
               _(0x000000000FFFFFFFLL, 0x0000000000000001LL),
               _(0x0000000FFFFFFFFFLL, 0x000000000000000FLL),
               _(0x0000000FFFFFFFFFLL, 0x0000000000000010LL),
               _(0x0000000000000100LL, 0x000000000FFFFFFFLL),
               _(0x00FFFFFFF0000000LL, 0x0000000010000000LL),
               _(0x07FFFFFF80000000LL, 0x0000000080000000LL),
               _(0x7FFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
               _(0x7FFFFFFEFFFFFFF0LL, 0x0000FFFFFFFFFFFELL),
               _(0x7FFFFFFEFFFFFFF0LL, 0x7FFFFFFEFFFFFFF0LL),
               _(0x7FFFFFFFFFFFFFFFLL, 0x8000000000000000LL),	// llmax, llmin
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFDLL),	// llmax, -3
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// llmax, -2
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL),	// llmax, -1
               _(0x8000000000000000LL, 0x0000000000000001LL),	// llmin, 1
               _(0x8000000000000000LL, 0x0000000000000002LL),	// llmin, 2
               _(0x8000000000000000LL, 0x0000000000000003LL),	// llmin, 3
               _(0x8000000000000000LL, 0x00000000FFFFFFFELL),
               _(0x8000000000000000LL, 0x00000000FFFFFFFFLL),
               _(0x8000000000000000LL, 0x0000000100000000LL),
               _(0x8000000000000000LL, 0x0000000100000001LL),
               _(0x8000000000000000LL, 0x0000000100000002LL),
               _(0x8000000000000000LL, 0x8000000000000000LL),	// llmin, llmin
               _(0x8000000000000000LL, 0xFFFFFFFF00000000LL),
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFDLL),	// llmin, -3
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFELL),	// llmin, -2
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// llmin, -1
               _(0x8000000080000000LL, 0x0000000080000000LL),
               _(0x8000000080000001LL, 0x0000000080000001LL),
               _(0xFFFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000080000000LL),
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000001LL),	// -2, 1
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000002LL),	// -2, 2
               _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFELL),	// -2, -2
               _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFFLL),	// -2, -1
               _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000001LL),	// -1, 1
               _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000002LL),	// -1, 2
               _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// -1, -2
               _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL)};	// -1, -1

#undef _

#ifndef HELPER
SQWORD	__divdi3(SQWORD dividend, SQWORD divisor);
SQWORD	__moddi3(SQWORD dividend, SQWORD divisor);
SQWORD	__muldi3(SQWORD multiplicand, SQWORD multiplier);
QWORD	__udivdi3(QWORD dividend, QWORD divisor);
QWORD	__umoddi3(QWORD dividend, QWORD divisor);
QWORD	__umuldi3(QWORD multiplicand, QWORD multiplier);
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);

__declspec(noinline)
QWORD	__unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = divisor;

	return dividend;
}
#else
__declspec(naked)
__declspec(noinline)
QWORD	WINAPI	_aullnop(QWORD left, QWORD right)
{
	__asm	ret	16
}
#endif // HELPER

__forceinline	// companion for __emulu()
struct
{
	DWORD	ulQuotient, ulRemainder;
}	WINAPI	__edivmodu(QWORD ullDividend, DWORD ulDivisor)
{
	__asm	mov	eax, dword ptr ullDividend
	__asm	mov	edx, dword ptr ullDividend+4
	__asm	div	ulDivisor
}

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	QWORD	ullQuotient, ullRemainder;
	SQWORD	llQuotient, llRemainder;

	volatile
#ifdef HELPER
	QWORD	qwQuotient, qwRemainder;
	QWORD	qwDividend, qwDivisor = ~0ULL;
#else
	QWORD	qwQuotient;
	QWORD	qwDividend, qwDivisor = ~0ULL, qwRemainder;
#endif		// 2**64 / golden ratio
	QWORD	qwLeft = 0x9E3779B97F4A7C15ULL;
		// bit-vector of prime numbers:
		//  2**prime is set for each prime in [0, 63]
	QWORD	qwRight = 0x28208A20A08A28ACULL;

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		__movsb(dwCPUID, "undetermined processor", sizeof("undetermined processor"));

	PrintFormat(hOutput, "\r\nTesting 64-bit division...\r\n");

	for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);
#ifndef HELPER
		ullQuotient = __udivmoddi4(ullTable[dw].ullDividend, ullTable[dw].ullDivisor, &ullRemainder);
#else
		ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
		ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
#endif
		if (ullQuotient != ullTable[dw].ullQuotient)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);

		if (ullQuotient > ullTable[dw].ullDividend)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);

		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);

		if (ullRemainder >= ullTable[dw].ullDivisor)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
	}

	PrintFormat(hOutput, "\r\nTesting unsigned 64-bit division...\r\n");

	for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);
#ifndef HELPER
		ullQuotient = __udivdi3(ullTable[dw].ullDividend, ullTable[dw].ullDivisor);
#else
		ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
#endif
		if (ullQuotient != ullTable[dw].ullQuotient)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);

		if (ullQuotient > ullTable[dw].ullDividend)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);
#ifndef HELPER
		ullRemainder = ullTable[dw].ullDividend - __muldi3(ullTable[dw].ullDivisor, ullQuotient);
#else
		ullRemainder = ullTable[dw].ullDividend - ullTable[dw].ullDivisor * ullQuotient;
#endif
		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
#ifndef HELPER
		ullRemainder = __umoddi3(ullTable[dw].ullDividend, ullTable[dw].ullDivisor);
#else
		ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
#endif
		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);

		if (ullRemainder >= ullTable[dw].ullDivisor)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
	}

	PrintFormat(hOutput, "\r\nTesting signed 64-bit division...\r\n");

	for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);
#ifndef HELPER
		llQuotient = __divdi3(llTable[dw].llDividend, llTable[dw].llDivisor);
#else
		llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;
#endif
		if (llQuotient != llTable[dw].llQuotient)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);

		if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
		 || (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);
#ifndef HELPER
		llRemainder = llTable[dw].llDividend - __muldi3(llTable[dw].llDivisor, llQuotient);
#else
		llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;
#endif
		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);

		if ((llRemainder != 0LL)
		 && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
#ifndef HELPER
		llRemainder = __moddi3(llTable[dw].llDividend, llTable[dw].llDivisor);
#else
		llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;
#endif
		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);

		if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
		 || (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);

		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
	}

	PrintFormat(hOutput, "\r\nTiming 64-bit division on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
		qwQuotient = __unopdi4(qwLeft, qwRight, NULL);
#else
		qwQuotient = _aullnop(qwLeft, qwRight);
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
		qwQuotient = __unopdi4(qwLeft, qwRight, &qwRemainder);
#else
		qwQuotient = _aullnop(qwLeft, qwRight);
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwLeft, qwRight, &NULL);
#else
		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwLeft, qwRight, &qwRemainder);
#else
		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
		qwQuotient = __umuldi3(qwLeft, qwRight);
#else
		qwQuotient = qwLeft * qwRight;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
		qwQuotient = __umuldi3(qwLeft, qwRight);
#else
		qwQuotient = qwLeft * qwRight;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifndef HELPER
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6lu.%09lu      0\r\n"
	            "__udivmoddi4()  %6lu.%09lu %6lu.%09lu\r\n"
	            "__umuldi3()     %6lu.%09lu %6lu.%09lu\r\n"
	            "                %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL),
	            __edivmodu(qwTx, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL),
	            __edivmodu(qwTy, 1000000000UL),
	            __edivmodu(qwTz, 1000000000UL));
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6lu.%07lu      0\r\n"
	            "__udivmoddi4()  %6lu.%07lu %6lu.%07lu\r\n"
	            "__umuldi3()     %6lu.%07lu %6lu.%07lu\r\n"
	            "                %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL),
	            __edivmodu(qwTx, 10000000UL),
	            __edivmodu(qwT3, 10000000UL),
	            __edivmodu(qwTy, 10000000UL),
	            __edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#else // HELPER
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()      %6lu.%09lu      0\r\n"
	            "_aulldvrm()     %6lu.%09lu %6lu.%09lu\r\n"
	            "_aullmul()      %6lu.%09lu %6lu.%09lu\r\n"
	            "                %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL),
	            __edivmodu(qwTx, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL),
	            __edivmodu(qwTy, 1000000000UL),
	            __edivmodu(qwTz, 1000000000UL));
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()      %6lu.%07lu      0\r\n"
	            "_aulldvrm()     %6lu.%07lu %6lu.%07lu\r\n"
	            "_aullmul()      %6lu.%07lu %6lu.%07lu\r\n"
	            "                %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL),
	            __edivmodu(qwTx, 10000000UL),
	            __edivmodu(qwT3, 10000000UL),
	            __edivmodu(qwTy, 10000000UL),
	            __edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#endif // HELPER
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
		qwQuotient = __unopdi4(qwDividend, qwDivisor, NULL);
#else
		qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, NULL);
#else
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
		qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
		qwQuotient = qwDividend * qwDivisor;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
		qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
		qwQuotient = qwDividend * qwDivisor;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifndef HELPER
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6lu.%09lu      0\r\n"
	            "__udivmoddi4()  %6lu.%09lu %6lu.%09lu\r\n"
	            "__umuldi3()     %6lu.%09lu %6lu.%09lu\r\n"
	            "                %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL),
	            __edivmodu(qwTx, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL),
	            __edivmodu(qwTy, 1000000000UL),
	            __edivmodu(qwTz, 1000000000UL));
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "__unopdi4()     %6lu.%07lu      0\r\n"
	            "__udivmoddi4()  %6lu.%07lu %6lu.%07lu\r\n"
	            "__umuldi3()     %6lu.%07lu %6lu.%07lu\r\n"
	            "                %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL),
	            __edivmodu(qwTx, 10000000UL),
	            __edivmodu(qwT3, 10000000UL),
	            __edivmodu(qwTy, 10000000UL),
	            __edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#else // HELPER
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()      %6lu.%09lu      0\r\n"
	            "_aulldvrm()     %6lu.%09lu %6lu.%09lu\r\n"
	            "_aullmul()      %6lu.%09lu %6lu.%09lu\r\n"
	            "                %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL),
	            __edivmodu(qwTx, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL),
	            __edivmodu(qwTy, 1000000000UL),
	            __edivmodu(qwTz, 1000000000UL));
#else
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()      %6lu.%07lu      0\r\n"
	            "_aulldvrm()     %6lu.%07lu %6lu.%07lu\r\n"
	            "_aullmul()      %6lu.%07lu %6lu.%07lu\r\n"
	            "                %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL),
	            __edivmodu(qwTx, 10000000UL),
	            __edivmodu(qwT3, 10000000UL),
	            __edivmodu(qwTy, 10000000UL),
	            __edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#endif // HELPER
	ExitProcess(0UL);
}

DWORD_PTR	__security_cookie = 3141592654UL;	// π * 10**9

extern	LPVOID	__safe_se_handler_table[];
extern	BYTE	__safe_se_handler_count;

const	IMAGE_LOAD_CONFIG_DIRECTORY32	_load_config_used = {sizeof(_load_config_used),
					                     'DATE',	// = 2006-04-15 20:15:01 UTC
					                     _MSC_VER / 100, _MSC_VER % 100,
					                     0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL,
					                     0U, 0U,
					                     0UL,
					                     &__security_cookie,
					                     __safe_se_handler_table,
					                     &__safe_se_handler_count};
__declspec(naked)
VOID	__fastcall	__security_check_cookie(DWORD_PTR _stackcookie)
{
	__asm
	{
		cmp	ecx, __security_cookie
		jne	corrupt
		ret

	corrupt:
		ud2
	}
}

Save this C source as 64-i386.c in an arbitrary, preferable empty directory, save the 16 32-bit assembler sources presented above as alldiv.asm, alldvrm.asm, allmul.asm, allrem.asm, allshl.asm, allshr.asm, aulldiv.asm, aulldvrm.asm, aullrem.asm, aullshr.asm, divdi3.asm, moddi3.asm, muldi3.asm, udivdi3.asm, umoddi3.asm and udivmoddi4.asm respectively there too, optionally copy clang_rt.builtins-i386.lib from an installation of LLVM’s Clang, then start the command prompt of the Windows software development kit for the I386 platform in this directory, run the following command lines to build the benchmark programs 64-i386.exe, 64-helper.exe, 64-msft.exe and optionally 64-llvm.exe, and execute them:

CL.EXE /Brepro /c /DCYCLES /GAFwy /O2y /W4 /Zl 64-i386.c
ML.EXE /Brepro /Cp /Cx /c /W3 /X divdi3.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X moddi3.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X muldi3.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X udivdi3.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X umoddi3.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X udivmoddi4.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X alldiv.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X alldvrm.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X allmul.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X allrem.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X allshl.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X allshr.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X aulldiv.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X aulldvrm.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X aullrem.asm
ML.EXE /Brepro /Cp /Cx /c /W3 /X aullshr.asm
LINK.EXE /LIB /BREPRO /MACHINE:I386 /NODEFAULTLIB /OUT:64-i386.lib divdi3.obj moddi3.obj muldi3.obj udivdi3.obj umoddi3.obj udivmoddi4.obj alldiv.obj alldvrm.obj allmul.obj allrem.obj allshl.obj allshr.obj aulldiv.obj aulldvrm.obj aullrem.obj aullshr.obj
LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-i386.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj 64-i386.lib kernel32.lib user32.lib
CL.EXE /Brepro /c /DCYCLES /DHELPER /GAFwy /O2y /W4 /Zl 64-i386.c
LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-helper.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj 64-i386.lib kernel32.lib user32.lib
LINK.EXE /LIB /BREPRO /DEF /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /MACHINE:I386 /NAME:NTDLL /NODEFAULTLIB /OUT:64-msft.lib
LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-msft.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj 64-msft.lib kernel32.lib user32.lib
IF EXIST clang_rt.builtins-i386.lib LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-llvm.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj clang_rt.builtins-i386.lib kernel32.lib user32.lib
.\64-i386.exe
.\64-helper.exe
.\64-msft.exe
IF EXIST 64-llvm.exe .\64-llvm.exe

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: all 32-bit programs are pure Win32 console applications and build without the MSVCRT libraries.

Note: the trivial transformation of the assembler sources with directives for Unix’ or GNU’s as into assembler sources for Microsoft’s ML.EXE is left as an exercise to the reader. Microsoft Macro Assembler Reference

Note: linking the program 64-msft.exe with the compiler helper routines built from their source code blcrtasm.asm is also left as an exercise to the reader.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

64-i386.c
64-i386.c(823) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD_PTR *'
64-i386.c(824) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *'
64-i386.c(825) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *'
64-i386.c(828) : warning C4100: '_stackcookie' : unreferenced formal parameter

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: divdi3.asm
…
 Assembling: udivmoddi4.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alldiv.asm
…
 Assembling: aullshr.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

64-i386.c
64-i386.c(150) : warning C4100: 'right' : unreferenced formal parameter
64-i386.c(150) : warning C4100: 'left' : unreferenced formal parameter
64-i386.c(823) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD_PTR *'
64-i386.c(824) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *'
64-i386.c(825) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *'
64-i386.c(828) : warning C4100: '_stackcookie' : unreferenced formal parameter

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Program Maintenance Utility Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

   Creating library 64-msft.lib and object 64-msft.exp

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()         12.849202656      0
__udivmoddi4()      37.561920358     24.712717702
__umuldi3()         14.358287749      1.509085093
                    64.769410763 clock cycles

__unopdi4()         18.308137879      0
__udivmoddi4()      37.448476732     19.140338853
__umuldi3()         19.587959635      1.279821756
                    75.344574246 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           9.108604673      0
_aulldvrm()         39.178505498     30.069900825
_aullmul()          14.272042690      5.163438017
                    62.559152861 clock cycles

_aullnop()          14.043325395      0
_aulldvrm()         38.404302453     24.360977058
_aullmul()          19.309414816      5.266089421
                    71.757042664 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           9.005029514      0
_aulldvrm()        145.500002260    136.494972746
_aullmul()          17.647885499      8.642855985
                   172.152917273 clock cycles

_aullnop()          13.827490013      0
_aulldvrm()        111.386159799     97.558669786
_aullmul()          22.663936806      8.836446793
                   147.877586618 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()         12.857257576      0
__udivmoddi4()      94.499937193     81.642679617
__umuldi3()         30.708206573     17.850948997
                   138.065401342 clock cycles

__unopdi4()         17.108234965      0
__udivmoddi4()     161.966266965    144.858032000
__umuldi3()         35.101783471     17.993548506
                   214.176285401 clock cycles

Also without the preprocessor macro CYCLES defined:

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          4.5864294      0
__udivmoddi4()      15.5064994     10.9200700
__umuldi3()          5.6004359      1.0140065
                    25.6933647 nano-seconds

__unopdi4()          7.1760460      0
__udivmoddi4()      14.9760960      7.8000500
__umuldi3()          7.7376496      0.5616036
                    29.8897916 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           3.6660235      0
_aulldvrm()         15.8029013     12.1368778
_aullmul()           5.5380355      1.8720120
                    25.0069603 nano-seconds

_aullnop()           5.9592382      0
_aulldvrm()         15.4752992      9.5160610
_aullmul()           7.9716511      2.0124129
                    29.4061885 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           3.6660235      0
_aulldvrm()         58.4691748     54.8031513
_aullmul()           7.2696466      3.6036231
                    69.4048449 nano-seconds

_aullnop()           5.7564369      0
_aulldvrm()         44.0546824     38.2982455
_aullmul()           9.3132597      3.5568228
                    59.1243790 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          5.1480330      0
__udivmoddi4()      37.3934397     32.2454067
__umuldi3()         12.4176796      7.2696466
                    54.9591523 nano-seconds

__unopdi4()          6.8640440      0
__udivmoddi4()      64.9432163     58.0791723
__umuldi3()         14.1648908      7.3008468
                    85.9721511 nano-seconds

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          8.710364463      0
__udivmoddi4()      29.568165444     20.857800981
__umuldi3()         10.016409737      1.306045274
                    48.294939644 clock cycles

__unopdi4()         11.899356861      0
__udivmoddi4()      31.305810062     19.406453201
__umuldi3()         14.074341743      2.174984882
                    57.279508666 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

_aullnop()           6.281171716      0
_aulldvrm()         30.299316500     24.018144784
_aullmul()          10.415490092      4.134318376
                    46.995978308 clock cycles

_aullnop()          10.305468488      0
_aulldvrm()         29.560666513     19.255198025
_aullmul()          15.232518004      4.927049516
                    55.098653005 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

_aullnop()           6.282753357      0
_aulldvrm()        130.221916499    123.939163142
_aullmul()          12.560291961      6.277538604
                   149.064961817 clock cycles

_aullnop()          10.305609251      0
_aulldvrm()         93.916607827     83.610998576
_aullmul()          17.949963126      7.644353875
                   122.172180204 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          8.794296716      0
__udivmoddi4()      58.334799420     49.540502704
__umuldi3()         16.971398673      8.177101957
                    84.100494809 clock cycles

__unopdi4()         11.806963851      0
__udivmoddi4()     108.598490949     96.791527098
__umuldi3()         22.271070710     10.464106859
                   142.676525510 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopdi4()          9.513493832      0
__udivmoddi4()      28.904259242     19.390765410
__umuldi3()          9.111766044      0.000000000
                    47.529519118 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

_aullnop()           8.466770864      0
_aulldvrm()        133.568853734    125.102082870
_aullmul()          13.159542118      4.692771254
                   155.195166716 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
57
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          8.176246060      0
__udivmoddi4()      24.540802967     16.364556907
__umuldi3()          8.774901071      0.598655011
                    41.491950098 clock cycles

__unopdi4()         10.752357791      0
__udivmoddi4()      24.479256622     13.726898831
__umuldi3()         12.175662023      1.423304232
                    47.407276436 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

_aullnop()           6.042327017      0
_aulldvrm()         24.822108405     18.779781388
_aullmul()           8.850690256      2.808363239
                    39.715125678 clock cycles

_aullnop()           9.036137407      0
_aulldvrm()         24.078298463     15.042161056
_aullmul()          12.182378249      3.146240842
                    45.296814119 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

_aullnop()           6.043049544      0
_aulldvrm()        121.360828766    115.317779222
_aullmul()          11.284504325      5.241454781
                   138.688382635 clock cycles

_aullnop()           9.038334480      0
_aulldvrm()         87.144452426     78.106117946
_aullmul()          14.460059957      5.421725477
                   110.642846863 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          8.182234986      0
__udivmoddi4()      49.594440527     41.412205541
__umuldi3()         15.480297393      7.298062407
                    73.256972906 clock cycles

__unopdi4()         10.785032002      0
__udivmoddi4()      93.296232493     82.511200491
__umuldi3()         19.044985770      8.259953768
                   123.126250265 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          4.758041439      0
__udivmoddi4()      14.900456178     10.142414739
__umuldi3()          5.118839780      0.360798341
                    24.777337397 clock cycles

__unopdi4()          6.264035993      0
__udivmoddi4()      14.991681122      8.727645129
__umuldi3()          7.116819579      0.852783586
                    28.372536694 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

_aullnop()           3.511911560      0
_aulldvrm()         14.931006596     11.419095036
_aullmul()           5.185640855      1.673729295
                    23.628559011 clock cycles

_aullnop()           5.267329959      0
_aulldvrm()         14.287518025      9.020188066
_aullmul()           7.649375365      2.382045406
                    27.204223349 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

_aullnop()           3.888929085      0
_aulldvrm()         75.630752291     71.741823206
_aullmul()           7.039982148      3.151053063
                    86.559663524 clock cycles

_aullnop()           5.706960149      0
_aulldvrm()         51.744648850     46.037688701
_aullmul()           8.437223071      2.730262922
                    65.888832070 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          4.761309343      0
__udivmoddi4()      30.499168861     25.737859518
__umuldi3()          9.146076148      4.384766805
                    44.406554352 clock cycles

__unopdi4()          6.320688165      0
__udivmoddi4()      58.207913400     51.887225235
__umuldi3()         11.342916480      5.022228315
                    75.871518045 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopdi4()          7.677882315      0
__udivmoddi4()      23.667828663     15.989946348
__umuldi3()          7.422475230      0.000000000
                    38.768186208 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

_aullnop()           5.841766912      0
_aulldvrm()        106.658152285    100.816385373
_aullmul()          10.760192090      4.918425178
                   123.260111287 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()         14.243392667      0
__udivmoddi4()      55.997587943     41.754195276
__umuldi3()         13.500837936      0.000000000
                    83.741818546 clock cycles

__unopdi4()         17.199216332      0
__udivmoddi4()      45.874249502     28.675033170
__umuldi3()         18.633292583      1.434076251
                    81.706758417 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

_aullnop()           9.027738059      0
_aulldvrm()        120.796853280    111.769115221
_aullmul()          15.186058308      6.158320249
                   145.010649647 clock cycles

_aullnop()          15.578224772      0
_aulldvrm()         90.215115103     74.636890331
_aullmul()          21.860576148      6.282351376
                   127.653916023 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

_aullnop()          10.246997781      0
_aulldvrm()         54.808625176     44.561627395
_aullmul()          13.902460030      3.655462249
                    78.958082987 clock cycles

_aullnop()          15.956642108      0
_aulldvrm()         47.420239312     31.463597204
_aullmul()          22.055934131      6.099292023
                    85.432815551 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()         11.401600724      0
__udivmoddi4()      90.877376326     79.475775602
__umuldi3()         24.230306820     12.828706096
                   126.509283870 clock cycles

__unopdi4()         16.350275112      0
__udivmoddi4()     181.099752347    164.749477235
__umuldi3()         28.889429738     12.539154626
                   226.339457197 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()          5.1406250      0
__udivmoddi4()      20.2656250     15.1250000
__umuldi3()          5.7343750      0.5937500
                    31.1406250 nano-seconds

__unopdi4()          7.1718750      0
__udivmoddi4()      19.9375000     12.7656250
__umuldi3()          8.0468750      0.8750000
                    35.1562500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

_aullnop()           3.9531250      0
_aulldvrm()         20.4062500     16.4531250
_aullmul()           6.0468750      2.0937500
                    30.4062500 nano-seconds

_aullnop()           6.8281250      0
_aulldvrm()         20.7812500     13.9531250
_aullmul()           8.8281250      2.0000000
                    36.4375000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

_aullnop()           3.9375000      0
_aulldvrm()         49.3437500     45.4062500
_aullmul()           6.6406250      2.7031250
                    59.9218750 nano-seconds

_aullnop()           7.0781250      0
_aulldvrm()         42.0000000     34.9218750
_aullmul()           9.4843750      2.4062500
                    58.5625000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G

__unopdi4()          5.1093750      0
__udivmoddi4()      39.4843750     34.3750000
__umuldi3()         10.5468750      5.4375000
                    55.1406250 nano-seconds

__unopdi4()          7.2656250      0
__udivmoddi4()      69.2343750     61.9687500
__umuldi3()         12.7656250      5.5000000
                    89.2656250 nano-seconds

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          8.637489867      0
__udivmoddi4()      27.828655455     19.191165588
__umuldi3()          9.761457334      1.123967467
                    46.227602656 clock cycles

__unopdi4()         11.229091635      0
__udivmoddi4()      26.703517279     15.474425644
__umuldi3()         12.675702170      1.446610535
                    50.608311084 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

_aullnop()           6.031238132      0
_aulldvrm()         27.804057740     21.772819608
_aullmul()          10.548285859      4.517047727
                    44.383581731 clock cycles

_aullnop()           9.489672570      0
_aulldvrm()         27.331796039     17.842123469
_aullmul()          11.909754514      2.420081944
                    48.731223123 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

_aullnop()           6.040463934      0
_aulldvrm()         88.367491909     82.327027975
_aullmul()          10.869423368      4.828959434
                   105.277379211 clock cycles

_aullnop()           9.492139142      0
_aulldvrm()         67.661042025     58.168902883
_aullmul()          14.151106441      4.658967299
                    91.304287608 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          8.218757132      0
__udivmoddi4()      68.973908219     60.755151087
__umuldi3()         16.682145911      8.463388779
                    93.874811262 clock cycles

__unopdi4()         11.236472949      0
__udivmoddi4()     125.545346907    114.308873958
__umuldi3()         20.463970255      9.227497306
                   157.245790111 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          8.156522700      0
__udivmoddi4()      26.587471432     18.430948732
__umuldi3()          8.227731420      0.071208720
                    42.971725552 clock cycles

__unopdi4()         11.171514133      0
__udivmoddi4()      24.479854346     13.308340213
__umuldi3()         11.409089652      0.237575519
                    47.060458131 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           6.010190784      0
_aulldvrm()         26.050758627     20.040567843
_aullmul()           8.197509891      2.187319107
                    40.258459302 clock cycles

_aullnop()           9.910357023      0
_aulldvrm()         24.299246162     14.388889139
_aullmul()          11.622902328      1.712545305
                    45.832505513 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           6.010334245      0
_aulldvrm()         79.145175680     73.134841435
_aullmul()           9.026602597      3.016268352
                    94.182112522 clock cycles

_aullnop()           9.909955188      0
_aulldvrm()         53.473960584     43.564005396
_aullmul()          13.012305555      3.102350367
                    76.396221327 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          8.154862312      0
__udivmoddi4()      64.296126258     56.141263946
__umuldi3()         14.782954357      6.628092045
                    87.233942927 clock cycles

__unopdi4()         11.159903449      0
__udivmoddi4()     115.862991234    104.703087785
__umuldi3()         18.352062682      7.192159233
                   145.374957365 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          2.2656250      0
__udivmoddi4()       7.4062500      5.1406250
__umuldi3()          2.2812500      0.0156250
                    11.9531250 nano-seconds

__unopdi4()          3.1093750      0
__udivmoddi4()       6.7812500      3.6718750
__umuldi3()          3.2343750      0.1250000
                    13.1250000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           1.6718750      0
_aulldvrm()          7.2500000      5.5781250
_aullmul()           2.2812500      0.6093750
                    11.2031250 nano-seconds

_aullnop()           2.7343750      0
_aulldvrm()          6.7812500      4.0468750
_aullmul()           3.2343750      0.5000000
                    12.7500000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           1.6718750      0
_aulldvrm()         22.0156250     20.3437500
_aullmul()           2.5156250      0.8437500
                    26.2031250 nano-seconds

_aullnop()           2.7500000      0
_aulldvrm()         14.9218750     12.1718750
_aullmul()           3.6093750      0.8593750
                    21.2812500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          2.2656250      0
__udivmoddi4()      18.0000000     15.7343750
__umuldi3()          4.2968750      2.0312500
                    24.5625000 nano-seconds

__unopdi4()          3.1093750      0
__udivmoddi4()      32.2812500     29.1718750
__umuldi3()          5.1875000      2.0781250
                    40.5781250 nano-seconds

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          7.820339238      0
__udivmoddi4()      25.219830234     17.399490996
__umuldi3()          7.896016504      0.075677266
                    40.936185976 clock cycles

__unopdi4()         10.683594514      0
__udivmoddi4()      23.327167298     12.643572784
__umuldi3()         10.966053464      0.282458950
                    44.976815276 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           5.782054086      0
_aulldvrm()         24.732162011     18.950107925
_aullmul()           7.867011224      2.084957138
                    38.381227321 clock cycles

_aullnop()           9.460103500      0
_aulldvrm()         23.216316090     13.756212590
_aullmul()          11.151011086      1.690907586
                    43.827430676 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           5.815668129      0
_aulldvrm()         75.679497601     69.863829472
_aullmul()           9.474285823      3.658617694
                    90.969451553 clock cycles

_aullnop()           9.458013729      0
_aulldvrm()         50.089870151     40.631856422
_aullmul()          12.508264043      3.050250314
                    72.056147923 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          7.817188050      0
__udivmoddi4()      61.087041419     53.269853369
__umuldi3()         14.145779236      6.328591186
                    83.050008705 clock cycles

__unopdi4()         10.659368382      0
__udivmoddi4()     109.785429868     99.126061486
__umuldi3()         17.585432466      6.926064084
                   138.030230716 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          2.0781250      0
__udivmoddi4()       6.6562500      4.5781250
__umuldi3()          2.0937500      0.0156250
                    10.8281250 nano-seconds

__unopdi4()          2.8125000      0
__udivmoddi4()       6.2343750      3.4218750
__umuldi3()          2.9843750      0.1718750
                    12.0312500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           1.5312500      0
_aulldvrm()          6.5312500      5.0000000
_aullmul()           2.1718750      0.6406250
                    10.2343750 nano-seconds

_aullnop()           2.5000000      0
_aulldvrm()          6.1250000      3.6250000
_aullmul()           2.9375000      0.4375000
                    11.5625000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           1.5000000      0
_aulldvrm()         19.7187500     18.2187500
_aullmul()           2.5000000      1.0000000
                    23.7187500 nano-seconds

_aullnop()           2.4843750      0
_aulldvrm()         13.2031250     10.7187500
_aullmul()           3.2812500      0.7968750
                    18.9687500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          2.0625000      0
__udivmoddi4()      16.2343750     14.1718750
__umuldi3()          3.8906250      1.8281250
                    22.1875000 nano-seconds

__unopdi4()          2.8125000      0
__udivmoddi4()      29.0781250     26.2656250
__umuldi3()          4.7187500      1.9062500
                    36.6093750 nano-seconds

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

__unopdi4()          5.783152900      0
__udivmoddi4()      15.055998960      9.272846060
__umuldi3()          5.495972020      0.000000000
                    26.335123880 clock cycles

__unopdi4()          7.998956160      0
__udivmoddi4()      15.909793620      7.910837460
__umuldi3()          7.757748040      0.000000000
                    31.666497820 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

_aullnop()           4.537734560      0
_aulldvrm()         15.874445940     11.336711380
_aullmul()           6.147943420      1.610208860
                    26.560123920 clock cycles

_aullnop()           6.793898680      0
_aulldvrm()         16.717796680      9.923898000
_aullmul()           9.055848460      2.261949780
                    32.567543820 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

_aullnop()           4.527671860      0
_aulldvrm()         51.431490520     46.903818660
_aullmul()           6.800312040      2.272640180
                    62.759474420 clock cycles

_aullnop()           6.791780420      0
_aulldvrm()         34.746860380     27.955079960
_aullmul()           9.737921840      2.946141420
                    51.276562640 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

__unopdi4()          5.787192880      0
__udivmoddi4()      39.760556600     33.973363720
__umuldi3()         10.045714200      4.258521320
                    55.593463680 clock cycles

__unopdi4()          7.803949780      0
__udivmoddi4()      70.401916480     62.597966700
__umuldi3()         12.308324680      4.504374900
                    90.514190940 clock cycles

[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

__unopdi4()          2.7968750      0
__udivmoddi4()       7.5000000      4.7031250
__umuldi3()          2.7968750      0.0000000
                    13.0937500 nano-seconds

__unopdi4()          4.1406250      0
__udivmoddi4()       7.9843750      3.8437500
__umuldi3()          3.9687500      0.0000000
                    16.0937500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

_aullnop()           2.3437500      0
_aulldvrm()          8.1875000      5.8437500
_aullmul()           3.2031250      0.8593750
                    13.7343750 nano-seconds

_aullnop()           3.5468750      0
_aulldvrm()          8.7500000      5.2031250
_aullmul()           4.7187500      1.1718750
                    17.0156250 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

_aullnop()           2.3593750      0
_aulldvrm()         26.7656250     24.4062500
_aullmul()           3.5468750      1.1875000
                    32.6718750 nano-seconds

_aullnop()           3.5468750      0
_aulldvrm()         17.9843750     14.4375000
_aullmul()           5.0937500      1.5468750
                    26.6250000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD EPYC 7713 64-Core Processor

__unopdi4()          2.8593750      0
__udivmoddi4()      20.8125000     17.9531250
__umuldi3()          5.2187500      2.3593750
                    28.8906250 nano-seconds

__unopdi4()          4.3437500      0
__udivmoddi4()      36.9062500     32.5625000
__umuldi3()          6.4062500      2.0625000
                    47.6562500 nano-seconds

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

The software and the documentation on this site are provided as is without any warranty, neither express nor implied.
In no event will the author be held liable for any damage(s) arising from the use of the software or the documentation.
Permission is granted to use the current version of the software and the current version of the documentation solely for personal private and non-commercial purposes.
An individuals use of the software or the documentation in his or her capacity or function as an agent, (independent) contractor, employee, member or officer of a business, corporation or organisation (commercial or non-commercial) does not qualify as personal private and non-commercial purpose.
Without written approval from the author the software or the documentation must not be used for a business, for commercial, corporate, governmental, military or organisational purposes of any kind, or in a commercial, corporate, governmental, military or organisational environment of any kind.
Redistribution of the software and the documentation is allowed only in unmodified form of its current version and free of charge.

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):

the (pseudonymised) IP address;
the date and time of the request;
the URL of the requested web page or file;
the Referer and User-Agent HTTP headers sent by the web browser;
the result (success or failure) of the request;
the amount of data received and sent.