Valid HTML 4.01 Transitional Valid CSS Valid SVG 1.0

Me, myself & IT

Fast(est) 128÷128-bit and 64÷64-bit Integer Division

Purpose
Algorithms
Extended Precision Division
Shift & Subtract Division
Hybrid Variant
Implementation for AMD64 Processors
128÷128-bit Unsigned Integer Division (128-bit Quotient and Remainder)
64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)
Implementation for i386 Processors
64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)
64÷64-bit Unsigned Integer Division (64-bit Quotient)
64÷64-bit Unsigned Integer Division (64-bit Remainder)
64÷64-bit Signed Integer Division (64-bit Quotient and Remainder)
64÷64-bit Signed Integer Division (64-bit Quotient)
64÷64-bit Signed Integer Division (64-bit Remainder)
64×64-bit Signed and Unsigned Integer Multiplication (64-bit Product)
64-bit Signed and Unsigned Integer Shift (64-bit Result)
Execution Times (Sustained Reciprocal Throughput)
Running ’round in Circles Cycles
Summary
Benchmark Programs for AMD64 Processors
Benchmark Program for i386 Processors

Purpose

Present fast 128÷128-bit unsigned integer division routines __udivmodti4() for AMD64 processors, the fast 64÷64-bit unsigned integer division routines __udivmoddi4(), __udivdi3() and __umoddi3() plus the fast 64÷64-bit signed integer division routines __divdi3() and __moddi3() for i386 processors, as well as the fast compiler helper routines _alldiv(), _alldvrm(), _allmul(), _allrem(), _allshl(), _allshr(), _aulldiv(), _aulldvrm(), _aullrem() and _aullshr(), which the Microsoft® Visual C compiler calls to perform 64-bit division and multiplication on i386 processors.

Note: the fast 128÷64-bit unsigned integer division routine _udiv128() for i386 processors, implemented in ANSI C and Assembler, is presented in my article Donald Knuth’s Algorithm D, and provided with my NOMSVCRT.LIB runtime library.

Algorithms

The following outline assumes a machine word and corresponding register size of 64 bit.

Extended Precision Division

The extended precision division algorithm is quite simple: it relies on the processor’s native DIV instruction, which performs a so-called narrowing 128÷64-bit division, producing a 64-bit quotient and a 64-bit remainder from an 128-bit dividend and a 64-bit divisor.
Exceptional case
If the divisor is 0, a divide by 0 exception is raised.
Note: this is handled by the second simple case!
Trivial cases
If the divisor is greater than the dividend (which implies that the divisor is greater than 0), the quotient is 0, while the remainder is equal to the dividend.
If the divisor is equal to the dividend and (both are) greater than 0, the quotient is 1, while the remainder is 0.
Simple cases
If the divisor is less than 264, but greater than the upper half of the dividend, the upper halves of quotient and remainder are 0, and a single DIV instruction yields their lower halves.
If the divisor is less than 264 and not greater than the upper half of the dividend, the upper half of the remainder is 0 too, while its lower half and (both halves of) the quotient are produced with two consecutive DIV instructions using the so-called long alias schoolbook division (and 64-bit numbers as digits) to avoid an overflow of the quotient.
Hard case (multiple steps)
If the divisor is not less than 264, it is normalised, i.e. shifted left until its most significant bit is set, which is equivalent to a division by 264−number of leading '0' bits, and its lower half discarded.
The truncated normalised divisor′ is eventually subtracted from the upper half of the dividend to prevent an overflow, then used to produce the lower half of an intermediate approximate quotient′ with a single DIV instruction; if the normalised divisor′ was subtracted from the upper half of the dividend before, the higher half of the intermediate approximate quotient′ is 1, else 0.
The intermediate approximate quotient′ is shifted left by the same amount as the normalised divisor′, giving the final approximate quotient″, which might be 1 to high due to the discarded lower half of the normalised divisor′ (the lower half of the final approximate quotient″ is 0 and discarded).
The approximate remainder′ is computed as dividend minus the product of original divisor and final approximate quotient″.
If the approximate remainder′ is less than 0, the original divisor is added, while the final approximate quotient″ is decremented by 1, producing the proper quotient and remainder.

Shift & Subtract Division

The shift & subtract alias binary long division algorithm is almost trivial: it’s the schoolbook algorithm using bits as digits.
Exceptional case
If the divisor is 0, a divide by 0 exception is raised.
Trivial cases
If the divisor is greater than the dividend (what implies that the divisor is greater than 0), the quotient is 0, while the remainder is equal to the dividend.
If the divisor is equal to the dividend and (both are) greater than 0, the quotient is 1, while the remainder is 0.
Long case (multiple steps with a loop)
The quotient is set to 0.
The divisor is aligned to the dividend, i.e. shifted left until their most significant bits are in the same position.
Until the divisor is back in its original position,
¹ the quotient is shifted left one bit,
² if the dividend is not less than the divisor, the divisor is subtracted from the dividend, and the quotient incremented by 1, i.e. its least significant bit is set,
³ the divisor is shifted right one bit.
The remainder is the (remaining) dividend.

Hybrid Variant

The hybrid variant combines the long alias schoolbook division with with the binary long alias shift & subtract division algorithm:
if the divisor is less than 264, it uses the simple cases of the extended precision division algorithm,
else it uses the trivial cases and the long case of the shift & subtract alias binary long division algorithm.

Implementation for AMD64 Processors

Prototype for the __udivmoddi4() function, and sample C implementation of the 64÷64-bit shift & subtract division:
// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _MSC_VER
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
#else
#pragma intrinsic(_BitScanReverse64)

typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;

uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder)
{
    uint64_t quotient;
    uint32_t index1, index2;

    if (_BitScanReverse64(&index2, divisor))
        if (_BitScanReverse64(&index1, dividend))
#if 0
            if (index1 >= index2)
#else
            if (dividend >= divisor)
#endif
            {
                // dividend >= divisor > 0,
                //  64 > index1 >= index2 >= 0
                //   (number of leading '0' bits = 63 - index)

                divisor <<= index1 - index2;
                quotient = 0;

                do
                {
                    quotient <<= 1;

                    if (dividend >= divisor)
                    {
                        dividend -= divisor;
                        quotient |= 1;
                    }

                    divisor >>= 1;
                } while (index1 >= ++index2);

                if (remainder != 0)
                    *remainder = dividend;

                return quotient;
            }
            else // divisor > dividend > 0:
                 //  quotient = 0, remainder = dividend
            {
                if (remainder != 0)
                    *remainder = dividend;

                return 0;
            }
        else // divisor > dividend == 0:
             //  quotient = 0, remainder = 0
        {
            if (remainder != 0)
                *remainder = 0;

            return 0;
        }
    else // divisor == 0
    {
        if (remainder != 0)
            return dividend % divisor;

        return dividend / divisor;
    }
}
#endif // _MSC_VER
The suffix di4 specifies the number of arguments plus return value and their size: double integer denotes an 8-byte QWORD alias 64-bit uint64_t.

Prototype for the __udivmodti4() function, and sample C implementation of the 128÷128-bit extended precision as well as the shift & subtract division:

// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _MSC_VER
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);
#else
typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;
#if 0
typedef unsigned __int128 uint128_t;
#else
typedef struct
{
    uint64_t low, high;
} uint128_t;
#endif
#if _MSC_VER >= 1920 // MSC 19.20 alias 2019
#pragma intrinsic(__shiftleft128, __shiftright128, _udiv128, _umul128, _BitScanReverse64)

uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder)
{
    uint128_t quotient;
#ifndef HYBRID
    uint64_t tmp, low, high;
    uint32_t index;

    if (_BitScanReverse64(&index, divisor.high))
    {
        tmp = __shiftleft128(divisor.low, divisor.high, 63 - index);

        if (tmp < dividend.high)
        {
            tmp = _udiv128(dividend.high - tmp, dividend.low, tmp, &tmp);
            tmp = __shiftleft128(tmp, 1, 63 - index);
        }
        else
        {
            tmp = _udiv128(dividend.high, dividend.low, tmp, &tmp);
#if 0
            tmp = __shiftleft128(tmp, 0, 63 - index);
#else
            tmp <<= 63 - index;
#endif
        }

        quotient.high = 0;
        quotient.low = tmp;

        low = _umul128(tmp, divisor.low, &high);
        tmp *= divisor.high;
        high += tmp;

        if ((high < tmp)           // quotient * divisor >= 2**128 > dividend
         || (high > dividend.high) // quotient * divisor > dividend
         || ((high == dividend.high) && (low > dividend.low)))
        {
            quotient.low -= 1;

            low = _umul128(quotient.low, divisor.low, &high);
            high += quotient.low * divisor.high;
        }

        if (remainder != 0)
        {
            dividend.high -= high + (dividend.low < low);
            dividend.low -= low;

            *remainder = dividend;
        }
    }
#else // HYBRID
    uint64_t tmp;
    uint32_t index1, index2;

    if (_BitScanReverse64(&index2, divisor.high))
        if (_BitScanReverse64(&index1, dividend.high))
            if (index1 >= index2)
            {
                // dividend >= divisor >= 2**64,
                //  64 > index1 >= index2 >= 0
                //   (number of leading '0' bits = 63 - index)

                divisor.high = __shiftleft128(divisor.low, divisor.high, index1 - index2);
                divisor.low <<= index1 - index2;

                quotient.high = quotient.low = 0;

                do
                {
                    quotient.low <<= 1;

                    if ((dividend.high > divisor.high)
                     || ((dividend.high == divisor.high) && (dividend.low >= divisor.low)))
                    {
                        dividend.high -= divisor.high + (dividend.low < divisor.low);
                        dividend.low -= divisor.low;

                        quotient.low |= 1;
                    }

                    divisor.low = __shiftright128(divisor.low, divisor.high, 1);
                    divisor.high >>= 1;
                } while (index1 >= ++index2);

                if (remainder != 0)
                    *remainder = dividend;
            }
            else // divisor > dividend >= 2**64:
                 //  quotient = 0, remainder = dividend
            {
                if (remainder != 0)
                    *remainder = dividend;
            }
        else // divisor >= 2**64 > dividend:
             //  quotient = 0, remainder = dividend
        {
            if (remainder != 0)
#if 1
                *remainder = dividend;
#else
            {
                remainder->high = 0;
                remainder->low = dividend.low;
            }
#endif
        }
#endif // HYBRID
    else // divisor < 2**64
    {
        if (dividend.high < divisor.low)
        {
            quotient.high = 0;
            quotient.low = _udiv128(dividend.high, dividend.low, divisor.low, &tmp);
        }
        else // "long" alias "schoolbook" division
        {
            quotient.high = _udiv128(0, dividend.high, divisor.low, &tmp);
            quotient.low = _udiv128(tmp, dividend.low, divisor.low, &tmp);
        }

        if (remainder != 0)
        {
            remainder->high = 0;
            remainder->low = tmp;
        }
    }

    return quotient;
}
#endif
#endif // _MSC_VER
The suffix ti4 specifies the number of arguments plus return value and their size: tetra integer denotes a 16-byte OWORD alias 128-bit uint128_t.

Note: the Microsoft Visual C compiler does not provide a 128-bit integer data type; the keyword __int128 is reserved, but unsupported, its use yields error C4235.

Note: with the preprocessor macro HYBRID defined, the hybrid variant of the division algorithm is used.

128÷128-bit Unsigned Integer Division (128-bit Quotient and Remainder)

__udivmodti4() function for AMD64 processors, supporting the Unix® System V calling convention, using the extended precision division algorithm:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# * The software is provided "as is" without any warranty, neither express
#   nor implied.
# * In no event will the author be held liable for any damage(s) arising
#   from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
#   and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
#   as an agent, (independent) contractor, employee, member or officer of
#   a business, corporation or organization (commercial or non-commercial)
#   does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
#   for a business, for commercial, corporate, governmental, military or
#   organizational purposes of any kind, or in a commercial, corporate,
#   governmental, military or organizational environment of any kind.

# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
#   in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
#   (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
#   left);
# - 128-bit integer arguments are passed as 2 64-bit integer arguments;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
#   RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
#   preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
#   case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
#   before calling other functions (CALL pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.

# NOTE: raises "division exception" when divisor is 0!

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	___udivmodti4
	.type	___udivmodti4, @function
	.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
___udivmodti4:
.if 0
	cmp	rdi, rdx
	mov	rax, rsi
	sbb	rax, rcx
	jb	.trivial	# dividend < divisor?
.else
	cmp	rsi, rcx
	jb	.trivial	# (high qword of) dividend < (high qword of) divisor?
.endif
	mov	r11, rcx	# r11 = high qword of divisor
	mov	r10, rdx	# r10 = low qword of divisor

	bsr	rcx, rcx	# rcx = index of leading '1' bit in high qword of divisor
	jnz	.extended	# high qword of divisor <> 0?

	# divisor < 2**64 (so remainder will be < 2**64 too)

	cmp	rsi, rdx
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64 (so quotient will be < 2**64):
	# perform normal division
.normal:
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = dividend
	div	r10		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	0f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
0:
	mov	rdx, r11	# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64 (so quotient will be >= 2**64):
	# perform "long" alias "schoolbook" division
.long:
	mov	rdx, r11	# rdx = 0
	mov	rax, rsi	# rdx:rax = high qword of dividend
	div	r10		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	mov	rcx, rax	# rcx = high qword of quotient
	mov	rax, rdi	# rax = low qword of dividend
	div	r10		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	1f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
1:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# dividend < divisor
.trivial:
	test	r8, r8
	jz	2f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0
	ret

	# divisor >= 2**64 (so quotient will be < 2**64):
	# perform "extended & adjusted" division
.extended:
	not	ecx		# ecx = number of leading '0' bits in (high qword of) divisor
	mov	r9, r11		# r9 = high qword of divisor
	shld	r9, r10, cl	# r9 = divisor / 2**(index + 1)
				#    = divisor'
#	shl	r10, cl
	mov	rdx, rsi	# rdx = high qword of dividend
	mov	rax, rdi	# rax = low qword of dividend

	push	rbx
.ifnotdef JMPLESS
	xor	ebx, ebx	# rbx = high qword of quotient' = 0

	cmp	rdx, r9
	jb	3f		# high qword of dividend < divisor'?

	# high qword of dividend >= divisor':
	# subtract divisor' from high qword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	rdx, r9		# rdx = high qword of dividend - divisor'
				#     = high qword of dividend'
	inc	ebx		# rbx = high qword of quotient' = 1
3:
.elseif 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	sbb	rbx, rbx	# rbx = (high qword of dividend < divisor') ? -1 : 0
	and	rbx, r9		# rbx = (high qword of dividend < divisor') ? divisor' : 0
	add	rdx, rbx	# rdx = high qword of dividend
				#     - (high qword of dividend < divisor') ? 0 : divisor'
				#     = high qword of dividend'
	neg	rbx		# CF = (high qword of dividend < divisor') ? 1 : 0
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.elseif 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	cmovb	rdx, rsi	#     = high qword of dividend'
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.else
	xor	ebx, ebx	# rbx = high qword of quotient' = 0
	sub	rdx, r9		# rdx = high qword of dividend - divisor'
	cmovb	rdx, rsi	#     = high qword of dividend'
	sbb	ebx, -1		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.endif # JMPLESS
	# high qword of dividend' < divisor'

	div	r9		# rax = dividend' / divisor'
				#     = low qword of quotient',
				# rdx = remainder'
	shld	rbx, rax, cl	# rbx = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"
#	shl	rax, cl

	mov	r9, r11		# r9 = high qword of divisor
	mov	rax, r10	# rax = low qword of divisor
	imul	r9, rbx		# r9 = high qword of divisor * quotient"
	mul	rbx		# rdx:rax = low qword of divisor * quotient"
.ifnotdef JMPLESS
	add	rdx, r9		# rdx:rax = divisor * quotient"
	jnc	4f		# divisor * quotient" < 2**64?
				# with carry, it is off by divisor,
				#  and quotient" is off by 1
.if 0
	sbb	rbx, 0		# rbx = quotient" - 1
.else
	dec	rbx		# rbx = quotient" - 1
.endif
	sub	rax, r10
	sbb	rdx, r11	# rdx:rax = divisor * (quotient" - 1)
4:
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.else
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend - low qword of divisor * quotient"
	sub	rsi, r9		# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.endif # JMPLESS
	jnb	5f		# remainder" >= 0?
				# with borrow, it is off by divisor,
				#  and quotient" is off by 1
.if 0
	sbb	rbx, 0		# rbx = quotient" - 1
				#     = quotient
.else
	dec	rbx		# rbx = quotient" - 1
				#     = quotient
.endif
	add	rdi, r10
	adc	rsi, r11	# rsi:rdi = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
5:
	test	r8, r8
	jz	6f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = rsi:rdi
6:
	mov	rax, rbx	# rax = (low qword of) quotient
	xor	edx, edx	# rdx:rax = quotient

	pop	rbx
	ret

	.end
__udivmodti4() function for AMD64 processors, supporting the Microsoft calling convention, using the extended precision division algorithm:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; Microsoft calling convention for AMD64 platform:
; - first 4 integer or pointer arguments (from left to right) are passed
;   in registers RCX/R1, RDX/R2, R8 and R9;
; - 16-byte arguments are passed by reference;
; - surplus arguments are pushed on stack in reverse order (from right
;   to left);
; - caller allocates memory for 16-byte return value and passes pointer
;   to it as (hidden) first argument, thus shifting all other arguments;
; - caller always allocates "home space" for 4 arguments on stack, even
;   when less than 4 arguments are passed, but does not need to push
;   first 4 arguments;
; - callee can spill first 4 arguments from registers to "home space";
; - callee can clobber "home space";
; - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
;   when it calls other functions (CALL pushes 8 bytes);
; - 64-bit integer or pointer result is returned in register RAX/R0;
; - 32-bit integer result is returned in register EAX;
; - registers RAX/R0, RCX/R1, RDX/R2, R8, R9, R10 and R11 are volatile
;   and can be clobbered;
; - registers RBX/R3, RSP/R4, RBP/R5, RSI/R6, RDI/R7, R12, R13, R14 and
;   R15 must be preserved.

; NOTE: raises "division exception" when divisor is 0!

	.code
				; rcx = oword ptr quotient
				; rdx = oword ptr dividend
				; r8 = oword ptr divisor
				; r9 = oword ptr remainder
__udivmodti4 proc public

	mov	rax, [rdx]	; rax = low qword of dividend
	mov	rdx, [rdx+8]	; rdx = high qword of dividend

	mov	r10, [r8]	; r10 = low qword of divisor
	mov	r11, [r8+8]	; r11 = high qword of divisor

	mov	r8, rcx		; r8 = address of quotient
if 0
	cmp	rax, r10
	mov	rcx, rdx
	sbb	rcx, r11
	jb	trivial		; dividend < divisor?
else
	cmp	rdx, r11
	jb	trivial		; (high qword of) dividend < (high qword of) divisor?
endif
	bsr	rcx, r11	; rcx = index of leading '1' bit in high qword of divisor
	jnz	extended	; high qword of divisor <> 0?

	; divisor < 2**64 (so remainder will be < 2**64 too)

	cmp	rdx, r10
	jae	long		; high qword of dividend >= (low qword of) divisor?

	; dividend < divisor * 2**64 (so quotient will be < 2**64):
	; perform normal division
normal:
	div	r10		; rax = (low qword of) quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], r11	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend >= divisor * 2**64 (so quotient will be >= 2**64):
	; perform "long" alias "schoolbook" division
long:
	mov	rcx, rax	; rcx = low qword of dividend
	mov	rax, rdx	; rax = high qword of dividend
	mov	rdx, r11	; rdx:rax = high qword of dividend
	div	r10		; rax = high qword of quotient,
				; rdx = high qword of remainder'
	xchg	rcx, rax	; rcx = high qword of quotient,
				; rax = low qword of dividend
	div	r10		; rax = low qword of quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], rcx	; quotient = rcx:rax

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend < divisor
trivial:
	xor	ecx, ecx
	mov	[r8], rcx
	mov	[r8+8], rcx	; quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; divisor >= 2**64 (so quotient will be < 2**64):
	; perform "extended & adjusted" division
extended:
	mov	[rsp+8], rbx
	mov	[rsp+16], r12
	mov	[rsp+24], r13
	mov	[rsp+32], r14

	not	ecx		; ecx = number of leading '0' bits in (high qword of) divisor
	mov	r12, r11	; r12 = high qword of divisor
	shld	r12, r10, cl	; r12 = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	r10, cl
	mov	r13, rax	; r13 = low qword of dividend
	mov	r14, rdx	; r14 = high qword of dividend
ifndef JMPLESS
	xor	ebx, ebx	; rbx = high qword of quotient' = 0

	cmp	rdx, r12
	jb	@f		; high qword of dividend < divisor'?

	; high qword of dividend >= divisor':
	; subtract divisor' from high qword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	rdx, r12	; rdx = high qword of dividend - divisor'
				;     = high qword of dividend'
	inc	ebx		; rbx = high qword of quotient' = 1
@@:
elseif 0
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	sbb	rbx, rbx	; rbx = (high qword of dividend < divisor') ? -1 : 0
	and	rbx, r12	; rbx = (high qword of dividend < divisor') ? divisor' : 0
	add	rdx, rbx	; rdx = high qword of dividend
				;     - (high qword of dividend < divisor') ? 0 : divisor'
				;     = high qword of dividend'
	neg	rbx		; CF = (high qword of dividend < divisor') ? 1 : 0
	sbb	ebx, ebx	; ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
elseif 0
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	cmovb	rdx, r14	;     = high qword of dividend'
	sbb	ebx, ebx	; ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
else
	xor	ebx, ebx	; rbx = high qword of quotient' = 0
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	cmovb	rdx, r14	;     = high qword of dividend'
	sbb	ebx, -1		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
endif ; JMPLESS
	; high qword of dividend' < divisor'

	div	r12		; rax = dividend' / divisor'
				;     = low qword of quotient',
				; rdx = remainder'
	shld	rbx, rax, cl	; rbx = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	rax, cl
ifndef JMPLESS
	mov	r12, r11	; r12 = high qword of divisor
	mov	rax, r10	; rax = low qword of divisor
	imul	r12, rbx	; r12 = high qword of divisor * quotient"
	mul	rbx		; rdx:rax = low qword of divisor * quotient"
	add	rdx, r12	; rdx:rax = divisor * quotient"
	jnc	@f		; divisor * quotient" < 2**64?
				; with carry, it is off by divisor,
				;  and quotient" is off by 1
if 0
	sbb	rbx, 0		; rbx = quotient" - 1
else
	dec	rbx		; rbx = quotient" - 1
endif
	sub	rax, r10
	sbb	rdx, r11	; rdx:rax = divisor * (quotient" - 1)
@@:
	sub	r13, rax
	sbb	r14, rdx	; r14:r13 = dividend - divisor * quotient"
				;         = remainder"
else
	mov	r12, r11	; r12 = high qword of divisor
	mov	rax, r10	; rax = low qword of divisor
	imul	r12, rbx	; r12 = high qword of divisor * quotient"
	mul	rbx		; rdx:rax = low qword of divisor * quotient"
	sub	r13, rax
	sbb	r14, rdx	; r14:r13 = dividend - low qword of divisor * quotient"
	sub	r14, r12	; r14:r13 = dividend - divisor * quotient"
				;         = remainder"
endif ; JMPLESS
	jnb	@f		; remainder" >= 0?
				; with borrow, it is off by divisor,
				;  and quotient" is off by 1
if 0
	sbb	rbx, 0		; rbx = quotient" - 1
				;     = quotient
else
	dec	rbx		; rbx = quotient" - 1
				;     = quotient
endif
	add	r13, r10
	adc	r14, r11	; r14:r13 = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = remainder
@@:
	xor	eax, eax	; rax = high qword of quotient = 0
	mov	[r8], rbx
	mov	[r8+8], rax	; quotient = rax:rbx

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], r13
	mov	[r9+8], r14	; remainder = r14:r13
@@:
	mov	rbx, [rsp+8]
	mov	r12, [rsp+16]
	mov	r13, [rsp+24]
	mov	r14, [rsp+32]

	mov	rax, r8		; rax = address of quotient
	ret

__udivmodti4 endp
	end
__udivmodti4() function for AMD64 processors, supporting the Unix System V calling convention, using the hybrid variant of the division algorithm:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	___udivmodti4
	.type	___udivmodti4, @function
	.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
___udivmodti4:
.if 0
	cmp	rdi, rdx
	mov	rax, rsi
	sbb	rax, rcx
	jb	.trivial	# dividend < divisor?
.else
	cmp	rsi, rcx
	jb	.trivial	# (high qword of) dividend < (high qword of) divisor?
.endif
	bsr	r9, rcx		# r9 = index of leading '1' bit in high qword of divisor
	jz	.simple		# high qword of divisor = 0?

	# dividend >= divisor >= 2**64 (so quotient will be < 2**64)

	mov	r11, rcx	# r11 = high qword of divisor
	bsr	rcx, rdx	# rcx = index of leading '1' bit in high qword of dividend
#	jz	.trivial	# high qword of dividend = 0?

	# perform "shift & subtract" alias "binary long" division
.large:
	sub	rcx, r9		# rcx = distance of leading '1' bits
#	jb	.trivial	# dividend < divisor?

	xor	r9, r9		# r9 = (low qword of) quotient' = 0
	mov	r10, rdx	# r10 = low qword of divisor
	shld	r11, r10, cl
	shl	r10, cl		# r11:r10 = divisor << distance of leading '1' bits
				#         = divisor'
.loop:
	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = dividend'
	sub	rdi, r10
	sbb	rsi, r11	# rsi:rdi = dividend' - divisor'
				#         = dividend",
				# CF = dividend' < divisor'
	cmovb	rdi, rax
	cmovb	rsi, rdx	# rsi:rdi = (dividend' < divisor') ? dividend' : dividend"
	cmc			# CF = dividend' >= divisor'
	adc	r9, r9		# r9 = quotient' << 1
				#    + dividend' >= divisor'
				#    = quotient"
.if 0
	shrd	r10, r11, 1
	shr	r11, 1		# r11:r10 = divisor' >> 1
				#         = divisor"
.else
	shr	r11, 1
	rcr	r10, 1		# r11:r10 = divisor' >> 1
				#         = divisor"
.endif
	dec	ecx
	jns	.loop

	test	r8, r8
	jz	0f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend"
0:
	mov	rax, r9		# rax = (low qword of) quotient
	xor	edx, edx	# rdx:rax = quotient
	ret

	# dividend < divisor
.trivial:
	test	r8, r8
	jz	1f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
1:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0
	ret

	# divisor < 2**64 (so remainder will be < 2**64 too)
.simple:
	mov	r9, rdx		# r9 = (low qword of) divisor
	cmp	rsi, rdx
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64 (so quotient will be < 2**64):
	# perform normal division
.normal:
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = dividend
	div	r9		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	2f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], rcx	# high qword of remainder = 0
2:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64 (so quotient will be >= 2**64):
	# perform "long" alias "schoolbook" division
.long:
	mov	rdx, rcx	# rdx = 0
	mov	rax, rsi	# rdx:rax = high qword of dividend
	div	r9		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	mov	r10, rax	# r10 = high qword of quotient
	mov	rax, rdi	# rax = low qword of dividend
	div	r9		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	3f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], rcx	# high qword of remainder = 0
3:
	mov	rdx, r10	# rdx:rax = quotient
	ret

	.end
__udivmodti4() function for AMD64 processors, supporting the Microsoft calling convention, using the hybrid variant of the division algorithm:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.code

__udivmodti4 proc public

	mov	rax, [rdx]	; rax = low qword of dividend
	mov	rdx, [rdx+8]	; rdx = high qword of dividend

	mov	r10, [r8]	; r10 = low qword of divisor
	mov	r11, [r8+8]	; r11 = high qword of divisor

	mov	r8, rcx		; r8 = address of quotient
if 0
	cmp	rax, r10
	mov	rcx, rdx
	sbb	rcx, r11
	jb	trivial		; dividend < divisor?
else
	cmp	rdx, r11
	jb	trivial		; (high qword of) dividend < (high qword of) divisor?
endif
	bsr	rcx, r11	; rcx = index of leading '1' bit in high qword of divisor
	jz	simple		; high qword of divisor = 0?

	; dividend >= divisor >= 2**64 (so quotient will be < 2**64)

	mov	[rsp+8], rbx

	bsr	rbx, rdx	; rbx = index of leading '1' bit in high qword of dividend
;;	jz	trivial		; high qword of dividend = 0?

	; perform "shift & subtract" alias "binary long" division
large:
	sub	ebx, ecx	; ebx = distance of leading '1' bits
;;	jb	trivial		; dividend < divisor?

	mov	ecx, ebx
	xor	ebx, ebx	; rbx = (low qword of) quotient' = 0

	shld	r11, r10, cl
	shl	r10, cl		; r11:r10 = divisor << distance of leading '1' bits
				;         = divisor'
	mov	[rsp+16], r12
	mov	[rsp+24], r13
@@:
	mov	r12, rax
	mov	r13, rdx	; r13:r12 = dividend'
	sub	rax, r10
	sbb	rdx, r11	; rdx:rax = dividend' - divisor'
				;         = dividend",
				; CF = dividend' < divisor'
	cmovb	rax, r12
	cmovb	rdx, r13	; rdx:rax = (dividend' < divisor') ? dividend' : dividend"
	cmc			; CF = dividend' >= divisor'
	adc	rbx, rbx	; rbx = quotient' << 1
				;     + dividend' >= divisor'
				;     = quotient
if 0
	shrd	r10, r11, 1
	shr	r11, 1		; r11:r10 = divisor' >> 1
				;         = divisor
else
	shr	r11, 1
	rcr	r10, 1		; r11:r10 = divisor' >> 1
				;         = divisor
endif
	dec	ecx
	jns	@b

	mov	r12, [rsp+16]
	mov	r13, [rsp+24]

	xor	ecx, ecx
	mov	[r8], rbx
	mov	[r8+8], rcx	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend"
@@:
	mov	rax, r8		; rax = address of quotient

	mov	rbx, [rsp+8]
	ret

	; dividend < (2**64 <=) divisor:
	; quotient = 0, remainder = dividend
trivial:
	xor	ecx, ecx
	mov	[r8], rcx
	mov	[r8+8], rcx	; quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend
@@:
	mov	rax, r8		; rax = address of quotient

	mov	rbx, [rsp+8]
	ret

	; divisor < 2**64 (so remainder will be < 2**64 too)
simple:
	cmp	rdx, r10
	jae	long		; high qword of dividend >= (low qword of) divisor?

	; dividend < divisor * 2**64 (so quotient will be < 2**64):
	; perform normal division
normal:
	div	r10		; rax = (low qword of) quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], r11	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend >= divisor * 2**64 (so quotient will be >= 2**64):
	; perform "long" alias "schoolbook" division
long:
	mov	rcx, rax	; rcx = low qword of dividend
	mov	rax, rdx	; rax = high qword of dividend
	mov	rdx, r11	; rdx:rax = high qword of dividend
	div	r10		; rax = high qword of quotient,
				; rdx = high qword of remainder'
	xchg	rcx, rax	; rcx = high qword of quotient,
				; rax = low qword of dividend
	div	r10		; rax = low qword of quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], rcx	; quotient = rcx:rax

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

__udivmodti4 endp
	end

64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)

__udivmoddi4() function for AMD64 processors, supporting the Microsoft calling convention, using the shift & subtract division:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.code
				; rcx = dividend
				; rdx = divisor
				; r8 = oword ptr remainder
__udivmoddi4 proc public

	cmp	rcx, rdx
	jb	trivial		; dividend < divisor?

	bsr	rax, rdx	; rax = index of leading '1' bit of divisor
	jz	error		; divisor = 0?

	mov	r9, rcx		; r9 = dividend
	bsr	rcx, rcx	; rcx = index of leading '1' bit of dividend
	jz	zero		; dividend = 0?

	sub	ecx, eax	; ecx = distance of leading '1' bits
;;	jb	trivial		; dividend < divisor?

	shl	rdx, cl		; rdx = divisor << distance of leading '1' bits
				;     = divisor'
	xor	eax, eax	; eax = quotient' = 0
@@:
	mov	r10, r9		; r10 = dividend'
	sub	r9, rdx		; r9 = dividend' - divisor'
				;    = dividend"
				; CF = dividend' < divisor'
	cmovb	r9, r10		; r9 = (dividend' < divisor') ? dividend' : dividend"
	cmc			; CF = dividend' >= divisor'
	adc	rax, rax	; rax = quotient' << 1
				;     + (dividend' >= divisor')
				;     = quotient
	shr	rdx, 1		; rdx = divisor' >> 1
				;     = divisor
	dec	ecx
	jns	@b

	test	r8, r8
	jz	@f		; address of remainder = 0?

	mov	[r8], r9	; remainder = dividend"
@@:
	ret

	; dividend < divisor
trivial:
	test	r8, r8
	jz	@f		; address of remainder = 0?

	mov	[r8], rcx	; remainder = dividend
@@:
	xor	eax, eax	; rax = quotient = 0
	ret

	; dividend = 0
zero:
	test	r8, r8
	jz	@f		; address of remainder = 0?

	mov	[r8], r9	; remainder = 0
@@:
	mov	rax, r9		; rax = quotient = 0
	ret

	; divisor = 0
error:
	div	rdx
	ret

__udivmoddi4 endp
	end

Implementation for i386 Processors

Prototypes for the __udivmoddi4(), __udivdi3(), __umoddi3(), __divdi3() and __moddi3() functions:
// Copyleft © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
uint64_t __udivdi3(uint64_t dividend, uint64_t divisor);
uint64_t __umoddi3(uint64_t dividend, uint64_t divisor);

int64_t __divdi3(int64_t dividend, int64_t divisor);
int64_t __moddi3(int64_t dividend, int64_t divisor);
The suffixes di4 and di3 specify the number of arguments plus return value and their size: double integer denotes an 8-byte QWORD alias 64-bit uint64_t.

Note: the compiler helper routines for the Microsoft Visual C compiler use non-standard calling or naming conventions and can therefore not be prototyped; they are for internal use by the compiler only!

Note: the other code following here supports the common, so-called cdecl calling and naming convention used by C compilers on Linux®, Unix, Windows®, plus other operating systems, and runs on 35 year old Intel® 80386 micro-processors.

Note: the branch-free code paths, which are assembled when the macro JMPLESS is defined, actually yield lower performance!

64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)

__udivmoddi4() function for i386 processors:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# "cdecl" calling and naming convention for I386 platform:
# - arguments are pushed on stack in reverse order (from right to left);
# - 64-bit integer result is returned in registers EAX (low part) and
#   EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.

# NOTE: raises "division exception" when divisor is 0!

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___udivmoddi4
	.type	___udivmoddi4, @function
	.text
				# [esp+20] = address of (optional) remainder
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
___udivmoddi4:
	mov	edx, [esp+16]	# edx = high dword of divisor
.ifdef TRIVIAL
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, edx
	jb	.trivial	# (high dword of) dividend < (high dword of) divisor?
.endif
	bsr	ecx, edx	# ecx = index of leading '1' bit in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	0f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
0:
	xor	edx, edx	# edx:eax = quotient
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+24]	# ecx = address of remainder
	test	ecx, ecx
	jz	1f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
1:
	pop	edx		# edx:eax = quotient
	ret
.ifdef TRIVIAL
	# dividend < divisor
.trivial:
	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	2f		# address of remainder = 0?

	mov	edx, [esp+4]	# edx = low dword of dividend
	mov	[ecx+4], eax
	mov	[ecx], edx	# [ecx] = remainder = dividend
2:
	xor	eax, eax
	mov	edx, eax	# edx:eax = quotient = 0
	ret
.endif
	# high dword of divisor <> 0 (so high dword of quotient will be 0):
	# perform "extended & adjusted" division
.extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	# edx:eax = divisor
	not	ecx		# ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	mov	ebx, edx	# ebx = divisor'

	mov	edx, [esp+16]	# edx = high dword of dividend
	mov	eax, [esp+12]	# eax = low dword of dividend
.ifnotdef JMPLESS
	xor	edi, edi	# edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	3f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	edi		# edi = high dword of quotient' = 1
3:
.else
	sub	edx, ebx	# edx = high dword of dividend - divisor'
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	# edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
	neg	edi		# CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		# edi = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
.endif # JMPLESS
	# high dword of dividend' < divisor'

	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	shld	edi, eax, cl	# edi = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"
#	shl	eax, cl

	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	edi		# edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	# eax = high dword of divisor
	imul	eax, edi	# eax = high dword of divisor * quotient"

	sub	ebx, eax	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JMPLESS
	jnb	4f		# remainder" >= 0?
				# with borrow, it is off by divisor,
				#  and quotient" is off by 1
.if 0
	sbb	edi, 0		# edi = quotient" - 1
				#     = quotient
.else
	dec	edi		# edi = quotient" - 1
				#     = quotient
.endif
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
4:
.else
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	add	edi, eax	# edi = quotient" - 1
				#     = quotient
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
.endif # JMPLESS
	mov	eax, edi	# eax = (low dword of) quotient
	xor	edx, edx	# edx:eax = quotient

	mov	edi, [esp+28]	# edi = address of remainder
	test	edi, edi
	jz	5f		# address of remainder = 0?

	mov	[edi], ecx	# [edi] = low dword of remainder
	mov	[edi+4], ebx	# [edi+4] = high dword of remainder
5:
	pop	edi
	pop	ebx
	ret

	.end
Microsoft Visual C compiler helper routine _aulldvrm() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; "stdcall" calling and naming convention for I386 platform:
; - arguments are pushed on stack in reverse order (from right to left);
; - 64-bit integer result is returned in registers EAX (low part) and
;   EDX (high part);
; - 32-bit integer or pointer result is returned in register EAX;
; - registers EAX, ECX and EDX are volatile and can be clobbered;
; - registers EBX, EBP, ESI and EDI must be preserved;
; - register ESP (the stack pointer) must be restored by callee;
; - function names are prefixed with an underscore and suffixed with an
;   at-sign, followed by the total number of bytes for all arguments.

; NOTE: returns quotient in EDX:EAX and remainder in EBX:ECX

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_aulldvrm proc	public

	mov	edx, [esp+16]	; edx = high dword of divisor
ifdef TRIVIAL
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	xor	ebx, ebx	; ebx:ecx = remainder
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	mov	edx, ebx	; edx:eax = quotient
	xor	ebx, ebx	; ebx:ecx = remainder

	ret	16		; callee restores stack
ifdef TRIVIAL
	; dividend < divisor
trivial:
	mov	ecx, [esp+4]	; ecx = low dword of dividend
	mov	ebx, eax	; ebx:ecx = remainder = dividend
	xor	eax, eax
	mov	edx, eax	; edx:eax = quotient = 0

	ret	16		; callee restores stack
endif
	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	edi

	mov	eax, [esp+16]	; edx:eax = divisor
	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+12]	; edx = high dword of dividend
	mov	eax, [esp+8]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+8]
	mov	ebx, [esp+12]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+20]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JMPLESS
	jnb	@f		; remainder" >= 0?
				; with borrow, it is off by divisor,
				;  and quotient" is off by 1
if 0
	sbb	edi, 0		; edi = quotient" - 1
				;     = quotient
else
	dec	edi		; edi = quotient" - 1
				;     = quotient
endif
	add	ecx, [esp+16]
	adc	ebx, [esp+20]	; ebx:ecx = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = remainder
@@:
else
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	edi, eax	; edi = quotient" - 1
				;     = quotient
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = remainder
endif ; JMPLESS
	mov	eax, edi	; eax = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient

	pop	edi
	ret	16		; callee restores stack

_aulldvrm endp
	end

64÷64-bit Unsigned Integer Division (64-bit Quotient)

__udivdi3() function for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__udivdi3 proc	public

	mov	edx, [esp+16]	; edx = high dword of divisor
ifdef TRIVIAL
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	xor	edx, edx	; edx:eax = quotient
	ret

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient

	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = quotient
	ret
ifdef TRIVIAL
	; dividend < divisor
trivial:
	xor	eax, eax
	mov	edx, eax	; edx:eax = quotient = 0
	ret
endif
	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	; edx:eax = divisor
	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+16]	; edx = high dword of dividend
	mov	eax, [esp+12]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
	sbb	edi, 0		; edi = quotient" - (remainder" < 0)
				;     = quotient
	mov	eax, edi	; eax = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient

	pop	edi
	pop	ebx
	ret

__udivdi3 endp
	end
Microsoft Visual C compiler helper routine _aulldiv() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_aulldiv proc	public

	mov	edx, [esp+16]	; edx = high dword of divisor
ifdef TRIVIAL
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = quotient

	ret	16		; callee restores stack
ifdef TRIVIAL
	; dividend < divisor
trivial:
	xor	eax, eax
	mov	edx, eax	; edx:eax = quotient = 0

	ret	16		; callee restores stack
endif
	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	; edx:eax = divisor
	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+16]	; edx = high dword of dividend
	mov	eax, [esp+12]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
	sbb	edi, 0		; edi = quotient" - (remainder" < 0)
				;     = quotient
	mov	eax, edi	; eax = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient

	pop	edi
	pop	ebx
	ret	16		; callee restores stack

_aulldiv endp
	end

64÷64-bit Unsigned Integer Division (64-bit Remainder)

__umoddi3() function for i386 processors:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___umoddi3
	.type	___umoddi3, @function
	.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
___umoddi3:
	mov	edx, [esp+16]	# edx = high dword of divisor
.ifdef TRIVIAL
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, edx
	jb	.trivial	# (high dword of) dividend < (high dword of) divisor?
.endif
	bsr	ecx, edx	# ecx = index of leading '1' bit in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	mov	eax, [esp+4]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret
.ifdef TRIVIAL
	; dividend < divisor
.trivial:
	mov	edx, eax
	mov	eax, [esp+4]	# edx:eax = remainder = dividend
	ret
.endif
	# high dword of divisor <> 0 (so high dword of quotient will be 0):
	# perform "extended & adjusted" division
.extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	# edx:eax = divisor
	not	ecx		# ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	mov	ebx, edx	# ebx = divisor'

	mov	edx, [esp+16]	# edx = high dword of dividend
	mov	eax, [esp+12]	# eax = low dword of dividend
.ifnotdef JMPLESS
	xor	edi, edi	# edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	edi		# edi = high dword of quotient' = 1
0:
.else
	sub	edx, ebx	# edx = high dword of dividend - divisor'
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	# edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
	neg	edi		# CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		# edi = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
.endif # JMPLESS
	# high dword of dividend' < divisor'

	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	shld	edi, eax, cl	# edi = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"
#	shl	eax, cl

	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	edi		# edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	# eax = high dword of divisor
	imul	eax, edi	# eax = high dword of divisor * quotient"

	sub	ebx, eax	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JMPLESS
	jnb	1f		# remainder" >= 0?
				# with borrow, it is off by divisor
				#  (and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
1:
	mov	eax, ecx
	mov	edx, ebx	# edx:eax = remainder
.else
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	# edx:eax = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
.endif # JMPLESS
	pop	edi
	pop	ebx
	ret

	.end
Microsoft Visual C compiler helper routine _aullrem() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_aullrem proc	public

	mov	edx, [esp+16]	; edx = high dword of divisor
ifdef TRIVIAL
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack
ifdef TRIVIAL
	; dividend < divisor
trivial:
	mov	edx, eax
	mov	eax, [esp+4]	; edx:eax = remainder = dividend

	ret	16		; callee restores stack
endif
	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	; edx:eax = divisor
	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+16]	; edx = high dword of dividend
	mov	eax, [esp+12]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JMPLESS
	jnb	@f		; remainder" >= 0?
				; with borrow, it is off by divisor
				;  (and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = remainder
@@:
	mov	eax, ecx
	mov	edx, ebx	; edx:eax = remainder
else
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	; edx:eax = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = remainder
endif ; JMPLESS
	pop	edi
	pop	ebx
	ret	16		; callee restores stack

_aullrem endp
	end

64÷64-bit Signed Integer Division (64-bit Quotient and Remainder)

Microsoft Visual C compiler helper routine _alldvrm() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: returns quotient in EDX:EAX and remainder in EBX:ECX

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_alldvrm proc	public

	push	edi

	; determine sign of dividend and compute |dividend|

	mov	edx, [esp+12]	; edx = high dword of dividend
	mov	eax, [esp+8]	; eax = low dword of dividend

	mov	edi, edx
	sar	edi, 31		; edi = (dividend < 0) ? -1 : 0
				;     = sign of dividend
				;     = sign of remainder
	xor	eax, edi
	xor	edx, edi	; edx:eax = (dividend < 0) ? ~dividend : dividend
	sub	eax, edi
	sbb	edx, edi	; edx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+8], eax	; write |dividend| back on stack
	mov	[esp+12], edx

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]	; edx = high dword of divisor
	mov	eax, [esp+16]	; eax = low dword of divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	xor	ecx, edi	; ecx = sign of divisor ^ sign of dividend
				;     = sign of quotient
	push	ecx		; save sign of quotient on stack
ifdef TRIVIAL
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	ecx, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+16]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	xor	ebx, ebx	; ebx = high dword of quotient = 0
	jmp	next

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
next:
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder

	mov	ecx, edx	; ecx = (low dword of) |remainder|
	mov	edx, ebx	; edx:eax = |quotient|
;;	xor	ebx, ebx	; ebx:ecx = |remainder|

	pop	ebx		; ebx = sign of quotient
	xor	eax, ebx
	xor	edx, ebx
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = quotient

	mov	ebx, edi	; ebx = sign of remainder
	xor	ecx, ebx
	sub	ecx, ebx
	sbb	ebx, ebx	; ebx:ecx = remainder

	pop	edi
	ret	16		; callee restores stack

	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	edi		; save sign of remainder

	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+20]	; edx = high dword of dividend
	mov	eax, [esp+16]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+28]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JMPLESS
	jnb	@f		; remainder" >= 0?
				; with borrow, it is off by divisor,
				;  and quotient" is off by 1
if 0
	sbb	edi, 0		; edi = quotient" - 1
				;     = |quotient|
else
	dec	edi		; edi = quotient" - 1
				;     = |quotient|
endif
	add	ecx, [esp+24]
	adc	ebx, [esp+28]	; ebx:ecx = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = |remainder|
@@:
else
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	edi, eax	; edi = quotient" - 1
				;     = |quotient|
	and	eax, [esp+24]
	and	edx, [esp+28]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = |remainder|
endif ; JMPLESS
	mov	eax, edi	; eax = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edi		; edi = sign of remainder

	pop	edx		; edx = sign of quotient
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	xor	ecx, edi
	xor	ebx, edi
	sub	ecx, edi
	sbb	ebx, edi	; ebx:ecx = remainder

	pop	edi
	ret	16		; callee restores stack

_alldvrm endp
	end

64÷64-bit Signed Integer Division (64-bit Quotient)

__divdi3() function for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__divdi3 proc	public

	push	ebx

	; determine sign of dividend and compute |dividend|

	mov	edx, [esp+12]	; edx = high dword of dividend
	mov	eax, [esp+8]	; eax = low dword of dividend

	mov	ebx, edx
	sar	ebx, 31		; ebx = (dividend < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx	; edx:eax = (dividend < 0) ? ~dividend : dividend
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+8], eax	; write |dividend| back on stack
	mov	[esp+12], edx

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]	; edx = high dword of divisor
	mov	eax, [esp+16]	; eax = low dword of divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	xor	ecx, ebx	; ecx = sign of dividend ^ sign of divisor
				;     = sign of quotient
	push	ecx		; save sign of quotient on stack
ifdef TRIVIAL
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	ecx, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+16]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+12]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edx		; edx = sign of quotient
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	pop	ebx
	ret

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	edx, ebx	; edx:eax = |quotient|

	pop	ecx		; ecx = sign of quotient
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient

	pop	ebx
	ret

	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	edi

	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+20]	; edx = high dword of dividend
	mov	eax, [esp+16]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+28]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
	sbb	edi, 0		; edi = quotient" - (remainder" < 0)
				;     = |quotient|
	mov	eax, edi	; eax = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edi

	pop	edx		; edx = sign of quotient
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	pop	ebx
	ret

__divdi3 endp
	end
Microsoft Visual C compiler helper routine _alldiv() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_alldiv	proc	public

	push	ebx

	; determine sign of dividend and compute |dividend|

	mov	edx, [esp+12]	; edx = high dword of dividend
	mov	eax, [esp+8]	; eax = low dword of dividend

	mov	ebx, edx
	sar	ebx, 31		; ebx = (dividend < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx	; edx:eax = (dividend < 0) ? ~dividend : dividend
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+8], eax	; write |dividend| back on stack
	mov	[esp+12], edx

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]	; edx = high dword of divisor
	mov	eax, [esp+16]	; eax = low dword of divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	xor	ecx, ebx	; ecx = sign of dividend ^ sign of divisor
				;     = sign of quotient
	push	ecx		; save sign of quotient on stack
ifdef TRIVIAL
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	ecx, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+16]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+12]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edx		; edx = sign of quotient
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	pop	ebx
	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	edx, ebx	; edx:eax = |quotient|

	pop	ecx		; ecx = sign of quotient
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient

	pop	ebx
	ret	16		; callee restores stack

	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	edi

	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+20]	; edx = high dword of dividend
	mov	eax, [esp+16]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+28]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
	sbb	edi, 0		; edi = quotient" - (remainder" < 0)
				;     = |quotient|
	mov	eax, edi	; eax = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edi

	pop	edx		; edx = sign of quotient
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	pop	ebx
	ret	16		; callee restores stack

_alldiv	endp
	end

64÷64-bit Signed Integer Division (64-bit Remainder)

__moddi3() function for i386 processors:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___moddi3
	.type	___moddi3, @function
	.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
___moddi3:
	# determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]	# eax = high dword of dividend
	mov	ecx, [esp+4]	# ecx = low dword of dividend

	cdq			# edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	# ecx:eax = (dividend < 0) ? -dividend : dividend
				#         = |dividend|

	mov	[esp+4], ecx	# write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		# save sign of dividend on stack

	# determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]	# edx = high dword of divisor
	mov	eax, [esp+16]	# eax = low dword of divisor

	mov	ecx, edx
	sar	ecx, 31		# ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = (divisor < 0) ? -divisor : divisor
				#         = |divisor|

	mov	[esp+16], eax	# write |divisor| back on stack
	mov	[esp+20], edx
.ifdef TRIVIAL
	mov	ecx, [esp+12]	# ecx = high dword of dividend
	cmp	ecx, edx
	jb	.trivial	# (high dword of) dividend < (high dword of) divisor?
.endif
	bsr	ecx, edx	# ecx = index of leading '1' bit in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+12]	# eax = high dword of dividend
	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	jmp	.next

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
.next:
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) |remainder|
#	xor	edx, edx	# edx:eax = |remainder|

	pop	edx		# edx = sign of remainder
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	# edx:eax = remainder
	ret

	# high dword of divisor <> 0 (so high dword of quotient will be 0):
	# perform "extended & adjusted" division
.extended:
	push	ebx
	push	edi

	not	ecx		# ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
#	shl	eax, cl
	mov	ebx, edx	# ebx = divisor'

	mov	edx, [esp+20]	# edx = high dword of dividend
	mov	eax, [esp+16]	# eax = low dword of dividend
.ifnotdef JMPLESS
	xor	edi, edi	# edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	edi		# edi = high dword of quotient' = 1
0:
.else
	sub	edx, ebx	# edx = high dword of dividend - divisor'
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	# edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
	neg	edi		# CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		# edi = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
.endif # JMPLESS
	# high dword of dividend' < divisor'

	div	ebx		# eax = dividend' / divisor'
				#     = low dword of quotient',
				# edx = remainder'
	shld	edi, eax, cl	# edi = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"
#	shl	eax, cl

	mov	eax, [esp+24]	# eax = low dword of divisor
	mul	edi		# edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+28]	# eax = high dword of divisor
	imul	eax, edi	# eax = high dword of divisor * quotient"

	sub	ebx, eax	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JMPLESS
	jnb	1f		# remainder" >= 0?
				# with borrow, it is off by divisor
				#  (and quotient" is off by 1)
	add	ecx, [esp+24]
	adc	ebx, [esp+28]	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
1:
	mov	eax, ecx
	mov	edx, ebx	# edx:eax = |remainder|
.else
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+24]
	and	edx, [esp+28]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	# edx:eax = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
.endif # JMPLESS
	pop	edi
	pop	ebx

	pop	ecx		# ecx = sign of remainder
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = remainder

	ret

	.end
Microsoft Visual C compiler helper routine _allrem() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
_allrem	proc	public

	; determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]	; eax = high dword of dividend
	mov	ecx, [esp+4]	; ecx = low dword of dividend

	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; ecx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		; save sign of dividend on stack

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]	; edx = high dword of divisor
	mov	eax, [esp+16]	; eax = low dword of divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx
ifdef TRIVIAL
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	ecx, edx
	jb	trivial		; (high dword of) dividend < (high dword of) divisor?
endif
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	jmp	next

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
next:
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) |remainder|
;;	xor	edx, edx	; edx:eax = |remainder|

	pop	edx		; edx = sign of remainder
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	ebx
	push	edi

	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+20]	; edx = high dword of dividend
	mov	eax, [esp+16]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
;;	shl	eax, cl

	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+28]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JMPLESS
	jnb	@f		; remainder" >= 0?
				; with borrow, it is off by divisor
				;  (and quotient" is off by 1)
	add	ecx, [esp+24]
	adc	ebx, [esp+28]	; ebx:ecx = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = remainder
@@:
	mov	eax, ecx
	mov	edx, ebx	; edx:eax = |remainder|
else
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+24]
	and	edx, [esp+28]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	; edx:eax = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = |remainder|
endif ; JMPLESS
	pop	edi
	pop	ebx

	pop	ecx		; ecx = sign of remainder
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = remainder

	ret	16		; callee restores stack

_allrem	endp
	end

64×64-bit Signed and Unsigned Integer Multiplication (64-bit Product)

__muldi3() alias __umuldi3() function for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of multiplier
				; [esp+12] = low dword of multiplier
				; [esp+8] = high dword of multiplicand
				; [esp+4] = low dword of multiplicand
__muldi3 proc	public
__umuldi3 proc	public

	push	ebx

	mov	eax, [esp+16]	; eax = low dword of multiplier
	mov	ecx, [esp+12]	; ecx = high dword of multiplicand
	imul	ecx, eax	; ecx = high dword of multiplicand
				;     * low dword of multiplier

	mov	edx, [esp+8]	; edx = low dword of multiplicand
	mov	ebx, [esp+20]	; ebx = high dword of multiplier
	imul	ebx, edx	; ebx = high dword of multiplier
				;     * low dword of multiplicand

	mul	edx		; edx:eax = low dword of multiplier
				;         * low dword of multiplicand
	add	ecx, ebx	; ebx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
	add	edx, ecx	; edx:eax = product % 2**64

	pop	ebx
	ret

__umuldi3 endp
__muldi3 endp
	end
Microsoft Visual C compiler helper routine _allmul(), for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of multiplier
				; [esp+12] = low dword of multiplier
				; [esp+8] = high dword of multiplicand
				; [esp+4] = low dword of multiplicand
_allmul	proc	public

	push	ebx

	mov	ecx, [esp+20]	; ecx = high dword of multiplier
	mov	eax, [esp+16]	; eax = low dword of multiplier
	mov	ebx, [esp+12]	; ebx = high dword of multiplicand
	mov	edx, [esp+8]	; edx = low dword of multiplicand
	imul	ebx, eax	; ebx = high dword of multiplicand
				;     * low dword of multiplier
	imul	ecx, edx	; ecx = high dword of multiplier
				;     * low dword of multiplicand
	mul	edx		; edx:eax = low dword of multiplier
				;         * low dword of multiplicand
	add	ecx, ebx	; ecx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
	add	edx, ecx	; edx:eax = product % 2**64

	pop	ebx
	ret	16		; callee restores stack

_allmul	endp
	end

64-bit Signed and Unsigned Integer Shift (64-bit Result)

Microsoft Visual C compiler helper routine _allshl() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: treats shift count % 64

	.386
	.model	flat, C
	.code
				; edx:eax = value
				; ecx = count
_allshl	proc	public

	test	cl, 32
	jz	@f		; count < 32?

	mov	edx, eax
	xor	eax, eax
	shl	edx, cl

	ret
@@:
	shld	edx, eax, cl
	shl	eax, cl

	ret

_allshl	endp
	end
Microsoft Visual C compiler helper routine _allshr() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: treats shift count % 64

	.386
	.model	flat, C
	.code
				; edx:eax = value
				; ecx = count
_allshr	proc	public

	test	cl, 32
	jz	@f		; count < 32?

	mov	eax, edx
	xor	edx, edx
	sar	eax, cl

	ret
@@:
	shrd	eax, edx, cl
	sar	edx, cl

	ret

_allshr	endp
	end
Microsoft Visual C compiler helper routine _aullshr() for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: treats shift count % 64

	.386
	.model	flat, C
	.code
				; edx:eax = value
				; ecx = count
_aullshr proc	public

	test	cl, 32
	jz	@f		; count < 32?

	mov	eax, edx
	xor	edx, edx
	shr	eax, cl

	ret
@@:
	shrd	eax, edx, cl
	shr	edx, cl

	ret

_aullshr endp
	end

Execution Times (Sustained Reciprocal Throughput)

Measurements were performed on Windows 10 with the benchmark programs presented below, which are available for download in the cabinet file INTEGER.CAB: the console programs *.com measure execution times in nano-seconds, while the console programs *.exe measure processor clock cycles.

The makefile INTEGER.MAK for Microsoft’s NMAKE.EXE performs all following steps, using but slightly different filenames; it contains the sources presented above and below as inline files and was used to create the cabinet file INTEGER.CAB.

Running ’round in Circles Cycles

The table shows the execution times for 128÷128-bit division, 64÷64-bit division and 64×64-bit multiplication on several processors in clock cycles per function call or instruction; the upper half for the AMD64 platform, the lower half for the i386 platform.

For division, the left columns show the execution times for 128-bit or 64-bit uniform distributed pseudo-random dividend and divisor, i.e. the (rather unlikely) special case with numbers of (almost) equal magnitude, while the right columns show the execution times for 128-bit to 65-bit or 64-bit to 33-bit pseudo-random dividend and divisor respectively, i.e. the (more likely) general case with numbers of different magnitude.

128÷128-bit division 64÷64-bit division
__udivmodti4() __udivmodti4() __udivmodti4() __udivmoddi4() DIV
eSKamation LLVM eSKamation eSKamation AMD, Intel
AMD Ryzen9 3900XT 19 19 39 190 20 56 21 39 13 16
AMD Ryzen5 3600 20 20 41 201 21 59 22 41 14 17
AMD Ryzen7 2700X 20 19 44 212 23 63 21 41 14 17
Intel Core i7-8550U 31 32 24 122 11 37 12 21 28 28
Intel Core i5-7400 55 56 41 214 18 65 15 32 28 29
Intel Core i5-4670 53 55 44 217 22 74 20 39 31 32
Intel Core2 Duo P8700 60 60 62 296 27 117 33 60 28 29

64÷64-bit division 64×64-bit multiplication
__udivmoddi4() __udivmoddi4() _aulldvrm() _aulldvrm() _aullmul() __umuldi3() __umuldi3()
eSKamation LLVM eSKamation Microsoft Microsoft LLVM eSKamation
AMD Ryzen9 3900XT 17 13 53 99 19 14 70 41 4 7
AMD Ryzen5 3600 18 13 56 105 20 14 73 44 3 7
AMD Ryzen7 2700X 19 15 61 114 22 17 82 58 5 9 1
Intel Core i7-8550U 10 9 26 52 11 9 72 46 3 5 1
Intel Core i5-7400 16 14 41 83 19 15 115 78 5 8 1
Intel Core i5-4670 21 19 50 97 24 19 124 84 8 10 1
Intel Core2 Duo P8700 25 19 82 145 30 24 136 98 9 18 1

Note: the deviation of the measurements for my own __udivmoddi4() and _aulldvrm() division routines is due to their different calling convention.

Summary

The following summary can be given from the benchmarks:

Benchmark Programs for AMD64 Processors

The first of the following two C programs for 64-bit processors measures the execution time for one billion divisions of uniform distributed 128-bit pseudo-random numbers and for one billion divisions of pseudo-random numbers in the interval from 2128−1 to 264 with the __udivmodti4() function.

With the preprocessor macro NATIVE defined, the second C program measures the execution time for one billion divisions of uniform distributed 64-bit pseudo-random numbers and one billion divisions of pseudo-random numbers in the interval from 264−1 to 232 with the DIV instruction, else with the shift & subtract algorithm, both disguised as the __udivmoddi4() function.

Note: with the preprocessor macro CYCLES defined, both programs measure the execution time in processor clock cycles and run on 64-bit editions of Windows Vista® and newer versions, else they measure the execution time in nano-seconds and run on 64-bit editions of Windows XP® and newer versions.

// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif

#pragma comment(compiler)
#pragma comment(user, __TIMESTAMP__)

#define _CRT_SECURE_NO_WARNINGS
#define STRICT
#define UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	ULONGLONG	QWORD;

const	struct
{
	QWORD	qwDividend[2], qwDivisor[2], qwQuotient[2], qwRemainder[2];
} owVector[] = {{0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {2ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL, 0ULL, 0ULL},
                {2ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {2ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
                {2ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
                {~0ULL, 0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, 0ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
                {0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {1ULL, 1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
                {1ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL},
                {1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {1ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {1ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {1ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {~0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL},
                {~0ULL, 1ULL, ~0ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL},
                {~0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
                {~0ULL, 1ULL, ~0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
                {~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
                {~0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
                {~0ULL, 0xFULL, 0xFULL, 0ULL, 0x1111111111111111ULL, 1ULL, 0ULL, 0ULL},
                {~0xFULL, ~1ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, ~0xFULL, ~1ULL},
                {0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 2ULL, 0ULL},
                {0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
                {0ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
                {1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL},
                {1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL, 0ULL},
                {1ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 3ULL, 0ULL},
                {1ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, ~0ULL},
                {~0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, ~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, ~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, 1ULL, 3ULL, 0x5555555555555555ULL, 0ULL, 0xAAAAAAAAAAAAAAAAULL, 0ULL},
                {~0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
                {~0ULL, ~0ULL, ~0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, ~1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x0000000001110001ULL, 0ULL, 0x00000000003EB455ULL, 0ULL},
                {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x0000000001110001ULL, 0ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x00000000003EB455ULL, 0ULL}};

#pragma intrinsic(_umul128)

__declspec(noinline)
QWORD	*__umulti3(QWORD qwProduct[2], QWORD qwMultiplicand[2], QWORD qwMultiplier[2])
{
	qwProduct[0] = _umul128(qwMultiplicand[0], qwMultiplier[0], qwProduct + 1);
	qwProduct[1] += qwMultiplicand[0] * qwMultiplier[1]
	              + qwMultiplicand[1] * qwMultiplier[0];

	return qwProduct;
}

__declspec(noinline)
QWORD	*__unopti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2])
{
	if (qwRemainder != NULL)
		*qwDivisor = *qwDividend;

	return qwQuotient;
}

QWORD	*__udivmodti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2]);

#pragma intrinsic(__shiftleft128, __shiftright128)

__forceinline
VOID	lfsr128l(QWORD qw[2])
{
#ifndef XORSHIFT
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D

	QWORD	qwMask = (LONGLONG) (qw[1]) >> 63;
	qw[1] = __shiftleft128(qw[0], qw[1], 1)
	      ^ (qwMask & 0x5DB2B62B0C5F8E1BULL);
	qw[0] = (qwMask & 0xD8CCE715FCB2726DULL) ^ (qw[0] << 1);
#elif 1
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Richard Peirce Brent

	QWORD	qwTemp = qw[1];
	qw[1] = qw[0];
	qw[0] ^= qw[0] << 33;
	qwTemp ^= qwTemp << 28;
	qw[0] ^= qw[0] >> 31;
	qwTemp ^= qwTemp >> 29;
	qw[0] ^= qwTemp;
#else
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	qw[1] ^= __shiftleft128(qw[0], qw[1], 26);
	qw[0] ^= qw[0] << 26;
	qw[0] ^= __shiftright128(qw[0], qw[1], 61);
	qw[1] ^= qw[1] >> 61;
	qw[1] ^= __shiftleft128(qw[0], qw[1], 37);
	qw[0] ^= qw[0] << 37;
#endif
}

__forceinline
VOID	lfsr128r(QWORD qw[2])
{
#ifndef XORSHIFT
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA

	QWORD	qwMask = 0ULL - (qw[0] & 1ULL);
	qw[0] = __shiftright128(qw[0], qw[1], 1)
	      ^ (qwMask & 0xD871FA30D46D4DBAULL);
	qw[1] = (qwMask & 0xB64E4D3FA8E7331BULL) ^ (qw[1] >> 1);
#elif 1
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Sebastiano Vigna

	QWORD	qwTemp = qw[1];
	qw[1] = qw[0];
	qwTemp ^= qwTemp << 23;
	qw[0] ^= qw[0] >> 26;
	qwTemp ^= qwTemp >> 17;
	qw[0] ^= qwTemp;
#else
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	qw[1] ^= __shiftleft128(qw[0], qw[1], 11);
	qw[0] ^= qw[0] << 11;
	qw[0] ^= __shiftright128(qw[0], qw[1], 61);
	qw[1] ^= qw[1] >> 61;
	qw[1] ^= __shiftleft128(qw[0], qw[1], 45);
	qw[0] ^= qw[0] << 45;
#endif
}

__forceinline
VOID	scale128(QWORD qwOut[2], QWORD qwIn[2])
{
	qwOut[0] = __shiftright128(qwIn[0], qwIn[1], qwIn[0] /* & 63 */);
	qwOut[1] = qwIn[1] >> (qwIn[0] /* & 63 */);
}

__declspec(safebuffers)
BOOL	PrintConsole(HANDLE hConsole, LPCWSTR lpFormat, ...)
{
	WCHAR	szBuffer[1025];
	DWORD	dwBuffer;
	DWORD	dwConsole;

	va_list	vaInserts;
	va_start(vaInserts, lpFormat);

	dwBuffer = wvsprintf(szBuffer, lpFormat, vaInserts);

	va_end(vaInserts);

	if (dwBuffer == 0UL)
		return FALSE;

	if (!WriteConsole(hConsole, szBuffer, dwBuffer, &dwConsole, NULL))
		return FALSE;

	return dwConsole == dwBuffer;
}

__declspec(noreturn)
VOID	wmainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	QWORD	qwDividend[2], qwDivisor[2], qwQuotient[2], qwRemainder[2];
		// bit-vector of prime numbers: 2**n == prime
	QWORD	qwLeft[2] = {0x28208A20A08A28ACULL, 0x800228A202088288ULL};
		// 2**128 / golden ratio
	QWORD	qwRight[2] = {0xF39CC0605CEDC834ULL, 0x9E3779B97F4A7C15ULL};

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		strcpy((LPSTR) dwCPUID, "undetermined processor");

	PrintConsole(hOutput, L"\nTesting 128-bit division...\n");

	for (dw = 1UL; dw < sizeof(owVector) / sizeof(*owVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#if 0
		if ((owVector[dw].qwDivisor[1] | owVector[dw].qwDivisor[0]) == 0ULL)
			continue;
#endif
		__udivmodti4(qwQuotient, owVector[dw].qwDividend, owVector[dw].qwDivisor, qwRemainder);

		if ((qwQuotient[1] != owVector[dw].qwQuotient[1])
		 || (qwQuotient[0] != owVector[dw].qwQuotient[0]))
			PrintConsole(hOutput, L"\t0x%016I64X:%016I64X\a / %016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n",
			             owVector[dw].qwDividend[1], owVector[dw].qwDividend[0],
			             owVector[dw].qwDivisor[1], owVector[dw].qwDivisor[0],
			             owVector[dw].qwQuotient[1], owVector[dw].qwQuotient[0],
			             qwQuotient[1], qwQuotient[0]);

		if ((qwRemainder[1] != owVector[dw].qwRemainder[1])
		 || (qwRemainder[0] != owVector[dw].qwRemainder[0]))
			PrintConsole(hOutput, L"\t0x%016I64X:%016I64X\a %% %016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n",
			             owVector[dw].qwDividend[1], owVector[dw].qwDividend[0],
			             owVector[dw].qwDivisor[1], owVector[dw].qwDivisor[0],
			             owVector[dw].qwRemainder[1], owVector[dw].qwRemainder[0],
			             qwRemainder[1], qwRemainder[0]);
	}

	PrintConsole(hOutput, L"\nTiming 128-bit division on %.48hs\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		__unopti4(qwQuotient, qwLeft, qwRight, NULL);
		lfsr128r(qwRight);
		__unopti4(qwQuotient, qwLeft, qwRight, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		__udivmodti4(qwQuotient, qwLeft, qwRight, NULL);
		lfsr128r(qwRight);
		__udivmodti4(qwQuotient, qwLeft, qwRight, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		__umulti3(qwQuotient, qwLeft, qwRight);
		lfsr128r(qwRight);
		__umulti3(qwQuotient, qwLeft, qwRight);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 - qwT1;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopti4()     %6I64u.%09I64u      0\n"
	                      L"__udivmodti4()  %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"__umulti3()     %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"                %6I64u.%09I64u clock cycles\n",
	             qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	             qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	             qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	             qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	             qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	             qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopti4()     %6I64u.%07I64u      0\n"
	                      L"__udivmodti4()  %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"__umulti3()     %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"                %6I64u.%07I64u nano-seconds\n",
	             qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	             qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	             qwTx / 10000000ULL, qwTx % 10000000ULL,
	             qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	             qwTy / 10000000ULL, qwTy % 10000000ULL,
	             qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		scale128(qwDividend, qwLeft);
		__unopti4(qwQuotient, qwDividend, qwDivisor, NULL);
		lfsr128r(qwRight);
		scale128(qwDivisor, qwRight);
		__unopti4(qwQuotient, qwDividend, qwDivisor, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		scale128(qwDividend, qwLeft);
		__udivmodti4(qwQuotient, qwDividend, qwDivisor, NULL);
		lfsr128r(qwRight);
		scale128(qwDivisor, qwRight);
		__udivmodti4(qwQuotient, qwDividend, qwDivisor, qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwLeft);
		scale128(qwDividend, qwLeft);
		__umulti3(qwQuotient, qwDividend, qwDivisor);
		lfsr128r(qwRight);
		scale128(qwDivisor, qwRight);
		__umulti3(qwQuotient, qwDividend, qwDivisor);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 - qwT1;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopti4()     %6I64u.%09I64u      0\n"
	                      L"__udivmodti4()  %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"__umulti3()     %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"                %6I64u.%09I64u clock cycles\n",
	             qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	             qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	             qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	             qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	             qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	             qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopti4()     %6I64u.%07I64u      0\n"
	                      L"__udivmodti4()  %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"__umulti3()     %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"                %6I64u.%07I64u nano-seconds\n",
	             qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	             qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	             qwTx / 10000000ULL, qwTx % 10000000ULL,
	             qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	             qwTy / 10000000ULL, qwTy % 10000000ULL,
	             qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
	ExitProcess(0UL);
}
// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif

#pragma comment(compiler)
#pragma comment(user, __TIMESTAMP__)

#define _CRT_SECURE_NO_WARNINGS
#define STRICT
#define UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	ULONGLONG	QWORD;

#ifndef NATIVE
#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}

const	struct
{
	QWORD	qwDividend, qwDivisor, qwQuotient, qwRemainder;
} qwVector[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
                _(0x0000000000000001ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000002ULL),
                _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
                _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
                _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
                _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
                _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
                _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000000000001ULL),
                _(0x8000000000000000ULL, 0x0000000000000002ULL),
                _(0x8000000000000000ULL, 0x0000000000000003ULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000100000000ULL),
                _(0x8000000000000000ULL, 0x0000000100000001ULL),
                _(0x8000000000000000ULL, 0x0000000100000002ULL),
                _(0x8000000000000000ULL, 0x0000000100000003ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000080000000ULL, 0x0000000080000000ULL),
                _(0x8000000080000001ULL, 0x0000000080000001ULL),
                _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};

#undef _
#ifndef INTERN
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);
#else
__declspec(noinline)
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	QWORD	quotient;
	DWORD	index1, index2;

	if (_BitScanReverse64(&index2, divisor))
		if (_BitScanReverse64(&index1, dividend))
#if 0
			if (index1 >= index2)
#else
			if (dividend >= divisor)
#endif
			{
				// dividend >= divisor > 0,
				//  64 > index1 >= index2 >= 0
				//   (number of leading '0' bits = 63 - index)

				divisor <<= index1 - index2;
				quotient = 0ULL;

				do
				{
					quotient <<= 1;

					if (dividend >= divisor)
					{
						dividend -= divisor;
						quotient |= 1ULL;
					}

					divisor >>= 1;
				} while (index1 >= ++index2);

				if (remainder != NULL)
					*remainder = dividend;

				return quotient;
			}
			else // divisor > dividend > 0:
			     //  quotient = 0, remainder = dividend
			{
				if (remainder != NULL)
					*remainder = dividend;

				return 0ULL;
			}
		else // divisor > dividend == 0:
		     //  quotient = 0, remainder = 0
		{
			if (remainder != NULL)
				*remainder = 0ULL;

			return 0ULL;
		}
	else // divisor == 0
	{
		if (remainder != NULL)
			return dividend % divisor;

		return dividend / divisor;
	}
}
#endif // INTERN
#else
__declspec(noinline)
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = dividend % divisor;

	return dividend / divisor;
}
#endif // NATIVE

__declspec(noinline)
QWORD	__unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = divisor;

	return dividend;
}

__declspec(noinline)
QWORD	__umuldi4(QWORD multiplicand, QWORD multiplier, QWORD *dummy)
{
	if (dummy != NULL)
		*dummy = 0ULL;

	return multiplicand * multiplier;
}

__declspec(safebuffers)
BOOL	PrintConsole(HANDLE hConsole, LPCWSTR lpFormat, ...)
{
	WCHAR	szBuffer[1025];
	DWORD	dwBuffer;
	DWORD	dwConsole;

	va_list	vaInserts;
	va_start(vaInserts, lpFormat);

	dwBuffer = wvsprintf(szBuffer, lpFormat, vaInserts);

	va_end(vaInserts);

	if (dwBuffer == 0UL)
		return FALSE;

	if (!WriteConsole(hConsole, szBuffer, dwBuffer, &dwConsole, NULL))
		return FALSE;

	return dwConsole == dwBuffer;
}

__declspec(noreturn)
VOID	wmainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	volatile
	QWORD	qwQuotient;
	QWORD	qwRemainder, qwDividend, qwDivisor = ~0ULL;
		// bit-vector of prime numbers: 2**n == prime
	QWORD	qwLeft = 0x28208A20A08A28ACULL;
		// 2**64 / golden ratio
	QWORD	qwRight = 0x9E3779B97F4A7C15ULL;

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		strcpy((LPSTR) dwCPUID, "undetermined processor");
#ifndef NATIVE
#ifndef INTERN
	PrintConsole(hOutput, L"\nTesting 64-bit assembly division...\n");
#else
	PrintConsole(hOutput, L"\nTesting 64-bit C division...\n");
#endif
	for (dw = 0UL; dw < sizeof(qwVector) / sizeof(*qwVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);

		qwQuotient = __udivmoddi4(qwVector[dw].qwDividend, qwVector[dw].qwDivisor, &qwRemainder);

		if (qwQuotient != qwVector[dw].qwQuotient)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwQuotient, qwVector[dw].qwQuotient);

		if (qwQuotient > qwVector[dw].qwDividend)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u greater dividend\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwQuotient);

		if (qwRemainder != qwVector[dw].qwRemainder)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a remainder %I64u not equal %I64u\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwRemainder, qwVector[dw].qwRemainder);

		if (qwRemainder >= qwVector[dw].qwDivisor)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not less divisor\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwRemainder);
	}
#ifndef INTERN
	PrintConsole(hOutput, L"\nTiming 64-bit assembly division on %.48hs\n", dwCPUID);
#else
	PrintConsole(hOutput, L"\nTiming 64-bit C division on %.48hs\n", dwCPUID);
#endif
#else
	PrintConsole(hOutput, L"\nTiming 64-bit native division on %.48hs\n", dwCPUID);
#endif // NATIVE
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __unopdi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __unopdi4(qwLeft, qwRight, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __udivmoddi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __udivmoddi4(qwLeft, qwRight, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __umuldi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __umuldi4(qwLeft, qwRight, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6I64u.%09I64u      0\n"
	                      L"__udivmoddi4()  %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"__umuldi3()     %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"                %6I64u.%09I64u clock cycles\n",
	             qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	             qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	             qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	             qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	             qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	             qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6I64u.%07I64u      0\n"
	                      L"__udivmoddi4()  %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"__umuldi3()     %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"                %6I64u.%07I64u nano-seconds\n",
	             qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	             qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	             qwTx / 10000000ULL, qwTx % 10000000ULL,
	             qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	             qwTy / 10000000ULL, qwTy % 10000000ULL,
	             qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = qwLeft >> (qwLeft & 31ULL);
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = qwRight >> (qwRight & 31ULL);
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = qwLeft >> (qwLeft & 31ULL);
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = qwRight >> (qwRight & 31ULL);
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = qwLeft >> (qwLeft & 31ULL);
		qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = qwRight >> (qwRight & 31ULL);
		qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6I64u.%09I64u      0\n"
	                      L"__udivmoddi4()  %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"__umuldi3()     %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"                %6I64u.%09I64u clock cycles\n",
	             qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	             qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	             qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	             qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	             qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	             qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6I64u.%07I64u      0\n"
	                      L"__udivmoddi4()  %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"__umuldi3()     %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"                %6I64u.%07I64u nano-seconds\n",
	             qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	             qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	             qwTx / 10000000ULL, qwTx % 10000000ULL,
	             qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	             qwTy / 10000000ULL, qwTy % 10000000ULL,
	             qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
	ExitProcess(0UL);
}
Save the first C source presented above as 128-amd64.c and the second C source as 64-amd64.c in an arbitrary, preferable empty directory, save the second assembly source presented above as udivmodti4.asm, the fourth assembly source as udivmodti4-hybrid.asm, and the fifth assembly source as udivmoddi4.asm in this directory too, then start the command prompt of the Windows software development kit for the AMD64 platform there, run the following command lines to assemble, compile and link build the benchmark programs 64.exe, 64-div.exe, 128.exe plus 128-hybrid.exe, and execute them:
ML64.EXE /c /W3 /X udivmoddi4.asm
CL.EXE /c /DCYCLES /GAFy /O2y /W4 /Zl 64-amd64.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64.exe /SUBSYSTEM:CONSOLE 64-amd64.obj udivmoddi4.obj kernel32.lib user32.lib
CL.EXE /c /DCYCLES /DNATIVE /GAFy /O2y /W4 /Zl 64-amd64.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-div.exe /SUBSYSTEM:CONSOLE 64-amd64.obj kernel32.lib user32.lib
ML64.EXE /c /DJMPLESS /W3 /X udivmodti4.asm
ML64.EXE /c /W3 /X udivmodti4-hybrid.asm
CL.EXE /c /DCYCLES /GAFy /O2y /W4 /Zl 128-amd64.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128.exe /SUBSYSTEM:CONSOLE 128-amd64.obj udivmodti4.obj kernel32.lib user32.lib
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128-hybrid.exe /SUBSYSTEM:CONSOLE 128-amd64.obj udivmodti4-hybrid.obj kernel32.lib user32.lib
.\128.exe
.\128-hybrid.exe
.\64.exe
.\64-div.exe
Note: the command lines can be copied and pasted as block into the Command Processor window!

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: all 64-bit programs are pure Win32 console applications and build without the MSVCRT libraries.

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmoddi4.asm

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

64-amd64.c

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

64-amd64.c

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmodti4.asm

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmodti4-hybrid.asm

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

128-amd64.c
128-amd64.c(201) : warning C4244: 'function' : conversion from 'QWORD' to 'BYTE', possible loss of data
128-amd64.c(273) : warning C4090: 'function' : different 'const' qualifiers
128-amd64.c(273) : warning C4090: 'function' : different 'const' qualifiers

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          8.406722043      0
__udivmodti4()      68.135103197     59.728381154
__umulti3()         15.426767694      7.020045651
                    91.968592934 clock cycles

__unopti4()         10.955079756      0
__udivmodti4()      71.382337619     60.427257863
__umulti3()         20.658580753      9.703500997
                   102.995998128 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          8.500168812      0
__udivmodti4()      35.628062934     27.127894122
__umulti3()         15.499977787      6.999808975
                    59.628209533 clock cycles

__unopti4()         10.962429071      0
__udivmodti4()     127.868276342    116.905847271
__umulti3()         20.865134980      9.902705909
                   159.695840393 clock cycles

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          7.046572321      0
__udivmoddi4()      39.744549176     32.697976855
__umuldi3()          8.225293991      1.178721670
                    55.016415488 clock cycles

__unopdi4()          7.939823193      0
__udivmoddi4()      67.565681569     59.625858376
__umuldi3()          8.377724642      0.437901449
                    83.883229404 clock cycles

Timing 64-bit native division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          7.267861652      0
__udivmoddi4()      35.352622330     28.084760678
__umuldi3()          7.988646681      0.720785029
                    50.609130663 clock cycles

__unopdi4()          7.972264793      0
__udivmoddi4()      37.397281198     29.425016405
__umuldi3()          8.457423360      0.485158567
                    53.826969351 clock cycles
Now without the preprocessor macro CYCLES defined:
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          3.3852217      0
__udivmodti4()      27.3937756     24.0085539
__umulti3()          6.2556401      2.8704184
                    37.0346374 nano-seconds

__unopti4()          4.3992282      0
__udivmodti4()      28.3453817     23.9461535
__umulti3()          8.3772537      3.9780255
                    41.1218636 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          3.3852217      0
__udivmodti4()      14.5080930     11.1228713
__umulti3()          6.2868403      2.9016186
                    24.1801550 nano-seconds

__unopti4()          4.3368278      0
__udivmodti4()      51.9015327     47.5647049
__umulti3()          8.3772537      4.0404259
                    64.6156142 nano-seconds

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          3.0264194      0
__udivmoddi4()      14.6952942     11.6688748
__umuldi3()          3.1512202      0.1248008
                    20.8729338 nano-seconds

__unopdi4()          3.0888198      0
__udivmoddi4()      26.7229713     23.6341515
__umuldi3()          3.5256226      0.4368028
                    33.3374137 nano-seconds

Timing 64-bit native division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          2.7300175      0
__udivmoddi4()      14.4144924     11.6844749
__umuldi3()          3.1356201      0.4056026
                    20.2801300 nano-seconds

__unopdi4()          3.1044199      0
__udivmoddi4()      15.1320970     12.0276771
__umuldi3()          3.5100225      0.4056026
                    21.7465394 nano-seconds
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopti4()          6.787912997      0
__udivmodti4()      59.706410079     52.918497082
__umulti3()          9.539191229      2.751278232
                    76.033514305 clock cycles

__unopti4()          9.064762228      0
__udivmodti4()      64.037532443     54.972770215
__umulti3()         12.381997632      3.317235404
                    85.484292303 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopti4()          6.721662277      0
__udivmodti4()      29.001432494     22.279770217
__umulti3()          9.428926160      2.707263883
                    45.152020931 clock cycles

__unopti4()          8.960905247      0
__udivmodti4()      83.353539778     74.392634531
__umulti3()         12.205547423      3.244642176
                   104.519992448 clock cycles

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          5.247617465      0
__udivmoddi4()      24.890194689     19.642577224
__umuldi3()          6.295783447      1.048165982
                    36.433595601 clock cycles

__unopdi4()          5.941744662      0
__udivmoddi4()      44.505047583     38.563302921
__umuldi3()          7.127920907      1.186176245
                    57.574713152 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          4.928616197      0
__udivmoddi4()      36.111315035     31.182698838
__umuldi3()          6.104300272      1.175684075
                    47.144231504 clock cycles

__unopdi4()          5.832601506      0
__udivmoddi4()      37.489066778     31.656465272
__umuldi3()          6.694749067      0.862147561
                    50.016417351 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopti4()          5.739886508      0
__udivmodti4()      60.265247522     54.525361014
__umulti3()          8.030493537      2.290607029
                    74.035627567 clock cycles

__unopti4()          8.376397925      0
__udivmodti4()      63.878099605     55.501701680
__umulti3()         10.674320936      2.297923011
                    82.928818466 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopti4()          5.759768704      0
__udivmodti4()      24.207704489     18.447935785
__umulti3()          7.991973095      2.232204391
                    37.959446288 clock cycles

__unopti4()          8.356751289      0
__udivmodti4()      73.164876383     64.808125094
__umulti3()         10.667141168      2.310389879
                    92.188768840 clock cycles

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          5.215172319      0
__udivmoddi4()      20.464980809     15.249808490
__umuldi3()          4.339737255      0.000000000
                    30.019890383 clock cycles

__unopdi4()          6.034145232      0
__udivmoddi4()      37.823775351     31.789630119
__umuldi3()          5.595748061      0.000000000
                    49.453668644 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          4.311459182      0
__udivmoddi4()      32.456528199     28.145069017
__umuldi3()          5.798396574      1.486937392
                    42.566383955 clock cycles

__unopdi4()          5.594526971      0
__udivmoddi4()      34.625131407     29.030604436
__umuldi3()          5.613384251      0.018857280
                    45.833042629 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopti4()          3.384227961      0
__udivmodti4()      34.535941576     31.151713615
__umulti3()          4.561600376      1.177372415
                    42.481769913 clock cycles

__unopti4()          4.958807640      0
__udivmodti4()      36.796688055     31.837880415
__umulti3()          6.071705006      1.112897366
                    47.827200701 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopti4()          3.285746669      0
__udivmodti4()      14.265474814     10.979728145
__umulti3()          4.595527857      1.309781188
                    22.146749340 clock cycles

__unopti4()          4.911969125      0
__udivmodti4()      42.153665292     37.241696167
__umulti3()          6.065414902      1.153445777
                    53.131049319 clock cycles

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          4.833738399      0
__udivmoddi4()      16.775388790     11.941650391
__umuldi3()          2.665548625      0.000000000
                    24.274675814 clock cycles

__unopdi4()          4.713733765      0
__udivmoddi4()      25.241120837     20.527387072
__umuldi3()          3.737127811      0.000000000
                    33.691982413 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          5.110940314      0
__udivmoddi4()      33.205803994     28.094863680
__umuldi3()          5.463078377      0.352138063
                    43.779822685 clock cycles

__unopdi4()          5.326425239      0
__udivmoddi4()      32.876005661     27.549580422
__umuldi3()          5.327849707      0.001424468
                    43.530280607 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopti4()          6.486039210      0
__udivmodti4()      26.669040891     20.183001681
__umulti3()          9.301024928      2.814985718
                    42.456105029 clock cycles

__unopti4()          9.463028389      0
__udivmodti4()      28.686989038     19.223960649
__umulti3()         12.162446193      2.699417804
                    50.312463620 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopti4()          6.759914430      0
__udivmodti4()      29.315665619     22.555751189
__umulti3()          9.908795028      3.148880598
                    45.984375077 clock cycles

__unopti4()         10.063938544      0
__udivmodti4()      73.125239046     63.061300502
__umulti3()         12.680222751      2.616284207
                    95.869400341 clock cycles

Timing 64-bit native division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          5.242663759      0
__udivmoddi4()      19.351564974     14.108901215
__umuldi3()          6.522515592      1.279851833
                    31.116744325 clock cycles

__unopdi4()          5.654228263      0
__udivmoddi4()      22.197831810     16.543603547
__umuldi3()          6.958791467      1.304563204
                    34.810851540 clock cycles

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          5.169526411      0
__udivmoddi4()      26.320155604     21.150629193
__umuldi3()          5.166327910      0.000000000
                    36.656009925 clock cycles

__unopdi4()          5.596172042      0
__udivmoddi4()      47.084314600     41.488142558
__umuldi3()          5.595622223      0.000000000
                    58.276108865 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          6.898510224      0
__udivmodti4()      26.925326748     20.026816524
__umulti3()          9.177466284      2.278956060
                    43.001303256 clock cycles

__unopti4()          9.476578368      0
__udivmodti4()      29.322601849     19.846023481
__umulti3()         11.710555056      2.233976688
                    50.509735273 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          6.865702096      0
__udivmodti4()      27.542023885     20.676321789
__umulti3()          9.108802297      2.243100201
                    43.516528278 clock cycles

__unopti4()          9.442571687      0
__udivmodti4()      68.794109504     59.351537817
__umulti3()         11.703519360      2.260947673
                    89.940200551 clock cycles

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          4.722583824      0
__udivmoddi4()      26.829651312     22.107067488
__umuldi3()          4.722048143      0.000000000
                    36.274283279 clock cycles

__unopdi4()          5.156534846      0
__udivmoddi4()      46.419813577     41.263278731
__umuldi3()          5.153521140      0.000000000
                    56.729869563 clock cycles

Timing 64-bit native division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          4.721372604      0
__udivmoddi4()      19.197411303     14.476038699
__umuldi3()          5.582367577      0.860994973
                    29.501151484 clock cycles

__unopdi4()          5.153817924      0
__udivmoddi4()      22.015109233     16.861291309
__umuldi3()          6.009193188      0.855375264
                    33.178120345 clock cycles
And without the preprocessor macro CYCLES defined:
[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          2.0000000      0
__udivmodti4()      10.7343750      8.7343750
__umulti3()          2.5312500      0.5312500
                    15.2656250 nano-seconds

__unopti4()          2.6250000      0
__udivmodti4()       8.1093750      5.4843750
__umulti3()          3.2031250      0.5781250
                    13.9375000 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          2.0156250      0
__udivmodti4()       8.7500000      6.7343750
__umulti3()          2.5312500      0.5156250
                    13.2968750 nano-seconds

__unopti4()          2.6250000      0
__udivmodti4()      20.0468750     17.4218750
__umulti3()          3.1875000      0.5625000
                    25.8593750 nano-seconds

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          1.3125000      0
__udivmoddi4()       7.2812500      5.9687500
__umuldi3()          1.3281250      0.0156250
                     9.9218750 nano-seconds

__unopdi4()          1.4375000      0
__udivmoddi4()      12.3281250     10.8906250
__umuldi3()          1.3125000      0.0000000
                    15.0781250 nano-seconds

Timing 64-bit native division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          1.3125000      0
__udivmoddi4()       5.3281250      4.0156250
__umuldi3()          1.5625000      0.2500000
                     8.2031250 nano-seconds

__unopdi4()          1.4218750      0
__udivmoddi4()       6.1250000      4.7031250
__umuldi3()          1.6718750      0.2500000
                     9.2187500 nano-seconds
[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          6.639901585      0
__udivmodti4()      25.407730112     18.767828527
__umulti3()          8.787561378      2.147659793
                    40.835193075 clock cycles

__unopti4()          9.009978790      0
__udivmodti4()      27.641956656     18.631977866
__umulti3()         11.091630160      2.081651370
                    47.743565606 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          6.564753997      0
__udivmodti4()      26.123169316     19.558415319
__umulti3()          8.788291128      2.223537131
                    41.476214441 clock cycles

__unopti4()          9.088542617      0
__udivmodti4()      65.108671217     56.020128600
__umulti3()         11.086648437      1.998105820
                    85.283862271 clock cycles

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          4.487864035      0
__udivmoddi4()      25.416913553     20.929049518
__umuldi3()          4.530544610      0.042680575
                    34.435322198 clock cycles

__unopdi4()          4.909401335      0
__udivmoddi4()      43.696358312     38.786956977
__umuldi3()          4.915678491      0.006277156
                    53.521438138 clock cycles

Timing 64-bit native division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          4.486075560      0
__udivmoddi4()      17.826014215     13.339938655
__umuldi3()          5.293388605      0.807313045
                    27.605478380 clock cycles

__unopdi4()          4.913181349      0
__udivmoddi4()      20.468841039     15.555659690
__umuldi3()          5.707274649      0.794093300
                    31.089297037 clock cycles
And without the preprocessor macro CYCLES defined:
[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          1.8437500      0
__udivmodti4()       6.7187500      4.8750000
__umulti3()          2.3125000      0.4687500
                    10.8750000 nano-seconds

__unopti4()          2.3906250      0
__udivmodti4()       7.2343750      4.8437500
__umulti3()          2.9531250      0.5625000
                    12.5781250 nano-seconds

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopti4()          1.7968750      0
__udivmodti4()       7.1875000      5.3906250
__umulti3()          2.3125000      0.5156250
                    11.2968750 nano-seconds

__unopti4()          2.3437500      0
__udivmodti4()      17.2812500     14.9375000
__umulti3()          2.9062500      0.5625000
                    22.5312500 nano-seconds

Testing 64-bit assembly division...
57
Timing 64-bit assembly division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          1.1718750      0
__udivmoddi4()       6.5468750      5.3750000
__umuldi3()          1.3281250      0.1562500
                     9.0468750 nano-seconds

__unopdi4()          1.2968750      0
__udivmoddi4()      11.0937500      9.7968750
__umuldi3()          1.1562500      0.0000000
                    13.5468750 nano-seconds

Timing 64-bit native division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          1.2031250      0
__udivmoddi4()       4.7031250      3.5000000
__umuldi3()          1.3593750      0.1562500
                     7.2656250 nano-seconds

__unopdi4()          1.2968750      0
__udivmoddi4()       5.4062500      4.1093750
__umuldi3()          1.6406250      0.3437500
                     8.3437500 nano-seconds

Benchmark Program for i386 Processors

The following C program for 32-bit processors measures the execution time for one billion divisions of uniform distributed 64-bit pseudo-random numbers and for one billion divisions of pseudo-random numbers in the interval from 264−1 to 232 with the __udivmoddi4() function.

Note: with the preprocessor macro HELPER defined, it uses the compiler helper routines _alldiv(), _alldvrm(), _allmul(), _allrem(), _aulldiv(), _aulldvrm() and _aullrem() instead, which the Microsoft Visual C compiler calls to perform 64-bit division and multiplication.

Note: with the preprocessor macro CYCLES defined, it measures the execution time in processor clock cycles and runs on Windows Vista® and newer versions, else it measures the execution time in nano-seconds and runs on all versions of Windows NT®.

Note: it uses the same pseudo-random number generators as the second C program for 64-bit processors, so their results are directly comparable.

// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_IX86
#pragma message("For I386 platform only!")
#endif

#pragma comment(compiler)
#pragma comment(user, __TIMESTAMP__)

#define _CRT_SECURE_NO_WARNINGS
#define STRICT
#define UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	LONGLONG	SQWORD;
typedef	ULONGLONG	QWORD;

#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}

const	struct	_ull
{
	QWORD	ullDividend, ullDivisor, ullQuotient, ullRemainder;
} ullVector[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
                 _(0x0000000000000001ULL, 0x0000000000000001ULL),
                 _(0x0000000000000002ULL, 0x0000000000000001ULL),
                 _(0x0000000000000002ULL, 0x0000000000000002ULL),
                 _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
                 _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
                 _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
                 _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
                 _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
                 _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
                 _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
                 _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
                 _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
                 _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                 _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
                 _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x8000000000000000ULL, 0x0000000000000001ULL),
                 _(0x8000000000000000ULL, 0x0000000000000002ULL),
                 _(0x8000000000000000ULL, 0x0000000000000003ULL),
                 _(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
                 _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
                 _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
                 _(0x8000000000000000ULL, 0x0000000100000000ULL),
                 _(0x8000000000000000ULL, 0x0000000100000001ULL),
                 _(0x8000000000000000ULL, 0x0000000100000002ULL),
                 _(0x8000000000000000ULL, 0x0000000100000003ULL),
                 _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
                 _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
                 _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x8000000080000000ULL, 0x0000000080000000ULL),
                 _(0x8000000080000001ULL, 0x0000000080000001ULL),
                 _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
                 _(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
                 _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};

const	struct	_ll
{
	SQWORD	llDividend, llDivisor, llQuotient, llRemainder;
} llVector[] = {_(0x0000000000000000LL, 0x0000000000000001LL),	// 0, 1
                _(0x0000000000000001LL, 0x0000000000000001LL),	// 1, 1
                _(0x0000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// 0, -1
                _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFFLL),	// 1, -1
                _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFELL),	// 1, -2
                _(0x0000000000000002LL, 0xFFFFFFFFFFFFFFFELL),	// 2, -2
                _(0x000000000FFFFFFFLL, 0x0000000000000001LL),
                _(0x0000000FFFFFFFFFLL, 0x000000000000000FLL),
                _(0x0000000FFFFFFFFFLL, 0x0000000000000010LL),
                _(0x0000000000000100LL, 0x000000000FFFFFFFLL),
                _(0x00FFFFFFF0000000LL, 0x0000000010000000LL),
                _(0x07FFFFFF80000000LL, 0x0000000080000000LL),
                _(0x7FFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
                _(0x7FFFFFFEFFFFFFF0LL, 0x0000FFFFFFFFFFFELL),
                _(0x7FFFFFFEFFFFFFF0LL, 0x7FFFFFFEFFFFFFF0LL),
                _(0x7FFFFFFFFFFFFFFFLL, 0x8000000000000000LL),	// llmax, llmin
                _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFDLL),	// llmax, -3
                _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// llmax, -2
                _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL),	// llmax, -1
                _(0x8000000000000000LL, 0x0000000000000001LL),	// llmin, 1
                _(0x8000000000000000LL, 0x0000000000000002LL),	// llmin, 2
                _(0x8000000000000000LL, 0x0000000000000003LL),	// llmin, 3
                _(0x8000000000000000LL, 0x00000000FFFFFFFELL),
                _(0x8000000000000000LL, 0x00000000FFFFFFFFLL),
                _(0x8000000000000000LL, 0x0000000100000000LL),
                _(0x8000000000000000LL, 0x0000000100000001LL),
                _(0x8000000000000000LL, 0x0000000100000002LL),
                _(0x8000000000000000LL, 0x8000000000000000LL),	// llmin, llmin
                _(0x8000000000000000LL, 0xFFFFFFFF00000000LL),
                _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFDLL),	// llmin, -3
                _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFELL),	// llmin, -2
#ifndef _WIN64
                _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// llmin, -1
#endif
                _(0x8000000080000000LL, 0x0000000080000000LL),
                _(0x8000000080000001LL, 0x0000000080000001LL),
                _(0xFFFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
                _(0xFFFFFFFFFFFFFFFELL, 0x0000000080000000LL),
                _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000001LL),	// -2, 1
                _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000002LL),	// -2, 2
                _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFELL),	// -2, -2
                _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFFLL),	// -2, -1
                _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000001LL),	// -1, 1
                _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000002LL),	// -1, 2
                _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// -1, -2
                _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL)};	// -1, -1

#undef _

#ifndef HELPER
SQWORD	__divdi3(SQWORD dividend, SQWORD divisor);
SQWORD	__moddi3(SQWORD dividend, SQWORD divisor);
SQWORD	__muldi3(SQWORD multiplicand, SQWORD multiplier);
QWORD	__udivdi3(QWORD dividend, QWORD divisor);
QWORD	__umoddi3(QWORD dividend, QWORD divisor);
QWORD	__umuldi3(QWORD multiplicand, QWORD multiplier);
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);

__declspec(noinline)
QWORD	__unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = divisor;

	return dividend;
}
#else
__declspec(naked)
__declspec(noinline)
QWORD	WINAPI	_aullnop(QWORD left, QWORD right)
{
	__asm	ret	16
}
#endif // HELPER

__forceinline	// companion for __emul()
LONG	WINAPI	__ediv(LONGLONG llDividend, LONG lDivisor)
{
	__asm	mov	eax, dword ptr llDividend
	__asm	mov	edx, dword ptr llDividend+4
	__asm	idiv	lDivisor
}

__forceinline	// companion for __emul()
LONG	WINAPI	__emod(LONGLONG llDividend, LONG lDivisor)
{
	__asm	mov	eax, dword ptr llDividend
	__asm	mov	edx, dword ptr llDividend+4
	__asm	idiv	lDivisor
	__asm	mov	eax, edx
}

__forceinline	// companion for __emulu()
DWORD	WINAPI	__edivu(QWORD ullDividend, DWORD ulDivisor)
{
	__asm	mov	eax, dword ptr ullDividend
	__asm	mov	edx, dword ptr ullDividend+4
	__asm	div	ulDivisor
}

__forceinline	// companion for __emulu()
DWORD	WINAPI	__emodu(QWORD ullDividend, DWORD ulDivisor)
{
	__asm	mov	eax, dword ptr ullDividend
	__asm	mov	edx, dword ptr ullDividend+4
	__asm	div	ulDivisor
	__asm	mov	eax, edx
}

__declspec(safebuffers)
BOOL	PrintConsole(HANDLE hConsole, LPCWSTR lpFormat, ...)
{
	WCHAR	szBuffer[1025];
	DWORD	dwBuffer;
	DWORD	dwConsole;

	va_list	vaInserts;
	va_start(vaInserts, lpFormat);

	dwBuffer = wvsprintf(szBuffer, lpFormat, vaInserts);

	va_end(vaInserts);

	if (dwBuffer == 0UL)
		return FALSE;

	if (!WriteConsole(hConsole, szBuffer, dwBuffer, &dwConsole, NULL))
		return FALSE;

	return dwConsole == dwBuffer;
}

__declspec(noreturn)
VOID	wmainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	QWORD	ullQuotient, ullRemainder;
	SQWORD	llQuotient, llRemainder;

	volatile
#ifdef HELPER
	QWORD	qwQuotient, qwRemainder;
	QWORD	qwDividend, qwDivisor = ~0ULL;
#else
	QWORD	qwQuotient;
	QWORD	qwDividend, qwDivisor = ~0ULL, qwRemainder;
#endif		// bit-vector of prime numbers: 2**n == prime
	QWORD	qwLeft = 0x28208A20A08A28ACULL;
		// 2**64 / golden ratio
	QWORD	qwRight = 0x9E3779B97F4A7C15ULL;

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		strcpy((LPSTR) dwCPUID, "undetermined processor");

	PrintConsole(hOutput, L"\nTesting 64-bit division...\n");

	for (dw = 0UL; dw < sizeof(ullVector) / sizeof(*ullVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#ifndef HELPER
		ullQuotient = __udivmoddi4(ullVector[dw].ullDividend, ullVector[dw].ullDivisor, &ullRemainder);
#else
		ullQuotient = ullVector[dw].ullDividend / ullVector[dw].ullDivisor;
		ullRemainder = ullVector[dw].ullDividend % ullVector[dw].ullDivisor;
#endif
		if (ullQuotient != ullVector[dw].ullQuotient)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient, ullVector[dw].ullQuotient);

		if (ullQuotient > ullVector[dw].ullDividend)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u greater dividend\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient);

		if (ullRemainder != ullVector[dw].ullRemainder)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder, ullVector[dw].ullRemainder);

		if (ullRemainder >= ullVector[dw].ullDivisor)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not less divisor\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder);
	}

	PrintConsole(hOutput, L"\nTesting unsigned 64-bit division...\n");

	for (dw = 0UL; dw < sizeof(ullVector) / sizeof(*ullVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#ifndef HELPER
		ullQuotient = __udivdi3(ullVector[dw].ullDividend, ullVector[dw].ullDivisor);
#else
		ullQuotient = ullVector[dw].ullDividend / ullVector[dw].ullDivisor;
#endif
		if (ullQuotient != ullVector[dw].ullQuotient)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient, ullVector[dw].ullQuotient);

		if (ullQuotient > ullVector[dw].ullDividend)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u greater dividend\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient);
#ifndef HELPER
		ullRemainder = ullVector[dw].ullDividend - __muldi3(ullVector[dw].ullDivisor, ullQuotient);
#else
		ullRemainder = ullVector[dw].ullDividend - ullVector[dw].ullDivisor * ullQuotient;
#endif
		if (ullRemainder != ullVector[dw].ullRemainder)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a remainder %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder, ullVector[dw].ullRemainder);
#ifndef HELPER
		ullRemainder = __umoddi3(ullVector[dw].ullDividend, ullVector[dw].ullDivisor);
#else
		ullRemainder = ullVector[dw].ullDividend % ullVector[dw].ullDivisor;
#endif
		if (ullRemainder != ullVector[dw].ullRemainder)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder, ullVector[dw].ullRemainder);

		if (ullRemainder >= ullVector[dw].ullDivisor)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not less divisor\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder);
	}

	PrintConsole(hOutput, L"\nTesting signed 64-bit division...\n");

	for (dw = 0UL; dw < sizeof(llVector) / sizeof(*llVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#ifndef HELPER
		llQuotient = __divdi3(llVector[dw].llDividend, llVector[dw].llDivisor);
#else
		llQuotient = llVector[dw].llDividend / llVector[dw].llDivisor;
#endif
		if (llQuotient != llVector[dw].llQuotient)
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a quotient %I64d not equal %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llQuotient, llVector[dw].llQuotient);

		if ((llVector[dw].llDividend < 0LL) && (llQuotient < llVector[dw].llDividend)
		 || (llVector[dw].llDividend >= 0LL) && (llQuotient > llVector[dw].llDividend))
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a quotient %I64d greater dividend\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llQuotient);
#ifndef HELPER
		llRemainder = llVector[dw].llDividend - __muldi3(llVector[dw].llDivisor, llQuotient);
#else
		llRemainder = llVector[dw].llDividend - llVector[dw].llDivisor * llQuotient;
#endif
		if (llRemainder != llVector[dw].llRemainder)
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a remainder %I64d not equal %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llRemainder);

		if ((llRemainder != 0LL)
		 && ((llRemainder < 0LL) != (llVector[dw].llDividend < 0LL)))
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llDividend);
#ifndef HELPER
		llRemainder = __moddi3(llVector[dw].llDividend, llVector[dw].llDivisor);
#else
		llRemainder = llVector[dw].llDividend % llVector[dw].llDivisor;
#endif
		if (llRemainder != llVector[dw].llRemainder)
			PrintConsole(hOutput, L"\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llRemainder);

		if ((llVector[dw].llDivisor < 0LL) && (llRemainder <= llVector[dw].llDivisor)
		 || (llVector[dw].llDivisor > 0LL) && (llRemainder >= llVector[dw].llDivisor))
			PrintConsole(hOutput, L"\t%I64d %% %I64d:\a remainder %I64d not less divisor\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder);

		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llVector[dw].llDividend < 0LL)))
			PrintConsole(hOutput, L"\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llDividend);
	}

	PrintConsole(hOutput, L"\nTiming 64-bit division on %.48hs\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
		qwQuotient = __unopdi4(qwLeft, qwRight, NULL);
#else
		qwQuotient = _aullnop(qwLeft, qwRight);
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
		qwQuotient = __unopdi4(qwLeft, qwRight, &qwRemainder);
#else
		qwQuotient = _aullnop(qwLeft, qwRight);
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwLeft, qwRight, &NULL);
#else
		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwLeft, qwRight, &qwRemainder);
#else
		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
		qwQuotient = __umuldi3(qwLeft, qwRight);
#else
		qwQuotient = qwLeft * qwRight;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
		qwQuotient = __umuldi3(qwLeft, qwRight);
#else
		qwQuotient = qwLeft * qwRight;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifndef HELPER
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6lu.%09lu      0\n"
	                      L"__udivmoddi4()  %6lu.%09lu %6lu.%09lu\n"
	                      L"__umuldi3()     %6lu.%09lu %6lu.%09lu\n"
	                      L"                %6lu.%09lu clock cycles\n",
	             __edivu(qwT1, 1000000000UL), __emodu(qwT1, 1000000000UL),
	             __edivu(qwT2, 1000000000UL), __emodu(qwT2, 1000000000UL),
	             __edivu(qwTx, 1000000000UL), __emodu(qwTx, 1000000000UL),
	             __edivu(qwT3, 1000000000UL), __emodu(qwT3, 1000000000UL),
	             __edivu(qwTy, 1000000000UL), __emodu(qwTy, 1000000000UL),
	             __edivu(qwTz, 1000000000UL), __emodu(qwTz, 1000000000UL));
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6lu.%07lu      0\n"
	                      L"__udivmoddi4()  %6lu.%07lu %6lu.%07lu\n"
	                      L"__umuldi3()     %6lu.%07lu %6lu.%07lu\n"
	                      L"                %6lu.%07lu nano-seconds\n",
	             __edivu(qwT1, 10000000UL), __emodu(qwT1, 10000000UL),
	             __edivu(qwT2, 10000000UL), __emodu(qwT2, 10000000UL),
	             __edivu(qwTx, 10000000UL), __emodu(qwTx, 10000000UL),
	             __edivu(qwT3, 10000000UL), __emodu(qwT3, 10000000UL),
	             __edivu(qwTy, 10000000UL), __emodu(qwTy, 10000000UL),
	             __edivu(qwTz, 10000000UL), __emodu(qwTz, 10000000UL));
#endif // CYCLES
#else // HELPER
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"_aullnop()      %6lu.%09lu      0\n"
	                      L"_aulldvrm()     %6lu.%09lu %6lu.%09lu\n"
	                      L"_aullmul()      %6lu.%09lu %6lu.%09lu\n"
	                      L"                %6lu.%09lu clock cycles\n",
	             __edivu(qwT1, 1000000000UL), __emodu(qwT1, 1000000000UL),
	             __edivu(qwT2, 1000000000UL), __emodu(qwT2, 1000000000UL),
	             __edivu(qwTx, 1000000000UL), __emodu(qwTx, 1000000000UL),
	             __edivu(qwT3, 1000000000UL), __emodu(qwT3, 1000000000UL),
	             __edivu(qwTy, 1000000000UL), __emodu(qwTy, 1000000000UL),
	             __edivu(qwTz, 1000000000UL), __emodu(qwTz, 1000000000UL));
#else
	PrintConsole(hOutput, L"\n"
	                      L"_aullnop()      %6lu.%07lu      0\n"
	                      L"_aulldvrm()     %6lu.%07lu %6lu.%07lu\n"
	                      L"_aullmul()      %6lu.%07lu %6lu.%07lu\n"
	                      L"                %6lu.%07lu nano-seconds\n",
	             __edivu(qwT1, 10000000UL), __emodu(qwT1, 10000000UL),
	             __edivu(qwT2, 10000000UL), __emodu(qwT2, 10000000UL),
	             __edivu(qwTx, 10000000UL), __emodu(qwTx, 10000000UL),
	             __edivu(qwT3, 10000000UL), __emodu(qwT3, 10000000UL),
	             __edivu(qwTy, 10000000UL), __emodu(qwTy, 10000000UL),
	             __edivu(qwTz, 10000000UL), __emodu(qwTz, 10000000UL));
#endif // CYCLES
#endif // HELPER
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
		qwQuotient = __unopdi4(qwDividend, qwDivisor, NULL);
#else
		qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, NULL);
#else
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwLeft ^= qwLeft << 14;
		qwLeft ^= qwLeft >> 31;
		qwLeft ^= qwLeft << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
		qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
		qwQuotient = qwDividend * qwDivisor;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwRight ^= qwRight << 10;
		qwRight ^= qwRight >> 15;
		qwRight ^= qwRight << 4;
		qwRight ^= qwRight >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
		qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
		qwQuotient = qwDividend * qwDivisor;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifndef HELPER
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6lu.%09lu      0\n"
	                      L"__udivmoddi4()  %6lu.%09lu %6lu.%09lu\n"
	                      L"__umuldi3()     %6lu.%09lu %6lu.%09lu\n"
	                      L"                %6lu.%09lu clock cycles\n",
	             __edivu(qwT1, 1000000000UL), __emodu(qwT1, 1000000000UL),
	             __edivu(qwT2, 1000000000UL), __emodu(qwT2, 1000000000UL),
	             __edivu(qwTx, 1000000000UL), __emodu(qwTx, 1000000000UL),
	             __edivu(qwT3, 1000000000UL), __emodu(qwT3, 1000000000UL),
	             __edivu(qwTy, 1000000000UL), __emodu(qwTy, 1000000000UL),
	             __edivu(qwTz, 1000000000UL), __emodu(qwTz, 1000000000UL));
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6lu.%07lu      0\n"
	                      L"__udivmoddi4()  %6lu.%07lu %6lu.%07lu\n"
	                      L"__umuldi3()     %6lu.%07lu %6lu.%07lu\n"
	                      L"                %6lu.%07lu nano-seconds\n",
	             __edivu(qwT1, 10000000UL), __emodu(qwT1, 10000000UL),
	             __edivu(qwT2, 10000000UL), __emodu(qwT2, 10000000UL),
	             __edivu(qwTx, 10000000UL), __emodu(qwTx, 10000000UL),
	             __edivu(qwT3, 10000000UL), __emodu(qwT3, 10000000UL),
	             __edivu(qwTy, 10000000UL), __emodu(qwTy, 10000000UL),
	             __edivu(qwTz, 10000000UL), __emodu(qwTz, 10000000UL));
#endif // CYCLES
#else // HELPER
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"_aullnop()      %6lu.%09lu      0\n"
	                      L"_aulldvrm()     %6lu.%09lu %6lu.%09lu\n"
	                      L"_aullmul()      %6lu.%09lu %6lu.%09lu\n"
	                      L"                %6lu.%09lu clock cycles\n",
	             __edivu(qwT1, 1000000000UL), __emodu(qwT1, 1000000000UL),
	             __edivu(qwT2, 1000000000UL), __emodu(qwT2, 1000000000UL),
	             __edivu(qwTx, 1000000000UL), __emodu(qwTx, 1000000000UL),
	             __edivu(qwT3, 1000000000UL), __emodu(qwT3, 1000000000UL),
	             __edivu(qwTy, 1000000000UL), __emodu(qwTy, 1000000000UL),
	             __edivu(qwTz, 1000000000UL), __emodu(qwTz, 1000000000UL));
#else
	PrintConsole(hOutput, L"\n"
	                      L"_aullnop()      %6lu.%07lu      0\n"
	                      L"_aulldvrm()     %6lu.%07lu %6lu.%07lu\n"
	                      L"_aullmul()      %6lu.%07lu %6lu.%07lu\n"
	                      L"                %6lu.%07lu nano-seconds\n",
	             __edivu(qwT1, 10000000UL), __emodu(qwT1, 10000000UL),
	             __edivu(qwT2, 10000000UL), __emodu(qwT2, 10000000UL),
	             __edivu(qwTx, 10000000UL), __emodu(qwTx, 10000000UL),
	             __edivu(qwT3, 10000000UL), __emodu(qwT3, 10000000UL),
	             __edivu(qwTy, 10000000UL), __emodu(qwTy, 10000000UL),
	             __edivu(qwTz, 10000000UL), __emodu(qwTz, 10000000UL));
#endif // CYCLES
#endif // HELPER
	ExitProcess(0UL);
}

DWORD_PTR	__security_cookie = 3141592654UL;	// π * 10**9

extern	LPVOID	__safe_se_handler_table[];
extern	BYTE	__safe_se_handler_count;

const	IMAGE_LOAD_CONFIG_DIRECTORY32	_load_config_used = {sizeof(_load_config_used),
					                     'CVSM',	// = "MSVC" = 2005-10-19 14:08:13 UTC
					                     _MSC_VER / 100, _MSC_VER % 100,
					                     0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL,
					                     0, 0,
					                     0L,
					                     &__security_cookie,
					                     __safe_se_handler_table,
					                     &__safe_se_handler_count};

__declspec(naked)
VOID	__fastcall	__security_check_cookie(DWORD_PTR _stackcookie)
{
	__asm
	{
		cmp	ecx, __security_cookie
		jne	corrupt
		ret

	corrupt:
		ud2
	}
}
Save this C source as 64-i386.c in an arbitrary, preferable empty directory, save the 16 32-bit assembly sources presented above as divdi3.asm, moddi3.asm, muldi3.asm, udivdi3.asm, umoddi3.asm, udivmoddi4.asm, alldiv.asm, alldvrm.asm, allmul.asm, allrem.asm, allshl.asm, allshr.asm, aulldiv.asm, aulldvrm.asm, aullrem.asm and aullshr.asm respectively there too, optionally copy clang_rt.builtins-i386.lib from an installation of LLVM᾿s Clang, then start the command prompt of the Windows software development kit for the I386 platform in this directory, run the following command lines to build the benchmark programs 64-i386.exe, 64-helper.exe, 64-msft.exe and optionally 64-llvm.exe, and execute them:
CL.EXE /c /DCYCLES /GAFy /O2y /W4 /Zl 64-i386.c
ML.EXE /c /W3 /X divdi3.asm
ML.EXE /c /W3 /X moddi3.asm
ML.EXE /c /W3 /X muldi3.asm
ML.EXE /c /W3 /X udivdi3.asm
ML.EXE /c /W3 /X umoddi3.asm
ML.EXE /c /W3 /X udivmoddi4.asm
ML.EXE /c /W3 /X alldiv.asm
ML.EXE /c /W3 /X alldvrm.asm
ML.EXE /c /W3 /X allmul.asm
ML.EXE /c /W3 /X allrem.asm
ML.EXE /c /W3 /X allshl.asm
ML.EXE /c /W3 /X allshr.asm
ML.EXE /c /W3 /X aulldiv.asm
ML.EXE /c /W3 /X aulldvrm.asm
ML.EXE /c /W3 /X aullrem.asm
ML.EXE /c /W3 /X aullshr.asm
LINK.EXE /LIB /MACHINE:X86 /NODEFAULTLIB /OUT:64-i386.lib divdi3.obj moddi3.obj muldi3.obj udivdi3.obj umoddi3.obj udivmoddi4.obj alldiv.obj alldvrm.obj allmul.obj allrem.obj allshl.obj allshr.obj aulldiv.obj aulldvrm.obj aullrem.obj aullshr.obj
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:X86 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-i386.exe /SUBSYSTEM:CONSOLE 64-i386.obj 64-i386.lib kernel32.lib user32.lib
CL.EXE /c /DCYCLES /DHELPER /GAFy /O2y /W4 /Zl 64-i386.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:X86 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-helper.exe /SUBSYSTEM:CONSOLE 64-i386.obj 64-i386.lib kernel32.lib user32.lib
LINK.EXE /LIB /DEF /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /MACHINE:X86 /NAME:NTDLL /NODEFAULTLIB /OUT:64-msft.lib
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:X86 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-msft.exe /SUBSYSTEM:CONSOLE 64-i386.obj 64-msft.lib kernel32.lib user32.lib
IF EXIST clang_rt.builtins-i386.lib LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:X86 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-llvm.exe /SUBSYSTEM:CONSOLE 64-i386.obj clang_rt.builtins-i386.lib kernel32.lib user32.lib
.\64-i386.exe
.\64-helper.exe
.\64-msft.exe
IF EXIST 64-llvm.exe .\64-llvm.exe
Note: the command lines can be copied and pasted as block into the Command Processor window!

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: all 32-bit programs are pure Win32 console applications and build without the MSVCRT libraries.

Note: the trivial transformation of the assembly sources with directives for Unix’ or GNU’s as into assembly sources for Microsoft’s ML.EXE is left as an exercise to the reader.

Note: linking the program 64-msft.exe with the compiler helper routines built from their source code blcrtasm.asm is also left as an exercise to the reader.

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

64-i386.c
64-i386.c(830) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD_PTR *'
64-i386.c(831) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *'
64-i386.c(832) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *'
64-i386.c(835) : warning C4100: '_stackcookie' : unreferenced formal parameter

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: divdi3.asm
…
 Assembling: udivmoddi4.asm
 Assembling: alldiv.asm
…
 Assembling: aullshr.asm

Microsoft (R) Library Manager Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

64-i386.c
64-i386.c(156) : warning C4100: 'right' : unreferenced formal parameter
64-i386.c(156) : warning C4100: 'left' : unreferenced formal parameter
64-i386.c(830) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD_PTR *'
64-i386.c(831) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *'
64-i386.c(832) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *'
64-i386.c(835) : warning C4100: '_stackcookie' : unreferenced formal parameter

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Program Maintenance Utility Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

   Creating library 64-msft.lib and object 64-msft.exp

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()         12.849202656      0
__udivmoddi4()      37.561920358     24.712717702
__umuldi3()         14.358287749      1.509085093
                    64.769410763 clock cycles

__unopdi4()         18.308137879      0
__udivmoddi4()      37.448476732     19.140338853
__umuldi3()         19.587959635      1.279821756
                    75.344574246 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           9.108604673      0
_aulldvrm()         39.178505498     30.069900825
_aullmul()          14.272042690      5.163438017
                    62.559152861 clock cycles

_aullnop()          14.043325395      0
_aulldvrm()         38.404302453     24.360977058
_aullmul()          19.309414816      5.266089421
                    71.757042664 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           9.005029514      0
_aulldvrm()        145.500002260    136.494972746
_aullmul()          17.647885499      8.642855985
                   172.152917273 clock cycles

_aullnop()          13.827490013      0
_aulldvrm()        111.386159799     97.558669786
_aullmul()          22.663936806      8.836446793
                   147.877586618 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()         12.857257576      0
__udivmoddi4()      94.499937193     81.642679617
__umuldi3()         30.708206573     17.850948997
                   138.065401342 clock cycles

__unopdi4()         17.108234965      0
__udivmoddi4()     161.966266965    144.858032000
__umuldi3()         35.101783471     17.993548506
                   214.176285401 clock cycles
Also without the preprocessor macro CYCLES defined:
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          4.5864294      0
__udivmoddi4()      15.5064994     10.9200700
__umuldi3()          5.6004359      1.0140065
                    25.6933647 nano-seconds

__unopdi4()          7.1760460      0
__udivmoddi4()      14.9760960      7.8000500
__umuldi3()          7.7376496      0.5616036
                    29.8897916 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           3.6660235      0
_aulldvrm()         15.8029013     12.1368778
_aullmul()           5.5380355      1.8720120
                    25.0069603 nano-seconds

_aullnop()           5.9592382      0
_aulldvrm()         15.4752992      9.5160610
_aullmul()           7.9716511      2.0124129
                    29.4061885 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()           3.6660235      0
_aulldvrm()         58.4691748     54.8031513
_aullmul()           7.2696466      3.6036231
                    69.4048449 nano-seconds

_aullnop()           5.7564369      0
_aulldvrm()         44.0546824     38.2982455
_aullmul()           9.3132597      3.5568228
                    59.1243790 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          5.1480330      0
__udivmoddi4()      37.3934397     32.2454067
__umuldi3()         12.4176796      7.2696466
                    54.9591523 nano-seconds

__unopdi4()          6.8640440      0
__udivmoddi4()      64.9432163     58.0791723
__umuldi3()         14.1648908      7.3008468
                    85.9721511 nano-seconds
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          8.710364463      0
__udivmoddi4()      29.568165444     20.857800981
__umuldi3()         10.016409737      1.306045274
                    48.294939644 clock cycles

__unopdi4()         11.899356861      0
__udivmoddi4()      31.305810062     19.406453201
__umuldi3()         14.074341743      2.174984882
                    57.279508666 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

_aullnop()           6.281171716      0
_aulldvrm()         30.299316500     24.018144784
_aullmul()          10.415490092      4.134318376
                    46.995978308 clock cycles

_aullnop()          10.305468488      0
_aulldvrm()         29.560666513     19.255198025
_aullmul()          15.232518004      4.927049516
                    55.098653005 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

_aullnop()           6.282753357      0
_aulldvrm()        130.221916499    123.939163142
_aullmul()          12.560291961      6.277538604
                   149.064961817 clock cycles

_aullnop()          10.305609251      0
_aulldvrm()         93.916607827     83.610998576
_aullmul()          17.949963126      7.644353875
                   122.172180204 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          8.794296716      0
__udivmoddi4()      58.334799420     49.540502704
__umuldi3()         16.971398673      8.177101957
                    84.100494809 clock cycles

__unopdi4()         11.806963851      0
__udivmoddi4()     108.598490949     96.791527098
__umuldi3()         22.271070710     10.464106859
                   142.676525510 clock cycles
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
57
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          8.176246060      0
__udivmoddi4()      24.540802967     16.364556907
__umuldi3()          8.774901071      0.598655011
                    41.491950098 clock cycles

__unopdi4()         10.752357791      0
__udivmoddi4()      24.479256622     13.726898831
__umuldi3()         12.175662023      1.423304232
                    47.407276436 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

_aullnop()           6.042327017      0
_aulldvrm()         24.822108405     18.779781388
_aullmul()           8.850690256      2.808363239
                    39.715125678 clock cycles

_aullnop()           9.036137407      0
_aulldvrm()         24.078298463     15.042161056
_aullmul()          12.182378249      3.146240842
                    45.296814119 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

_aullnop()           6.043049544      0
_aulldvrm()        121.360828766    115.317779222
_aullmul()          11.284504325      5.241454781
                   138.688382635 clock cycles

_aullnop()           9.038334480      0
_aulldvrm()         87.144452426     78.106117946
_aullmul()          14.460059957      5.421725477
                   110.642846863 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          8.182234986      0
__udivmoddi4()      49.594440527     41.412205541
__umuldi3()         15.480297393      7.298062407
                    73.256972906 clock cycles

__unopdi4()         10.785032002      0
__udivmoddi4()      93.296232493     82.511200491
__umuldi3()         19.044985770      8.259953768
                   123.126250265 clock cycles
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          4.758041439      0
__udivmoddi4()      14.900456178     10.142414739
__umuldi3()          5.118839780      0.360798341
                    24.777337397 clock cycles

__unopdi4()          6.264035993      0
__udivmoddi4()      14.991681122      8.727645129
__umuldi3()          7.116819579      0.852783586
                    28.372536694 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

_aullnop()           3.511911560      0
_aulldvrm()         14.931006596     11.419095036
_aullmul()           5.185640855      1.673729295
                    23.628559011 clock cycles

_aullnop()           5.267329959      0
_aulldvrm()         14.287518025      9.020188066
_aullmul()           7.649375365      2.382045406
                    27.204223349 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

_aullnop()           3.888929085      0
_aulldvrm()         75.630752291     71.741823206
_aullmul()           7.039982148      3.151053063
                    86.559663524 clock cycles

_aullnop()           5.706960149      0
_aulldvrm()         51.744648850     46.037688701
_aullmul()           8.437223071      2.730262922
                    65.888832070 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

__unopdi4()          4.761309343      0
__udivmoddi4()      30.499168861     25.737859518
__umuldi3()          9.146076148      4.384766805
                    44.406554352 clock cycles

__unopdi4()          6.320688165      0
__udivmoddi4()      58.207913400     51.887225235
__umuldi3()         11.342916480      5.022228315
                    75.871518045 clock cycles
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          8.637489867      0
__udivmoddi4()      27.828655455     19.191165588
__umuldi3()          9.761457334      1.123967467
                    46.227602656 clock cycles

__unopdi4()         11.229091635      0
__udivmoddi4()      26.703517279     15.474425644
__umuldi3()         12.675702170      1.446610535
                    50.608311084 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

_aullnop()           6.031238132      0
_aulldvrm()         27.804057740     21.772819608
_aullmul()          10.548285859      4.517047727
                    44.383581731 clock cycles

_aullnop()           9.489672570      0
_aulldvrm()         27.331796039     17.842123469
_aullmul()          11.909754514      2.420081944
                    48.731223123 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

_aullnop()           6.040463934      0
_aulldvrm()         88.367491909     82.327027975
_aullmul()          10.869423368      4.828959434
                   105.277379211 clock cycles

_aullnop()           9.492139142      0
_aulldvrm()         67.661042025     58.168902883
_aullmul()          14.151106441      4.658967299
                    91.304287608 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          8.218757132      0
__udivmoddi4()      68.973908219     60.755151087
__umuldi3()         16.682145911      8.463388779
                    93.874811262 clock cycles

__unopdi4()         11.236472949      0
__udivmoddi4()     125.545346907    114.308873958
__umuldi3()         20.463970255      9.227497306
                   157.245790111 clock cycles
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          8.156522700      0
__udivmoddi4()      26.587471432     18.430948732
__umuldi3()          8.227731420      0.071208720
                    42.971725552 clock cycles

__unopdi4()         11.171514133      0
__udivmoddi4()      24.479854346     13.308340213
__umuldi3()         11.409089652      0.237575519
                    47.060458131 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           6.010190784      0
_aulldvrm()         26.050758627     20.040567843
_aullmul()           8.197509891      2.187319107
                    40.258459302 clock cycles

_aullnop()           9.910357023      0
_aulldvrm()         24.299246162     14.388889139
_aullmul()          11.622902328      1.712545305
                    45.832505513 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           6.010334245      0
_aulldvrm()         79.145175680     73.134841435
_aullmul()           9.026602597      3.016268352
                    94.182112522 clock cycles

_aullnop()           9.909955188      0
_aulldvrm()         53.473960584     43.564005396
_aullmul()          13.012305555      3.102350367
                    76.396221327 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          8.154862312      0
__udivmoddi4()      64.296126258     56.141263946
__umuldi3()         14.782954357      6.628092045
                    87.233942927 clock cycles

__unopdi4()         11.159903449      0
__udivmoddi4()     115.862991234    104.703087785
__umuldi3()         18.352062682      7.192159233
                   145.374957365 clock cycles
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          2.2656250      0
__udivmoddi4()       7.4062500      5.1406250
__umuldi3()          2.2812500      0.0156250
                    11.9531250 nano-seconds

__unopdi4()          3.1093750      0
__udivmoddi4()       6.7812500      3.6718750
__umuldi3()          3.2343750      0.1250000
                    13.1250000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           1.6718750      0
_aulldvrm()          7.2500000      5.5781250
_aullmul()           2.2812500      0.6093750
                    11.2031250 nano-seconds

_aullnop()           2.7343750      0
_aulldvrm()          6.7812500      4.0468750
_aullmul()           3.2343750      0.5000000
                    12.7500000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           1.6718750      0
_aulldvrm()         22.0156250     20.3437500
_aullmul()           2.5156250      0.8437500
                    26.2031250 nano-seconds

_aullnop()           2.7500000      0
_aulldvrm()         14.9218750     12.1718750
_aullmul()           3.6093750      0.8593750
                    21.2812500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          2.2656250      0
__udivmoddi4()      18.0000000     15.7343750
__umuldi3()          4.2968750      2.0312500
                    24.5625000 nano-seconds

__unopdi4()          3.1093750      0
__udivmoddi4()      32.2812500     29.1718750
__umuldi3()          5.1875000      2.0781250
                    40.5781250 nano-seconds
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          7.820339238      0
__udivmoddi4()      25.219830234     17.399490996
__umuldi3()          7.896016504      0.075677266
                    40.936185976 clock cycles

__unopdi4()         10.683594514      0
__udivmoddi4()      23.327167298     12.643572784
__umuldi3()         10.966053464      0.282458950
                    44.976815276 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           5.782054086      0
_aulldvrm()         24.732162011     18.950107925
_aullmul()           7.867011224      2.084957138
                    38.381227321 clock cycles

_aullnop()           9.460103500      0
_aulldvrm()         23.216316090     13.756212590
_aullmul()          11.151011086      1.690907586
                    43.827430676 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           5.815668129      0
_aulldvrm()         75.679497601     69.863829472
_aullmul()           9.474285823      3.658617694
                    90.969451553 clock cycles

_aullnop()           9.458013729      0
_aulldvrm()         50.089870151     40.631856422
_aullmul()          12.508264043      3.050250314
                    72.056147923 clock cycles

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          7.817188050      0
__udivmoddi4()      61.087041419     53.269853369
__umuldi3()         14.145779236      6.328591186
                    83.050008705 clock cycles

__unopdi4()         10.659368382      0
__udivmoddi4()     109.785429868     99.126061486
__umuldi3()         17.585432466      6.926064084
                   138.030230716 clock cycles
[…]

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          2.0781250      0
__udivmoddi4()       6.6562500      4.5781250
__umuldi3()          2.0937500      0.0156250
                    10.8281250 nano-seconds

__unopdi4()          2.8125000      0
__udivmoddi4()       6.2343750      3.4218750
__umuldi3()          2.9843750      0.1718750
                    12.0312500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           1.5312500      0
_aulldvrm()          6.5312500      5.0000000
_aullmul()           2.1718750      0.6406250
                    10.2343750 nano-seconds

_aullnop()           2.5000000      0
_aulldvrm()          6.1250000      3.6250000
_aullmul()           2.9375000      0.4375000
                    11.5625000 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

_aullnop()           1.5000000      0
_aulldvrm()         19.7187500     18.2187500
_aullmul()           2.5000000      1.0000000
                    23.7187500 nano-seconds

_aullnop()           2.4843750      0
_aulldvrm()         13.2031250     10.7187500
_aullmul()           3.2812500      0.7968750
                    18.9687500 nano-seconds

Testing 64-bit division...
57
Testing unsigned 64-bit division...
57
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor

__unopdi4()          2.0625000      0
__udivmoddi4()      16.2343750     14.1718750
__umuldi3()          3.8906250      1.8281250
                    22.1875000 nano-seconds

__unopdi4()          2.8125000      0
__udivmoddi4()      29.0781250     26.2656250
__umuldi3()          4.7187500      1.9062500
                    36.6093750 nano-seconds

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):


Copyright © 1995–2020 • Stefan Kanthak • <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>