Valid HTML 4.01 Transitional Valid CSS Valid SVG 1.0

Me, myself & IT

Fast(est) 128÷128-bit and 64÷64-bit Integer Division

Purpose
Algorithms
Divide & Conquer Division
Shift & Subtract Division
Implementation for AMD64 Processors
128÷128-bit Unsigned Integer Division (128-bit Quotient and Remainder)
64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)
Implementation for i386 Processors
64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)
64÷64-bit Unsigned Integer Division (64-bit Quotient)
64÷64-bit Unsigned Integer Division (64-bit Remainder)
64÷64-bit Signed Integer Division (64-bit Quotient)
64÷64-bit Signed Integer Division (64-bit Remainder)
64×64-bit Signed and Unsigned Integer Multiplication (64-bit Product)
Execution Times (Sustained Reciprocal Throughput)
Summary Table (Running ’round in Circles Cycles)
Benchmark Programs for AMD64 Processors
Benchmark Program for i386 Processors

Purpose

Present fast 128÷128-bit unsigned integer division routines __udivmodti4() for AMD64 processors, the fast 64÷64-bit unsigned integer division routines __udivmoddi4(), __udivdi3() and __umoddi3() for i386 processors, plus the fast 64÷64-bit signed integer division routines __divdi3() and __moddi3() for i386 processors.

Note: a fast 128÷64-bit unsigned integer division routine for i386 processors, implemented in ANSI C and Assembler, is presented in my article Donald Knuth’s Algorithm D, and provided with my NOMSVCRT.LIB library.

Algorithms

Divide & Conquer Division

The divide & conquer division algorithm is quite simple: it relies on the processor’s DIV instruction, which performs a so-called narrowing 128÷64-bit division, producing a 64-bit quotient and a 64-bit remainder from an 128-bit dividend and a 64-bit divisor.
Exceptional case
If the divisor is 0, a divide by 0 exception is raised.
Note: this is handled in the second simple case!
Simple cases
If the divisor is less than 264, but greater than the upper half of the dividend, the upper halves of quotient and remainder are 0, and a single DIV instruction yields their lower halves.
If the divisor is less than 264 and not greater than the upper half of the dividend, the upper half of the remainder is 0 too, while its lower half and (both halves of) the quotient are produced with two consecutive DIV instructions using the so-called long alias schoolbook division (and 64-bit numbers as digits) to avoid an overflow of the quotient.
Hard case (multiple steps)
If the divisor is not less than 264, it is normalized, i.e. shifted left until its most significant bit is set, which is equivalent to a division by 264−number of leading '0' bits, and its lower half discarded.
The truncated normalized divisor′ is eventually subtracted from the upper half of the dividend to prevent an overflow, then used to produce an intermediate approximate quotient′ with a single DIV instruction.
The intermediate approximate quotient′ is shifted left by the same amount as the normalized divisor′, giving the final approximate quotient″, which might be one to high due to the discarded lower half of the normalized divisor′.
The approximate remainder′ is computed as dividend minus the product of original divisor and final approximate quotient″.
If the approximate remainder′ is less than 0, the original divisor is added, while the final approximate quotient″ is decremented by one, producing the proper quotient and remainder.

Shift & Subtract Division

The shift & subtract alias binary long division algorithm is almost trivial: it’s the schoolbook algorithm using bits as digits.
Exceptional case
If the divisor is 0, a divide by 0 exception is raised.
Trivial cases
If the divisor is greater than the dividend (what implies that the divisor is greater than 0), the quotient is 0, while the remainder is equal to the dividend.
If the divisor is equal to the dividend and (both are) greater than 0, the quotient is 1, while the remainder is 0.
Normal case (multiple steps with a loop)
The quotient is set to 0.
The divisor is aligned to the dividend, i.e. shifted left until their most significant bits are in the same position.
Until the divisor is back in its original position,
¹ the quotient is shifted left one bit,
² if the dividend is not less than the divisor, the divisor is subtracted from the dividend, and the quotient incremented by one, i.e. its least significand bit is set,
³ the divisor is shifted right one bit.
The remainder is the (remaining) dividend.

Implementation for AMD64 Processors

Prototype for the __udivmoddi4() function, and sample C implementation of the 64÷64-bit shift & subtract division:
// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _MSC_VER
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
#else
#pragma intrinsic(_BitScanReverse64)

typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;

uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder)
{
    uint64_t quotient;
    uint32_t index1, index2;

    if (_BitScanReverse64(&index2, divisor))
        if (_BitScanReverse64(&index1, dividend))
#if 0
            if (index1 >= index2)
#else
            if (dividend >= divisor)
#endif
            {
                // dividend >= divisor > 0,
                //  64 > index1 >= index2 >= 0
                // NOTE: number of leading '0' bits = 64 - index

                divisor <<= index1 - index2;
                quotient = 0;

                do
                {
                    quotient <<= 1;

                    if (dividend >= divisor)
                    {
                        dividend -= divisor;
                        quotient |= 1;
                    }

                    divisor >>= 1;
                } while (index1 >= ++index2);

                if (remainder != 0)
                    *remainder = dividend;

                return quotient;
            }
            else // divisor > dividend > 0:
                 //  quotient = 0, remainder = dividend
            {
                if (remainder != 0)
                    *remainder = dividend;

                return 0;
            }
        else // divisor > dividend == 0:
             //  quotient = 0, remainder = 0
        {
            if (remainder != 0)
                *remainder = 0;

            return 0;
        }
    else // divisor == 0
    {
        if (remainder != 0)
            return dividend % divisor;

        return dividend / divisor;
    }
}
#endif // _MSC_VER
The suffix di4 specifies the number of arguments plus return value and their size: double integer denotes an 8-byte QWORD alias 64-bit uint64_t.

Prototype for the __udivmodti4() function, and sample C implementation of the 128÷128-bit divide & conquer as well as the shift & subtract division:

// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _MSC_VER
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);
#else
typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;
#if 0
typedef unsigned __int128 uint128_t;
#else
typedef struct
{
    uint64_t low;
    uint64_t high;
} uint128_t;
#endif
#if _MSC_VER >= 1920 // MSC 19.20 alias 2019
#pragma intrinsic(__shiftleft128, __shiftright128, _udiv128, _umul128, _BitScanReverse64)

uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder)
{
    uint128_t quotient;
#ifndef HYBRID
    uint64_t  tmp, low, high;
    uint32_t  index;

    if (_BitScanReverse64(&index, divisor.high))
    {
        tmp = __shiftleft128(divisor.low, divisor.high, 64 - index);

        if (tmp < dividend.high)
        {
            tmp = _udiv128(dividend.high - tmp, dividend.low, tmp, &tmp);
            tmp = __shiftleft128(tmp, 1, 64 - index);
        }
        else
        {
            tmp = _udiv128(dividend.high, dividend.low, tmp, &tmp);
#if 0
            tmp = __shiftleft128(tmp, 0, 64 - index);
#else
            tmp <<= 64 - index;
#endif
        }

        quotient.high = 0;
        quotient.low = tmp;

        low = _umul128(tmp, divisor.low, &high);
        tmp *= divisor.high;
        high += tmp;

        if ((high < tmp)           // quotient * divisor >= 2**128 > dividend
         || (high > dividend.high) // quotient * divisor > dividend
         || ((high == dividend.high) && (low > dividend.low)))
        {
            quotient.low -= 1;

            low = _umul128(quotient.low, divisor.low, &high);
            high += quotient.low * divisor.high;
        }

        if (remainder != 0)
        {
            dividend.high -= high + (dividend.low < low);
            dividend.low -= low;

            *remainder = dividend;
        }
    }
#else
    uint64_t  tmp;
    uint32_t  index1, index2;

    if (_BitScanReverse64(&index2, divisor.high))
        if (_BitScanReverse64(&index1, dividend.high))
            if (index1 >= index2)
            {
                // dividend >= divisor >= 2**64,
                //  64 > index1 >= index2 >= 0
                // NOTE: number of leading '0' bits = 64 - index

                divisor.high = __shiftleft128(divisor.low, divisor.high, index1 - index2);
                divisor.low <<= index1 - index2;

                quotient.high = quotient.low = 0;

                do
                {
                    quotient.low <<= 1;

                    if ((dividend.high > divisor.high)
                     || ((dividend.high == divisor.high) && (dividend.low >= divisor.low)))
                    {
                        dividend.high -= divisor.high + (dividend.low < divisor.low);
                        dividend.low -= divisor.low;

                        quotient.low |= 1;
                    }

                    divisor.low = __shiftright128(divisor.low, divisor.high, 1);
                    divisor.high >>= 1;
                } while (index1 >= ++index2);

                if (remainder != 0)
                    *remainder = dividend;
            }
            else // divisor > dividend >= 2**64:
                 //  quotient = 0, remainder = dividend
            {
                if (remainder != 0)
                    *remainder = dividend;
            }
        else // divisor >= 2**64 > dividend:
             //  quotient = 0, remainder = dividend
        {
            if (remainder != 0)
#if 1
                *remainder = dividend;
#else
            {
                remainder->high = 0;
                remainder->low = dividend.low;
            }
#endif
        }
#endif // HYBRID
    else // divisor < 2**64
    {
        if (dividend.high < divisor.low)
        {
            quotient.high = 0;
            quotient.low = _udiv128(dividend.high, dividend.low, divisor.low, &tmp);
        }
        else // "long" alias "schoolbook" division
        {
            quotient.high = _udiv128(0, dividend.high, divisor.low, &tmp);
            quotient.low = _udiv128(tmp, dividend.low, divisor.low, &tmp);
        }

        if (remainder != 0)
        {
            remainder->high = 0;
            remainder->low = tmp;
        }
    }

    return quotient;
}
#endif
#endif // _MSC_VER
The suffix ti4 specifies the number of arguments plus return value and their size: tetra integer denotes a 16-byte OWORD alias 128-bit uint128_t.

Note: the Microsoft® Visual C compiler does not provide a 128-bit integer data type; the keyword __int128 is reserved, but unsupported, its use yields error C4235.

Note: with the macro HYBRID defined, the hard case of the divide & conquer division is replaced by the normal case of the shift & subtract division.

128÷128-bit Unsigned Integer Division (128-bit Quotient and Remainder)

__udivmodti4() function for AMD64 processors, supporting the Unix® System V calling convention, using the divide & conquer algorithm:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# * The software is provided "as is" without any warranty, neither express
#   nor implied.
# * In no event will the author be held liable for any damage(s) arising
#   from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
#   and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
#   as an agent, (independent) contractor, employee, member or officer of
#   a business, corporation or organization (commercial or non-commercial)
#   does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
#   for a business, for commercial, corporate, governmental, military or
#   organizational purposes of any kind, or in a commercial, corporate,
#   governmental, military or organizational environment of any kind.

# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
#   in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
#   (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
#   left);
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
#   RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
#   preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
#   case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
#   before calling other functions (CALL pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.

# NOTE: raises "division exception" when divisor is 0!

#	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	___udivmodti4
#	.type	___udivmodti4, @function
	.text
				# rdi = oword ptr dividend
				# rsi = oword ptr divisor
				# rdx = oword ptr remainder
___udivmodti4:
	mov	r10, rdx	# r10 = address of (optional) remainder

	mov	r9, [rsi+8]	# r9 = high qword of divisor
	mov	r8, [rsi]	# r8 = low qword of divisor

	mov	rdx, [rdi+8]	# rdx = high qword of dividend
	mov	rax, [rdi]	# rax = low qword of dividend

	bsr	rcx, r9		# rcx = index of leading '1' bit in high qword of divisor
	jnz	.extended	# high qword of divisor <> 0?

	# divisor < 2**64 (so remainder will be < 2**64 too)

	cmp	rdx, r8
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64 (so quotient will be < 2**64):
	# perform normal division
.normal:
	div	r8		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r10, r10
	jz	0f		# address of remainder = 0?

	mov	[r10], rdx
	mov	[r10+8], r9	# high qword of remainder = 0
0:
	mov	rdx, r9		# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64:
	# perform "long" alias "schoolbook" division
.long:
	mov	rcx, rax	# rcx = low qword of dividend
	mov	rax, rdx	# rax = high qword of dividend
	mov	rdx, r9		# rdx:rax = high qword of dividend
	div	r8		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	xchg	rcx, rax	# rcx = high qword of quotient,
				# rax = low qword of dividend
	div	r8		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r10, r10
	jz	1f		# address of remainder = 0?

	mov	[r10], rdx
	mov	[r10+8], r9	# high qword of remainder = 0
1:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# divisor >= 2**64 (so quotient will be < 2**64):
	# perform "extended & adjusted" division
.extended:
	not	ecx		# ecx = number of leading '0' bits in (high qword of) divisor
	mov	r11, r9		# r11 = high qword of divisor
	shld	r11, r8, cl	# r11 = divisor / 2**(index + 1)
				#     = divisor'
	mov	rsi, rdx	# rsi = high qword of dividend
	mov	rdi, rax	# rdi = low qword of dividend

	push	rbx
.ifnotdef JMPLESS
	xor	ebx, ebx	# rbx = high qword of quotient' = 0

	cmp	rdx, r11
	jb	2f		# high qword of dividend < divisor'?

	# high qword of dividend >= divisor':
	# subtract divisor' from high qword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	rdx, r11	# rdx = high qword of dividend - divisor'
				#     = high qword of dividend'
	inc	ebx		# rbx = high qword of quotient' = 1
2:
.elseif 0
	sub	rdx, r11	# rdx = high qword of dividend - divisor'
	sbb	rbx, rbx	# rbx = (high qword of dividend < divisor') ? -1 : 0
	and	rbx, r11	# rbx = (high qword of dividend < divisor') ? divisor' : 0
	add	rdx, rbx	# rdx = high qword of dividend
				#     - (high qword of dividend < divisor') ? 0 : divisor'
				#     = high qword of dividend'
	neg	rbx		# CF = (high qword of dividend < divisor') ? 1 : 0
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.else
	sub	rdx, r11	# rdx = high qword of dividend - divisor'
	cmovb	rdx, rsi	#     = high qword of dividend'
	sbb	ebx, ebx	# ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		# rbx = (high qword of dividend < divisor') ? 0 : 1
				#     = high qword of quotient'
.endif # JMPLESS
	# high qword of dividend' < divisor'

	div	r11		# rax = low qword of quotient'
				#     = dividend' / divisor',
				# rdx = remainder'
	shld	rbx, rax, cl	# rbx = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"

	mov	r11, r9		# r11 = high qword of divisor
	mov	rax, r8		# rax = low qword of divisor
	imul	r11, rbx	# r11 = high qword of divisor * quotient"
	mul	rbx		# rdx:rax = low qword of divisor * quotient"
.ifnotdef JMPLESS
	add	rdx, r11	# rdx:rax = divisor * quotient"
	jnc	3f		# divisor * quotient" < 2**64?
				# with carry, it is off by divisor,
				#  and quotient" is off by 1

	dec	rbx		# rbx = quotient" - 1
	sub	rax, r8
	sbb	rdx, r9		# rdx:rax = divisor * (quotient" - 1)
3:
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.else
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = dividend - low qword of divisor * quotient"
	sub	rsi, r11	# rsi:rdi = dividend - divisor * quotient"
				#         = remainder"
.endif # JMPLESS
	jnb	4f		# remainder" >= 0?
				# with borrow, it is off by divisor,
				#  and quotient" is off by 1

	dec	rbx		# rbx = quotient" - 1
				#     = quotient
	add	rdi, r8
	adc	rsi, r9		# rsi:rdi = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
4:
	test	r10, r10
	jz	5f		# address of remainder = 0?

	mov	[r10], rdi
	mov	[r10+8], rsi	# remainder = rsi:rdi
5:
	mov	rax, rbx	# rax = (low qword of) quotient
	xor	edx, edx	# rdx:rax = quotient

	pop	rbx
	ret

	.end
__udivmodti4() function for AMD64 processors, supporting the Microsoft calling convention, using the divide & conquer algorithm:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; Microsoft calling convention for AMD64 platform:
; - first 4 integer or pointer arguments (from left to right) are passed
;   in registers RCX/R1, RDX/R2, R8 and R9;
; - surplus arguments are pushed on stack in reverse order (from right
;   to left);
; - caller always allocates "home space" for 4 arguments on stack, even
;   when less than 4 arguments are passed, but does not need to push
;   first 4 arguments;
; - callee can spill first 4 arguments from registers to "home space";
; - callee can clobber "home space";
; - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
;   when it calls other functions (CALL pushes 8 bytes);
; - 64-bit integer or pointer result is returned in register RAX/R0;
; - 32-bit integer result is returned in register EAX;
; - registers RAX/R0, RCX/R1, RDX/R2, R8, R9, R10 and R11 are volatile
;   and can be clobbered;
; - registers RBX/R3, RSP/R4, RBP/R5, RSI/R6, RDI/R7, R12, R13, R14 and
;   R15 must be preserved.

	.code
				; rcx = oword ptr quotient
				; rdx = oword ptr dividend
				; r8 = oword ptr divisor
				; r9 = oword ptr remainder
__udivmodti4 proc public

	mov	rax, [rdx]	; rax = low qword of dividend
	mov	rdx, [rdx+8]	; rdx = high qword of dividend

	mov	r10, [r8]	; r10 = low qword of divisor
	mov	r11, [r8+8]	; r11 = high qword of divisor

	mov	r8, rcx		; r8 = address of quotient

	bsr	rcx, r11	; rcx = index of leading '1' bit in high qword of divisor
	jnz	extended	; high qword of divisor <> 0?

	; divisor < 2**64 (so remainder will be < 2**64 too)

	cmp	rdx, r10
	jae	long		; high qword of dividend >= (low qword of) divisor?

	; dividend < divisor * 2**64 (so quotient will be < 2**64):
	; perform normal division
normal:
	div	r10		; rax = (low qword of) quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], r11	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend >= divisor * 2**64:
	; perform "long" alias "schoolbook" division
long:
	mov	rcx, rax	; rcx = low qword of dividend
	mov	rax, rdx	; rax = high qword of dividend
	mov	rdx, r11	; rdx:rax = high qword of dividend
	div	r10		; rax = high qword of quotient,
				; rdx = high qword of remainder'
	xchg	rcx, rax	; rcx = high qword of quotient,
				; rax = low qword of dividend
	div	r10		; rax = low qword of quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], rcx	; quotient = rcx:rax

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; divisor >= 2**64 (so quotient will be < 2**64):
	; perform "extended & adjusted" division
extended:
	mov	[rsp+8], rbx
	mov	[rsp+16], r12
	mov	[rsp+24], r13
	mov	[rsp+32], r14

	not	ecx		; ecx = number of leading '0' bits in (high qword of) divisor
	mov	r12, r11	; r12 = high qword of divisor
	shld	r12, r10, cl	; r12 = divisor / 2**(index + 1)
				;     = divisor'
	mov	r13, rax	; r13 = low qword of dividend
	mov	r14, rdx	; r14 = high qword of dividend
ifndef JMPLESS
	xor	ebx, ebx	; rbx = high qword of quotient' = 0

	cmp	rdx, r12
	jb	@f		; high qword of dividend < divisor'?

	; high qword of dividend >= divisor':
	; subtract divisor' from high qword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	rdx, r12	; rdx = high qword of dividend - divisor'
				;     = high qword of dividend'
	inc	ebx		; rbx = high qword of quotient' = 1
@@:
elseif 0
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	sbb	rbx, rbx	; rbx = (high qword of dividend < divisor') ? -1 : 0
	and	rbx, r12	; rbx = (high qword of dividend < divisor') ? divisor' : 0
	add	rdx, rbx	; rdx = high qword of dividend
				;     - (high qword of dividend < divisor') ? 0 : divisor'
				;     = high qword of dividend'
	neg	rbx		; CF = (high qword of dividend < divisor') ? 1 : 0
	sbb	ebx, ebx	; ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
else
	sub	rdx, r12	; rdx = high qword of dividend - divisor'
	cmovb	rdx, r14	;     = high qword of dividend'
	sbb	ebx, ebx	; ebx = (high qword of dividend < divisor') ? -1 : 0
	inc	ebx		; rbx = (high qword of dividend < divisor') ? 0 : 1
				;     = high qword of quotient'
endif ; JMPLESS
	; high qword of dividend' < divisor'

	div	r12		; rax = low qword of quotient'
				;     = dividend' / divisor',
				; rdx = remainder'
	shld	rbx, rax, cl	; rbx = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"
ifndef JMPLESS
	mov	r12, r11	; r12 = high qword of divisor
	mov	rax, r10	; rax = low qword of divisor
	imul	r12, rbx	; r12 = high qword of divisor * quotient"
	mul	rbx		; rdx:rax = low qword of divisor * quotient"
	add	rdx, r12	; rdx:rax = divisor * quotient"
	jnc	@f		; divisor * quotient" < 2**64?
				; with carry, it is off by divisor,
				;  and quotient" is off by 1

	dec	rbx		; rbx = quotient" - 1

	sub	rax, r10
	sbb	rdx, r11	; rdx:rax = divisor * (quotient" - 1)
@@:
	sub	r13, rax
	sbb	r14, rdx	; r14:r13 = dividend - divisor * quotient"
				;         = remainder"
else
	mov	r12, r11	; r12 = high qword of divisor
	mov	rax, r10	; rax = low qword of divisor
	imul	r12, rbx	; r12 = high qword of divisor * quotient"
	mul	rbx		; rdx:rax = low qword of divisor * quotient"
	sub	r13, rax
	sbb	r14, rdx	; r14:r13 = dividend - low qword of divisor * quotient"
	sub	r14, r12	; r14:r13 = dividend - divisor * quotient"
				;         = remainder"
endif ; JMPLESS
	jnb	@f		; remainder" >= 0?
				; with borrow, it is off by divisor,
				;  and quotient" is off by 1

	dec	rbx		; rbx = quotient" - 1
				;     = quotient
	add	r13, r10
	adc	r14, r11	; r14:r13 = remainder" + divisor
				;         = dividend - divisor * (quotient" - 1)
				;         = dividend - divisor * quotient
				;         = remainder
@@:
	xor	eax, eax	; rax = high qword of quotient = 0
	mov	[r8], rbx
	mov	[r8+8], rax	; quotient = rax:rbx

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], r13
	mov	[r9+8], r14	; remainder = r14:r13
@@:
	mov	rbx, [rsp+8]
	mov	r12, [rsp+16]
	mov	r13, [rsp+24]
	mov	r14, [rsp+32]

	mov	rax, r8		; rax = address of quotient
	ret

__udivmodti4 endp
	end
__udivmodti4() function for AMD64 processors, supporting the Unix System V calling convention, using the divide & conquer algorithm with its hard case replaced by the normal case of the shift & subtract algorithm:
# Copyright © 2020-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#	.arch	generic64
	.code64
	.intel_syntax noprefix
	.global	___udivmodti4
#	.type	___udivmodti4, @function
	.text
				# rdi = oword ptr dividend
				# rsi = oword ptr divisor
				# rdx = oword ptr remainder
___udivmodti4:
	push	rbx

	mov	r10, rdx	# r10 = address of (optional) remainder

	mov	r9, [rsi+8]	# r9 = high qword of divisor
	mov	r8, [rsi]	# r8 = low qword of divisor

	mov	rdx, [rdi+8]	# rdx = high qword of dividend
	mov	rax, [rdi]	# rax = low qword of dividend

	bsr	rbx, r9		# rbx = index of leading '1' bit in high qword of divisor
	jz	.simple		# high qword of divisor = 0?

	# divisor >= 2**64 (so quotient will be < 2**64)

	bsr	rcx, rdx	# rcx = index of leading '1' bit in high qword of dividend
	jz	.trivial	# high qword of dividend = 0?

	# divisor >= 2**64, dividend >= 2**64:
	# perform "binary long" alias "shift & subtract" division
.large:
	sub	ecx, ebx	# ecx = distance of leading '1' bits
	jb	.zero		# dividend < divisor?

	xor	ebx, ebx	# rbx = (low qword of) quotient' = 0

	shld	r9, r8, cl
	shl	r8, cl		# r9:r8 = divisor << distance of leading '1' bits
				#       = divisor'
.loop:
	mov	rsi, rax
	mov	rdi, rdx	# rdi:rsi = dividend'
	sub	rax, r8
	sbb	rdx, r9		# rdx:rax = dividend' - divisor'
				#         = dividend",
				# CF = dividend' < divisor'
	cmovb	rax, rsi
	cmovb	rdx, rdi	# rdx:rax = (dividend' < divisor') ? dividend' : dividend"
	cmc			# CF = dividend' >= divisor'
	adc	rbx, rbx	# rbx = quotient' << 1
				#     + dividend' >= divisor'
				#     = quotient"
	shrd	r8, r9, 1
	shr	r9, 1		# r9:r8 = divisor' >> 1
				#       = divisor"
	dec	ecx
	jns	.loop

	test	r10, r10
	jz	0f		# address of remainder = 0?

	mov	[r10], rax
	mov	[r10+8], rdx	# remainder = dividend"
0:
	mov	rax, rbx	# rax = (low qword of) quotient
	xor	edx, edx	# rdx:rax = quotient

	pop	rbx
	ret

	# dividend < divisor
.zero:
	test	r10, r10
	jz	1f		# address of remainder = 0?

	mov	[r10], rax
	mov	[r10+8], rdx	# remainder = dividend"
1:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0

	pop	rbx
	ret

	# dividend < 2**64 <= divisor
.trivial:
	test	r10, r10
	jz	2f		# address of remainder = 0?

	mov	[r10], rax
	mov	[r10+8], rdx	# remainder = dividend"
2:
	mov	eax, edx	# rdx:rax = quotient = 0

	pop	rbx
	ret

	# divisor < 2**64 (so remainder will be < 2**64 too)
.simple:
	cmp	rdx, r8
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64 (so quotient will be < 2**64):
	# perform normal division
.normal:
	div	r8		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r10, r10
	jz	3f		# address of remainder = 0?

	mov	[r10], rdx
	mov	[r10+8], r9	# high qword of remainder = 0
3:
	mov	rdx, r9		# rdx:rax = quotient

	pop	rbx
	ret

	# dividend >= divisor * 2**64:
	# perform "long" alias "schoolbook" division
.long:
	mov	rcx, rax	# rcx = low qword of dividend
	mov	rax, rdx	# rax = high qword of dividend
	mov	rdx, r9		# rdx:rax = high qword of dividend
	div	r8		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	xchg	rcx, rax	# rcx = high qword of quotient,
				# rax = low qword of dividend
	div	r8		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r10, r10
	jz	4f		# address of remainder = 0?

	mov	[r10], rdx
	mov	[r10+8], r9	# high qword of remainder = 0
4:
	mov	rdx, rcx	# rdx:rax = quotient

	pop	rbx
	ret

	.end
__udivmodti4() function for AMD64 processors, supporting the Microsoft calling convention, using the divide & conquer algorithm with its hard case replaced by the normal case of the shift & subtract algorithm:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code

__udivmodti4 proc public

	mov	rax, [rdx]	; rax = low qword of dividend
	mov	rdx, [rdx+8]	; rdx = high qword of dividend

	mov	r10, [r8]	; r10 = low qword of divisor
	mov	r11, [r8+8]	; r11 = high qword of divisor

	mov	r8, rcx		; r8 = address of quotient

	bsr	rcx, r11	; rcx = index of leading '1' bit in high qword of divisor
	jz	simple		; high qword of divisor = 0?

	; divisor >= 2**64 (so quotient will be < 2**64)

	mov	[rsp+8], rbx

	bsr	rbx, rdx	; rbx = index of leading '1' bit in high qword of dividend
	jz	trivial		; high qword of dividend = 0?

	; divisor >= 2**64, dividend >= 2**64:
	; perform "shift & subtract" alias "binary long" division
large:
	sub	ebx, ecx	; ebx = distance of leading '1' bits
	jb	trivial		; dividend < divisor?

	mov	ecx, ebx
	xor	ebx, ebx	; rbx = (low qword of) quotient' = 0

	shld	r11, r10, cl
	shl	r10, cl		; r11:r10 = divisor << distance of leading '1' bits
				;         = divisor'
	mov	[rsp+16], r12
	mov	[rsp+24], r13
@@:
	mov	r12, rax
	mov	r13, rdx	; r13:r12 = dividend'
	sub	rax, r10
	sbb	rdx, r11	; rdx:rax = dividend' - divisor'
				;         = dividend",
				; CF = dividend' < divisor'
	cmovb	rax, r12
	cmovb	rdx, r13	; rdx:rax = (dividend' < divisor') ? dividend' : dividend"
	cmc			; CF = dividend' >= divisor'
	adc	rbx, rbx	; rbx = quotient' << 1
				;     + dividend' >= divisor'
				;     = quotient
	shrd	r10, r11, 1
	shr	r11, 1		; r11:r10 = divisor' >> 1
				;         = divisor
	dec	ecx
	jns	@b

	mov	r12, [rsp+16]
	mov	r13, [rsp+24]

	xor	ecx, ecx
	mov	[r8], rbx
	mov	[r8+8], rcx	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend"
@@:
	mov	rax, r8		; rax = address of quotient

	mov	rbx, [rsp+8]
	ret

	; dividend < (2**64 <=) divisor:
	; quotient = 0, remainder = dividend
trivial:
	xor	ecx, ecx
	mov	[r8], rcx
	mov	[r8+8], rcx	; quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rax
	mov	[r9+8], rdx	; remainder = dividend"
@@:
	mov	rax, r8		; rax = address of quotient

	mov	rbx, [rsp+8]
	ret

	; divisor < 2**64 (so remainder will be < 2**64 too)
simple:
	cmp	rdx, r10
	jae	long		; high qword of dividend >= (low qword of) divisor?

	; dividend < divisor * 2**64 (so quotient will be < 2**64):
	; perform normal division
normal:
	div	r10		; rax = (low qword of) quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], r11	; high qword of quotient = 0

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

	; dividend >= divisor * 2**64:
	; perform "long" alias "schoolbook" division
long:
	mov	rcx, rax	; rcx = low qword of dividend
	mov	rax, rdx	; rax = high qword of dividend
	mov	rdx, r11	; rdx:rax = high qword of dividend
	div	r10		; rax = high qword of quotient,
				; rdx = high qword of remainder'
	xchg	rcx, rax	; rcx = high qword of quotient,
				; rax = low qword of dividend
	div	r10		; rax = low qword of quotient,
				; rdx = (low qword of) remainder
	mov	[r8], rax
	mov	[r8+8], rcx	; quotient = rcx:rax

	test	r9, r9
	jz	@f		; address of remainder = 0?

	mov	[r9], rdx
	mov	[r9+8], r11	; high qword of remainder = 0
@@:
	mov	rax, r8		; rax = address of quotient
	ret

__udivmodti4 endp
	end

64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)

__udivmoddi4() function for AMD64 processors, supporting the Microsoft calling convention, using the shift & subtract division:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code
				; rcx = dividend
				; rdx = divisor
				; r8 = oword ptr remainder
__udivmoddi4 proc public

	cmp	rcx, rdx
	jb	trivial		; dividend < divisor?

	bsr	rax, rdx	; rax = index of leading '1' bit of divisor
	jz	error		; divisor = 0?

	mov	r9, rcx		; r9 = dividend
	bsr	rcx, rcx	; rcx = index of leading '1' bit of dividend
	jz	zero		; dividend = 0?

	sub	ecx, eax	; ecx = distance of leading '1' bits
;;	jb	trivial		; dividend < divisor?

	shl	rdx, cl		; rdx = divisor << distance of leading '1' bits
				;     = divisor'
	xor	eax, eax	; eax = quotient' = 0
@@:
	mov	r10, r9		; r10 = dividend'
	sub	r9, rdx		; r9 = dividend' - divisor'
				;    = dividend"
				; CF = dividend' < divisor'
	cmovb	r9, r10		; r9 = (dividend' < divisor') ? dividend' : dividend"
	cmc			; CF = dividend' >= divisor'
	adc	rax, rax	; rax = quotient' << 1
				;     + (dividend' >= divisor')
				;     = quotient
	shr	rdx, 1		; rdx = divisor' >> 1
				;     = divisor
	dec	ecx
	jns	@b

	test	r8, r8
	jz	@f		; address of remainder = 0?

	mov	[r8], r9	; remainder = dividend"
@@:
	ret

	; dividend < divisor
trivial:
	test	r8, r8
	jz	@f		; address of remainder = 0?

	mov	[r8], rcx	; remainder = dividend
@@:
	xor	eax, eax	; rax = quotient = 0
	ret

	; dividend = 0
zero:
	test	r8, r8
	jz	@f		; address of remainder = 0?

	mov	[r8], r9	; remainder = 0
@@:
	mov	rax, r9		; rax = quotient = 0
	ret

	; divisor = 0
error:
	div	rdx
	ret

__udivmoddi4 endp
	end

Implementation for i386 Processors

Prototypes for the __udivmoddi4(), __udivdi3(), __umoddi3(), __divdi3() and __moddi3() functions:
// Copyleft © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
uint64_t __udivdi3(uint64_t dividend, uint64_t divisor);
uint64_t __umoddi3(uint64_t dividend, uint64_t divisor);

int64_t __divdi3(int64_t dividend, int64_t divisor);
int64_t __moddi3(int64_t dividend, int64_t divisor);
The suffixes di4 and di3 specify the number of arguments plus return value and their size: double integer denotes an 8-byte QWORD alias 64-bit uint64_t.

Note: the code following here supports the common, so-called cdecl calling and naming convention used by C compilers on Linux®, Unix, Windows®, plus other operating systems, and runs on 35 year old Intel® 80386 micro-processors.

64÷64-bit Unsigned Integer Division (64-bit Quotient and Remainder)

__udivmoddi4() function for i386 processors:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# "cdecl" calling and naming convention for I386 platform:
# - arguments are pushed on stack in reverse order (from right to left);
# - 64-bit integer result is returned in registers EAX (low part) and
#   EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.

# NOTE: raises "division exception" when divisor is 0!

#	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___udivmoddi4
#	.type	___udivmoddi4, @function
	.text
				# [esp+20] = address of (optional) remainder
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
___udivmoddi4:
	mov	edx, [esp+16]	# edx = high dword of divisor
	bsr	ecx, edx	# ecx = index of leading '1' bit in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend

	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+20]	# ecx = address of remainder
	test	ecx, ecx
	jz	0f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
0:
	xor	edx, edx	# edx:eax = quotient
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	push	eax		# [esp] = high dword of quotient
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder

	mov	ecx, [esp+24]	# ecx = address of remainder
	test	ecx, ecx
	jz	1f		# address of remainder = 0?

	mov	[ecx], edx	# [ecx] = (low dword of) remainder
	xor	edx, edx
	mov	[ecx+4], edx
1:
	pop	edx		# edx:eax = quotient
	ret

	# high dword of divisor <> 0 (so high dword of quotient will be 0):
	# perform "extended & adjusted" division
.extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	# edx:eax = divisor
	not	ecx		# ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
	mov	ebx, edx	# ebx = divisor'

	mov	edx, [esp+16]	# edx = high dword of dividend
	mov	eax, [esp+12]	# eax = low dword of dividend
.ifnotdef JMPLESS
	xor	edi, edi	# edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	2f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	edi		# edi = high dword of quotient' = 1
2:
.else
	sub	edx, ebx	# edx = high dword of dividend - divisor'
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	# edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
	neg	edi		# CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		# edi = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
.endif # JMPLESS
	# high dword of dividend' < divisor'

	div	ebx		# eax = low dword of quotient'
				#     = dividend' / divisor',
				# edx = remainder'
	shld	edi, eax, cl	# edi = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"

	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	edi		# edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	# eax = high dword of divisor
	imul	eax, edi	# eax = high dword of divisor * quotient"

	sub	ebx, eax	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JMPLESS
	jnb	3f		# remainder" >= 0?
				# with borrow, it is off by divisor,
				#  and quotient" is off by 1

	dec	edi		# edi = quotient" - 1
				#     = quotient
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
3:
.else
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	add	edi, eax	# edi = quotient" - 1
				#     = quotient
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
.endif # JMPLESS
	mov	eax, edi	# eax = (low dword of) quotient
	xor	edx, edx	# edx:eax = quotient

	mov	edi, [esp+28]	# edi = address of remainder
	test	edi, edi
	jz	4f		# address of remainder = 0?

	mov	[edi], ecx	# [edi] = low dword of remainder
	mov	[edi+4], ebx	# [edi+4] = high dword of remainder
4:
	pop	edi
	pop	ebx
	ret

	.end

64÷64-bit Unsigned Integer Division (64-bit Quotient)

__udivdi3() function for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__udivdi3 proc	public

	mov	edx, [esp+16]	; edx = high dword of divisor
	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend

	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	xor	edx, edx	; edx:eax = quotient
	ret

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient

	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = quotient
	ret

	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	; edx:eax = divisor
	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+16]	; edx = high dword of dividend
	mov	eax, [esp+12]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = low dword of quotient'
				;     = dividend' / divisor',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"

	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
	sbb	edi, 0		; edi = quotient" - (remainder" < 0)
				;     = quotient
	mov	eax, edi	; eax = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient

	pop	edi
	pop	ebx
	ret

__udivdi3 endp
	end

64÷64-bit Unsigned Integer Division (64-bit Remainder)

__umoddi3() function for i386 processors:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

#	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___umoddi3
#	.type	___umoddi3, @function
	.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
___umoddi3:
	mov	edx, [esp+16]	# edx = high dword of divisor
	bsr	ecx, edx	# ecx = index of leading '1' bit in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, [esp+12]	# ecx = (low dword of) divisor
	mov	eax, [esp+8]	# eax = high dword of dividend

	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	mov	eax, [esp+4]	# edx:eax = dividend
	div	ecx		# eax = (low dword of) quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
	mov	eax, [esp+4]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) remainder
	xor	edx, edx	# edx:eax = remainder
	ret

	# high dword of divisor <> 0 (so high dword of quotient will be 0):
	# perform "extended & adjusted" division
.extended:
	push	ebx
	push	edi

	mov	eax, [esp+20]	# edx:eax = divisor
	not	ecx		# ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
	mov	ebx, edx	# ebx = divisor'

	mov	edx, [esp+16]	# edx = high dword of dividend
	mov	eax, [esp+12]	# eax = low dword of dividend
.ifnotdef JMPLESS
	xor	edi, edi	# edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	edi		# edi = high dword of quotient' = 1
0:
.else
	sub	edx, ebx	# edx = high dword of dividend - divisor'
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	# edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
	neg	edi		# CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		# edi = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
.endif # JMPLESS
	# high dword of dividend' < divisor'

	div	ebx		# eax = low dword of quotient'
				#     = dividend' / divisor',
				# edx = remainder'
	shld	edi, eax, cl	# edi = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"

	mov	eax, [esp+20]	# eax = low dword of divisor
	mul	edi		# edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+24]	# eax = high dword of divisor
	imul	eax, edi	# eax = high dword of divisor * quotient"

	sub	ebx, eax	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JMPLESS
	jnb	1f		# remainder" >= 0?
				# with borrow, it is off by divisor
				#  (and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
1:
	mov	eax, ecx
	mov	edx, ebx	# edx:eax = remainder
.else
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	# edx:eax = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
.endif # JMPLESS
	pop	edi
	pop	ebx
	ret

	.end

64÷64-bit Signed Integer Division (64-bit Quotient)

__divdi3() function for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: raises "division exception" when divisor is 0!

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of divisor
				; [esp+12] = low dword of divisor
				; [esp+8] = high dword of dividend
				; [esp+4] = low dword of dividend
__divdi3 proc	public

	push	ebx

	; determine sign of dividend and compute |dividend|

	mov	edx, [esp+12]	; edx = high dword of dividend
	mov	eax, [esp+8]	; eax = low dword of dividend

	mov	ebx, edx
	sar	ebx, 31		; ebx = (dividend < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx	; edx:eax = (dividend < 0) ? ~dividend : dividend
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+8], eax	; write |dividend| back on stack
	mov	[esp+12], edx

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]	; edx = high dword of divisor
	mov	eax, [esp+16]	; eax = low dword of divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	xor	ecx, ebx	; ecx = sign of dividend ^ sign of divisor
				;     = sign of quotient
	push	ecx		; save sign of quotient on stack

	bsr	ecx, edx	; ecx = index of leading '1' bit in high dword of divisor
	jnz	extended	; high dword of divisor <> 0?

	; high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+16]	; eax = high dword of dividend

	cmp	eax, ecx
	jae	long		; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+12]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edx		; edx = sign of quotient
	xor	eax, edx	; set sign of quotient
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	pop	ebx
	ret

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	edx, ebx	; edx:eax = |quotient|

	pop	ecx		; ecx = sign of quotient
ifndef JMPLESS
	jecxz	@f		; quotient >= 0?

	neg	edx		; change sign of quotient
	neg	eax
	sbb	edx, 0		; edx:eax = quotient
@@:
else
	xor	eax, ecx	; set sign of quotient
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient
endif ; JMPLESS
	pop	ebx
	ret

	; high dword of divisor <> 0 (so high dword of quotient will be 0):
	; perform "extended & adjusted" division
extended:
	push	edi

	not	ecx		; ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
	mov	ebx, edx	; ebx = divisor'

	mov	edx, [esp+20]	; edx = high dword of dividend
	mov	eax, [esp+16]	; eax = low dword of dividend
ifndef JMPLESS
	xor	edi, edi	; edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	@f		; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	edi		; edi = high dword of quotient' = 1
@@:
else
	sub	edx, ebx	; edx = high dword of dividend - divisor'
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	; edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
	neg	edi		; CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	; edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		; edi = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
endif ; JMPLESS
	; high dword of dividend' < divisor'

	div	ebx		; eax = low dword of quotient'
				;     = dividend' / divisor',
				; edx = remainder'
	shld	edi, eax, cl	; edi = quotient' / 2**(index + 1)
				;     = dividend / divisor
				;     = quotient"

	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	edi		; edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+28]	; eax = high dword of divisor
	imul	eax, edi	; eax = high dword of divisor * quotient"

	sub	ebx, eax	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
	sbb	edi, 0		; edi = quotient" - (remainder" < 0)
				;     = quotient
	mov	eax, edi	; eax = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edi
	pop	edx		; edx = sign of quotient
	xor	eax, edx	; set sign of quotient
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	pop	ebx
	ret

__divdi3 endp
	end

64÷64-bit Signed Integer Division (64-bit Remainder)

__moddi3() function for i386 processors:
# Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# NOTE: raises "division exception" when divisor is 0!

#	.arch	i386
	.code32
	.intel_syntax noprefix
	.global	___moddi3
#	.type	___moddi3, @function
	.text
				# [esp+16] = high dword of divisor
				# [esp+12] = low dword of divisor
				# [esp+8] = high dword of dividend
				# [esp+4] = low dword of dividend
___moddi3:
	# determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]	# eax = high dword of dividend
	mov	ecx, [esp+4]	# ecx = low dword of dividend

	cdq			# edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	# ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	# ecx:eax = (dividend < 0) ? -dividend : dividend
				#         = |dividend|

	mov	[esp+4], ecx	# write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		# save sign of dividend on stack

	# determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]	# edx = high dword of divisor
	mov	eax, [esp+16]	# eax = low dword of divisor

	mov	ecx, edx
	sar	ecx, 31		# ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	# edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = (divisor < 0) ? -divisor : divisor
				#         = |divisor|

	mov	[esp+16], eax	# write |divisor| back on stack
	mov	[esp+20], edx

	bsr	ecx, edx	# ecx = index of leading '1' bit in high dword of divisor
	jnz	.extended	# high dword of divisor <> 0?

	# high dword of divisor = 0 (so high dword of remainder will be 0 too)

	mov	ecx, eax	# ecx = (low dword of) divisor
	mov	eax, [esp+12]	# eax = high dword of dividend

	cmp	eax, ecx
	jae	.long		# high dword of dividend >= divisor?

	# perform normal division
.normal:
	mov	edx, eax	# edx = high dword of dividend
	jmp	.continue

	# perform "long" alias "schoolbook" division
.long:
#	xor	edx, edx	# edx:eax = high dword of dividend
	div	ecx		# eax = high dword of quotient,
				# edx = high dword of remainder'
.continue:
	mov	eax, [esp+8]	# eax = low dword of dividend
	div	ecx		# eax = low dword of quotient,
				# edx = (low dword of) remainder
	mov	eax, edx	# eax = (low dword of) |remainder|
#	xor	edx, edx	# edx:eax = |remainder|

	pop	edx		# edx = sign of remainder
	xor	eax, edx	# set sign of remainder
	sub	eax, edx
	sbb	edx, edx	# edx:eax = remainder
	ret

	# high dword of divisor <> 0 (so high dword of quotient will be 0):
	# perform "extended & adjusted" division
.extended:
	push	ebx
	push	edi

	not	ecx		# ecx = number of leading '0' bits in (high dword of) divisor
	shld	edx, eax, cl	# edx = divisor / 2**(index + 1)
				#     = divisor'
	mov	ebx, edx	# ebx = divisor'

	mov	edx, [esp+20]	# edx = high dword of dividend
	mov	eax, [esp+16]	# eax = low dword of dividend
.ifnotdef JMPLESS
	xor	edi, edi	# edi = high dword of quotient' = 0

	cmp	edx, ebx
	jb	0f		# high dword of dividend < divisor'?

	# high dword of dividend >= divisor':
	# subtract divisor' from high dword of dividend to prevent possible
	# quotient overflow and set most significant bit of quotient"

	sub	edx, ebx	# edx = high dword of dividend - divisor'
				#     = high dword of dividend'
	inc	edi		# edi = high dword of quotient' = 1
0:
.else
	sub	edx, ebx	# edx = high dword of dividend - divisor'
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	and	edi, ebx	# edi = (high dword of dividend < divisor') ? divisor' : 0
	add	edx, edi	# edx = high dword of dividend
				#     - (high dword of dividend < divisor') ? 0 : divisor'
				#     = high dword of dividend'
	neg	edi		# CF = (high dword of dividend < divisor') ? 1 : 0
	sbb	edi, edi	# edi = (high dword of dividend < divisor') ? -1 : 0
	inc	edi		# edi = (high dword of dividend < divisor') ? 0 : 1
				#     = high dword of quotient'
.endif # JMPLESS
	# high dword of dividend' < divisor'

	div	ebx		# eax = low dword of quotient'
				#     = dividend' / divisor',
				# edx = remainder'
	shld	edi, eax, cl	# edi = quotient' / 2**(index + 1)
				#     = dividend / divisor
				#     = quotient"

	mov	eax, [esp+24]	# eax = low dword of divisor
	mul	edi		# edx:eax = low dword of divisor * quotient"

	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	# ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	# ebx:ecx = dividend - low dword of divisor * quotient"

	mov	eax, [esp+28]	# eax = high dword of divisor
	imul	eax, edi	# eax = high dword of divisor * quotient"

	sub	ebx, eax	# ebx:ecx = dividend - divisor * quotient"
				#         = remainder"
.ifnotdef JMPLESS
	jnb	1f		# remainder" >= 0?
				# with borrow, it is off by divisor
				#  (and quotient" is off by 1)
	add	ecx, [esp+24]
	adc	ebx, [esp+28]	# ebx:ecx = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
1:
	mov	eax, ecx
	mov	edx, ebx	# edx:eax = |remainder|
.else
	sbb	eax, eax	# eax = (remainder" < 0) ? -1 : 0
	cdq			# edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+24]
	and	edx, [esp+28]	# edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	# edx:eax = remainder" + divisor
				#         = dividend - divisor * (quotient" - 1)
				#         = dividend - divisor * quotient
				#         = remainder
.endif # JMPLESS
	pop	edi
	pop	ebx

	pop	ecx		# ecx = sign of remainder
.ifnotdef JMPLESS
	jecxz	2f		# remainder >= 0?

	neg	edx		# change sign of remainder
	neg	eax
	sbb	edx, 0		# edx:eax = remainder
2:
.else
	xor	eax, ecx	# set sign of remainder
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	# edx:eax = remainder
.endif # JMPLESS
	ret

	.end

64×64-bit Signed and Unsigned Integer Multiplication (64-bit Product)

__muldi3() alias __umuldi3() function for i386 processors:
; Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code
				; [esp+16] = high dword of multiplier
				; [esp+12] = low dword of multiplier
				; [esp+8] = high dword of multiplicand
				; [esp+4] = low dword of multiplicand
__muldi3 proc	public
__umuldi3 proc	public

	push	ebx

	mov	eax, [esp+16]	; eax = low dword of multiplier
	mov	ecx, [esp+12]	; ecx = high dword of multiplicand
	imul	ecx, eax	; ecx = high dword of multiplicand
				;     * low dword of multiplier

	mov	edx, [esp+8]	; edx = low dword of multiplicand
	mov	ebx, [esp+20]	; ebx = high dword of multiplier
	imul	ebx, edx	; ebx = high dword of multiplier
				;     * low dword of multiplicand

	mul	edx		; edx:eax = low dword of multiplier
				;         * low dword of multiplicand
	add	ecx, ebx	; ebx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
	add	edx, ecx	; edx:eax = product

	pop	ebx
	ret

__umuldi3 endp
__muldi3 endp
	end

Execution Times (Sustained Reciprocal Throughput)

Measurements were performed on Windows 10 with the benchmark programs presented below, which are available for download in the cabinet file integer.cab: while the programs *.exe measure execution times in processor clock cycles, the console programs *.com measure nano-seconds.

Summary Table (Running ’round in Circles Cycles)

The table shows the execution times for 128÷128-bit division, 64÷64-bit division and 64×64-bit multiplication on several processors in clock cycles per call; the upper half for the AMD64 platform, the lower half for the i386 platform.

128÷128-bit 64÷64-bit
__udivmodti4() __udivmodti4() __udivmoddi4() DIV
eSKamationeSKamation eSKamationAMD, Intel
AMD Ryzen5 3600 32? ?21
AMD Ryzen7 2700X 32? ?21
Intel Core i5-9500 74? ?33
Intel Core i5-7400 9320 1738
Intel Core i5-6600 99? ?45
Intel Core i5-4670 91? ?49
Intel Core2 Duo P8700 8029 3538

64÷64-bit 64×64-bit
__udivmoddi4() _aulldvrm() _aullmul() __umuldi3()
eSKamationMicrosoft MicrosofteSKamation
AMD Ryzen5 3600 1773 3
AMD Ryzen7 2700X 20100 6
Intel Core i5-9500 16101 5
Intel Core i5-7400 20127 6
Intel Core i5-6600 20135 7
Intel Core i5-4670 25136 8
Intel Core2 Duo P8700 36145 11

On modern Core processors, the __udivmoddi4() 64÷64-bit division routines presented above run more than twice as fast as their native 64-bit DIV instruction, even in 32-bit mode!
The same holds for the hybrid __udivmodti4() 128÷128-bit division routine.

Microsoft’s 64÷64-bit division routine _aulldvrm() for 32-bit i386 processors, which they dare to ship with Windows and their MSVCRT library, but really sucks: it is 4 to 6 times slower than a properly implemented division routine!
The same holds for their 64×64-bit multiplication routine _allmul() alias _aullmul() for 32-bit i386 processors: while execution of the properly implemented __muldi3() alias __umuldi3() routine is buried in the pipeline of super-scalar processors and therefore not measurable, _aullmul() wastes up to 11 clock cycles.

Benchmark Programs for AMD64 Processors

The first of the following two C programs for 64-bit processors measures the execution time for one billion divisions of uniform distributed 128-bit pseudo-random numbers with the __udivmodti4() function; the second C program measures the execution time for one billion divisions of uniform distributed 64-bit pseudo-random numbers with the DIV instruction, disguised as the __udivmoddi4() function.

Note: with the macro CYCLES defined, both programs measure the execution time in processor clock cycles and run on 64-bit editions of Windows Vista® and newer versions, else they measure the execution time in nano-seconds and run on 64-bit editions of Windows XP® and newer versions.

// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif

#pragma comment(compiler)
#pragma comment(user, __TIMESTAMP__)

#define _CRT_SECURE_NO_WARNINGS
#define STRICT
#define UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	ULONGLONG	QWORD;

const	struct
{
	QWORD	qwDividend[2];
	QWORD	qwDivisor[2];
	QWORD	qwQuotient[2];
	QWORD	qwRemainder[2];
} owVector[] = {{0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
                {2ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL, 0ULL, 0ULL},
                {2ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {2ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
                {2ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
                {~0ULL, 0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, 0ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
                {0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
                {0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
                {1ULL, 1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
                {1ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL},
                {1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {1ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {1ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {1ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
                {~0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL},
                {~0ULL, 1ULL, ~0ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL},
                {~0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
                {~0ULL, 1ULL, ~0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
                {~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
                {~0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
                {~0ULL, 0xFULL, 0xFULL, 0ULL, 0x1111111111111111ULL, 1ULL, 0ULL, 0ULL},
                {~0xFULL, ~1ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, ~0xFULL, ~1ULL},
                {0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 2ULL, 0ULL},
                {0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {0ULL, ~0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
                {0ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
                {1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL},
                {1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL, 0ULL},
                {1ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 3ULL, 0ULL},
                {1ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {1ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {1ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, ~0ULL},
                {~0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, ~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, ~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, 1ULL, 3ULL, 0x5555555555555555ULL, 0ULL, 0xAAAAAAAAAAAAAAAAULL, 0ULL},
                {~0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
                {~0ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
                {~0ULL, ~0ULL, ~0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
                {~0ULL, ~0ULL, ~1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
                {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x0000000001110001ULL, 0ULL, 0x00000000003EB455ULL, 0ULL},
                {0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x0000000001110001ULL, 0ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x00000000003EB455ULL, 0ULL}};

#pragma intrinsic(_umul128)

__declspec(noinline)
QWORD	*__umulti3(QWORD qwProduct[2], QWORD qwMultiplicand[2], QWORD qwMultiplier[2])
{
	qwProduct[0] = _umul128(qwMultiplicand[0], qwMultiplier[0], qwProduct + 1);
	qwProduct[1] += qwMultiplicand[0] * qwMultiplier[1]
	              + qwMultiplicand[1] * qwMultiplier[0];

	return qwProduct;
}

__declspec(noinline)
QWORD	*__unopti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2])
{
	if (qwRemainder != NULL)
		*qwDivisor = *qwDividend;

	return qwQuotient;
}

QWORD	*__udivmodti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2]);

__forceinline
VOID	lfsr128l(QWORD qw[2])
{
#ifdef XORSHIFT
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Richard Peirce Brent

	QWORD	qwTemp = qw[1];
	qw[1] = qw[0];
	qw[0] ^= qw[0] << 33;
	qwTemp ^= qwTemp << 28;
	qw[0] ^= qw[0] >> 31;
	qwTemp ^= qwTemp >> 29;
	qw[0] ^= qwTemp;
#elif 0
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	qw[1] ^= __shiftleft128(qw[0], qw[1], 26);
	qw[0] ^= qw[0] << 26;
	qw[0] ^= __shiftright128(qw[0], qw[1], 61);
	qw[1] ^= qw[1] >> 61;
	qw[1] ^= __shiftleft128(qw[0], qw[1], 37);
	qw[0] ^= qw[0] << 37;
#else
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D

	QWORD	qwMask = (LONGLONG) (qw[1]) >> 63;
	qw[1] = __shiftleft128(qw[0], qw[1], 1)
	      ^ (qwMask & 0x5DB2B62B0C5F8E1BULL);
	qw[0] = (qwMask & 0xD8CCE715FCB2726DULL) ^ (qw[0] << 1);
#endif
}

__forceinline
VOID	lfsr128r(QWORD qw[2])
{
#ifdef XORSHIFT
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Sebastiano Vigna

	QWORD	qwTemp = qw[1];
	qw[1] = qw[0];
	qwTemp ^= qwTemp << 23;
	qw[0] ^= qw[0] >> 26;
	qwTemp ^= qwTemp >> 17;
	qw[0] ^= qwTemp;
#elif 0
	// 128-bit linear feedback shift register (XorShift form)
	//  using shift constants from Melissa O'Neill

	qw[1] ^= __shiftleft128(qw[0], qw[1], 11);
	qw[0] ^= qw[0] << 11;
	qw[0] ^= __shiftright128(qw[0], qw[1], 61);
	qw[1] ^= qw[1] >> 61;
	qw[1] ^= __shiftleft128(qw[0], qw[1], 45);
	qw[0] ^= qw[0] << 45;
#else
	// 128-bit linear feedback shift register (Galois form) using
	//  primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA

	QWORD	qwMask = 0ULL - (qw[0] & 1ULL);
	qw[0] = __shiftright128(qw[0], qw[1], 1)
	      ^ (qwMask & 0xD871FA30D46D4DBAULL);
	qw[1] = (qwMask & 0xB64E4D3FA8E7331BULL) ^ (qw[1] >> 1);
#endif
}

__declspec(safebuffers)
BOOL	PrintConsole(HANDLE hConsole, LPCWSTR lpFormat, ...)
{
	WCHAR	szBuffer[1025];
	DWORD	dwBuffer;
	DWORD	dwConsole;

	va_list	vaInserts;
	va_start(vaInserts, lpFormat);

	dwBuffer = wvsprintf(szBuffer, lpFormat, vaInserts);

	va_end(vaInserts);

	if (dwBuffer == 0L)
		return FALSE;

	if (!WriteConsole(hConsole, szBuffer, dwBuffer, &dwConsole, NULL))
		return FALSE;

	return dwConsole == dwBuffer;
}

__declspec(noreturn)
VOID	wmainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	QWORD	qwQuotient[2], qwRemainder[2];
		// bit-vector of prime numbers: 2**n == prime
	QWORD	qwDividend[2] = {0x28208A20A08A28ACULL, 0x800228A202088288ULL};
		// 2**128 / golden ratio
	QWORD	qwDivisor[2] = {0xF39CC0605CEDC834ULL, 0x9E3779B97F4A7C15ULL};

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		strcpy((LPSTR) dwCPUID, "undetermined processor");

	PrintConsole(hOutput, L"\nTesting 128-bit division...\n");

	for (dw = 1UL; dw < sizeof(owVector) / sizeof(*owVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#if 0
		if ((owVector[dw].qwDivisor[1] | owVector[dw].qwDivisor[0]) == 0ULL)
			continue;
#endif
		__udivmodti4(qwQuotient, owVector[dw].qwDividend, owVector[dw].qwDivisor, qwRemainder);

		if ((qwQuotient[1] != owVector[dw].qwQuotient[1])
		 || (qwQuotient[0] != owVector[dw].qwQuotient[0]))
			PrintConsole(hOutput, L"\t0x%016I64X:%016I64X\a / %016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n",
			             owVector[dw].qwDividend[1], owVector[dw].qwDividend[0],
			             owVector[dw].qwDivisor[1], owVector[dw].qwDivisor[0],
			             owVector[dw].qwQuotient[1], owVector[dw].qwQuotient[0],
			             qwQuotient[1], qwQuotient[0]);

		if ((qwRemainder[1] != owVector[dw].qwRemainder[1])
		 || (qwRemainder[0] != owVector[dw].qwRemainder[0]))
			PrintConsole(hOutput, L"\t0x%016I64X:%016I64X\a %% %016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n"
			                      L"\t0x%016I64X:%016I64X\n",
			             owVector[dw].qwDividend[1], owVector[dw].qwDividend[0],
			             owVector[dw].qwDivisor[1], owVector[dw].qwDivisor[0],
			             owVector[dw].qwRemainder[1], owVector[dw].qwRemainder[0],
			             qwRemainder[1], qwRemainder[0]);
	}

	PrintConsole(hOutput, L"\nTiming 128-bit division on %.48hs\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwDividend);
		__unopti4(qwQuotient, qwDividend, qwDivisor, NULL);
		lfsr128r(qwDivisor);
		__unopti4(qwQuotient, qwDividend, qwDivisor, NULL);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwDividend);
		__udivmodti4(qwQuotient, qwDividend, qwDivisor, NULL);
		lfsr128r(qwDivisor);
		__udivmodti4(qwQuotient, qwDividend, qwDivisor, NULL);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		lfsr128l(qwDividend);
		__umulti3(qwQuotient, qwDividend, qwDivisor);
		lfsr128r(qwDivisor);
		__umulti3(qwQuotient, qwDividend, qwDivisor);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 - qwT1;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopti4()     %6I64u.%09I64u      0\n"
	                      L"__udivmodti4()  %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"__umulti3()     %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"                %6I64u.%09I64u clock cycles\n",
	             qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	             qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	             qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	             qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	             qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	             qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopti4()     %6I64u.%07I64u      0\n"
	                      L"__udivmodti4()  %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"__umulti3()     %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"                %6I64u.%07I64u nano-seconds\n",
	             qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	             qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	             qwTx / 10000000ULL, qwTx % 10000000ULL,
	             qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	             qwTy / 10000000ULL, qwTy % 10000000ULL,
	             qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
	ExitProcess(0UL);
}
// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif

#pragma comment(compiler)
#pragma comment(user, __TIMESTAMP__)

#define _CRT_SECURE_NO_WARNINGS
#define STRICT
#define UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	ULONGLONG	QWORD;

#ifndef NATIVE
#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}

const	struct
{
	QWORD	qwDividend, qwDivisor, qwQuotient, qwRemainder;
} qwVector[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
                _(0x0000000000000001ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000002ULL),
                _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
                _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
                _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
                _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
                _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
                _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000000000001ULL),
                _(0x8000000000000000ULL, 0x0000000000000002ULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000100000000ULL),
                _(0x8000000000000000ULL, 0x0000000100000001ULL),
                _(0x8000000000000000ULL, 0x0000000100000002ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000080000000ULL, 0x0000000080000000ULL),
                _(0x8000000080000001ULL, 0x0000000080000001ULL),
                _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};

#undef _
#ifdef ASSEMBLY
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);
#else
__declspec(noinline)
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	QWORD	quotient;
	DWORD	index1, index2;

	if (_BitScanReverse64(&index2, divisor))
		if (_BitScanReverse64(&index1, dividend))
#if 0
			if (index1 >= index2)
#else
			if (dividend >= divisor)
#endif
			{
				// dividend >= divisor > 0,
				//  64 > index1 >= index2 >= 0
				// NOTE: number of leading '0' bits = 64 - index

				divisor <<= index1 - index2;
				quotient = 0ULL;

				do
				{
					quotient <<= 1;

					if (dividend >= divisor)
					{
						dividend -= divisor;
						quotient |= 1ULL;
					}

					divisor >>= 1;
				} while (index1 >= ++index2);

				if (remainder != NULL)
					*remainder = dividend;

				return quotient;
			}
			else // divisor > dividend > 0:
			     //  quotient = 0, remainder = dividend
			{
				if (remainder != NULL)
					*remainder = dividend;

				return 0ULL;
			}
		else // divisor > dividend == 0:
		     //  quotient = 0, remainder = 0
		{
			if (remainder != NULL)
				*remainder = 0ULL;

			return 0ULL;
		}
	else // divisor == 0
	{
		if (remainder != NULL)
			return dividend % divisor;

		return dividend / divisor;
	}
}
#endif // ASSEMBLY
#else
__declspec(noinline)
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = dividend % divisor;

	return dividend / divisor;
}
#endif // NATIVE

__declspec(noinline)
QWORD	__unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = divisor;

	return dividend;
}

__declspec(noinline)
QWORD	__umuldi4(QWORD multiplicand, QWORD multiplier, QWORD *dummy)
{
	if (dummy != NULL)
		*dummy = 0ULL;

	return multiplicand * multiplier;
}

__declspec(safebuffers)
BOOL	PrintConsole(HANDLE hConsole, LPCWSTR lpFormat, ...)
{
	WCHAR	szBuffer[1025];
	DWORD	dwBuffer;
	DWORD	dwConsole;

	va_list	vaInserts;
	va_start(vaInserts, lpFormat);

	dwBuffer = wvsprintf(szBuffer, lpFormat, vaInserts);

	va_end(vaInserts);

	if (dwBuffer == 0L)
		return FALSE;

	if (!WriteConsole(hConsole, szBuffer, dwBuffer, &dwConsole, NULL))
		return FALSE;

	return dwConsole == dwBuffer;
}

__declspec(noreturn)
VOID	wmainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	volatile
	QWORD	qwQuotient, qwRemainder;
		// bit-vector of prime numbers: 2**n == prime
	QWORD	qwDividend = 0x28208A20A08A28ACULL;
		// 2**64 / golden ratio
	QWORD	qwDivisor = 0x9E3779B97F4A7C15ULL;

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		strcpy((LPSTR) dwCPUID, "undetermined processor");
#ifndef NATIVE
#ifdef ASSEMBLY
	PrintConsole(hOutput, L"\nTesting 64-bit assembly division...\n");
#else
	PrintConsole(hOutput, L"\nTesting 64-bit C division...\n");
#endif
	for (dw = 0L; dw < sizeof(qwVector) / sizeof(*qwVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);

		qwQuotient = __udivmoddi4(qwVector[dw].qwDividend, qwVector[dw].qwDivisor, &qwRemainder);

		if (qwQuotient != qwVector[dw].qwQuotient)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwQuotient, qwVector[dw].qwQuotient);

		if (qwQuotient > qwVector[dw].qwDividend)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u greater dividend\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwQuotient);

		if (qwRemainder != qwVector[dw].qwRemainder)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a remainder %I64u not equal %I64u\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwRemainder, qwVector[dw].qwRemainder);

		if (qwRemainder >= qwVector[dw].qwDivisor)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not less divisor\n",
			             qwVector[dw].qwDividend, qwVector[dw].qwDivisor, qwRemainder);
	}
#ifdef ASSEMBLY
	PrintConsole(hOutput, L"\nTiming 64-bit assembly division on %.48hs\n", dwCPUID);
#else
	PrintConsole(hOutput, L"\nTiming 64-bit C division on %.48hs\n", dwCPUID);
#endif
#else
	PrintConsole(hOutput, L"\nTiming 64-bit native division on %.48hs\n", dwCPUID);
#endif // NATIVE
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwDividend ^= qwDividend << 14;
		qwDividend ^= qwDividend >> 31;
		qwDividend ^= qwDividend << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwDividend = (qwDividend << 1)
		           ^ (((LONGLONG) qwDividend >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwDivisor ^= qwDivisor << 10;
		qwDivisor ^= qwDivisor >> 15;
		qwDivisor ^= qwDivisor << 4;
		qwDivisor ^= qwDivisor >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwDivisor = (qwDivisor >> 1)
		          ^ ((0ULL - (qwDivisor & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwDividend ^= qwDividend << 14;
		qwDividend ^= qwDividend >> 31;
		qwDividend ^= qwDividend << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwDividend = (qwDividend << 1)
		           ^ (((LONGLONG) qwDividend >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwDivisor ^= qwDivisor << 10;
		qwDivisor ^= qwDivisor >> 15;
		qwDivisor ^= qwDivisor << 4;
		qwDivisor ^= qwDivisor >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwDivisor = (qwDivisor >> 1)
		          ^ ((0ULL - (qwDivisor & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwDividend ^= qwDividend << 14;
		qwDividend ^= qwDividend >> 31;
		qwDividend ^= qwDividend << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwDividend = (qwDividend << 1)
		           ^ (((LONGLONG) qwDividend >> 63) & 0xAD93D23594C935A9ULL);
#endif
		qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwDivisor ^= qwDivisor << 10;
		qwDivisor ^= qwDivisor >> 15;
		qwDivisor ^= qwDivisor << 4;
		qwDivisor ^= qwDivisor >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwDivisor = (qwDivisor >> 1)
		          ^ ((0ULL - (qwDivisor & 1ULL)) & 0x2B5926535897936BULL);
#endif
		qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6I64u.%09I64u      0\n"
	                      L"__udivmoddi4()  %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"__umuldi3()     %6I64u.%09I64u %6I64u.%09I64u\n"
	                      L"                %6I64u.%09I64u clock cycles\n",
	             qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
	             qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
	             qwTx / 1000000000ULL, qwTx % 1000000000ULL,
	             qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
	             qwTy / 1000000000ULL, qwTy % 1000000000ULL,
	             qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6I64u.%07I64u      0\n"
	                      L"__udivmoddi4()  %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"__umuldi3()     %6I64u.%07I64u %6I64u.%07I64u\n"
	                      L"                %6I64u.%07I64u nano-seconds\n",
	             qwT1 / 10000000ULL, qwT1 % 10000000ULL,
	             qwT2 / 10000000ULL, qwT2 % 10000000ULL,
	             qwTx / 10000000ULL, qwTx % 10000000ULL,
	             qwT3 / 10000000ULL, qwT3 % 10000000ULL,
	             qwTy / 10000000ULL, qwTy % 10000000ULL,
	             qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
	ExitProcess(0UL);
}
Save the first C source shown above as 128-amd.c and the second C source as 64-amd.c in an arbitrary, preferable empty directory, save the second assembly source shown above as udivmodti4.asm, the fourth assembly source as udivmodti4-hybrid.asm, and the fifth assembly source as udivmoddi4.asm in this directory too, then start the command prompt of the Windows software development kit for the AMD64 platform there, run the following command lines to assemble, compile and link build the benchmark programs 64.exe, 64-div.exe, 128.exe plus 128-hybrid.exe, and execute them:
ML64.EXE /c /W3 /X udivmoddi4.asm
CL.EXE /c /DCYCLES /DASSEMBLY /GAFy /O2y /W4 /Zl 64-amd.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64.exe /SUBSYSTEM:CONSOLE 64-amd.obj udivmoddi4.obj kernel32.lib user32.lib
CL.EXE /c /DCYCLES /DNATIVE /GAFy /O2y /W4 /Zl 64-amd.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-div.exe /SUBSYSTEM:CONSOLE 64-amd.obj kernel32.lib user32.lib
ML64.EXE /c /DJMPLESS /W3 /X udivmodti4.asm
ML64.EXE /c /W3 /X udivmodti4-hybrid.asm
CL.EXE /c /DCYCLES /GAFy /O2y /W4 /Zl 128-amd.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128.exe /SUBSYSTEM:CONSOLE 128-amd.obj udivmodti4.obj kernel32.lib user32.lib
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:AMD64 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128-hybrid.exe /SUBSYSTEM:CONSOLE 128-amd.obj udivmodti4-hybrid.obj kernel32.lib user32.lib
.\128.exe
.\128-hybrid.exe
.\64-div.exe
.\64.exe
Note: the command lines can be copied and pasted as block into the Command Processor window!

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: both 64-bit programs are pure Win32 applications and build without the MSVCRT libraries.

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmoddi4.asm

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

64-amd.c
64-amd.c(233) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(282) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(298) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(323) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(339) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(364) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(380) : warning C4090: 'function' : different 'volatile' qualifiers

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

64-amd.c
64-amd.c(282) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(298) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(323) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(339) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(364) : warning C4090: 'function' : different 'volatile' qualifiers
64-amd.c(380) : warning C4090: 'function' : different 'volatile' qualifiers

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmodti4.asm

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: udivmodti4-hybrid.asm

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

128-amd.c
128-amd.c(265) : warning C4090: 'function' : different 'const' qualifiers
128-amd.c(265) : warning C4090: 'function' : different 'const' qualifiers

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          8.989818268      0
__udivmodti4()      88.876676755     79.886858487
__umulti3()         15.686146340      6.696328072
                   113.552641363 clock cycles

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopti4()          8.902544693      0
__udivmodti4()      37.789423743     28.886879050
__umulti3()         15.540114856      6.637570163
                    62.232083292 clock cycles

Timing 64-bit native division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          6.807291005      0
__udivmoddi4()      44.883843913     38.076552908
__umuldi3()          7.463218904      0.655927899
                    59.154353822 clock cycles

Testing 64-bit assembly division...
42
Timing 64-bit assembly division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()          6.842327463      0
__udivmoddi4()      42.269848160     35.427520697
__umuldi3()          7.400889615      0.558562152
                    56.513065238 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopti4()          6.726640248      0
__udivmodti4()      97.576265506     90.849625258
                   104.302905754 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          4.611097449      0
__udivmoddi4()      53.284829173     48.673731724
__umuldi3()          5.378845916      0.767748467
                    63.274772538 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopti4()          7.199683026      0
__udivmodti4()     106.511525258     99.311842232
__umulti3()          8.255705996      1.056022970
                   121.966914280 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopdi4()          4.220402804      0
__udivmoddi4()      49.618307342     45.397904538
__umuldi3()          4.714148528      0.493745724
                    58.552858674 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopti4()          6.352947369      0
__udivmodti4()      99.540832388     93.187885019
__umulti3()          7.698108475      1.345161106
                   113.591888232 clock cycles

Testing 128-bit division...
9648
Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopti4()          6.388772778      0
__udivmodti4()      26.116034388     19.727261610
__umulti3()          7.603151388      1.214378610
                    40.107958554 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          3.970766137      0
__udivmoddi4()      41.977970602     38.007204465
__umuldi3()          4.388517417      0.417751280
                    50.337254156 clock cycles

Testing 64-bit assembly division...
42
Timing 64-bit assembly division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          3.987179367      0
__udivmoddi4()      20.801602647     16.814423280
__umuldi3()          4.204111850      0.216932483
                    28.992893864 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopti4()          5.533478554      0
__udivmodti4()      79.084291140     73.550812586
__umulti3()          6.123935107      0.590456553
                    90.741704801 clock cycles

Timing 64-bit native division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopdi4()          3.153262630      0
__udivmoddi4()      36.889642083     33.736379453
__umuldi3()          3.505350719      0.352088089
                    43.548255432 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopti4()          6.079204604      0
__udivmodti4()      38.381435026     32.302230422
                    44.460639630 clock cycles

Timing 64-bit native division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          4.419591430      0
__udivmoddi4()      25.641230738     21.221639308
__umuldi3()          5.227782949      0.808191519
                    35.288605117 clock cycles
[…]

Testing 128-bit division...
80
Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopti4()          6.993401683      0
__udivmodti4()      39.019684399     32.026282716
__umulti3()          9.493567155      2.500165472
                    55.506653237 clock cycles

Timing 64-bit native division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          4.868280208      0
__udivmoddi4()      26.223927786     21.355647578
__umuldi3()          4.795732764      0.000000000
                    35.887940758 clock cycles

Benchmark Program for i386 Processors

The following C program for 32-bit processors measures the execution time for one billion divisions of uniform distributed 64-bit pseudo-random numbers with the __udivmoddi4() function.

Note: with the macro CYCLES defined, this program measures the execution time in processor clock cycles and runs on Windows Vista® and newer versions, else it measures the execution time in nano-seconds and runs on all versions of Windows NT®.

// Copyright © 2004-2020, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _M_IX86
#pragma message("For I386 platform only!")
#endif

#pragma comment(compiler)
#pragma comment(user, __TIMESTAMP__)

#define _CRT_SECURE_NO_WARNINGS
#define STRICT
#define UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	LONGLONG	SQWORD;
typedef	ULONGLONG	QWORD;

#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}

const	struct	_ull
{
	QWORD	ullDividend, ullDivisor, ullQuotient, ullRemainder;
} ullVector[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
                 _(0x0000000000000001ULL, 0x0000000000000001ULL),
                 _(0x0000000000000002ULL, 0x0000000000000001ULL),
                 _(0x0000000000000002ULL, 0x0000000000000002ULL),
                 _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
                 _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
                 _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
                 _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
                 _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
                 _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
                 _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
                 _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
                 _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                 _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x8000000000000000ULL, 0x0000000000000001ULL),
                 _(0x8000000000000000ULL, 0x0000000000000002ULL),
                 _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
                 _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
                 _(0x8000000000000000ULL, 0x0000000100000000ULL),
                 _(0x8000000000000000ULL, 0x0000000100000001ULL),
                 _(0x8000000000000000ULL, 0x0000000100000002ULL),
                 _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
                 _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                 _(0x8000000080000000ULL, 0x0000000080000000ULL),
                 _(0x8000000080000001ULL, 0x0000000080000001ULL),
                 _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                 _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};

const	struct	_ll
{
	SQWORD	llDividend, llDivisor, llQuotient, llRemainder;
} llVector[] = {_(0x0000000000000000LL, 0x0000000000000001LL),	// 0, 1
                _(0x0000000000000001LL, 0x0000000000000001LL),	// 1, 1
                _(0x0000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// 0, -1
                _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFFLL),	// 1, -1
                _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFELL),	// 1, -2
                _(0x0000000000000002LL, 0xFFFFFFFFFFFFFFFELL),	// 2, -2
                _(0x000000000FFFFFFFLL, 0x0000000000000001LL),
                _(0x0000000FFFFFFFFFLL, 0x000000000000000FLL),
                _(0x0000000FFFFFFFFFLL, 0x0000000000000010LL),
                _(0x0000000000000100LL, 0x000000000FFFFFFFLL),
                _(0x00FFFFFFF0000000LL, 0x0000000010000000LL),
                _(0x07FFFFFF80000000LL, 0x0000000080000000LL),
                _(0x7FFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
                _(0x7FFFFFFEFFFFFFF0LL, 0x0000FFFFFFFFFFFELL),
                _(0x7FFFFFFEFFFFFFF0LL, 0x7FFFFFFEFFFFFFF0LL),
                _(0x7FFFFFFFFFFFFFFFLL, 0x8000000000000000LL),	// llmax, llmin
                _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFDLL),	// llmax, -3
                _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// llmax, -2
                _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL),	// llmax, -1
                _(0x8000000000000000LL, 0x0000000000000001LL),	// llmin, 1
                _(0x8000000000000000LL, 0x0000000000000002LL),	// llmin, 2
                _(0x8000000000000000LL, 0x0000000000000003LL),	// llmin, 3
                _(0x8000000000000000LL, 0x00000000FFFFFFFELL),
                _(0x8000000000000000LL, 0x00000000FFFFFFFFLL),
                _(0x8000000000000000LL, 0x0000000100000000LL),
                _(0x8000000000000000LL, 0x0000000100000001LL),
                _(0x8000000000000000LL, 0x0000000100000002LL),
                _(0x8000000000000000LL, 0x8000000000000000LL),	// llmin, llmin
                _(0x8000000000000000LL, 0xFFFFFFFF00000000LL),
                _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFDLL),	// llmin, -3
                _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFELL),	// llmin, -2
#ifndef _WIN64
                _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// llmin, -1
#endif
                _(0x8000000080000000LL, 0x0000000080000000LL),
                _(0x8000000080000001LL, 0x0000000080000001LL),
                _(0xFFFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
                _(0xFFFFFFFFFFFFFFFELL, 0x0000000080000000LL),
                _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000001LL),	// -2, 1
                _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000002LL),	// -2, 2
                _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFELL),	// -2, -2
                _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFFLL),	// -2, -1
                _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000001LL),	// -1, 1
                _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000002LL),	// -1, 2
                _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// -1, -2
                _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL)};	// -1, -1

#undef _

#ifndef BUILTIN
SQWORD	__divdi3(SQWORD dividend, SQWORD divisor);
SQWORD	__moddi3(SQWORD dividend, SQWORD divisor);
SQWORD	__muldi3(SQWORD multiplicand, SQWORD multiplier);
QWORD	__udivdi3(QWORD dividend, QWORD divisor);
QWORD	__umoddi3(QWORD dividend, QWORD divisor);
QWORD	__umuldi3(QWORD multiplicand, QWORD multiplier);
QWORD	__udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);

__declspec(noinline)
QWORD	__unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
	if (remainder != NULL)
		*remainder = divisor;

	return dividend;
}
#else
__declspec(naked)
__declspec(noinline)
QWORD	WINAPI	_aullnop(QWORD left, QWORD right)
{
	__asm	ret	16
}
#endif // BUILTIN

__forceinline	// companion for __emul()
LONG	WINAPI	__ediv(LONGLONG llDividend, LONG lDivisor)
{
	__asm
	{
		mov	eax, dword ptr llDividend
		mov	edx, dword ptr llDividend+4
		idiv	lDivisor
	}
}

__forceinline	// companion for __emul()
LONG	WINAPI	__emod(LONGLONG llDividend, LONG lDivisor)
{
	__asm
	{
		mov	eax, dword ptr llDividend
		mov	edx, dword ptr llDividend+4
		idiv	lDivisor
		mov	eax, edx
	}
}

__forceinline	// companion for __emulu()
DWORD	WINAPI	__edivu(QWORD ullDividend, DWORD ulDivisor)
{
	__asm
	{
		mov	eax, dword ptr ullDividend
		mov	edx, dword ptr ullDividend+4
		div	ulDivisor
	}
}

__forceinline	// companion for __emulu()
DWORD	WINAPI	__emodu(QWORD ullDividend, DWORD ulDivisor)
{
	__asm
	{
		mov	eax, dword ptr ullDividend
		mov	edx, dword ptr ullDividend+4
		div	ulDivisor
		mov	eax, edx
	}
}

__declspec(safebuffers)
BOOL	PrintConsole(HANDLE hConsole, LPCWSTR lpFormat, ...)
{
	WCHAR	szBuffer[1025];
	DWORD	dwBuffer;
	DWORD	dwConsole;

	va_list	vaInserts;
	va_start(vaInserts, lpFormat);

	dwBuffer = wvsprintf(szBuffer, lpFormat, vaInserts);

	va_end(vaInserts);

	if (dwBuffer == 0L)
		return FALSE;

	if (!WriteConsole(hConsole, szBuffer, dwBuffer, &dwConsole, NULL))
		return FALSE;

	return dwConsole == dwBuffer;
}

__declspec(noreturn)
VOID	wmainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];

	QWORD	qwT0, qwT1, qwT2, qwT3;
	QWORD	qwTx, qwTy, qwTz;

	QWORD	ullQuotient, ullRemainder;
	SQWORD	llQuotient, llRemainder;

	volatile
	QWORD	qwQuotient, qwRemainder;
		// bit-vector of prime numbers: 2**n == prime
	QWORD	qwDividend = 0x28208A20A08A28ACULL;
		// 2**64 / golden ratio
	QWORD	qwDivisor = 0x9E3779B97F4A7C15ULL;

	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		ExitProcess(GetLastError());

	if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		strcpy((LPSTR) dwCPUID, "undetermined processor");

	PrintConsole(hOutput, L"\nTesting 64-bit division...\n");

	for (dw = 0L; dw < sizeof(ullVector) / sizeof(*ullVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#ifndef BUILTIN
		ullQuotient = __udivmoddi4(ullVector[dw].ullDividend, ullVector[dw].ullDivisor, &ullRemainder);
#else
		ullQuotient = ullVector[dw].ullDividend / ullVector[dw].ullDivisor;
		ullRemainder = ullVector[dw].ullDividend % ullVector[dw].ullDivisor;
#endif
		if (ullQuotient != ullVector[dw].ullQuotient)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient, ullVector[dw].ullQuotient);

		if (ullQuotient > ullVector[dw].ullDividend)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u greater dividend\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient);

		if (ullRemainder != ullVector[dw].ullRemainder)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder, ullVector[dw].ullRemainder);

		if (ullRemainder >= ullVector[dw].ullDivisor)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not less divisor\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder);
	}

	PrintConsole(hOutput, L"\nTesting unsigned 64-bit division...\n");

	for (dw = 0L; dw < sizeof(ullVector) / sizeof(*ullVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#ifndef BUILTIN
		ullQuotient = __udivdi3(ullVector[dw].ullDividend, ullVector[dw].ullDivisor);
#else
		ullQuotient = ullVector[dw].ullDividend / ullVector[dw].ullDivisor;
#endif
		if (ullQuotient != ullVector[dw].ullQuotient)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient, ullVector[dw].ullQuotient);

		if (ullQuotient > ullVector[dw].ullDividend)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a quotient %I64u greater dividend\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullQuotient);
#ifndef BUILTIN
		ullRemainder = ullVector[dw].ullDividend - __muldi3(ullVector[dw].ullDivisor, ullQuotient);
#else
		ullRemainder = ullVector[dw].ullDividend - ullVector[dw].ullDivisor * ullQuotient;
#endif
		if (ullRemainder != ullVector[dw].ullRemainder)
			PrintConsole(hOutput, L"\t%I64u / %I64u:\a remainder %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder, ullVector[dw].ullRemainder);
#ifndef BUILTIN
		ullRemainder = __umoddi3(ullVector[dw].ullDividend, ullVector[dw].ullDivisor);
#else
		ullRemainder = ullVector[dw].ullDividend % ullVector[dw].ullDivisor;
#endif
		if (ullRemainder != ullVector[dw].ullRemainder)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder, ullVector[dw].ullRemainder);

		if (ullRemainder >= ullVector[dw].ullDivisor)
			PrintConsole(hOutput, L"\t%I64u %% %I64u:\a remainder %I64u not less divisor\n",
			             ullVector[dw].ullDividend, ullVector[dw].ullDivisor, ullRemainder);
	}

	PrintConsole(hOutput, L"\nTesting signed 64-bit division...\n");

	for (dw = 0L; dw < sizeof(llVector) / sizeof(*llVector); dw++)
	{
		PrintConsole(hOutput, L"\r%lu", dw);
#ifndef BUILTIN
		llQuotient = __divdi3(llVector[dw].llDividend, llVector[dw].llDivisor);
#else
		llQuotient = llVector[dw].llDividend / llVector[dw].llDivisor;
#endif
		if (llQuotient != llVector[dw].llQuotient)
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a quotient %I64d not equal %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llQuotient, llVector[dw].llQuotient);

		if ((llVector[dw].llDividend < 0LL) && (llQuotient < llVector[dw].llDividend)
		 || (llVector[dw].llDividend >= 0LL) && (llQuotient > llVector[dw].llDividend))
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a quotient %I64d greater dividend\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llQuotient);
#ifndef BUILTIN
		llRemainder = llVector[dw].llDividend - __muldi3(llVector[dw].llDivisor, llQuotient);
#else
		llRemainder = llVector[dw].llDividend - llVector[dw].llDivisor * llQuotient;
#endif
		if (llRemainder != llVector[dw].llRemainder)
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a remainder %I64d not equal %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llRemainder);

		if ((llRemainder != 0LL)
		 && ((llRemainder < 0LL) != (llVector[dw].llDividend < 0LL)))
			PrintConsole(hOutput, L"\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llDividend);
#ifndef BUILTIN
		llRemainder = __moddi3(llVector[dw].llDividend, llVector[dw].llDivisor);
#else
		llRemainder = llVector[dw].llDividend % llVector[dw].llDivisor;
#endif
		if (llRemainder != llVector[dw].llRemainder)
			PrintConsole(hOutput, L"\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llRemainder);

		if ((llVector[dw].llDivisor < 0LL) && (llRemainder <= llVector[dw].llDivisor)
		 || (llVector[dw].llDivisor > 0LL) && (llRemainder >= llVector[dw].llDivisor))
			PrintConsole(hOutput, L"\t%I64d %% %I64d:\a remainder %I64d not less divisor\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder);

		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llVector[dw].llDividend < 0LL)))
			PrintConsole(hOutput, L"\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\n",
			             llVector[dw].llDividend, llVector[dw].llDivisor, llRemainder, llVector[dw].llDividend);
	}

	PrintConsole(hOutput, L"\nTiming 64-bit division on %.48hs\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwDividend ^= qwDividend << 14;
		qwDividend ^= qwDividend >> 31;
		qwDividend ^= qwDividend << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwDividend = (qwDividend << 1)
		           ^ (((SQWORD) qwDividend >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef BUILTIN
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwDivisor ^= qwDivisor << 10;
		qwDivisor ^= qwDivisor >> 15;
		qwDivisor ^= qwDivisor << 4;
		qwDivisor ^= qwDivisor >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwDivisor = (qwDivisor >> 1)
		          ^ ((0ULL - (qwDivisor & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef BUILTIN
		qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwDividend ^= qwDividend << 14;
		qwDividend ^= qwDividend >> 31;
		qwDividend ^= qwDividend << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwDividend = (qwDividend << 1)
		           ^ (((SQWORD) qwDividend >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef BUILTIN
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwDivisor ^= qwDivisor << 10;
		qwDivisor ^= qwDivisor >> 15;
		qwDivisor ^= qwDivisor << 4;
		qwDivisor ^= qwDivisor >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwDivisor = (qwDivisor >> 1)
		          ^ ((0ULL - (qwDivisor & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef BUILTIN
		qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#else
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from George Marsaglia

		qwDividend ^= qwDividend << 14;
		qwDividend ^= qwDividend >> 31;
		qwDividend ^= qwDividend << 45;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwDividend = (qwDividend << 1)
		           ^ (((SQWORD) qwDividend >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef BUILTIN
		qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
		qwQuotient = qwDividend * qwDivisor;
#endif
#ifdef XORSHIFT
		// 64-bit linear feedback shift register (XorShift form)
		//  using shift constants from Richard Peirce Brent

		qwDivisor ^= qwDivisor << 10;
		qwDivisor ^= qwDivisor >> 15;
		qwDivisor ^= qwDivisor << 4;
		qwDivisor ^= qwDivisor >> 13;
#else
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwDivisor = (qwDivisor >> 1)
		          ^ ((0ULL - (qwDivisor & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef BUILTIN
		qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
		qwQuotient = qwDividend * qwDivisor;
#endif
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	qwTz = qwT3 - qwT0;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
	qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
	qwTx = qwT2 - qwT1;
#ifndef BUILTIN
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6lu.%09lu      0\n"
	                      L"__udivmoddi4()  %6lu.%09lu %6lu.%09lu\n"
	                      L"__umuldi3()     %6lu.%09lu %6lu.%09lu\n"
	                      L"                %6lu.%09lu clock cycles\n",
	             __edivu(qwT1, 1000000000UL), __emodu(qwT1, 1000000000UL),
	             __edivu(qwT2, 1000000000UL), __emodu(qwT2, 1000000000UL),
	             __edivu(qwTx, 1000000000UL), __emodu(qwTx, 1000000000UL),
	             __edivu(qwT3, 1000000000UL), __emodu(qwT3, 1000000000UL),
	             __edivu(qwTy, 1000000000UL), __emodu(qwTy, 1000000000UL),
	             __edivu(qwTz, 1000000000UL), __emodu(qwTz, 1000000000UL));
#else
	PrintConsole(hOutput, L"\n"
	                      L"__unopdi4()     %6lu.%07lu      0\n"
	                      L"__udivmoddi4()  %6lu.%07lu %6lu.%07lu\n"
	                      L"__umuldi3()     %6lu.%07lu %6lu.%07lu\n"
	                      L"                %6lu.%07lu nano-seconds\n",
	             __edivu(qwT1, 10000000UL), __emodu(qwT1, 10000000UL),
	             __edivu(qwT2, 10000000UL), __emodu(qwT2, 10000000UL),
	             __edivu(qwTx, 10000000UL), __emodu(qwTx, 10000000UL),
	             __edivu(qwT3, 10000000UL), __emodu(qwT3, 10000000UL),
	             __edivu(qwTy, 10000000UL), __emodu(qwTy, 10000000UL),
	             __edivu(qwTz, 10000000UL), __emodu(qwTz, 10000000UL));
#endif // CYCLES
#else // BUILTIN
#ifdef CYCLES
	PrintConsole(hOutput, L"\n"
	                      L"_aullnop()      %6lu.%09lu      0\n"
	                      L"_aulldvrm()     %6lu.%09lu %6lu.%09lu\n"
	                      L"_aullmul()      %6lu.%09lu %6lu.%09lu\n"
	                      L"                %6lu.%09lu clock cycles\n",
	             __edivu(qwT1, 1000000000UL), __emodu(qwT1, 1000000000UL),
	             __edivu(qwT2, 1000000000UL), __emodu(qwT2, 1000000000UL),
	             __edivu(qwTx, 1000000000UL), __emodu(qwTx, 1000000000UL),
	             __edivu(qwT3, 1000000000UL), __emodu(qwT3, 1000000000UL),
	             __edivu(qwTy, 1000000000UL), __emodu(qwTy, 1000000000UL),
	             __edivu(qwTz, 1000000000UL), __emodu(qwTz, 1000000000UL));
#else
	PrintConsole(hOutput, L"\n"
	                      L"_aullnop()      %6lu.%07lu      0\n"
	                      L"_aulldvrm()     %6lu.%07lu %6lu.%07lu\n"
	                      L"_aullmul()      %6lu.%07lu %6lu.%07lu\n"
	                      L"                %6lu.%07lu nano-seconds\n",
	             __edivu(qwT1, 10000000UL), __emodu(qwT1, 10000000UL),
	             __edivu(qwT2, 10000000UL), __emodu(qwT2, 10000000UL),
	             __edivu(qwTx, 10000000UL), __emodu(qwTx, 10000000UL),
	             __edivu(qwT3, 10000000UL), __emodu(qwT3, 10000000UL),
	             __edivu(qwTy, 10000000UL), __emodu(qwTy, 10000000UL),
	             __edivu(qwTz, 10000000UL), __emodu(qwTz, 10000000UL));
#endif // CYCLES
#endif // BUILTIN
	ExitProcess(0UL);
}
Save this C source as 64-i386.c in an arbitrary, preferable empty directory, save the 6 32-bit assembly sources shown above as divdi3.asm, moddi3.asm, muldi3.asm, udivdi3.asm, umoddi3.asm and udivmoddi4.asm there too, then start the command prompt of the Windows software development kit for the I386 platform in this directory, run the following command lines to build the benchmark program 64-i386.exe, and execute it:
CL.EXE /c /DCYCLES /GAFy /O2y /W4 /Zl 64-i386.c
ML.EXE /c /DJMPLESS /W3 /X divdi3.asm
ML.EXE /c /DJMPLESS /W3 /X moddi3.asm
ML.EXE /c /DJMPLESS /W3 /X muldi3.asm
ML.EXE /c /DJMPLESS /W3 /X udivdi3.asm
ML.EXE /c /DJMPLESS /W3 /X umoddi3.asm
ML.EXE /c /DJMPLESS /W3 /X udivmoddi4.asm
LINK.EXE /LIB /MACHINE:X86 /NODEFAULTLIB /OUT:64-i386.lib divdi3.obj moddi3.obj muldi3.obj udivdi3.obj umoddi3.obj udivmoddi4.obj
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:X86 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-i386.exe /SUBSYSTEM:CONSOLE 64-i386.obj 64-i386.lib kernel32.lib user32.lib
.\64-i386.exe
Note: the command lines can be copied and pasted as block into the Command Processor window!

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the 32-bit program 64-i386.exe is a pure Win32 application and builds without the MSVCRT libraries.

Note: the programs 64-amd.c and 64-i386.c use the same pseudo-random number generators, so the measured times are directly comparable.

Note: the trivial modification of the assembly sources with directives for Unix’ or GNU’s as is left as an exercise to the reader!

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

64-i386.c
64-i386.c(396) : warning C4090: 'function' : different 'volatile' qualifiers
64-i386.c(416) : warning C4090: 'function' : different 'volatile' qualifiers
64-i386.c(445) : warning C4090: 'function' : different 'volatile' qualifiers
64-i386.c(466) : warning C4090: 'function' : different 'volatile' qualifiers

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: divdi3.asm
…
 Assembling: udivmoddi4.asm

Microsoft (R) Library Manager Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

__unopdi4()         13.283718848      0
__udivmoddi4()      49.139274014     35.855555166
__umuldi3()         13.866633564      0.582914716
                    76.289626426 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

__unopdi4()          9.607796773      0
__udivmoddi4()      34.943224391     25.335427618
__umuldi3()          9.956829707      0.349032934
                    54.507850871 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

__unopdi4()          8.994161906      0
__udivmoddi4()      28.671893180     19.677731274
__umuldi3()          8.503948208      0.000000000

                    46.170003294 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

__unopdi4()          9.221094296      0
__udivmoddi4()      29.328803383     20.107709087
__umuldi3()          8.614981417      0.000000000
                    47.164879096 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

__unopdi4()          7.332281228      0
__udivmoddi4()      23.562549836     16.230268608
__umuldi3()          6.876648697      0.000000000
                    37.771479761 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

__unopdi4()          9.061527774      0
__udivmoddi4()      29.915904837     20.854377063
__umuldi3()          9.616930973      0.555403199
                    48.594363584 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

__unopdi4()          9.132606839      0
__udivmoddi4()      27.095760960     17.963154121
__umuldi3()          7.924676842      0.000000000
                    44.153044641 clock cycles
With the macro BUILTIN defined, 64-i386.c allows to benchmark the (almost undocumented) helper functions _alldiv(), _alldvrm(), _allmul(), _allrem(), _aulldiv(), _aulldvrm() and _aullrem(), which the Visual C compiler calls to perform 64-bit division and multiplication, instead of the functions __udivmoddi4(), __umuldi3() etc.

Fetch their source code blcrtasm.asm and save it in the directory next to 64-i386.c, then run the following command lines to build this program variant as 64-msft.exe and execute it:

COPY NUL: bl.inc
ML.EXE /c /W3 /X blcrtasm.asm
CL.EXE /c /DBUILTIN /DCYCLES /GAFy /O2y /W4 /Zl 64-i386.c
LINK.EXE /DYNAMICBASE /ENTRY:wmainCRTStartup /MACHINE:X86 /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-msft.exe /SUBSYSTEM:CONSOLE 64-i386.obj blcrtasm.obj kernel32.lib user32.lib
.\64-msft.exe
Note: the command lines can be copied and pasted as block into the Command Processor window!

        1 file(s) copied.

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: blcrtasm.asm

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

64-i386.c
64-i386.c(141) : warning C4100: 'right' : unreferenced formal parameter
64-i386.c(141) : warning C4100: 'left' : unreferenced formal parameter

Microsoft (R) Incremental Linker Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()          13.671386927      0
_aulldvrm()        158.969835392    145.298448465
_aullmul()          25.148575495     11.477188568
                   197.789797814 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz

_aullnop()           7.628671451      0
_aulldvrm()        143.788827025    136.160155574
_aullmul()          15.922523706      8.293852255
                   167.340022182 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz

_aullnop()           7.924113622      0
_aulldvrm()        143.165813624    135.241700002
_aullmul()          14.465450882      6.541337260
                   165.555378128 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz

_aullnop()           7.336220170      0
_aulldvrm()        134.324100965    126.987880795
_aullmul()          13.599235530      6.263015360
                   155.259556665 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz

_aullnop()           5.849818286      0
_aulldvrm()        106.592794907    100.742976621
_aullmul()          10.783351899      4.933533613
                   123.225965092 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor

_aullnop()          10.599235407      0
_aulldvrm()        110.840374287    100.241138880
_aullmul()          16.299852993      5.700617586
                   137.739462687 clock cycles
[…]

Testing 64-bit division...
42
Testing unsigned 64-bit division...
42
Testing signed 64-bit division...
43
Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor

_aullnop()           6.530378619      0
_aulldvrm()         80.023609792     73.493231173
_aullmul()           9.170997576      2.640618957
                    95.724985987 clock cycles

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):


Copyright © 1995–2020 • Stefan Kanthak • <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>