Me, myself & IT

Microsoft^® Visual C Compiler Helper Routines: Poor and Stupid Implementation

Purpose
Introduction
__chkstk Routine: Implementation in AMD64 Assembler
_alloca Routine
_chkstk Routine: Implementation in i386 Assembler
_allmul Routine: Implementations in i386 Assembler
_alldiv Routine: Implementations in i386 Assembler
_alldvrm Routine: Implementation in i386 Assembler
_allrem Routine: Implementation in i386 Assembler
_aulldiv Routine: Implementation in i386 Assembler
_aulldvrm Routine: Implementation in i386 Assembler
_aullrem Routine: Implementation in i386 Assembler
_aullshr Routine: Implementation in i386 Assembler
_allshl Routine: Implementation in i386 Assembler
_allshr Routine: Implementation in i386 Assembler
Revision History of _all* and _aull* Routines in Leaked Source
Runtime Measurement of _all* and _aull* Routines
_rotl64() and _rotr64() Intrinsic Functions for i386 Platform: Implementation of _allrol() and _allror() Functions in i386 Assembler
_abs64() Intrinsic Function for i386 Platform: Implementation of _allabs() Function in i386 Assembler
64-bit Integer Negation for i386 Platform: Implementation of _allneg() Function in i386 Assembler
64-bit Integer Negation for i386 Platform (Call by Reference)
64-bit Integer Signum for i386 Platform: Implementation of _allsgn() Function in i386 Assembler
64-bit Integer Comparison for i386 Platform: Implementation of _allcmp() and _aullcmp() Functions in i386 Assembler
64-bit Integer Maximum for i386 Platform: Implementation of _allmax() and _aullmax() Functions in i386 Assembler
64-bit Integer Minimum for i386 Platform: Implementation of _allmin() and _aullmin() Functions in i386 Assembler
acos(), asin(), atan(), atan2(), cos(), cosh(), exp(), fmod(), log(), log10(), pow(), sin(), sinh(), sqrt(), tan() and tanh() Standard Functions for i386 Platform
_CI* and _ftol* Routines
memchr() Standard Function for i386 Platform: Naïve Implementation in i386 Assembler; Smart Implementation in i386 Assembler; Implementation with SSE2 Instructions in i386 Assembler; Implementation with SSSE3 Instructions in i386 Assembler; Implementation with AVX Instructions in i386 Assembler; Implementation with AVX2 Instructions in i386 Assembler; Smart Implementation in AMD64 Assembler
mem*() Standard Functions: Implementation in ANSI C; Implementation in i386 Assembler; Implementation in AMD64 Assembler; Inline Implementation of memcpy() and memset() with Intrinsic Functions
strchr() Standard Function for i386 Platform: Implementation in i386 Assembler; Implementation with SSE2 Instructions in i386 Assembler; Implementation with SSSE3 Instructions in i386 Assembler; Implementation with AVX Instructions in i386 Assembler; Implementation with AVX2 Instructions in i386 Assembler
strlen() Standard Function for i386 Platform: Implementation in i386 Assembler; Implementation with SSE2 Instructions in i386 Assembler; Implementation with AVX Instructions in i386 Assembler; Implementation with AVX2 Instructions in i386 Assembler; Implementation in AMD64 Assembler
strrchr() and strstr() Standard Functions for i386 Platform: Implementation with SSE4.2 Instructions in i386 Assembler
str*() Standard Functions: Implementation in ANSI C; Implementation in i386 Assembler; Implementation in AMD64 Assembler
wcs*() Standard Functions: Implementation in i386 Assembler; Implementation in AMD64 Assembler
Thread Local Storage Support: Demonstration in ANSI C; Implementation in i386 Assembler; Implementation in AMD64 Assembler
_load_config_used and __security_check_cookie() Function (/GS Support): Implementation in ANSI C; Implementation in i386 Assembler; Implementation in AMD64 Assembler
Delay Load Support: Implementation in ANSI C
main() and wmain() Support: Falsification in ANSI C; Demonstration in i386 Assembler; Demonstration in AMD64 Assembler; Implementation in ANSI C
Usage Instructions
Footnote: Bloat: Demonstration
Appendix: .CRT Section Usage; .rtc Section Usage

Purpose

Show deficiencies and flaws of the compiler helper routines shipped with Microsoft’s Visual C compiler and linked into Windows’ executable files.

Additionally present properly written, especially for 64÷64-bit integer division on the i386 platform several times faster implementations.

Introduction

For code running on the AMD64 alias x64 processor architecture, the Microsoft Visual C compiler generates calls to the (almost) undocumented helper routine __chkstk for memory allocations on the stack, and to the standard functions memcpy() and memset() for assignment and initialisation of arrays and structures.

For code running on the i386 alias x86 processor architecture, the Microsoft Visual C compiler generates calls to the (almost) undocumented helper routines _alloca and _chkstk for memory allocations on the stack, to the standard functions memcpy() and memset() for assignment and initialisation of arrays and structures, to the (almost) undocumented helper routines _alldiv, _alldvrm, _allmul, _allrem, _allshl and _allshr for signed 64-bit integer division, multiplication and shift operations, to the also (almost) undocumented helper routines _aulldiv, _aulldvrm, _aullrem and _aullshr for unsigned 64-bit integer division, multiplication and shift operations, and to the helper routines _CIacos, _CIasin, _CIatan, _CIatan2, _CIcos, _CIcosh, _CIexp, _CIfmod, _CIlog, _CIlog10, _CIpow, _CIsin, _CIsinh, _CIsqrt, _CItan, _CItanh and _ftol for transcendental as well as trigonometric floating-point functions. Internal CRT globals and functions

Note: except for the mem*() and str*() standard functions, all helper routines use non-standard calling or naming convention and can’t be called from C or C++ sources by their name!

These routines are provided in the object file chkstk.obj, the object libraries libcmt.lib, libcmtd.lib, msvcrt.lib and msvcrtd.lib, partially also in runtmchk.lib; their i386 assembler sources are provided in the files alloca16.asm, chkstk.asm, lldiv.asm, lldvrm.asm, llmul.asm, llrem.asm, llshl.asm, llshr.asm, ulldiv.asm, ulldvrm.asm, ullrem.asm, ullshr.asm, memchr.asm, memcpy.asm, memset.asm, strchr.asm, strlen.asm etc., and their ANSI C sources are provided in the files strcat.c, strrchr.c, strstr.c etc.

Note: many of these routines are exported from NTDLL.dll.

Import libraries amd64.lib and i386.lib can be generated with the following 2 command lines:

LINK.EXE /LIB /DEF /EXPORT:__C_specific_handler /EXPORT:__chkstk /EXPORT:atoi /EXPORT:atol /EXPORT:isalnum /EXPORT:isalpha /EXPORT:iscntrl /EXPORT:isdigit /EXPORT:isgraph /EXPORT:islower /EXPORT:isprint /EXPORT:ispunct /EXPORT:isspace /EXPORT:isupper /EXPORT:isxdigit /EXPORT:iswalpha /EXPORT:iswctype /EXPORT:iswdigit /EXPORT:iswlower /EXPORT:iswspace /EXPORT:iswxdigit /EXPORT:memchr /EXPORT:memcmp /EXPORT:memcpy /EXPORT:memmove /EXPORT:memset /EXPORT:qsort /EXPORT:strcat /EXPORT:strcat_s /EXPORT:strchr /EXPORT:strcmp /EXPORT:strcpy /EXPORT:strcpy_s /EXPORT:strcspn /EXPORT:strlen /EXPORT:strncat /EXPORT:strncat_s /EXPORT:strncmp /EXPORT:strncpy /EXPORT:strncpy_s /EXPORT:strnlen /EXPORT:strpbrk /EXPORT:strrchr /EXPORT:strspn /EXPORT:strstr /EXPORT:strtok_s /EXPORT:strtol /EXPORT:strtoul /EXPORT:tolower /EXPORT:toupper /EXPORT:towlower /EXPORT:towupper /EXPORT:wcscat /EXPORT:wcscat_s /EXPORT:wcschr /EXPORT:wcscmp /EXPORT:wcscpy /EXPORT:wcscpy_s /EXPORT:wcscspn /EXPORT:wcslen /EXPORT:wcsncat /EXPORT:wcsncat_s /EXPORT:wcsncmp /EXPORT:wcsncpy /EXPORT:wcsncpy_s /EXPORT:wcsnlen /EXPORT:wcspbrk /EXPORT:wcsspn /EXPORT:wcsstr /EXPORT:wcstol /EXPORT:wcstoul /MACHINE:AMD64 /NAME:NTDLL /NODEFAULTLIB /OUT:amd64.lib
LINK.EXE /LIB /DEF /EXPORT:_CIcos /EXPORT:_CIlog /EXPORT:_CIpow /EXPORT:_CIsin /EXPORT:_CIsqrt /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_alloca_probe /EXPORT:_alloca_probe_8 /EXPORT:_alloca_probe_16 /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /EXPORT:_chkstk /EXPORT:_fltused /EXPORT:_ftol /EXPORT:atoi /EXPORT:atol /EXPORT:isalnum /EXPORT:isalpha /EXPORT:iscntrl /EXPORT:isdigit /EXPORT:isgraph /EXPORT:islower /EXPORT:isprint /EXPORT:ispunct /EXPORT:isspace /EXPORT:isupper /EXPORT:isxdigit /EXPORT:iswalpha /EXPORT:iswctype /EXPORT:iswdigit /EXPORT:iswlower /EXPORT:iswspace /EXPORT:iswxdigit /EXPORT:memchr /EXPORT:memcmp /EXPORT:memcpy /EXPORT:memmove /EXPORT:memset /EXPORT:qsort /EXPORT:strcat /EXPORT:strcat_s /EXPORT:strchr /EXPORT:strcmp /EXPORT:strcpy /EXPORT:strcpy_s /EXPORT:strcspn /EXPORT:strlen /EXPORT:strncat /EXPORT:strncat_s /EXPORT:strncmp /EXPORT:strncpy /EXPORT:strncpy_s /EXPORT:strnlen /EXPORT:strpbrk /EXPORT:strrchr /EXPORT:strspn /EXPORT:strstr /EXPORT:strtok_s /EXPORT:strtol /EXPORT:strtoul /EXPORT:tolower /EXPORT:toupper /EXPORT:towlower /EXPORT:towupper /EXPORT:wcscat /EXPORT:wcscat_s /EXPORT:wcschr /EXPORT:wcscmp /EXPORT:wcscpy /EXPORT:wcscpy_s /EXPORT:wcscspn /EXPORT:wcslen /EXPORT:wcsncat /EXPORT:wcsncat_s /EXPORT:wcsncmp /EXPORT:wcsncpy /EXPORT:wcsncpy_s /EXPORT:wcsnlen /EXPORT:wcspbrk /EXPORT:wcsspn /EXPORT:wcsstr /EXPORT:wcstol /EXPORT:wcstoul /MACHINE:I386 /NAME:NTDLL /NODEFAULTLIB /OUT:i386.lib

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

   Creating library amd64.lib and object amd64.exp

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

   Creating library i386.lib and object i386.exp

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

`__chkstk` Routine

The documentation for the __chkstk routine states:

Called by the compiler when you have more than one page of local variables in your function.
Remarks
__chkstk Routine is a helper routine for the C compiler. For x86 compilers, __chkstk Routine is called when the local variables exceed 4K bytes; for x64 compilers it is 8K.
This function is not defined in an SDK header and must be declared by the caller. This function is exported from kernelbase.dll.

OUCH¹: contrary to the first highlighted statement, the correct number for AMD64 alias x64 processors is but 4k too; 8k is was used only by compilers for the IA-64 alias Itanium^® processor!

OUCH²: contrary to the second highlighted statement, which is complete and dangerous nonsense, this routine uses a non-standard calling convention; it must not be declared and can not be called from C or C++ sources by its name!

The MSDN article x64 Prolog and Epilog specifies:

The __chkstk helper will not modify any registers other than R10, R11, and the condition codes. In particular, it will return RAX unchanged and leave all nonvolatile registers and argument-passing registers unmodified.

Note: the Visual C compiler calls it through the _alloca() intrinsic function, using register RAX for its argument.

Start the command prompt of the Visual C development environment for the AMD64 platform, then execute the following 3 command lines to locate the object file chkstk.obj and display its disassembly:

FOR %? IN (chkstk.obj) DO SET chkstk=%~$LIB:?
DIR "%chkstk%"
LINK.EXE /DUMP /DISASM "%chkstk%"

SET chkstk=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64\chkstk.obj

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64

02/18/2011  03:08 PM             1,922 chkstk.obj
               1 File(s)          1,922 bytes
               0 Dir(s)    9,876,543,210 bytes free

Microsoft (R) COFF/PE Dumper Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64\chkstk.obj

File Type: COFF OBJECT

$$000000:
  0000000000000000: CC                 int         3
  0000000000000001: CC                 int         3
  0000000000000002: CC                 int         3
  0000000000000003: CC                 int         3
  0000000000000004: CC                 int         3
  0000000000000005: CC                 int         3
  0000000000000006: 66 66 0F 1F 84 00  nop         word ptr [rax+rax+00000000h]
                    00 00 00 00
__chkstk:
  0000000000000010: 48 83 EC 10        sub         rsp,10h
  0000000000000014: 4C 89 14 24        mov         qword ptr [rsp],r10
  0000000000000018: 4C 89 5C 24 08     mov         qword ptr [rsp+8],r11
  000000000000001D: 4D 33 DB           xor         r11,r11
  0000000000000020: 4C 8D 54 24 18     lea         r10,[rsp+18h]
  0000000000000020: 4C 8D 54 24 08     lea         r10,[rsp+8]
  0000000000000025: 4C 2B D0           sub         r10,rax
  0000000000000028: 4D 0F 42 D3        cmovb       r10,r11
  000000000000002C: 65 4C 8B 1C 25 10  mov         r11,qword ptr gs:[00000010h]
                    00 00 00
  000000000000002C: 65 4D 8B 52 10     mov         r11,qword ptr gs:[r11+10h]
  0000000000000035: 4D 3B D3           cmp         r10,r11
  0000000000000038: 73 16              jae         cs20
  000000000000003A: 66 41 81 E2 00 F0  and         r10w,0F000h
cs10:
  0000000000000040: 4D 8D 9B 00 F0 FF  lea         r11,[r11+FFFFF000h]
                    FF
  0000000000000047: 41 C6 03 00        mov         byte ptr [r11],0
  0000000000000047: 4D 85 1B           test        qword ptr [r11],r11
  000000000000004B: 4D 3B D3           cmp         r10,r11
  000000000000004E: 75 F0              jne         cs10
  000000000000004E: 72 F0              jnae        cs10
cs20:
  0000000000000050: 4C 8B 14 24        mov         r10,qword ptr [rsp]
  0000000000000054: 4C 8B 5C 24 08     mov         r11,qword ptr [rsp+8]
  0000000000000059: 48 83 C4 10        add         rsp,10h
  000000000000005D: C3                 ret

  Summary

           0 .data
         3A8 .debug$S
          70 .debug$T
           C .pdata
          5E .text
           8 .xdata

19 (plus 7) instructions in 78 (plus 18) bytes.

OUCH¹: the __chkstk routine saves and restores the volatile registers R10 and R11 without necessity, and very clumsy too; instead to use 2 PUSH plus 2 POP instructions with just 8 bytes, it increments respectively decrements the stack pointer with SUB and ADD instructions and writes respectively reads the stack with 4 MOV instructions, wasting 26 bytes!

OUCH²: replacing the ~~deleted~~ JNE instruction at address 0x38 with the inserted JNAE alias JB instruction makes the ~~deleted~~ AND instruction at address 0x3A superfluous and saves 6 bytes!

OUCH³: replacing the ~~deleted~~ MOV instruction at address 0x47 with the inserted TEST instruction avoids a superfluous memory write and saves 1 byte!

Note: 7 of the 19 instructions and 37 of the total 78 code bytes are completely superfluous, they only waste memory, processor cycles – and every user’s time!

Oops: replacing the ~~deleted~~ MOV instruction at address 0x2C with the inserted one saves 4 more bytes!

Implementation in AMD64 Assembler

A proper implementation uses 10 instructions in only 38 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; * The software is provided "as is" without any warranty, neither express
;   nor implied.
; * In no event will the author be held liable for any damage(s) arising
;   from the use of the software.
; * Redistribution of the software is allowed only in unmodified form.
; * Permission is granted to use the software solely for personal private
;   and non-commercial purposes.
; * An individuals use of the software in his or her capacity or function
;   as an agent, (independent) contractor, employee, member or officer of
;   a business, corporation or organization (commercial or non-commercial)
;   does not qualify as personal private and non-commercial purpose.
; * Without written approval from the author the software must not be used
;   for a business, for commercial, corporate, governmental, military or
;   organizational purposes of any kind, or in a commercial, corporate,
;   governmental, military or organizational environment of any kind.

_nt_tib	struct	8		; thread information block
chain	qword	?		; address of first exception registration record
base	qword	?		; stack base
limit	qword	?		; stack limit
	qword	?		; address of subsystem thread information block
fiber	qword	?		; fiber data
pointer	qword	?		; arbitrary user pointer
self	qword	?		; address of _nt_tib
_nt_tib	ends

	.code

; MSC internal intrinsic _alloca() alias __chkstk():
; receives argument in rax

; NOTE: _alloca() must preserve rax and all argument registers;
;       it can raise 'stack overflow' exception!

;;	alias	<_alloca> = <__chkstk>

__chkstk proc	public		; qword __chkstk(qword size)

	mov	r10, gs:[_nt_tib.limit]
				; r10 = (current) stack limit
	lea	r11, [rsp+8]	; r11 = stack pointer of caller
	sub	r11, rax	; r11 = new stack pointer
	jnb	short limit
overflow:
	xor	r11, r11	; r11 = 0
probe:
	sub	r10, 4096	; r10 = address of guard page
	test	r10, [r10]	; r10 = new stack limit via 'guard page' exception
limit:
	cmp	r10, r11
	ja	short probe	; stack limit > new stack pointer?

	ret

__chkstk endp
	end

Save the AMD64 assembler source presented above as chkstk.asm in an arbitrary, preferable empty directory, then execute the following 3 command lines to generate the object file chkstk.obj and put it into the new object library amd64.lib:

SET ML=/c /W3 /X
ML64.EXE chkstk.asm
LINK.EXE /LIB /OUT:amd64.lib chkstk.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: chkstk.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_alloca` Routine

The documentation for the _alloca compiler helper routine states:

Allocates memory on the stack. […]
void *_alloca(
   size_t size
);
[…]
The _alloca routine returns a void pointer to the allocated space, which is suitably aligned for storage of any type of object. If size is 0, _alloca allocates a zero-length item and returns a valid pointer to that item.
A stack overflow exception is generated if the space can't be allocated. […]

OUCH: the _alloca_probe alias _chkstk routine returns but an unaligned pointer; only the _alloca_probe_8 and _alloca_probe_16 routines return an aligned pointer, the first suitable to store MMX^™ variables, the second suitable to store SSE variables!

CAVEAT: for constant arguments less than 64, the Visual C compiler generates calls to the _alloca_probe routine!

Start the command prompt of the Visual C development environment for the i386 platform, then execute the following 4 command lines to locate the assembler source file alloca16.asm and display its content:

FOR %? IN (msvcrt.lib) DO SET msvcrt=%~$LIB:?
SET source=%msvcrt:\lib\msvcrt.lib=\crt\src%
DIR "%source%\intel\alloca16.asm"
TYPE "%source%\intel\alloca16.asm"

SET msvcrt=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\msvcrt.lib
SET source=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             2,241 alloca16.asm
               1 File(s)          2,241 bytes
               0 Dir(s)    9,876,543,210 bytes free

        page    ,132
        title   alloca16 - aligned C stack checking routine
;***
;chkstk.asm - aligned C stack checking routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       Provides 16 and 8 bit aligned alloca routines.
;
;*******************************************************************************

.xlist
        include cruntime.inc
.list

extern  _chkstk:near

; size of a page of memory

        CODESEG

page
;***
; _alloca_probe_16, _alloca_probe_8 - align allocation to 16/8 byte boundary
;
;Purpose:
;       Adjust allocation size so the ESP returned from chkstk will be aligned
;       to 16/8 bit boundary. Call chkstk to do the real allocation.
;
;Entry:
;       EAX = size of local frame
;
;Exit:
;       Adjusted EAX.
;
;Uses:
;       EAX
;
;*******************************************************************************

public  _alloca_probe_8

_alloca_probe_16 proc                   ; 16 byte aligned alloca

        push    ecx
        lea     ecx, [esp] + 8          ; TOS before entering this function
        sub     ecx, eax                ; New TOS
        and     ecx, (16 - 1)           ; Distance from 16 bit align (align down)
        add     eax, ecx                ; Increase allocation size
        sbb     ecx, ecx                ; ecx = 0xFFFFFFFF if size wrapped around
        or      eax, ecx                ; cap allocation size on wraparound
        pop     ecx                     ; Restore ecx
        jmp     _chkstk

alloca_8:                               ; 8 byte aligned alloca
_alloca_probe_8 = alloca_8

        push    ecx
        lea     ecx, [esp] + 8          ; TOS before entering this function
        sub     ecx, eax                ; New TOS
        and     ecx, (8 - 1)            ; Distance from 8 bit align (align down)
        add     eax, ecx                ; Increase allocation Size
        sbb     ecx, ecx                ; ecx = 0xFFFFFFFF if size wrapped around
        or      eax, ecx                ; cap allocation size on wraparound
        pop     ecx                     ; Restore ecx
        jmp     _chkstk

_alloca_probe_16 endp

        end

18 instructions in 44 bytes (plus 4 bytes for alignment).

Oops: since both routines are contained in one (linkable) function, they occupy 48 bytes instead of 32 bytes; together with the referenced _chkstk routine they occupy 96 bytes.

`_chkstk` Routine

The documentation for the _chkstk compiler helper routine states:

_chkstk Routine is a helper routine for the C compiler. For x86 compilers, _chkstk Routine is called when the local variables exceed 4K bytes; for x64 compilers it is 8K.

OUCH: contrary to the highlighted statement, the correct number for x64 alias AMD64 processors is but 4096 too; 8192 is was used only by compilers for IA-64 alias Itanium^® processors!

The documentation for the /Gs compiler option states:

A stack probe is a sequence of code that the compiler inserts at the beginning of a function call. When initiated, a stack probe reaches benignly into memory by the amount of space required to store the function's local variables. This probe causes the operating system to transparently page in more stack memory if necessary, before the rest of the function runs.
By default, the compiler generates code that initiates a stack probe when a function requires more than one page of stack space. This default is equivalent to a compiler option of /Gs4096 for x86, x64, ARM, and ARM64 platforms. This value allows an application and the Windows memory manager to increase the amount of memory committed to the program stack dynamically at run time.

Execute the following 2 command lines to display the content of the assembler source file chkstk.asm shipped with the Visual C compiler:

DIR "%source%\intel\chkstk.asm"
TYPE "%source%\intel\chkstk.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             3,465 chkstk.asm
               1 File(s)          3,465 bytes
               0 Dir(s)    9,876,543,210 bytes free

        page    ,132
        title   chkstk - C stack checking routine
;***
;chkstk.asm - C stack checking routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       Provides support for automatic stack checking in C procedures
;       when stack checking is enabled.
;
;*******************************************************************************

.xlist
        include cruntime.inc
.list

; size of a page of memory

_PAGESIZE_      equ     1000h


        CODESEG

page
;***
;_chkstk - check stack upon procedure entry
;
;Purpose:
;       Provide stack checking on procedure entry. Method is to simply probe
;       each page of memory required for the stack in descending order. This
;       causes the necessary pages of memory to be allocated via the guard
;       page scheme, if possible. In the event of failure, the OS raises the
;       _XCPT_UNABLE_TO_GROW_STACK exception.
;
;       NOTE:  Currently, the (EAX < _PAGESIZE_) code path falls through
;       to the "lastpage" label of the (EAX >= _PAGESIZE_) code path.  This
;       is small; a minor speed optimization would be to special case
;       this up top.  This would avoid the painful save/restore of
;       ecx and would shorten the code path by 4-6 instructions.
;
;Entry:
;       EAX = size of local frame
;
;Exit:
;       ESP = new stackframe, if successful
;
;Uses:
;       EAX
;
;Exceptions:
;       _XCPT_GUARD_PAGE_VIOLATION - May be raised on a page probe. NEVER TRAP
;                                    THIS!!!! It is used by the OS to grow the
;                                    stack on demand.
;       _XCPT_UNABLE_TO_GROW_STACK - The stack cannot be grown. More precisely,
;                                    the attempt by the OS memory manager to
;                                    allocate another guard page in response
;                                    to a _XCPT_GUARD_PAGE_VIOLATION has
;                                    failed.
;
;*******************************************************************************

public  _alloca_probe

_chkstk proc

_alloca_probe    =  _chkstk

        push    ecx

; Calculate new TOS.

        lea     ecx, [esp] + 8 - 4      ; TOS before entering function + size for ret value
        sub     ecx, eax                ; new TOS

; Handle allocation size that results in wraparound.
; Wraparound will result in StackOverflow exception.

        cmc
        sbb     eax, eax                ; 0 if CF==0, ~0 if CF==1
        not     eax                     ; ~0 if TOS did not wrapped around, 0 otherwise
        and     ecx, eax                ; set to 0 if wraparound

        mov     eax, esp                ; current TOS
        and     eax, not ( _PAGESIZE_ - 1) ; Round down to current page boundary

cs10:
        cmp     ecx, eax                ; Is new TOS
        jb      short cs20              ; in probed page?
        mov     eax, ecx                ; yes.
        pop     ecx
        xchg    esp, eax                ; update esp
        mov     eax, dword ptr [eax]    ; get return address
        mov     dword ptr [esp], eax    ; and put it at new TOS
        push    [eax]
        ret

; Find next lower page and probe
cs20:
        sub     eax, _PAGESIZE_         ; decrease by PAGESIZE
        test    dword ptr [eax],eax     ; probe page.
        jmp     short cs10

_chkstk endp

        end

19 instructions in 43 bytes (plus 5 bytes for alignment).

Oops¹: every programmer should but really know that two’s-complement binary arithmetic exhibits the identity −value = not (value − 1)!

Oops²: instead of the ~~deleted~~ NOT instruction the CMC instruction inserted before the SBB instruction should be used, saving 1 byte.

OOPS: on Pentium^® and later processors, instead of the 2 ~~deleted~~ MOV instructions the single inserted PUSH instruction should be used, saving 4 bytes; the term - 4 of the initial LEA instruction must then be removed to account for the additional 4 bytes pushed onto the stack!

OUCH: if the new TOS is within an already allocated stack page, this stupid implementation but performs superfluous page probes, i.e. superfluous slow memory accesses, loading stale data into the cache hierarchy in the best case and triggering page faults transferring stale data into memory in the worst case!

FOR %? IN (chkstk.obj) DO SET chkstk=%~$LIB:?
DIR "%chkstk%"
LINK.EXE /DUMP /DISASM "%chkstk%"

SET chkstk=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\chkstk.obj

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib

02/18/2011  03:52 PM             1,377 chkstk.obj
               1 File(s)          1,377 bytes
               0 Dir(s)    9,876,543,210 bytes free

Microsoft (R) COFF/PE Dumper Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.


Dump of file C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\chkstk.obj

File Type: COFF OBJECT

__chkstk:
  00000000: 51                 push        ecx
  00000001: 8D 4C 24 04        lea         ecx,[esp+4]
  00000005: 2B C8              sub         ecx,eax
  00000007: 1B C0              sbb         eax,eax
  00000009: F7 D0              not         eax
  0000000B: 23 C8              and         ecx,eax
  0000000D: 8B C4              mov         eax,esp
  0000000F: 25 00 F0 FF FF     and         eax,0FFFFF000h
cs10:
  00000014: 3B C8              cmp         ecx,eax
  00000016: 72 0A              jb          cs20
  00000018: 8B C1              mov         eax,ecx
  0000001A: 59                 pop         ecx
  0000001B: 94                 xchg        eax,esp
  0000001C: 8B 00              mov         eax,dword ptr [eax]
  0000001E: 89 04 24           mov         dword ptr [esp],eax
  00000021: C3                 ret
cs20:
  00000022: 2D 00 10 00 00     sub         eax,1000h
  00000027: 85 00              test        dword ptr [eax],eax
  00000029: EB E9              jmp         cs10

  Summary

           0 .data
         2EC .debug$S
          24 .debug$T
          2B .text

Implementation in i386 Assembler

A proper implementation queries the thread environment block alias thread information block to avoid superfluous memory accesses, and of course incorporates the alignment with just a single extraneous SHL instruction; it uses 17 instructions in 40 bytes (plus 8 bytes for alignment) when the text macro ALLOCA is undefined, and 18 instructions in 43 bytes (plus 5 bytes for alignment) when the text macro ALLOCA is defined as 8 or 16:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C

_nt_tib	struct	4		; thread information block
chain	dword	?		; address of first exception registration record
base	dword	?		; stack base
limit	dword	?		; stack limit
	dword	?		; address of subsystem thread information block
fiber	dword	?		; fiber data
pointer	dword	?		; arbitrary user pointer
self	dword	?		; address of _nt_tib
_nt_tib	ends

	.code

; MSC internal intrinsic _alloca() alias _chkstk():
; receives argument in eax, returns result in esp

; NOTE: _alloca() must preserve all registers except eax;
;       it can raise 'stack overflow' exception!

ifndef ALLOCA
_chkstk	proc	public		; void _chkstk(dword size)
_alloca_probe proc	public	; void _alloca_probe(dword size)
elseifidn ALLOCA, %8
_alloca_probe_8 proc	public	; void _alloca_probe_8(dword size)
elseifidn ALLOCA, %16
_alloca_probe_16 proc	public	; void _alloca_probe_16(dword size)
endif
	push	ebx		; decrement stack pointer, save register
	lea	ebx, [esp+8]	; ebx = stack pointer of caller
	sub	ebx, eax	; ebx = new (unaligned) stack pointer
	cmc			; CF = ~(ebx < 0)
	sbb	eax, eax	; eax = (ebx < 0) ? 0 : -1
ifndef ALLOCA
elseifidn ALLOCA, %8
	shl	eax, 3		; eax = (ebx < 0) ? 0 : -8
elseifidn ALLOCA, %16
	shl	eax, 4		; eax = (ebx < 0) ? 0 : -16
endif
	and	eax, ebx	; eax = (ebx < 0) ? 0 : new (aligned) stack pointer
	assume	fs :flat
	mov	ebx, fs:[_nt_tib.limit]
				; ebx = (current) stack limit
	cmp	ebx, eax
	jna	short ready	; stack limit <= new stack pointer?
probe:
	sub	ebx, 4096	; ebx = address of guard page
	test	ebx, [ebx]	; ebx = new stack limit via 'guard page' exception
	cmp	ebx, eax
	ja	short probe	; new stack limit > new stack pointer?
ready:
	pop	ebx		; restore register
	xchg	eax, esp	; esp = new stack pointer,
				; eax = old stack pointer
				;     = address of return address
	push	[eax]		; decrement stack pointer, write return address
	ret

ifndef ALLOCA
_alloca_probe endp
_chkstk	endp
elseifidn ALLOCA, %8
_alloca_probe_8 endp
elseifidn ALLOCA, %16
_alloca_probe_16 endp
else
	echo	ALLOCA must be 8 or 16 when defined!
	.err	ALLOCA
endif
	end

Save the i386 assembler source presented above as alloca.asm in an arbitrary, preferable empty directory, then execute the following 5 command lines to generate the 3 object files alloca.obj, alloca8.obj plus alloca16.obj and put them into the new object library i386.lib:

SET ML=/c /safeseh /W3 /X
ML.EXE alloca.asm
ML.EXE /DALLOCA=8 /Foalloca8.obj alloca.asm
ML.EXE /DALLOCA=16 /Foalloca16.obj alloca.asm
LINK.EXE /LIB /OUT:i386.lib alloca.obj alloca8.obj alloca16.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alloca.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alloca.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alloca.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_allmul` Routine

The documentation for the _allmul compiler helper routine states:

Multiplies two LONGLONG or ULONGLONG integers. For example, to multiply two int64 values the compiler might generate a call to the _allmul routine.
Remarks
The _allmul routine is a helper routine for the C compiler. Whether the compiler uses _allmul is completely dependent on the optimization set.
This routine is used only on x86 platforms.

OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to the _allmul routine unconditionally, independent from any optimisation, when it encounters a multiplication where at least one of its operands is a (signed or unsigned) 64-bit integer!

Execute the following 2 command lines to display the content of the assembler source file llmul.asm shipped with the Visual C compiler:

DIR "%source%\intel\llmul.asm"
TYPE "%source%\intel\llmul.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             2,570 llmul.asm
               1 File(s)          2,570 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   llmul - long multiply routine
;***
;llmul.asm - long multiply routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       Defines long multiply routine
;       Both signed and unsigned routines are the same, since multiply's
;       work out the same in 2's complement
;       creates the following routine:
;           __allmul
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;llmul - long multiply routine
;
;Purpose:
;       Does a long multiply (same for signed/unsigned)
;       Parameters are not changed.
;
;Entry:
;       Parameters are passed on the stack:
;               1st pushed: multiplier (QWORD)
;               2nd pushed: multiplicand (QWORD)
;
;Exit:
;       EDX:EAX - product of multiplier and multiplicand
;       NOTE: parameters are removed from the stack
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_allmul PROC NEAR
.FPO (0, 4, 0, 0, 0, 0)

A       EQU     [esp + 4]       ; stack address of a
B       EQU     [esp + 12]      ; stack address of b

;
;       AHI, BHI : upper 32 bits of A and B
;       ALO, BLO : lower 32 bits of A and B
;
;             ALO * BLO
;       ALO * BHI
; +     BLO * AHI
; ---------------------
;

        mov     eax,HIWORD(A)
        mov     ecx,HIWORD(B)
        or      ecx,eax         ;test for both hiwords zero.
        mov     ecx,LOWORD(B)
        jnz     short hard      ;both are zero, just mult ALO and BLO

        mov     eax,LOWORD(A)
        mul     ecx

        ret     16              ; callee restores the stack

hard:
        push    ebx
.FPO (1, 4, 0, 0, 0, 0)

; must redefine A and B since esp has been altered

A2      EQU     [esp + 8]       ; stack address of a
B2      EQU     [esp + 16]      ; stack address of b

        mul     ecx             ;eax has AHI, ecx has BLO, so AHI * BLO
        mov     ebx,eax         ;save result

        mov     eax,LOWORD(A2)
        mul     dword ptr HIWORD(B2) ;ALO * BHI
        add     ebx,eax         ;ebx = ((ALO * BHI) + (AHI * BLO))

        mov     eax,LOWORD(A2)  ;ecx = BLO
        mul     ecx             ;so edx:eax = ALO*BLO
        add     edx,ebx         ;now edx has all the LO*HI stuff

        pop     ebx

        ret     16              ; callee restores the stack

_allmul ENDP

        end

19 instructions in 52 bytes (plus 12 bytes for alignment).

Ouch¹: since only the low parts of the products of the low and high parts of the arguments are needed, the 2 highlighted widening MUL instructions should be replaced with 2 faster IMUL instructions!

Ouch²: on processors featuring speculative execution, i.e. Pentium^®Pro (introduced November 1, 1995) and newer, which execute 2 IMUL or MUL instructions faster than a mispredicted conditional branch, the test whether the high parts of both arguments are 0 is superfluous and impairs performance!

Implementations in i386 Assembler

A proper implementation for processors which don’t feature speculative execution uses 14 instructions in 44 bytes (plus 4 bytes for alignment) if the text macro SPACE is undefined, else 12 instructions in 37 bytes (plus 11 bytes for alignment), but no non-volatile register:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _allmul():
; receives arguments on stack, returns product modulo 2**64 in edx:eax

_allmul	proc	public		; sqword _allmul(sqword multiplicand, sqword multiplier)

	mov	ecx, [esp+16]	; ecx = high dword of multiplier
	mov	edx, [esp+8]	; edx = high dword of multiplicand
	mov	eax, [esp+4]	; eax = low dword of multiplicand
	or	ecx, edx
ifdef SPACE
	jz	short zero	; high dwords are 0?
else ; SPACE
	jnz	short notzero	; high dwords are not 0?

	mul	dword ptr [esp+12]
				; edx:eax = low dword of multiplicand
				;         * low dword of multiplier

	ret	16		; callee restores stack
notzero:
endif ; SPACE
	imul	edx, [esp+12]	; edx = high dword of multiplicand
				;     * low dword of multiplier
	mov	ecx, [esp+16]	; ecx = high dword of multiplier
	imul	ecx, eax	; ecx = high dword of multiplier
				;     * low dword of multiplicand
	add	ecx, edx	; ecx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
zero:
	mul	dword ptr [esp+12]
				; edx:eax = low dword of multiplicand
				;         * low dword of multiplier
	add	edx, ecx	; edx:eax = product % 2**64

	ret	16		; callee restores stack

_allmul	endp
	end

A proper implementation for processors which feature speculative execution uses 9 instructions in 29 bytes (plus 3 bytes for alignment) without conditional branch instead of the 19 instructions in 52 bytes used by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _allmul():
; receives arguments on stack, returns product modulo 2**64 in edx:eax

_allmul	proc	public		; sqword _allmul(sqword multiplicand, sqword multiplier)

	mov	eax, [esp+4]	; eax = low dword of multiplicand
	mov	edx, [esp+8]	; edx = high dword of multiplicand
	imul	edx, [esp+12]	; edx = high dword of multiplicand
				;     * low dword of multiplier
	mov	ecx, [esp+16]	; ecx = high dword of multiplier
	imul	ecx, eax	; ecx = high dword of multiplier
				;     * low dword of multiplicand
	add	ecx, edx	; ecx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
	mul	dword ptr [esp+12]
				; edx:eax = low dword of multiplicand
				;         * low dword of multiplier
	add	edx, ecx	; edx:eax = product % 2**64

	ret	16		; callee restores stack

_allmul	endp
	end

Save the i386 assembler source presented above as allmul.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file allmul.obj and add it to the existing object library i386.lib:

ML.EXE allmul.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib allmul.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: allmul.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_alldiv` Routine

The documentation for the _alldiv compiler helper routine states:

Divides two LONGLONG integers. For example, to divide two int64 values the compiler might generate a call to _alldiv Routine.
Remarks
_alldiv Routine is a helper routine for the C compiler. Whether the compiler calls _alldiv Routine is completely dependent on the optimization set.

OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to the _alldiv routine unconditionally, independent from any optimisation, when it encounters a division where at least one of its operands is a signed 64-bit integer!

Execute the following 2 command lines to display the content of the assembler source file lldiv.asm shipped with the Visual C compiler:

DIR "%source%\intel\lldiv.asm"
TYPE "%source%\intel\lldiv.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             6,670 lldiv.asm
               1 File(s)          6,670 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   lldiv - signed long divide routine
;***
;lldiv.asm - signed long divide routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the signed long divide routine
;           __alldiv
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;lldiv - signed long divide
;
;Purpose:
;       Does a signed long divide of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_alldiv PROC NEAR
.FPO (3, 4, 0, 0, 0, 0)

        push    edi
        push    esi
        push    ebx

; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to lldiv(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |      EDI      |
;               |---------------|
;               |      ESI      |
;               |---------------|
;       ESP---->|      EBX      |
;               -----------------
;

DVND    equ     [esp + 16]      ; stack address of dividend (a)
DVSR    equ     [esp + 24]      ; stack address of divisor (b)
DVND    equ     [esp + 12]
DVSR    equ     [esp + 20]

; Determine sign of the result (edi = 0 if result is positive, non-zero
; otherwise) and make operands positive.

        xor     edi,edi         ; result sign assumed positive

        mov     eax,HIWORD(DVND) ; hi word of a
        or      eax,eax         ; test to see if signed
        test    eax,eax
        jge     short L1        ; skip rest if a is already positive
        inc     edi             ; complement result sign flag
        mov     edx,LOWORD(DVND) ; lo word of a
        neg     eax             ; make a positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVND),eax ; save positive value
        mov     LOWORD(DVND),edx
L1:
        mov     eax,HIWORD(DVSR) ; hi word of b
        or      eax,eax         ; test to see if signed
        test    eax,eax
        jge     short L2        ; skip rest if b is already positive
        inc     edi             ; complement the result sign flag
        mov     edx,LOWORD(DVSR) ; lo word of a
        neg     eax             ; make b positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVSR),eax ; save positive value
        mov     LOWORD(DVSR),edx
L2:

;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
; NOTE - eax currently contains the high order word of DVSR
;

        or      eax,eax         ; check to see if divisor < 4194304K
        test    eax,eax
        jnz     short L3        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; eax <- high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; eax <- low order bits of quotient
        mov     edx,ebx         ; edx:eax <- quotient
        jmp     short L4        ; set sign, restore stack and return

;
; Here we do it the hard way.  Remember, eax contains the high word of DVSR
;

L3:
        mov     ebx,eax         ; ebx:ecx <- divisor
        mov     ecx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L5:
        shr     ebx,1           ; shift divisor right one bit
        rcr     ecx,1
        shr     edx,1           ; shift dividend right one bit
        rcr     eax,1
        or      ebx,ebx
        test    ebx,ebx
        jnz     short L5        ; loop until divisor < 4194304K
        div     ecx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient
        mov     ebx,eax

;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;

        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     ecx,HIWORD(DVSR)
        imul    ecx,ebx
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        mul     ebx
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L6        ; carry means Quotient is off by 1

;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;

        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L6        ; if result > original, do subtract
        jb      short L7        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        sbb     edx,HIWORD(DVND)
        jbe     short L7        ; if less or equal we are ok, else subtract
L6:
        dec     esi             ; subtract 1 from quotient
        dec     ebx
L7:
        xor     edx,edx         ; edx:eax <- quotient
        mov     eax,esi
        mov     eax,ebx
;
; Just the cleanup left to do.  edx:eax contains the quotient.  Set the sign
; according to the save value, cleanup the stack, and return.
;

L4:
        dec     edi             ; check to see if result is negative
        jnz     short L8        ; if EDI == 0, result should be negative
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0

;
; Restore the saved registers and return.
;

L8:
        pop     ebx
        pop     esi
        pop     edi

        ret     16

_alldiv ENDP

end

With 70 instructions in 170 bytes (plus 6 bytes for alignment), this routine has several major and minor flaws: 3 major flaws on all kinds of processors, and 4 more only on processors which feature speculative execution!

OOPS¹: instead of the 4 ~~deleted~~ OR instructions which perform superfluous writes, the 4 inserted TEST instructions should be used.

OOPS²: instead of the ~~deleted~~ first widening MUL instruction and the following ~~deleted~~ MOV instruction, the inserted MOV instruction loading the high part of the divisor into register ECX followed by the inserted faster IMUL instruction should be used.

OUCH¹: instead of register ESI register EBX should be used, saving a pair of PUSH and POP instructions and 2 bytes!

OUCH²: for divisors less than 2³² and a dividend less than 2³²×divisor, i.e. if the quotient is less than 2³², instead of the long alias schoolbook division performed with the 2 highlighted chained DIV instructions – each slower than a mispredicted conditional branch – after the conditional branch to label L3:, a single DIV instruction is sufficient, saving about 40 to 240 processor cycles!

OUCH³: instead of the highlighted (brain)dead slow loop with 2 pairs of SHR and RCR instructions after label L5:, 2 pairs of SHRD and SHR instructions with their shift count determined per BSR instruction should be used!

Note: this BSR instruction would also replace the ~~deleted~~ OR instruction respectively the inserted TEST instruction after label L2:.

OUCH⁴: on processors which feature speculative execution, instead of the 3 highlighted conditional branches to the labels L1:, L2: and L8:, which are slow when mispredicted, and the following NEG plus SBB instructions to negate the arguments as well as the quotient, a branchless and thus faster instruction sequence should be used!

OUCH⁵: on processors which feature speculative execution, instead of the 2 CMP instructions and the 3 conditional branches before label L6:, which are slow when mispredicted, a faster instruction sequence with less or no conditional branches should be used!

Note: with the modifications shown in the source, this routine has 66 instructions in 164 bytes (plus 12 bytes for alignment).

Implementations in i386 Assembler

A proper (and several times faster) implementation for processors which don’t feature speculative execution, minimising the number of actually executed instructions via conditional branches, uses 88 instructions in 208 bytes, including 13 instructions in 27 bytes for the special and trivial cases not covered by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _alldiv():
; receives arguments on stack, returns quotient in edx:eax

; NOTE: _alldiv() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns ±2**63 for -2**63 / -1!

_alldiv	proc	public		; sqword _alldiv(sqword dividend, sqword divisor)

	xor	ecx, ecx	; ecx = sign of quotient = 0

	; determine sign of dividend and compute |dividend|

	mov	edx, [esp+8]	; edx = high dword of dividend
	test	edx, edx
	jns	short @f	; (high dword of) dividend >= 0?

	mov	eax, [esp+4]	; eax = low dword of dividend
	neg	edx
	neg	eax
	sbb	edx, ecx	; edx:eax = -dividend = |dividend|
	mov	[esp+4], eax
	mov	[esp+8], edx	; write |dividend| back on stack

	dec	ecx		; ecx = sign of dividend = -1
@@:
	; determine sign of divisor and compute |divisor|

	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	test	edx, edx
	jns	short @f	; (high dword of) divisor >= 0?

	neg	edx
	neg	eax
	sbb	edx, 0		; edx:eax = -divisor = |divisor|
	mov	[esp+12], eax
	mov	[esp+16], edx	; write |divisor| back on stack

	not	ecx		; ecx = sign of dividend
				;     ^ sign of divisor
				;     = sign of quotient
@@:
	push	ecx		; [esp] = (quotient < 0) ? -1 : 0

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+8]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|

	jmp	short quotient

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = |quotient|

	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0
trivial:
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	cdq			; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	jc	short @f	; divisor * quotient" >= 2**64?

	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
@@:
	sbb	eax, eax	; eax = (quotient < quotient") ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|
	pop	ebx
quotient:
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend = divisor = -2**63: quotient = 1
special:
	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	cdq			; edx:eax = quotient = 1

	ret	16		; callee restores stack

_alldiv	endp
	end

A proper (and several times faster) implementation for processors which feature speculative execution, minimising the number of (mispredictable) conditional branches, uses 86 instructions in 208 bytes, including 13 instructions in 29 bytes for the special and trivial cases not covered by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _alldiv():
; receives arguments on stack, returns quotient in edx:eax

; NOTE: _alldiv() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns ±2**63 for -2**63 / -1!

_alldiv	proc	public		; sqword _alldiv(sqword dividend, sqword divisor)

	; determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend

	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; eax:ecx = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; eax:ecx = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		; [esp] = (dividend < 0) ? -1 : 0

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	xor	[esp], ecx	; [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
				;       = (quotient < 0) ? -1 : 0

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+8]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|

	jmp	short quotient

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = |quotient|

	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0
trivial:
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|
	pop	ebx
quotient:
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend = divisor = -2**63: quotient = 1
special:
	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	xor	edx, edx	; edx:eax = quotient = 1

	ret	16		; callee restores stack

_alldiv	endp
	end

Save the i386 assembler source presented above as alldiv.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file alldiv.obj and add it to the existing object library i386.lib:

ML.EXE alldiv.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib alldiv.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alldiv.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_alldvrm` Routine

Execute the following 2 command lines to display the content of the assembler source file lldvrm.asm shipped with the Visual C compiler:

DIR "%source%\intel\lldvrm.asm"
TYPE "%source%\intel\lldvrm.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             8,557 lldvrm.asm
               1 File(s)          8,557 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   lldvrm - signed long divide and remainder routine
;***
;lldvrm.asm - signed long divide and remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the signed long divide and remainder routine
;           __alldvrm
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;lldvrm - signed long divide and remainder
;
;Purpose:
;       Does a signed long divide and remainder of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       EBX:ECX contains the remainder (divided % divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_alldvrm PROC NEAR
.FPO (3, 4, 0, 0, 1, 0)

        push    edi
        push    esi
        push    ebp

; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to alldvrm(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |      EDI      |
;               |---------------|
;               |      ESI      |
;               |---------------|
;       ESP---->|      EBP      |
;               -----------------
;

DVND    equ     [esp + 16]      ; stack address of dividend (a)
DVSR    equ     [esp + 24]      ; stack address of divisor (b)


; Determine sign of the quotient (edi = 0 if result is positive, non-zero
; otherwise) and make operands positive.
; Sign of the remainder is kept in ebp.

        xor     edi,edi         ; result sign assumed positive
        xor     ebp,ebp         ; result sign assumed positive

        mov     eax,HIWORD(DVND) ; hi word of a
        or      eax,eax         ; test to see if signed
        jge     short L1        ; skip rest if a is already positive
        inc     edi             ; complement result sign flag
        inc     ebp             ; complement result sign flag
        mov     edx,LOWORD(DVND) ; lo word of a
        neg     eax             ; make a positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVND),eax ; save positive value
        mov     LOWORD(DVND),edx
L1:
        mov     eax,HIWORD(DVSR) ; hi word of b
        or      eax,eax         ; test to see if signed
        jge     short L2        ; skip rest if b is already positive
        inc     edi             ; complement the result sign flag
        mov     edx,LOWORD(DVSR) ; lo word of a
        neg     eax             ; make b positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVSR),eax ; save positive value
        mov     LOWORD(DVSR),edx
L2:

;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
; NOTE - eax currently contains the high order word of DVSR
;

        or      eax,eax         ; check to see if divisor < 4194304K
        jnz     short L3        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; eax <- high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; eax <- low order bits of quotient
        mov     esi,eax         ; ebx:esi <- quotient
;
; Now we need to do a multiply so that we can compute the remainder.
;
        mov     eax,ebx         ; set up high word of quotient
        mul     dword ptr LOWORD(DVSR) ; HIWORD(QUOT) * DVSR
        mov     ecx,eax         ; save the result in ecx
        mov     eax,esi         ; set up low word of quotient
        mul     dword ptr LOWORD(DVSR) ; LOWORD(QUOT) * DVSR
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jmp     short L4        ; complete remainder calculation

;
; Here we do it the hard way.  Remember, eax contains the high word of DVSR
;

L3:
        mov     ebx,eax         ; ebx:ecx <- divisor
        mov     ecx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L5:
        shr     ebx,1           ; shift divisor right one bit
        rcr     ecx,1
        shr     edx,1           ; shift dividend right one bit
        rcr     eax,1
        or      ebx,ebx
        jnz     short L5        ; loop until divisor < 4194304K
        div     ecx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient

;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;

        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L6        ; carry means Quotient is off by 1

;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;

        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L6        ; if result %gt; original, do subtract
        jb      short L7        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L7        ; if less or equal we are ok, else subtract
L6:
        dec     esi             ; subtract 1 from quotient
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L7:
        xor     ebx,ebx         ; ebx:esi <- quotient

L4:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result if necessary.
;

        sub     eax,LOWORD(DVND) ; subtract dividend from result
        sbb     edx,HIWORD(DVND)

;
; Now check the result sign flag to see if the result is supposed to be positive
; or negative.  It is currently negated (because we subtracted in the 'wrong'
; direction), so if the sign flag is set we are done, otherwise we must negate
; the result to make it positive again.
;

        dec     ebp             ; check result sign flag
        jns     short L9        ; result is ok, set up the quotient
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0

;
; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
;
L9:
        mov     ecx,edx
        mov     edx,ebx
        mov     ebx,ecx
        mov     ecx,eax
        mov     eax,esi

;
; Just the cleanup left to do.  edx:eax contains the quotient.  Set the sign
; according to the save value, cleanup the stack, and return.
;

        dec     edi             ; check to see if result is negative
        jnz     short L8        ; if EDI == 0, result should be negative
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0

;
; Restore the saved registers and return.
;

L8:
        pop     ebp
        pop     esi
        pop     edi

        ret     16

_alldvrm ENDP

end

91 instructions in 223 bytes (plus 1 byte for alignment).

OUCH: the highlighted comment with the following code is a remarkable gem – the remainder is already present in register EDX!

Note: unlike the IDIV instruction, which raises a divide error (#DE) exception when dividing −2⁶³, the smallest signed 64-bit integer, by −1, this routine returns but the (wrong) quotient −2⁶³ and the (correct) remainder 0, i.e. the only integer smaller in magnitude than the divisor −1!

Implementation in i386 Assembler

A proper (and several times faster) implementation, targeting processors which feature speculative execution when the text macro JCCLESS is defined, else processors which don’t feature speculative execution, uses 111 instructions in 268 bytes (plus 4 bytes for alignment) respectively 108 instructions in 260 bytes (plus 12 bytes for alignment), including 22 instructions in 50 bytes for the special and trivial cases not covered by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _alldvrm():
; receives arguments on stack, returns quotient in edx:eax and remainder in ebx:ecx

; NOTE: _alldvrm() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns ±2**63 for -2**63 / -1 and 0 for -2**63 % -1!

_alldvrm proc	public		; sqword _alldvrm(sqword dividend, sqword divisor)

	; determine sign of dividend and compute |dividend|

	mov	edx, [esp+8]
	mov	eax, [esp+4]	; edx:eax = dividend

	mov	ebx, edx
	sar	ebx, 31		; ebx = (dividend < 0) ? -1 : 0
				;     = (remainder < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx	; edx:eax = (dividend < 0) ? ~dividend : dividend
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], eax	; write |dividend| back on stack
	mov	[esp+8], edx

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+16]
	mov	eax, [esp+12]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+12], eax	; write |divisor| back on stack
	mov	[esp+16], edx

	xor	ecx, ebx	; ecx = (divisor < 0) ^ (dividend < 0) ? -1 : 0
				;     = (quotient < 0) ? -1 : 0
	push	ecx		; [esp] = (quotient < 0) ? -1 : 0
	push	ebx		; [esp] = (remainder < 0) ? -1 : 0

	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+16]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	xor	ebx, ebx	; ebx = high dword of quotient = 0
	jmp	short next

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
next:
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder

	mov	ecx, edx	; ecx = (low dword of) |remainder|
	mov	edx, ebx	; edx:eax = |quotient|
;;	xor	ebx, ebx	; ebx:ecx = |remainder|
if 0
	mov	ebx, [esp+4]	; ebx = (quotient < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = quotient

	pop	ebx		; ebx = (remainder < 0) ? -1 : 0
	xor	ecx, ebx
	sub	ecx, ebx
	sbb	ebx, ebx	; ebx:ecx = remainder
else
	pop	ebx		; ebx = (remainder < 0) ? -1 : 0
	xor	ecx, ebx
	sub	ecx, ebx
	sbb	ebx, ebx	; ebx:ecx = remainder

	xor	eax, [esp]
	xor	edx, [esp]
	sub	eax, [esp]
	sbb	edx, [esp]	; edx:eax = quotient
endif
	add	esp, 4
	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	pop	eax		; eax = (remainder < 0) ? -1 : 0
	mov	ecx, [esp+8]
	mov	ebx, [esp+12]	; ebx:ecx = |remainder| = |dividend|
	xor	ecx, eax
	xor	ebx, eax
	sub	ecx, eax
	sbb	ebx, eax	; ebx:ecx = remainder

	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	push	ebx		; [esp] = quotient"
	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+28]	; ebx = high dword of divisor * quotient"
	add	edx, ebx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	pop	eax		; eax = quotient"
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = |remainder|
	dec	eax		; eax = quotient" - 1
				;     = low dword of |quotient|
@@:
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	; [esp] = quotient" - (remainder" < 0)
				;       = (low dword of) |quotient|
	and	eax, [esp+24]
	and	edx, [esp+28]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = |remainder|
	pop	eax		; eax = (low dword of) |quotient|
endif ; JCCLESS
;;	xor	edx, edx	; edx:eax = |quotient|

	pop	edx		; edx = (remainder < 0) ? -1 : 0
	xor	ecx, edx
	xor	ebx, edx
	sub	ecx, edx
	sbb	ebx, edx	; ebx:ecx = remainder

	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend = divisor = -2**63: quotient = 1, remainder = 0
special:
	pop	ebx		; ebx = sign of remainder = -1
	inc	ebx
;;	xor	ecx, ecx	; ebx:ecx = remainder = 0

	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	xor	edx, edx	; edx:eax = quotient = 1

	ret	16		; callee restores stack

_alldvrm endp
	end

Save the i386 assembler source presented above as alldvrm.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file alldvrm.obj and add it to the existing object library i386.lib:

ML.EXE alldvrm.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib alldvrm.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alldvrm.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_allrem` Routine

Execute the following 2 command lines to display the content of the assembler source file llrem.asm shipped with the Visual C compiler:

DIR "%source%\intel\llrem.asm"
TYPE "%source%\intel\llrem.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             7,067 llrem.asm
               1 File(s)          7,067 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   llrem - signed long remainder routine
;***
;llrem.asm - signed long remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the signed long remainder routine
;           __allrem
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;llrem - signed long remainder
;
;Purpose:
;       Does a signed long remainder of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the remainder (dividend%divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_allrem PROC NEAR
.FPO (2, 4, 0, 0, 0, 0)

        push    ebx
        push    edi

; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a%b will
; generate a call to lrem(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |       EBX     |
;               |---------------|
;       ESP---->|       EDI     |
;               -----------------
;

DVND    equ     [esp + 12]      ; stack address of dividend (a)
DVSR    equ     [esp + 20]      ; stack address of divisor (b)


; Determine sign of the result (edi = 0 if result is positive, non-zero
; otherwise) and make operands positive.

        xor     edi,edi         ; result sign assumed positive

        mov     eax,HIWORD(DVND) ; hi word of a
        or      eax,eax         ; test to see if signed
        jge     short L1        ; skip rest if a is already positive
        inc     edi             ; complement result sign flag bit
        mov     edx,LOWORD(DVND) ; lo word of a
        neg     eax             ; make a positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVND),eax ; save positive value
        mov     LOWORD(DVND),edx
L1:
        mov     eax,HIWORD(DVSR) ; hi word of b
        or      eax,eax         ; test to see if signed
        jge     short L2        ; skip rest if b is already positive
        mov     edx,LOWORD(DVSR) ; lo word of b
        neg     eax             ; make b positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVSR),eax ; save positive value
        mov     LOWORD(DVSR),edx
L2:

;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
; NOTE - eax currently contains the high order word of DVSR
;

        or      eax,eax         ; check to see if divisor < 4194304K
        jnz     short L3        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; edx <- remainder
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; edx <- final remainder
        mov     eax,edx         ; edx:eax <- remainder
        xor     edx,edx
        dec     edi             ; check result sign flag
        jns     short L4        ; negate result, restore stack and return
        jmp     short L8        ; result sign ok, restore stack and return

;
; Here we do it the hard way.  Remember, eax contains the high word of DVSR
;

L3:
        mov     ebx,eax         ; ebx:ecx <- divisor
        mov     ecx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L5:
        shr     ebx,1           ; shift divisor right one bit
        rcr     ecx,1
        shr     edx,1           ; shift dividend right one bit
        rcr     eax,1
        or      ebx,ebx
        jnz     short L5        ; loop until divisor < 4194304K
        div     ecx             ; now divide, ignore remainder

;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;

        mov     ecx,eax         ; save a copy of quotient in ECX
        mul     dword ptr HIWORD(DVSR)
        xchg    ecx,eax         ; save product, get quotient in EAX
        mul     dword ptr LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L6        ; carry means Quotient is off by 1

;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract the original divisor from the result.
;

        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L6        ; if result > original, do subtract
        jb      short L7        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L7        ; if less or equal we are ok, else subtract
L6:
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L7:

;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result if necessary.
;

        sub     eax,LOWORD(DVND) ; subtract dividend from result
        sbb     edx,HIWORD(DVND)

;
; Now check the result sign flag to see if the result is supposed to be positive
; or negative.  It is currently negated (because we subtracted in the 'wrong'
; direction), so if the sign flag is set we are done, otherwise we must negate
; the result to make it positive again.
;

        dec     edi             ; check result sign flag
        jns     short L8        ; result is ok, restore stack and return
L4:
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0

;
; Just the cleanup left to do.  edx:eax contains the quotient.
; Restore the saved registers and return.
;

L8:
        pop     edi
        pop     ebx

        ret     16

_allrem ENDP

	end

69 instructions in 178 bytes (plus 14 bytes for alignment).

Note: unlike the IDIV instruction, which raises a divide error (#DE) exception when dividing −2⁶³, the smallest signed 64-bit integer, by −1, this routine returns the (correct) remainder 0, i.e. the only integer smaller in magnitude than the divisor −1!

Implementation in i386 Assembler

A proper (and several times faster) implementation, targeting processors which feature speculative execution when the text macro JCCLESS is defined, else processors which don’t feature speculative execution, uses 85 instructions in 213 bytes (plus 11 bytes for alignment) respectively 84 instructions in 211 bytes (plus 13 bytes for alignment), including 12 instructions in 33 bytes for the special and trivial cases not covered by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _allrem():
; receives arguments on stack, returns remainder in edx:eax

; NOTE: _allrem() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns 0 for -2**63 % -1!

_allrem	proc	public		; sqword _allrem(sqword dividend, sqword divisor)

	; determine sign of dividend and compute |dividend|

	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend

	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; ecx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|

	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax

	push	edx		; [esp] = (dividend < 0) ? -1 : 0

	; determine sign of divisor and compute |divisor|

	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor

	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|

	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	jmp	short next

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
next:
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) |remainder|
;;	xor	edx, edx	; edx:eax = |remainder|

	pop	edx		; edx = (remainder < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+8]
	mov	edx, [esp+12]	; edx:eax = |remainder| = |dividend|

	jmp	short remainder

	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	; ebx = high dword of divisor * quotient"
	add	edx, ebx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ecx
	mov	edx, ebx	; edx:eax = |remainder|
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	; edx:eax = remainder" + divisor
				;         = |remainder|
endif ; JCCLESS
	pop	ebx
remainder:
	pop	ecx		; ecx = (remainder < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = remainder

	ret	16		; callee restores stack

	; dividend = divisor = -2**63: remainder = 0
special:
	pop	eax		; eax = sign of remainder = -1
	inc	eax
	xor	edx, edx	; edx:eax = remainder = 0

	ret	16		; callee restores stack

_allrem	endp
	end

Save the i386 assembler source presented above as allrem.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file allrem.obj and add it to the existing object library i386.lib:

ML.EXE allrem.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib allrem.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: allrem.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_aulldiv` Routine

The documentation for the _aulldiv compiler helper routine states:

Divides two ULONGLONG integers. For example, to divide two UInt64 values the compiler might generate a call to _aulldiv Routine.
Remarks
_aulldiv Routine is a helper routine for the C compiler. Whether the compiler calls _aulldiv Routine is completely dependent on the optimization set.

OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to the _aulldiv routine unconditionally, independent from any optimisation, when it encounters a division where at least one of its operands is an unsigned 64-bit integer!

Execute the following 2 command lines to display the content of the assembler source file ulldiv.asm shipped with the Visual C compiler:

DIR "%source%\intel\ulldiv.asm"
TYPE "%source%\intel\ulldiv.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             5,079 ulldiv.asm
               1 File(s)          5,079 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   ulldiv - unsigned long divide routine
;***
;ulldiv.asm - unsigned long divide routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the unsigned long divide routine
;           __aulldiv
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;ulldiv - unsigned long divide
;
;Purpose:
;       Does a unsigned long divide of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_aulldiv        PROC NEAR
.FPO (2, 4, 0, 0, 0, 0)

        push    ebx
        push    esi

; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to uldiv(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |      EBX      |
;               |---------------|
;       ESP---->|      ESI      |
;               -----------------
;

DVND    equ     [esp + 12]      ; stack address of dividend (a)
DVSR    equ     [esp + 20]      ; stack address of divisor (b)
DVND    equ     [esp + 8]
DVSR    equ     [esp + 16]

;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;

        mov     eax,HIWORD(DVSR) ; check to see if divisor < 4194304K
        or      eax,eax
        mov     edx,HIWORD(DVSR)
        test    edx,edx
        jnz     short L1        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; get high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; get low order bits of quotient
        mov     edx,ebx         ; edx:eax <- quotient hi:quotient lo
        jmp     short L2        ; restore stack and return

;
; Here we do it the hard way.  Remember, eax contains DVSRHI
;

L1:
        mov     ecx,eax         ; ecx:ebx <- divisor
        mov     ecx,edx
        mov     ebx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L3:
        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
        rcr     ebx,1
        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
        rcr     eax,1
        or      ecx,ecx
        test    ecx,ecx
        jnz     short L3        ; loop until divisor < 4194304K
        div     ebx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient
        mov     ebx,eax

;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;

        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     ecx,HIWORD(DVSR)
        imul    ecx,ebx
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        mul     ebx
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L4        ; carry means Quotient is off by 1

;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;

        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L4        ; if result > original, do subtract
        jb      short L5        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        sbb     edx,HIWORD(DVND)
        jbe     short L5        ; if less or equal we are ok, else subtract
L4:
        dec     esi             ; subtract 1 from quotient
        dec     ebx
L5:
        xor     edx,edx         ; edx:eax <- quotient
        mov     eax,esi
        mov     eax,ebx

;
; Just the cleanup left to do.  edx:eax contains the quotient.
; Restore the saved registers and return.
;

L2:

        pop     esi
        pop     ebx

        ret     16

_aulldiv        ENDP

        end

With 43 instructions in 104 bytes (plus 8 bytes for alignment), this routine has several major and minor flaws: 3 major flaws on all kinds of processors, and 1 more only on processors which feature speculative execution!

OOPS¹: instead of the 2 ~~deleted~~ OR instructions which perform superfluous write operations, the 2 inserted TEST instructions should be used.

OOPS²: register EDX should be used instead of register EAX before the conditional branch to label L1:, and the following ~~deleted~~ XOR instruction should be removed.

OOPS³: instead of the ~~deleted~~ first widening MUL instruction and the following ~~deleted~~ MOV instruction, the inserted MOV instruction loading the high part of the divisor into register ECX followed by the inserted faster IMUL instruction should be used.

OUCH¹: instead of register ESI register EBX should be used, saving a pair of PUSH and POP instructions and 2 bytes!

OUCH³: instead of the highlighted (brain)dead slow loop with 2 pairs of SHR and RCR instructions after label L3:, 2 pairs of SHRD and SHR instructions with their shift count determined per BSR instruction should be used!

Note: this BSR instruction would also replace the ~~deleted~~ OR instruction respectively the inserted TEST instruction before the conditional branch to label L1:.

OUCH⁴: on processors which feature speculative execution, instead of the 3 conditional branches before label L6:, which are slow when mispredicted, and the 2 CMP instructions, a faster instruction sequence with less or no conditional branches should be used!

Note: with the modifications shown in the source, this routine has 38 instructions in 96 bytes.

Implementation in i386 Assembler

A proper (and several times faster) implementation, targeting processors which feature speculative execution when the text macro JCCLESS is defined, else processors which don’t feature speculative execution, uses 59 instructions in 147 bytes (plus 13 bytes for alignment) respectively 60 instructions in 148 bytes (plus 12 bytes for alignment), including 12 instructions in 30 bytes for the special and trivial cases not covered by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _aulldiv():
; receives arguments on stack, returns quotient in edx:eax

; NOTE: _aulldiv() can raise 'division by zero' exception!

_aulldiv proc	public		; qword _aulldiv(qword dividend, qword divisor)

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0
trivial:
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
ifndef JCCLESS
	mov	ecx, [esp+20]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	jc	short @f	; divisor * quotient" >= 2**64?

	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
@@:
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else ; JCCLESS
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; ecx:... = dividend
				;         - low dword of divisor * quotient"
	mov	eax, [esp+20]	; eax = high dword of divisor
	imul	eax, ebx	; eax = high dword of divisor * quotient"
if 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else
	xor	edx, edx	; edx = high dword of quotient = 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	mov	eax, ebx	; eax = quotient"
	sbb	eax, edx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
endif
endif ; JCCLESS
	pop	ebx
	ret	16		; callee restores stack

	; dividend >= divisor >= 2**63: quotient = 1
special:
	xor	eax, eax
	xor	edx, edx
	inc	eax		; edx:eax = quotient = 1

	ret	16		; callee restores stack

_aulldiv endp
	end

Save the i386 assembler source presented above as aulldiv.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file aulldiv.obj and add it to the existing object library i386.lib:

ML.EXE aulldiv.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib aulldiv.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: aulldiv.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_aulldvrm` Routine

Execute the following 2 command lines to display the content of the assembler source file ulldvrm.asm shipped with the Visual C compiler:

DIR "%source%\intel\ulldvrm.asm"
TYPE "%source%\intel\ulldvrm.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             6,227 ulldvrm.asm
               1 File(s)          6,227 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   ulldvrm - unsigned long divide and remainder routine
;***
;ulldvrm.asm - unsigned long divide and remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the unsigned long divide and remainder routine
;           __aulldvrm
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;ulldvrm - unsigned long divide and remainder
;
;Purpose:
;       Does a unsigned long divide and remainder of the arguments.  Arguments
;       are not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       EBX:ECX contains the remainder (divided % divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_aulldvrm PROC NEAR
.FPO (1, 4, 0, 0, 0, 0)

        push    esi

; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to aulldvrm(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;       ESP---->|      ESI      |
;               -----------------
;

DVND    equ     [esp + 8]       ; stack address of dividend (a)
DVSR    equ     [esp + 16]      ; stack address of divisor (b)

;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;

        mov     eax,HIWORD(DVSR) ; check to see if divisor < 4194304K
        or      eax,eax
        jnz     short L1        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; get high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; get low order bits of quotient
        mov     esi,eax         ; ebx:esi <- quotient

;
; Now we need to do a multiply so that we can compute the remainder.
;
        mov     eax,ebx         ; set up high word of quotient
        mul     dword ptr LOWORD(DVSR) ; HIWORD(QUOT) * DVSR
        mov     ecx,eax         ; save the result in ecx
        mov     eax,esi         ; set up low word of quotient
        mul     dword ptr LOWORD(DVSR) ; LOWORD(QUOT) * DVSR
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jmp     short L2        ; complete remainder calculation

;
; Here we do it the hard way.  Remember, eax contains DVSRHI
;

L1:
        mov     ecx,eax         ; ecx:ebx <- divisor
        mov     ebx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L3:
        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
        rcr     ebx,1
        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
        rcr     eax,1
        or      ecx,ecx
        jnz     short L3        ; loop until divisor < 4194304K
        div     ebx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient

;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;

        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L4        ; carry means Quotient is off by 1

;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;

        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L4        ; if result > original, do subtract
        jb      short L5        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L5        ; if less or equal we are ok, else subtract
L4:
        dec     esi             ; subtract 1 from quotient
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L5:
        xor     ebx,ebx         ; ebx:esi <- quotient

L2:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result.
;

        sub     eax,LOWORD(DVND) ; subtract dividend from result
        sbb     edx,HIWORD(DVND)
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0

;
; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
;
        mov     ecx,edx
        mov     edx,ebx
        mov     ebx,ecx
        mov     ecx,eax
        mov     eax,esi
;
; Just the cleanup left to do.  edx:eax contains the quotient.
; Restore the saved registers and return.
;

        pop     esi

        ret     16

_aulldvrm ENDP

        end

58 instructions in 149 bytes (plus 11 bytes for alignment).

OUCH: the highlighted comment with the following code is a remarkable gem – the remainder is already present in register EDX!

Implementation in i386 Assembler

A proper (and several times faster) implementation, targeting processors which feature speculative execution when the text macro JCCLESS is defined, else processors which don’t feature speculative execution, uses 75 instructions in 193 bytes (plus 15 bytes for alignment) respectively 72 instructions in 185 bytes (plus 7 bytes for alignment), including 18 instructions in 50 bytes for the special and trivial cases not covered by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _aulldvrm():
; receives arguments on stack, returns quotient in edx:eax and remainder in ebx:ecx

; NOTE: _aulldvrm() can raise 'division by zero' exception!

_aulldvrm proc	public		; qword _aulldvrm(qword dividend, qword divisor)

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	xor	ebx, ebx	; ebx:ecx = remainder
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	mov	edx, ebx	; edx:eax = quotient
	xor	ebx, ebx	; ebx:ecx = remainder

	ret	16		; callee restores stack

	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	mov	ecx, [esp+4]
	mov	ebx, [esp+8]	; ebx:ecx = remainder = dividend
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+8]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+8]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+8]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+12]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+16]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	push	ebx		; [esp] = quotient"
	mov	ebx, [esp+12]	; ebx = high dword of dividend
	sub	ebx, ecx	; ebx = high dword of dividend
				;     - high dword of divisor * quotient"
	mov	ecx, [esp+8]	; ecx = low dword of dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	pop	eax		; eax = quotient"
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+12]
	adc	ebx, [esp+16]	; ebx:ecx = remainder" + divisor
				;         = remainder
	dec	eax		; eax = quotient" - 1
				;     = (low dword of) quotient
@@:
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	; [esp] = quotient" - (remainder" < 0)
				;       = (low dword of) quotient
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = remainder
	pop	eax		; eax = (low dword of) quotient
endif ; JCCLESS
	xor	edx, edx	; edx:eax = quotient

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**63:
	; quotient = 1, remainder = dividend - divisor
special:
	mov	ecx, [esp+4]
	mov	ebx, [esp+8]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor
				;         = remainder
	xor	eax, eax
	xor	edx, edx
	inc	eax		; edx:eax = quotient = 1

	ret	16		; callee restores stack

_aulldvrm endp
	end

Save the i386 assembler source presented above as aulldvrm.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file aulldvrm.obj and add it to the existing object library i386.lib:

ML.EXE aulldvrm.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib aulldvrm.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: aulldvrm.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_aullrem` Routine

Execute the following 2 command lines to display the content of the assembler source file ullrem.asm shipped with the Visual C compiler:

DIR "%source%\intel\ullrem.asm"
TYPE "%source%\intel\ullrem.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             5,330 ullrem.asm
               1 File(s)          5,330 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   ullrem - unsigned long remainder routine
;***
;ullrem.asm - unsigned long remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the unsigned long remainder routine
;           __aullrem
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;ullrem - unsigned long remainder
;
;Purpose:
;       Does a unsigned long remainder of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the remainder (dividend%divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_aullrem        PROC NEAR
.FPO (1, 4, 0, 0, 0, 0)

        push    ebx

; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a%b will
; generate a call to ullrem(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;       ESP---->|      EBX      |
;               -----------------
;

DVND    equ     [esp + 8]       ; stack address of dividend (a)
DVSR    equ     [esp + 16]      ; stack address of divisor (b)

; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;

        mov     eax,HIWORD(DVSR) ; check to see if divisor < 4194304K
        or      eax,eax
        jnz     short L1        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; edx <- remainder, eax <- quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; edx <- final remainder
        mov     eax,edx         ; edx:eax <- remainder
        xor     edx,edx
        jmp     short L2        ; restore stack and return

;
; Here we do it the hard way.  Remember, eax contains DVSRHI
;

L1:
        mov     ecx,eax         ; ecx:ebx <- divisor
        mov     ebx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L3:
        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
        rcr     ebx,1
        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
        rcr     eax,1
        or      ecx,ecx
        jnz     short L3        ; loop until divisor < 4194304K
        div     ebx             ; now divide, ignore remainder

;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;

        mov     ecx,eax         ; save a copy of quotient in ECX
        mul     dword ptr HIWORD(DVSR)
        xchg    ecx,eax         ; put partial product in ECX, get quotient in EAX
        mul     dword ptr LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L4        ; carry means Quotient is off by 1

;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we're ok, otherwise
; subtract the original divisor from the result.
;

        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L4        ; if result > original, do subtract
        jb      short L5        ; if result < original, we're ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L5        ; if less or equal we're ok, else subtract
L4:
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L5:

;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will perform the subtract in
; the opposite direction and negate the result to make it positive.
;

        sub     eax,LOWORD(DVND) ; subtract original dividend from result
        sbb     edx,HIWORD(DVND)
        neg     edx             ; and negate it
        neg     eax
        sbb     edx,0

;
; Just the cleanup left to do.  dx:ax contains the remainder.
; Restore the saved registers and return.
;

L2:

        pop     ebx

        ret     16

_aullrem        ENDP

        end

44 instructions in 117 bytes (plus 11 bytes for alignment).

Implementation in i386 Assembler

A proper (and several times faster) implementation, targeting processors which feature speculative execution when the text macro JCCLESS is defined, else processors which don’t feature speculative execution, uses 65 instructions in 173 bytes (plus 3 bytes for alignment) respectively 64 instructions in 171 bytes (plus 5 bytes for alignment), including 14 instructions in 43 bytes for the special and trivial cases not covered by Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _aullrem():
; receives arguments on stack, returns remainder in edx:eax

; NOTE: _aullrem() can raise 'division by zero' exception!

_aullrem proc	public		; qword _aullrem(qword dividend, qword divisor)

	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?

	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?

	; remainder < divisor < 2**32

	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?

	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder

	ret	16		; callee restores stack

	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = remainder = dividend

	ret	16		; callee restores stack

	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?

	; perform "extended & adjusted" division

	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?

	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"

	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+20]	; ebx = high dword of divisor * quotient"
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	sub	ecx, ebx	; ecx = high dword of dividend
				;     - high dword of divisor * quotient"
	mov	ebx, [esp+8]	; ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	; ecx:ebx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	; ecx:ebx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ebx
	mov	edx, ecx	; edx:eax = remainder
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ebx
	adc	edx, ecx	; edx:eax = remainder" + divisor
				;         = remainder
endif ; JCCLESS
	pop	ebx
	ret	16		; callee restores stack

	; dividend >= divisor >= 2**63: remainder = dividend - divisor
special:
if 0
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	; edx:eax = dividend - divisor
				;         = remainder
else
	neg	edx
	neg	eax
	sbb	edx, ecx	; edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	; edx:eax = dividend - divisor
				;         = remainder
endif
	ret	16		; callee restores stack

_aullrem endp
	end

Save the i386 assembler source presented above as aullrem.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file aullrem.obj and add it to the existing object library i386.lib:

ML.EXE aullrem.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib aullrem.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: aullrem.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_aullshr` Routine

Execute the following 2 command lines to display the content of the assembler source file ullshr.asm shipped with the Visual C compiler:

DIR "%source%\intel\ullshr.asm"
TYPE "%source%\intel\ullshr.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             1,545 ullshr.asm
               1 File(s)          1,545 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   ullshr - long shift right
;***
;ullshr.asm - long shift right
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       define unsigned long shift right routine
;           __aullshr
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;ullshr - long shift right
;
;Purpose:
;       Does a unsigned Long Shift Right
;       Shifts a long right any number of bits.
;
;Entry:
;       EDX:EAX - long value to be shifted
;       CL    - number of bits to shift by
;
;Exit:
;       EDX:EAX - shifted value
;
;Uses:
;       CL is destroyed.
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_aullshr        PROC NEAR
.FPO (0, 0, 0, 0, 0, 0)

;
; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result
; depends only on the high order bit of edx).
;
        cmp     cl,64
        jae     short RETZERO

;
; Handle shifts of between 0 and 31 bits
;
        cmp     cl, 32
        jae     short MORE32
        shrd    eax,edx,cl
        shr     edx,cl
        ret

;
; Handle shifts of between 32 and 63 bits
;
MORE32:
        mov     eax,edx
        xor     edx,edx
        and     cl,31
        shr     eax,cl
        ret

;
; return 0 in edx:eax
;
RETZERO:
        xor     eax,eax
        xor     edx,edx
        ret

_aullshr        ENDP

        end

15 instructions in 31 bytes (plus 1 byte for alignment).

OUCH: i386 and newer processors perform shift operations modulo the register size, the ~~deleted~~ AND instruction is therefore superfluous!

Implementation in i386 Assembler

A proper implementation prefers the more likely smaller shift counts and uses 13 instructions in 24 bytes (plus 8 bytes for alignment):

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _aullshr():
; receives arguments in edx:eax and cl, returns result in edx:eax

_aullshr proc	public		; qword _aullshr(qword value, byte count)

	cmp	cl, 31
	ja	short @f	; count > 31?

	shrd	eax, edx, cl
	shr	edx, cl		; edx:eax = result
	ret
@@:
	xor	eax, eax	; eax = high dword of result
				;     = 0
	cmp	cl, 63
	ja	short @f	; count > 63?

	xchg	eax, edx	; eax = high dword of value,
				; edx = high dword of result = 0
	shr	eax, cl		; edx:eax = result
	ret
@@:
	cdq			; edx:eax = result = 0
	ret

_aullshr endp
	end

Save the i386 assembler source presented above as aullshr.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file aullshr.obj and add it to the existing object library i386.lib:

ML.EXE aullshr.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib aullshr.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: aullshr.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_allshl` Routine

Execute the following 2 command lines to display the content of the assembler source file llshl.asm shipped with the Visual C compiler:

DIR "%source%\intel\llshl.asm"
TYPE "%source%\intel\llshl.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             1,493 llshl.asm
               1 File(s)          1,493 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   llshl - long shift left
;***
;llshl.asm - long shift left
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       define long shift left routine (signed and unsigned are same)
;           __allshl
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;llshl - long shift left
;
;Purpose:
;       Does a Long Shift Left (signed and unsigned are identical)
;       Shifts a long left any number of bits.
;
;Entry:
;       EDX:EAX - long value to be shifted
;       CL    - number of bits to shift by
;
;Exit:
;       EDX:EAX - shifted value
;
;Uses:
;       CL is destroyed.
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_allshl PROC NEAR
.FPO (0, 0, 0, 0, 0 ,0)

;
; Handle shifts of 64 or more bits (all get 0)
;
        cmp     cl, 64
        jae     short RETZERO

;
; Handle shifts of between 0 and 31 bits
;
        cmp     cl, 32
        jae     short MORE32
        shld    edx,eax,cl
        shl     eax,cl
        ret

;
; Handle shifts of between 32 and 63 bits
;
MORE32:
        mov     edx,eax
        xor     eax,eax
        and     cl,31
        shl     edx,cl
        ret

;
; return 0 in edx:eax
;
RETZERO:
        xor     eax,eax
        xor     edx,edx
        ret

_allshl ENDP

        end

15 instructions in 31 bytes (plus 1 byte for alignment).

OUCH: i386 and newer processors perform shift operations modulo the register size, the ~~deleted~~ AND instruction is therefore superfluous!

Implementation in i386 Assembler

A proper implementation prefers the more likely smaller shift counts and uses 13 instructions in 25 bytes (plus 7 bytes for alignment):

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _allshl():
; receives arguments in edx:eax and cl, returns result in edx:eax

_allshl	proc	public		; sqword _allshl(sqword value, byte count)

	cmp	cl, 31
	ja	short @f	; count > 31?

	shld	edx, eax, cl
	shl	eax, cl		; edx:eax = result
	ret
@@:
	mov	edx, eax	; edx = low dword of value
	xor	eax, eax	; eax = low dword of result
				;     = 0
	cmp	cl, 63
	ja	short @f	; count > 63?

	shl	edx, cl		; edx:eax = result
	ret
@@:
	cdq			; edx:eax = result = 0
	ret

_allshl	endp
	end

Save the i386 assembler source presented above as allshl.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file allshl.obj and add it to the existing object library i386.lib:

ML.EXE allshl.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib allshl.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: allshl.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`_allshr` Routine

Execute the following 2 command lines to display the content of the assembler source file llshr.asm shipped with the Visual C compiler:

DIR "%source%\intel\llshr.asm"
TYPE "%source%\intel\llshr.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             1,561 llshr.asm
               1 File(s)          1,561 bytes
               0 Dir(s)    9,876,543,210 bytes free

        title   llshr - long shift right
;***
;llshr.asm - long shift right
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       define signed long shift right routine
;           __allshr
;
;*******************************************************************************


.xlist
include cruntime.inc
include mm.inc
.list

;***
;llshr - long shift right
;
;Purpose:
;       Does a signed Long Shift Right
;       Shifts a long right any number of bits.
;
;Entry:
;       EDX:EAX - long value to be shifted
;       CL    - number of bits to shift by
;
;Exit:
;       EDX:EAX - shifted value
;
;Uses:
;       CL is destroyed.
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

_allshr PROC NEAR
.FPO (0, 0, 0, 0, 0, 0)

;
; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result
; depends only on the high order bit of edx).
;
        cmp     cl,64
        jae     short RETSIGN

;
; Handle shifts of between 0 and 31 bits
;
        cmp     cl, 32
        jae     short MORE32
        shrd    eax,edx,cl
        sar     edx,cl
        ret

;
; Handle shifts of between 32 and 63 bits
;
MORE32:
        mov     eax,edx
        sar     edx,31
        and     cl,31
        sar     eax,cl
        ret

;
; Return double precision 0 or -1, depending on the sign of edx
;
RETSIGN:
        sar     edx,31
        mov     eax,edx
        ret

_allshr ENDP

        end

15 instructions in 33 bytes (plus 15 bytes for alignment).

OUCH: i386 and newer processors perform shift operations modulo the register size, the ~~deleted~~ AND instruction is therefore superfluous!

Implementation in i386 Assembler

A proper implementation prefers the more likely smaller shift counts and uses 13 instructions in 25 bytes (plus 7 bytes for alignment):

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

; MSC internal _allshr():
; receives arguments in edx:eax and cl, returns result in edx:eax

_allshr	proc	public		; sqword _allshr(sqword value, byte count)

	cmp	cl, 31
	ja	short @f	; count > 31?

	shrd	edx, eax, cl
	sar	eax, cl		; edx:eax = result
	ret
@@:
	mov	eax, edx	; eax = high dword of value
	cdq			; edx = (value < 0) ? -1 : 0
				;     = high dword of result
	cmp	cl, 63
	ja	short @f	; count > 63?

	sar	eax, cl		; edx:eax = result
	ret
@@:
	mov	eax, edx	; edx:eax = (value < 0) ? -1 : 0
				;         = result
	ret

_allshr	endp
	end

Save the i386 assembler source presented above as allshr.asm in the directory where you created the object library i386.lib before, then execute the following 2 command lines to generate the object file allshr.obj and add it to the existing object library i386.lib:

ML.EXE allshr.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib allshr.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: allshr.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Revision History of `_all` and `_aull` Routines in Leaked Source

DIR "%source%\intel\ll*.asm"
DIR "%source%\intel\ull*.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             6,670 lldiv.asm
02/18/2011  03:40 PM             8,557 lldvrm.asm
02/18/2011  03:40 PM             2,570 llmul.asm
02/18/2011  03:40 PM             7,067 llrem.asm
02/18/2011  03:40 PM             1,493 llshl.asm
02/18/2011  03:40 PM             1,561 llshr.asm
               6 File(s)         27,918 bytes
               0 Dir(s)    9,876,543,210 bytes free

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             5,079 ulldiv.asm
02/18/2011  03:40 PM             6,227 ulldvrm.asm
02/18/2011  03:40 PM             5,330 ullrem.asm
02/18/2011  03:40 PM             1,545 ullshr.asm
               4 File(s)         18,181 bytes
               0 Dir(s)    9,876,543,210 bytes free

The following table presents the revision history extracted from the i386 assembler source file blcrtasm.asm, but stripped from the 10 individual assembler source files shown above.

Note: on November 19, 1993, 8 (in words: eight) years after Intel^® introduced their 80386 processor, and 8 months after they introduced the Pentium^® processor, these routines were modified to work on 64 bit integers, but without taking advantage of these 32-bit processor’s new instructions like BSF, BSR, SHLD and SHRD to replace the (dead)slow loops which shift both operands by just one bit per pass with SHR and RCR instructions.

Note: even the initial versions of the _alldvrm and _aulldvrm routines, created October 6, 1998, almost 3 years after Intel introduced their Pentium^®Pro processor, and 17 months after they introduced the Pentium^®II processor, failed to rectify (not just) this performance degrading deficiency. Intel Microprocessor Quick Reference Guide - Product Family

Revision history of Visual C compiler helper routines
Routine(s)										Date	Who	Comment
Routine(s)										Date	Who	Comment
				llshl	llshr				ullshr	1983-11-??	HS	initial version
lldiv		llmul	llrem			ulldiv		ullrem		1983-11-29	DFW	initial version
				llshl	llshr				ullshr	1983-11-30	DFW	added medium model support
				llshl	llshr				ullshr	1984-03-12	DFW	broke apart; added long model support
lldiv		llmul	llrem	llshl	llshr	ulldiv		ullrem	ullshr	1984-06-01	RN	modified to use cmacros
		llmul								1985-04-17	TC	ignore signs since they take care of themselves do a fast multiply if both hiwords of arguments are 0
		llmul								1986-10-10	MH	slightly faster implementation, for 0 in upper words
lldiv			llrem			ulldiv		ullrem		1987-10-23	SKS	fixed off-by-1 error for dividend close to 2**32.
		llmul								1989-03-20	SKS	Remove redundant "MOV SP,BP" from epilogs
		llmul								1989-05-18	SKS	Preserve BX
lldiv			llrem			ulldiv		ullrem		1989-05-18	SKS	Remove redundant "MOV SP,BP" from epilog
lldiv		llmul	llrem	llshl	llshr	ulldiv		ullrem	ullshr	1989-11-28	GJF	Fixed copyright
lldiv		llmul	llrem	llshl	llshr	ulldiv		ullrem	ullshr	1993-11-19	SMK	Modified to work on 64 bit integers
lldiv		llmul	llrem	llshl	llshr	ulldiv		ullrem	ullshr	1994-01-17	GDF	Minor changes to build with NT's masm386.
				llshl	llshr				ullshr	1994-07-08	GDF	Faster, fatter version for NT.
				llshl	llshr				ullshr	1994-07-13	GDF	Further improvements from JonM.
lldiv		llmul	llrem			ulldiv		ullrem		1994-07-22	GJF	Use esp-relative addressing for args. Shortened conditional jumps. Also, don't use xchg to do a simple move between regs.
	lldvrm						ulldvrm			1998-10-06	SMK	Initial version.

Runtime Measurement of `_all` and `_aull` Routines

With the preprocessor macro CYCLES defined, the following program measures the execution times of signed and unsigned 64÷64-bit divisions as well as 64×64-bit multiplications in processor clock cycles and runs on Windows Vista^® and later versions, else it measures the execution times in nano-seconds and runs on all versions of Windows^™ NT; it executes each operation on 1 billion pairs of uniform distributed 64-bit pseudo-random numbers in a first pass, then on 1 billion pairs of uniform distributed 33 to 64-bit pseudo-random numbers in a second pass:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

typedef	LONGLONG	SQWORD;
typedef	ULONGLONG	QWORD;

#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}

const	struct	_ull
{
	QWORD	ullDividend, ullDivisor, ullQuotient, ullRemainder;
} ullTable[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
                _(0x0000000000000001ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000002ULL),
                _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
                _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
                _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
                _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
                _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
                _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000000000001ULL),
                _(0x8000000000000000ULL, 0x0000000000000002ULL),
                _(0x8000000000000000ULL, 0x0000000000000003ULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000100000000ULL),
                _(0x8000000000000000ULL, 0x0000000100000001ULL),
                _(0x8000000000000000ULL, 0x0000000100000002ULL),
                _(0x8000000000000000ULL, 0x0000000100000003ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000080000000ULL, 0x0000000080000000ULL),
                _(0x8000000080000001ULL, 0x0000000080000001ULL),
                _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};

const	struct	_ll
{
	SQWORD	llDividend, llDivisor, llQuotient, llRemainder;
} llTable[] = {_(0x0000000000000000LL, 0x0000000000000001LL),	// 0, 1
               _(0x0000000000000001LL, 0x0000000000000001LL),	// 1, 1
               _(0x0000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// 0, -1
               _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFFLL),	// 1, -1
               _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFELL),	// 1, -2
               _(0x0000000000000002LL, 0xFFFFFFFFFFFFFFFELL),	// 2, -2
               _(0x000000000FFFFFFFLL, 0x0000000000000001LL),
               _(0x0000000FFFFFFFFFLL, 0x000000000000000FLL),
               _(0x0000000FFFFFFFFFLL, 0x0000000000000010LL),
               _(0x0000000000000100LL, 0x000000000FFFFFFFLL),
               _(0x00FFFFFFF0000000LL, 0x0000000010000000LL),
               _(0x07FFFFFF80000000LL, 0x0000000080000000LL),
               _(0x7FFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
               _(0x7FFFFFFEFFFFFFF0LL, 0x0000FFFFFFFFFFFELL),
               _(0x7FFFFFFEFFFFFFF0LL, 0x7FFFFFFEFFFFFFF0LL),
               _(0x7FFFFFFFFFFFFFFFLL, 0x8000000000000000LL),	// llmax, llmin
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFDLL),	// llmax, -3
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// llmax, -2
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL),	// llmax, -1
               _(0x8000000000000000LL, 0x0000000000000001LL),	// llmin, 1
               _(0x8000000000000000LL, 0x0000000000000002LL),	// llmin, 2
               _(0x8000000000000000LL, 0x0000000000000003LL),	// llmin, 3
               _(0x8000000000000000LL, 0x00000000FFFFFFFELL),
               _(0x8000000000000000LL, 0x00000000FFFFFFFFLL),
               _(0x8000000000000000LL, 0x0000000100000000LL),
               _(0x8000000000000000LL, 0x0000000100000001LL),
               _(0x8000000000000000LL, 0x0000000100000002LL),
               _(0x8000000000000000LL, 0x8000000000000000LL),	// llmin, llmin
               _(0x8000000000000000LL, 0xFFFFFFFF00000000LL),
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFDLL),	// llmin, -3
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFELL),	// llmin, -2
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// llmin, -1
               _(0x8000000080000000LL, 0x0000000080000000LL),
               _(0x8000000080000001LL, 0x0000000080000001LL),
               _(0xFFFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000080000000LL),
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000001LL),	// -2, 1
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000002LL),	// -2, 2
               _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFELL),	// -2, -2
               _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFFLL),	// -2, -1
               _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000001LL),	// -1, 1
               _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000002LL),	// -1, 2
               _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// -1, -2
               _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL)};	// -1, -1

#undef _

__declspec(naked)
__declspec(noinline)
QWORD	WINAPI	_aullnop(QWORD ullLeft, QWORD ullRight)
{
	__asm	ret	16
}

__forceinline	// companion for __emulu()
struct
{
	DWORD	ulQuotient, ulRemainder;
}	WINAPI	__edivmodu(QWORD ullDividend, DWORD ulDivisor)
{
	__asm	mov	eax, dword ptr ullDividend
	__asm	mov	edx, dword ptr ullDividend+4
	__asm	div	ulDivisor
}

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];
	QWORD	qwT0, qwT1, qwT2, qwT3, qwT4, qwT5, qwT6, qwT7, qwT8, qwT9;
	QWORD	ullQuotient, ullRemainder;
	SQWORD	llQuotient, llRemainder;
	volatile
	QWORD	qwQuotient, qwRemainder;
	QWORD	qwDividend, qwDivisor = ~0ULL;
	QWORD	qwLeft = 0x9E3779B97F4A7C15ULL;		// 2**64 / golden ratio
	QWORD	qwRight = 0x28208A20A08A28ACULL;	// bit-vector of prime numbers:
							//  2**prime is set for each prime in [0, 63]
	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if ((hOutput == INVALID_HANDLE_VALUE)
	 || (SetThreadIdealProcessor(hThread, 0UL) == -1L)
	 || (!SetThreadPriority(hThread, THREAD_PRIORITY_HIGHEST)))
		ExitProcess(GetLastError());

	__cpuid(dwCPUID, 0x80000000UL);

	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		__movsb(dwCPUID, "undetermined processor", sizeof("undetermined processor"));

	PrintFormat(hOutput, "\r\nTesting unsigned 64-bit division...\r\n");

	for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);

		ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
		ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;

		if (ullQuotient != ullTable[dw].ullQuotient)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);

		if (ullQuotient > ullTable[dw].ullDividend)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);

		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);

		if (ullRemainder >= ullTable[dw].ullDivisor)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
	}

	for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
	{
		PrintFormat(hOutput, "\r%ld", ~dw);

		ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;

		if (ullQuotient != ullTable[dw].ullQuotient)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);

		if (ullQuotient > ullTable[dw].ullDividend)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);

		ullRemainder = ullTable[dw].ullDividend - ullTable[dw].ullDivisor * ullQuotient;

		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);

		ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;

		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);

		if (ullRemainder >= ullTable[dw].ullDivisor)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
	}

	PrintFormat(hOutput, "\r\nTesting signed 64-bit division...\r\n");

	for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);

		llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;
		llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;

		if (llQuotient != llTable[dw].llQuotient)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);

		if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
		 || (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);

		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);

		if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
		 || (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);

		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);

		llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;

		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);

		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
	}

	for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
	{
		PrintFormat(hOutput, "\r%ld", ~dw);

		llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;

		if (llQuotient != llTable[dw].llQuotient)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);

		if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
		 || (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);

		llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;

		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);

		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);

		llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;

		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);

		if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
		 || (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);

		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
	}

	PrintFormat(hOutput, "\r\nTiming 64-bit division and multiplication on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9

		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwQuotient = _aullnop(qwLeft, qwRight);

		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwRemainder = _aullnop(qwLeft, qwRight);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwQuotient = qwLeft / qwRight;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwQuotient = qwLeft / qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwRemainder = qwLeft % qwRight;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwRemainder = qwLeft % qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT4))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT4))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwQuotient = qwLeft * qwRight;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwRemainder = qwLeft * qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT5))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT5))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT6))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT6))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT7))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT7))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT8))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT8))
#endif
		ExitProcess(GetLastError());

	qwT9 = qwT8 - qwT0;
	qwT8 -= qwT7;
	qwT7 -= qwT6;
	qwT6 -= qwT5;
	qwT5 -= qwT4;
	qwT4 -= qwT3;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%09lu      0\r\n"
	            "_aulldiv()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aullrem()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aulldvrm()  %6lu.%09lu %6lu.%09lu\r\n"
	            "_allmul()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldiv()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_allrem()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldvrm()   %6lu.%09lu %6lu.%09lu\r\n"
	            "             %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL), __edivmodu(qwT2 - qwT1, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL), __edivmodu(qwT3 - qwT1, 1000000000UL),
	            __edivmodu(qwT4, 1000000000UL), __edivmodu(qwT4 - qwT1, 1000000000UL),
	            __edivmodu(qwT5, 1000000000UL), __edivmodu(qwT5 - qwT1, 1000000000UL),
	            __edivmodu(qwT6, 1000000000UL), __edivmodu(qwT6 - qwT1, 1000000000UL),
	            __edivmodu(qwT7, 1000000000UL), __edivmodu(qwT7 - qwT1, 1000000000UL),
	            __edivmodu(qwT8, 1000000000UL), __edivmodu(qwT8 - qwT1, 1000000000UL),
	            __edivmodu(qwT9, 1000000000UL));
#else // CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%07lu      0\r\n"
	            "_aulldiv()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aullrem()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aulldvrm()  %6lu.%07lu %6lu.%07lu\r\n"
	            "_allmul()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldiv()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_allrem()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldvrm()   %6lu.%07lu %6lu.%07lu\r\n"
	            "             %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL), __edivmodu(qwT2 - qwT1, 10000000UL),
	            __edivmodu(qwT3, 10000000UL), __edivmodu(qwT3 - qwT1, 10000000UL),
	            __edivmodu(qwT4, 10000000UL), __edivmodu(qwT4 - qwT1, 10000000UL),
	            __edivmodu(qwT5, 10000000UL), __edivmodu(qwT5 - qwT1, 10000000UL),
	            __edivmodu(qwT6, 10000000UL), __edivmodu(qwT6 - qwT1, 10000000UL),
	            __edivmodu(qwT7, 10000000UL), __edivmodu(qwT7 - qwT1, 10000000UL),
	            __edivmodu(qwT8, 10000000UL), __edivmodu(qwT8 - qwT1, 10000000UL),
	            __edivmodu(qwT9, 10000000UL));
#endif // CYCLES
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = _aullnop(qwDividend, qwDivisor);

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = _aullnop(qwDividend, qwDivisor);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwRemainder = qwDividend % qwDivisor;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = qwDividend % qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT4))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT4))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = qwDividend * qwDivisor;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = qwDividend * qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT5))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT5))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT6))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT6))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT7))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT7))
#endif
		ExitProcess(GetLastError());

	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);

		qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;

		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);

		qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT8))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT8))
#endif
		ExitProcess(GetLastError());

	qwT9 = qwT8 - qwT0;
	qwT8 -= qwT7;
	qwT7 -= qwT6;
	qwT6 -= qwT5;
	qwT5 -= qwT4;
	qwT4 -= qwT3;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%09lu      0\r\n"
	            "_aulldiv()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aullrem()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aulldvrm()  %6lu.%09lu %6lu.%09lu\r\n"
	            "_allmul()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldiv()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_allrem()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldvrm()   %6lu.%09lu %6lu.%09lu\r\n"
	            "             %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL), __edivmodu(qwT2 - qwT1, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL), __edivmodu(qwT3 - qwT1, 1000000000UL),
	            __edivmodu(qwT4, 1000000000UL), __edivmodu(qwT4 - qwT1, 1000000000UL),
	            __edivmodu(qwT5, 1000000000UL), __edivmodu(qwT5 - qwT1, 1000000000UL),
	            __edivmodu(qwT6, 1000000000UL), __edivmodu(qwT6 - qwT1, 1000000000UL),
	            __edivmodu(qwT7, 1000000000UL), __edivmodu(qwT7 - qwT1, 1000000000UL),
	            __edivmodu(qwT8, 1000000000UL), __edivmodu(qwT8 - qwT1, 1000000000UL),
	            __edivmodu(qwT9, 1000000000UL));
#else // CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%07lu      0\r\n"
	            "_aulldiv()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aullrem()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aulldvrm()  %6lu.%07lu %6lu.%07lu\r\n"
	            "_allmul()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldiv()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_allrem()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldvrm()   %6lu.%07lu %6lu.%07lu\r\n"
	            "             %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL), __edivmodu(qwT2 - qwT1, 10000000UL),
	            __edivmodu(qwT3, 10000000UL), __edivmodu(qwT3 - qwT1, 10000000UL),
	            __edivmodu(qwT4, 10000000UL), __edivmodu(qwT4 - qwT1, 10000000UL),
	            __edivmodu(qwT5, 10000000UL), __edivmodu(qwT5 - qwT1, 10000000UL),
	            __edivmodu(qwT6, 10000000UL), __edivmodu(qwT6 - qwT1, 10000000UL),
	            __edivmodu(qwT7, 10000000UL), __edivmodu(qwT7 - qwT1, 10000000UL),
	            __edivmodu(qwT8, 10000000UL), __edivmodu(qwT8 - qwT1, 10000000UL),
	            __edivmodu(qwT9, 10000000UL));
#endif // CYCLES
	ExitProcess(ERROR_SUCCESS);
}

Save the ANSI C source presented above as i386-i64.c in the directory where you created the object library i386.lib before, then execute the following 4 command lines to compile it, link the generated object file i386-i64.obj with the routines from the object library i386.lib, and finally execute the image file i386-i64.exe:

SET CL=/GAFy /Oxy /W4 /Zl
SET LINK=/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
CL.EXE /DCYCLES i386-i64.c
.\i386-i64.exe

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-i64.c
i386-i64.c(128) : warning C4100: 'ullRight' : unreferenced formal parameter
i386-i64.c(128) : warning C4100: 'ullLeft' : unreferenced formal parameter

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
/out:i386-i64.exe
i386-i64.obj

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz

_aullnop()        5.130620616      0
_aulldiv()       24.916274238     19.785653622
_aullrem()       24.686248015     19.555627399
_aulldvrm()      25.947651690     20.817031074
_allmul()         6.753417214      1.622796598
_alldiv()        27.691010847     22.560390231
_allrem()        27.923880075     22.793259459
_alldvrm()       29.695561448     24.564940832
                172.744664143 clock cycles

_aullnop()        8.388855142      0
_aulldiv()       25.816723410     17.427868268
_aullrem()       25.383319447     16.994464305
_aulldvrm()      26.106060709     17.717205567
_allmul()        10.095017621      1.706162479
_alldiv()        30.421659163     22.032804021
_allrem()        30.961920386     22.573065244
_alldvrm()       32.481759875     24.092904733
                189.655315753 clock cycles

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

_aullnop()        4.323464204      0
_aulldiv()       20.430453818     16.106989614
_aullrem()       20.833517940     16.510053736
_aulldvrm()      21.894735828     17.571271624
_allmul()         5.704146716      1.380682512
_alldiv()        23.409770870     19.086306666
_allrem()        23.662985522     19.339521318
_alldvrm()       25.053725237     20.730261033
                145.312800135 clock cycles

_aullnop()        7.128891691      0
_aulldiv()       21.592044916     14.463153225
_aullrem()       21.438925969     14.310034278
_aulldvrm()      21.977489828     14.848598137
_allmul()         8.646452477      1.517560786
_alldiv()        25.699076924     18.570185233
_allrem()        26.119234835     18.990343144
_alldvrm()       27.478233473     20.349341782
                160.080350113 clock cycles

Oops: on this Intel^® Core^™ i7 processor, the division routines for signed 64-bit integers are up to 37 % slower than those for 64-bit unsigned integers.

Copy i386-i64.exe to systems with other processors and execute it there too:

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor

_aullnop()        5.122143146      0
_aulldiv()       20.223817270     15.101674124
_aullrem()       20.164726800     15.042583654
_aulldvrm()      21.157922084     16.035778938
_allmul()         7.981533048      2.859389902
_alldiv()        20.836136360     15.713993214
_allrem()        22.052193688     16.930050542
_alldvrm()       21.613653936     16.491510790
                139.152126332 clock cycles

_aullnop()        7.683783760      0
_aulldiv()       21.300318504     13.616534744
_aullrem()       21.362844062     13.679060302
_aulldvrm()      22.265051198     14.581267438
_allmul()         8.956695616      1.272911856
_alldiv()        23.670479992     15.986696232
_allrem()        24.239343886     16.555560126
_alldvrm()       25.328227580     17.644443820
                154.806744598 clock cycles

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()        8.978870978      0
_aulldiv()       41.378538512     32.399667534
_aullrem()       41.459120072     32.480249094
_aulldvrm()      42.496702958     33.517831980
_allmul()        13.102594044      4.123723066
_alldiv()        48.877112646     39.898241668
_allrem()        48.630810323     39.651939345
_alldvrm()       57.155683201     48.176812223
                302.079432734 clock cycles

_aullnop()       13.583675374      0
_aulldiv()       41.220087960     27.636412586
_aullrem()       39.885909615     26.302234241
_aulldvrm()      41.619714690     28.036039316
_allmul()        18.266469971      4.682794597
_alldiv()        52.360017066     38.776341692
_allrem()        52.759947948     39.176272574
_alldvrm()       58.291439368     44.707763994
                317.987261992 clock cycles

Oops: on this 15 year old Intel^® Core^™2 processor, the division routines for signed 64-bit integers are up to 54 % slower than those for unsigned 64-bit integers.

Generate the import library i386-.lib and link the object file i386-i64.lib with it, then execute the image file i386-i64.exe, now calling the (several times slower) 64-bit integer division and multiplication routines of NTDLL.dll:

LINK.EXE /LIB /DEF /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /MACHINE:I386 /NAME:NTDLL /OUT:i386.lib
LINK.EXE i386-i64.obj
.\i386-i64.exe

Note: the existing object library i386.lib and the image file i386-i64.exe are overwritten!

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

   Creating library i386.lib and object i386.exp

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz

_aullnop()        5.173417920      0
_aulldiv()      102.145250711     96.971832791
_aullrem()      103.746558934     98.573141014
_aulldvrm()     103.711363598     98.537945678
_allmul()         9.640065030      4.466647110
_alldiv()       108.071904817    102.898486897
_allrem()       109.348691219    104.175273299
_alldvrm()      111.349818320    106.176400400
                653.187070549 clock cycles

_aullnop()        8.391292312      0
_aulldiv()       70.546796218     62.155503906
_aullrem()       72.698974389     64.307682077
_aulldvrm()      72.715990565     64.324698253
_allmul()        12.941190977      4.549898665
_alldiv()        85.064559941     76.673267629
_allrem()        86.951781779     78.560489467
_alldvrm()       89.225890859     80.834598547
                498.536477040 clock cycles

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

_aullnop()        4.293480684      0
_aulldiv()       88.508826840     84.215346156
_aullrem()       86.771270732     82.477790048
_aulldvrm()      88.330881954     84.037401270
_allmul()         8.473599720      4.180119036
_alldiv()        91.503580475     87.210099791
_allrem()        92.444818641     88.151337957
_alldvrm()       94.478769821     90.185289137
                554.805228867 clock cycles

_aullnop()        7.264421067      0
_aulldiv()       61.477534879     54.213113812
_aullrem()       60.671739257     53.407318190
_aulldvrm()      60.621554727     53.357133660
_allmul()        11.255056723      3.990635656
_alldiv()        71.552476501     64.288055434
_allrem()        72.850638899     65.586217832
_alldvrm()       73.504622168     66.240201101
                419.198044221 clock cycles

OUCH: here Microsoft’s division routines are 3.2 to 5.5 times slower than those presented above, and their multiplication routine is 2.5 to 4.5 times slower!

Also copy this variant of i386-i64.exe to systems with other processors and execute it there:

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor

_aullnop()        5.121743374      0
_aulldiv()       55.833332662     50.711589288
_aullrem()       56.467057234     51.345313860
_aulldvrm()      58.395325278     53.273581904
_allmul()         7.690866062      2.569122688
_alldiv()        60.314670158     55.192926784
_allrem()        62.087331252     56.965587878
_alldvrm()       63.125034508     58.003291134
                369.035360528 clock cycles

_aullnop()        7.682569518      0
_aulldiv()       36.154348392     28.471778874
_aullrem()       37.041158934     29.358589416
_aulldvrm()      39.300191898     31.617622380
_allmul()        10.278630848      2.596061330
_alldiv()        45.203128180     37.520558662
_allrem()        46.432703432     38.750133914
_alldvrm()       47.750769056     40.068199538
                269.843500258 clock cycles

OUCH: there Microsoft’s division routines are up to 4 times slower than those presented above, and their multiplication routine is up to 2 times slower!

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()        8.952309405      0
_aulldiv()      129.226132812    120.273823407
_aullrem()      134.708512677    125.756203272
_aulldvrm()     143.453665206    134.501355801
_allmul()        17.856662118      8.904352713
_alldiv()       151.624513041    142.672203636
_allrem()       152.423786688    143.471477283
_alldvrm()      171.419909639    162.467600234
                909.665491586 clock cycles

_aullnop()       13.637814045      0
_aulldiv()       97.951782280     84.313968235
_aullrem()      103.122246554     89.484432509
_aulldvrm()     108.556615922     94.918801877
_allmul()        23.786421340     10.148607295
_alldiv()       117.589788695    103.951974650
_allrem()       119.202419933    105.564605888
_alldvrm()      124.590005377    110.952191332
                708.437094146 clock cycles

OUCH: all Microsoft routines are more than 2 times slower than those presented above, their division routines are even up to 4 times slower!

Finally execute the following 10 command lines to recreate the object library i386.lib, but now with the division routines for processors which feature speculative execution, then link the object file i386-i64.lib generated before with it and execute the image file i386-i64.exe:

SET ML=/c /DJCCLESS /safeseh /W3 /X
ML.EXE lldiv.asm
ML.EXE lldvrm.asm
ML.EXE llrem.asm
ML.EXE ulldiv.asm
ML.EXE ulldvrm.asm
ML.EXE ullrem.asm
LINK.EXE /LIB /OUT:i386.lib alldiv.obj alldvrm.obj allmul.obj alloca.obj alloca8.obj alloca16.obj allrem.obj allshl.obj allshr.obj aulldiv.obj aulldvrm.obj aullrem.obj aullshr.obj
LINK.EXE i386-i64.obj
.\i386-i64.exe

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alldiv.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: alldvrm.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: allrem.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: aulldiv.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: aulldvrm.asm

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: aullrem.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

   Creating library i386.lib and object i386.exp

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz

_aullnop()        5.130523022      0
_aulldiv()       22.148703486     17.018180464
_aullrem()       25.618235017     20.487711995
_aulldvrm()      27.484994154     22.354471132
_allmul()         6.756777960      1.626254938
_alldiv()        30.000776927     24.870253905
_allrem()        31.364631604     26.234108582
_alldvrm()       34.146860035     29.016337013
                182.651502205 clock cycles

_aullnop()        8.384999331      0
_aulldiv()       26.479597369     18.094598038
_aullrem()       27.445690820     19.060691489
_aulldvrm()      28.586813456     20.201814125
_allmul()        10.095121587      1.710122256
_alldiv()        31.576787296     23.191787965
_allrem()        33.327090912     24.942091581
_alldvrm()       35.484606261     27.099606930
                201.380707032 clock cycles

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz

_aullnop()        4.441024919      0
_aulldiv()       19.396569893     14.955544974
_aullrem()       22.313397424     17.872372505
_aulldvrm()      23.100780021     18.659755102
_allmul()         5.659121355      1.218096436
_alldiv()        25.226574313     20.785549394
_allrem()        26.355810792     21.914785873
_alldvrm()       28.902499259     24.461474340
                155.395777976 clock cycles

_aullnop()        7.081996184      0
_aulldiv()       22.156429579     15.074433395
_aullrem()       22.934132959     15.852136775
_aulldvrm()      23.886275722     16.804279538
_allmul()         8.362922762      1.280926578
_alldiv()        26.645217765     19.563221581
_allrem()        27.942975969     20.860979785
_alldvrm()       29.720484454     22.638488270
                168.730435394 clock cycles

Oops: on this Intel^® Core^™ i7 processor, the branch-avoiding division routine runs for big unsigned integers about 7.5 % faster than its conditionally branching variant, while the others are up to 15 % slower.

Again copy this variant of i386-i64.exe to systems with other processors and execute it there too:

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor

_aullnop()        5.120327104      0
_aulldiv()       17.308029902     12.187702798
_aullrem()       18.250468490     13.130141386
_aulldvrm()      20.019941914     14.899614810
_allmul()         7.951225720      2.830898616
_alldiv()        22.051554624     16.931227520
_allrem()        23.685382974     18.565055870
_alldvrm()       24.193433346     19.073106242
                138.580364074 clock cycles

_aullnop()        7.679050314      0
_aulldiv()       21.642133458     13.963083144
_aullrem()       21.867652058     14.188601744
_aulldvrm()      23.638699750     15.959649436
_allmul()         8.959283832      1.280233518
_alldiv()        24.276285226     16.597234912
_allrem()        24.838598612     17.159548298
_alldvrm()       26.549134226     18.870083912
                159.450837476 clock cycles

Oops: on this AMD^® Ryzen^™ processor, the branch-avoiding division routines run for big unsigned integers up to 20 % faster than their conditionally branching variants, else but up to 15 % slower.

Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz

_aullnop()        8.956408592      0
_aulldiv()       40.006629475     31.050220883
_aullrem()       45.331733367     36.375324775
_aulldvrm()      52.817756545     43.861347953
_allmul()        13.100714751      4.144306159
_alldiv()        53.679959575     44.723550983
_allrem()        60.075486247     51.119077655
_alldvrm()       73.557314195     64.600905603
                347.526002747 clock cycles

_aullnop()       13.582502272      0
_aulldiv()       43.965933963     30.383431691
_aullrem()       46.175253389     32.592751117
_aulldvrm()      49.844225982     36.261723710
_allmul()        18.252733959      4.670231687
_alldiv()        53.847451160     40.264948888
_allrem()        59.859327003     46.276824731
_alldvrm()       65.922751282     52.340249010
                351.450179010 clock cycles

OOPS: on this 15 year old Intel^® Core^™2 processor, contrary to the expected results, the branch-avoiding division routines run up to 34 % slower than their conditionally branching variants!

`_rotl64()` and `_rotr64()` Intrinsic Functions for i386 Platform

The Visual C compiler provides the intrinsic functions _rotl64() and _rotr64() for rotation of 64-bit integers, with but a stupid implementation.

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned long long _allrol(unsigned long long value, unsigned int count)
{
    return _rotl64(value, count);
}

unsigned long long _allror(unsigned long long value, unsigned int count)
{
    return _rotr64(value, count);
}

Save the ANSI C source presented above as i386-rotate.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-rotate.c

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-rotate.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-rotate.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__allrol
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-rotate.c
_TEXT	SEGMENT
_value$ = 8						; size = 8
_count$ = 16						; size = 4
__allrol PROC

; 5    :     return _rotl64(value, count);

  00000	8a 4c 24 0c	 mov	 cl, BYTE PTR _count$[esp-4]
  00004	56		 push	 esi
  00005	8b 74 24 08	 mov	 esi, DWORD PTR _value$[esp]
  00009	57		 push	 edi
  0000a	8b 7c 24 10	 mov	 edi, DWORD PTR _value$[esp+8]
  0000e	8b c6		 mov	 eax, esi
  00010	8b d7		 mov	 edx, edi
  00012	f6 c1 20	 test	 cl, 32			; 00000020H
  00015	74 04		 je	 SHORT $LN3@allrol
  00017	8b c7		 mov	 eax, edi
  00019	8b d6		 mov	 edx, esi
$LN3@allrol:
  0001b	80 e1 1f	 and	 cl, 31			; 0000001fH
  0001e	74 12		 je	 SHORT $LN4@allrol
  00020	8b f0		 mov	 esi, eax
  00022	8b c2		 mov	 eax, edx
  00024	8b d6		 mov	 edx, esi
  00026	0f a5 c2	 shld	 edx, eax, cl
  00029	0f a5 f0	 shld	 eax, esi, cl
  0002c	8b ca		 mov	 ecx, edx
  0002e	8b d0		 mov	 edx, eax
  00030	8b c1		 mov	 eax, ecx
$LN4@allrol:

; 6    : }

  00032	5f		 pop	 edi
  00033	5e		 pop	 esi
  00034	c3		 ret	 0
__allrol ENDP
_TEXT	ENDS
PUBLIC	__allror
; Function compile flags: /Ogtpy
_TEXT	SEGMENT
_value$ = 8						; size = 8
_count$ = 16						; size = 4
__allror PROC

; 10   :     return _rotr64(value, count);

  00040	8a 4c 24 0c	 mov	 cl, BYTE PTR _count$[esp-4]
  00044	56		 push	 esi
  00045	8b 74 24 08	 mov	 esi, DWORD PTR _value$[esp]
  00049	57		 push	 edi
  0004a	8b 7c 24 10	 mov	 edi, DWORD PTR _value$[esp+8]
  0004e	8b c6		 mov	 eax, esi
  00050	8b d7		 mov	 edx, edi
  00052	f6 c1 20	 test	 cl, 32			; 00000020H
  00055	74 04		 je	 SHORT $LN3@allror
  00057	8b c7		 mov	 eax, edi
  00059	8b d6		 mov	 edx, esi
$LN3@allror:
  0005b	80 e1 1f	 and	 cl, 31			; 0000001fH
  0005e	74 12		 je	 SHORT $LN4@allror
  00060	8b f0		 mov	 esi, eax
  00062	8b c2		 mov	 eax, edx
  00064	8b d6		 mov	 edx, esi
  00066	0f ad c2	 shrd	 edx, eax, cl
  00069	0f ad f0	 shrd	 eax, esi, cl
  0006c	8b ca		 mov	 ecx, edx
  0006e	8b d0		 mov	 edx, eax
  00070	8b c1		 mov	 eax, ecx
$LN4@allror:

; 11   : }

  00072	5f		 pop	 edi
  00073	5e		 pop	 esi
  00074	c3		 ret	 0
__allror ENDP
_TEXT	ENDS
END

With 24 instructions in 53 bytes, each function is as bad as such optimised code can get!

OUCH¹: instead to load its 64-bit argument into register pair EDX:EAX and swap them if the shift count÷32 is odd, this stupid code loads the 64-bit argument into register pair EDI:ESI first, from there into register pair EDX:EAX, then loads the latter in reverse order from register pair EDI:ESI if the shift count÷32 is odd, clobbering registers EDI and ESI without necessity!

OUCH²: instead to load register ESI from register EDX and then shift the register pairs EDX:EAX and EAX:ESI, this braindead code swaps registers EAX and EDX through register ESI, shifts register pairs EDX:EAX and EAX:ESI and finally swaps registers EAX and EDX once more, now through ECX!

Note: the evaluation of the code generated with the compiler options /Oisy is left as an exercise to the reader.

Implementation of `_allrol()` and `_allror()` Functions in i386 Assembler

Proper implementations use but only 14 instructions in just 34 bytes, i.e. less than two thirds of Microsoft’s abomination:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; Common "cdecl" calling and naming convention for i386 platform:
; - arguments are pushed on stack in reverse order (from right to left),
;   4-byte aligned;
; - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
;   low part below high part;
; - 80-bit, 64-bit or 32-bit floating-point result is returned in FPU
;   register ST0;
; - 64-bit integer result is returned in registers EAX (low part) and
;   EDX (high part);
; - 32-bit integer or pointer result is returned in register EAX;
; - registers EAX, ECX and EDX are volatile and can be clobbered;
; - registers EBX, ESP, EBP, ESI and EDI must be preserved;
; - function names are prefixed with an underscore.

	.386
	.model  flat, C
	.code

_allrol	proc	public			; qword _allrol(qword value, dword count)

	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = value
	mov	ecx, [esp+12]		; ecx = count
ifndef SPACE
	test	cl, 63
	jz	short return		; count % 64 = 0?
endif
	test	cl, 32
	jz	short shift
swap:
	xchg	eax, edx
shift:
	push	ebx
	mov	ebx, edx
	shld	edx, eax, cl
	shld	eax, ebx, cl
	pop	ebx
return:
	ret

_allrol	endp

_allror	proc	public			; qword _allror(qword value, dword count)

	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = value
	mov	ecx, [esp+12]		; ecx = count
ifndef SPACE
	test	cl, 63
	jz	short return		; count % 64 = 0?
endif
	test	cl, 32
	jz	short shift
swap:
	xchg	eax, edx
shift:
	push	ebx
	mov	ebx, eax
	shrd	eax, edx, cl
	shrd	edx, ebx, cl
	pop	ebx
return:
	ret

_allror	endp
	end

`_abs64()` Intrinsic Function for i386 Platform

The intrinsic function _abs64() alias llabs() provided by the Visual C compiler returns the absolute value of its 64-bit integer argument – but even this trivial routine is not fully optimised.

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long _allabs(long long argument)
{
    return _abs64(argument);
}

Save the ANSI C source presented above as i386-magnitude.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-magnitude.c

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-magnitude.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-magnitude.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__allabs
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-magnitude.c
;	COMDAT	__allabs
_TEXT	SEGMENT
_argument$ = 8						; size = 8
__allabs PROC						; COMDAT

; 5    :     return _abs64(argument);

  00000	8b 44 24 08	 mov	 eax, DWORD PTR _argument$[esp]
  00004	8b 4c 24 04	 mov	 ecx, DWORD PTR _argument$[esp-4]
  00008	99		 cdq
  00009	33 c2		 xor	 eax, edx
  0000b	33 ca		 xor	 ecx, edx
  0000d	2b ca		 sub	 ecx, edx
  0000f	1b c2		 sbb	 eax, edx
  00011	8b d0		 mov	 edx, eax
  00013	8b c1		 mov	 eax, ecx

; 6    : }

  00015	c3		 ret	 0
__allabs ENDP
_TEXT	ENDS
END

10 instructions in 22 bytes.

Spoiler: if you are curious whether the current version of the Visual C compiler has its flaws fixed, view its output in Compiler Explorer.

Implementation of `_allabs()` Function in i386 Assembler

Using addition instead of subtraction allows to take advantage of XORs commutativity, saving a MOV instruction and 2 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

_allabs	proc	public			; sqword _allabs(sqword argument)

	mov	eax, [esp+8]
	mov	ecx, [esp+4]		; eax:ecx = argument
	cdq				; edx = (argument < 0) ? -1 : 0
	add	ecx, edx
	adc	eax, edx		; eax:ecx = (argument < 0) ? argument - 1 : argument
					;         = (argument < 0) ? ~-argument : argument
	xor	ecx, edx
	xor	edx, eax		; edx:ecx = (argument < 0) ? -argument : argument
					;         = |argument|
	mov	eax, ecx		; edx:eax = |argument|
	ret

_allabs	endp
	end

64-bit Integer Negation for i386 Platform

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

long long _allneg(long long argument)
{
    return -argument;
}

Save the ANSI C source presented above as i386-negate.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-negate.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-negate.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-negate.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__allneg
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-negate.c
;	COMDAT	__allneg
_TEXT	SEGMENT
_argument$ = 8						; size = 8
__allneg PROC						; COMDAT

; 5    :     return -argument;

  00000	8b 44 24 04	 mov	 eax, DWORD PTR _argument$[esp-4]
  00004	8b 54 24 08	 mov	 edx, DWORD PTR _argument$[esp]
  00008	f7 d8		 neg	 eax
  0000a	83 d2 00	 adc	 edx, 0
  0000d	f7 da		 neg	 edx

; 6    : }

  0000f	c3		 ret	 0
__allneg ENDP
_TEXT	ENDS
END

6 instructions in 16 bytes.

Spoiler: if you are curious whether the current version of the Visual C compiler has its flaws fixed, view its output in Compiler Explorer.

Implementation of `_allneg()` Function in i386 Assembler

A proper implementation uses 5 instructions in 12 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

_allneg	proc	public			; sqword _allneg(sqword argument)

	xor	eax, eax
	cdq				; edx:eax = 0
	sub	eax, [esp+4]
	sbb	edx, [esp+8]		; edx:eax = -argument
	ret

_allneg	endp
	end

Note: the following code for inline use performs the negation in place; it avoids one of the dependencies of the code generated by the Visual C compiler:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

_allneg	proc	public			; sqword _allneg(sqword argument)

	mov	edx, [esp+8]
	mov	eax, [esp+4]		; edx:eax = argument
	not	edx
	neg	eax
	sbb	edx, -1			; edx:eax = -argument
	ret

_allneg	endp
	end

64-bit Integer Negation for i386 Platform (Call by Reference)

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

void blunder(long long *argument)
{
    *argument = -*argument;
}

Save the ANSI C source presented above as i386-blunder.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-blunder.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-blunder.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-blunder.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	_blunder
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-blunder.c
;	COMDAT	_blunder
_TEXT	SEGMENT
_argument$ = 8						; size = 4
_blunder PROC						; COMDAT

; 5    :     *argument = -*argument;

  00000	8b 44 24 04	 mov	 eax, DWORD PTR _argument$[esp-4]
  00004	8b 08		 mov	 ecx, DWORD PTR [eax]
  00006	8b 50 04	 mov	 edx, DWORD PTR [eax+4]
  00009	f7 d9		 neg	 ecx
  0000b	83 d2 00	 adc	 edx, 0
  0000e	f7 da		 neg	 edx
  00010	89 08		 mov	 DWORD PTR [eax], ecx
  00012	89 50 04	 mov	 DWORD PTR [eax+4], edx

; 6    : }

  00015	c3		 ret	 0
_blunder ENDP
_TEXT	ENDS
END

OUCH: this atrocity is a perfect declaration of bankruptcy!

Even a non-optimising compiler should but generate the following straightforward code, using 5 instructions in 14 bytes, instead of the 9 instructions in 22 bytes generated by the Visual C compiler:

	.code

	mov	eax, [esp+4]		; eax = address of argument
	neg	dword ptr [eax]
	not	dword ptr [eax+4]
	sbb	dword ptr [eax+4], -1
	ret

64-bit Integer Signum for i386 Platform

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int _allsgn(long long argument)
{
    return (argument > 0) - (argument < 0);
}

Save the ANSI C source presented above as i386-signum.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-signum.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-signum.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-signum.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__allsgn
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-signum.c
;	COMDAT	__allsgn
_TEXT	SEGMENT
_argument$ = 8						; size = 8
__allsgn PROC						; COMDAT

; 5    :     return (argument > 0) - (argument < 0);

  00000	8b 4c 24 08	 mov	 ecx, DWORD PTR _argument$[esp]
  00004	8b 54 24 04	 mov	 edx, DWORD PTR _argument$[esp-4]
  00008	85 c9		 test	 ecx, ecx
  0000a	7c 0d		 jl	 SHORT $LN5@allsgn
  0000c	7f 04		 jg	 SHORT $LN7@allsgn
  0000e	85 d2		 test	 edx, edx
  00010	74 07		 je	 SHORT $LN5@allsgn
$LN7@allsgn:
  00012	b8 01 00 00 00	 mov	 eax, 1
  00017	eb 02		 jmp	 SHORT $LN6@allsgn
$LN5@allsgn:
  00019	33 c0		 xor	 eax, eax
$LN6@allsgn:
  0001b	85 c9		 test	 ecx, ecx
  0001d	7f 0e		 jg	 SHORT $LN3@allsgn
  0001f	7c 04		 jl	 SHORT $LN8@allsgn
  00021	85 d2		 test	 edx, edx
  00023	73 08		 jae	 SHORT $LN3@allsgn
$LN8@allsgn:
  00025	b9 01 00 00 00	 mov	 ecx, 1
  0002a	2b c1		 sub	 eax, ecx

; 6    : }

  0002c	c3		 ret	 0
$LN3@allsgn:

; 5    :     return (argument > 0) - (argument < 0);

  0002d	33 c9		 xor	 ecx, ecx
  0002f	2b c1		 sub	 eax, ecx

; 6    : }

  00031	c3		 ret	 0
__allsgn ENDP
_TEXT	ENDS
END

21 instructions in 50 bytes.

OUCH¹: 6 (in words: six) superfluous conditional branch instructions – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium^®Pro and their users!

OUCH²: the first 2 highlighted instructions should be replaced with a single DEC instruction, saving 6 bytes.

OUCH³: the last 2 highlighted instructions should be replaced with a 4 byte NOP, which can then be removed together with the following RET instruction, saving 5 more bytes.

Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.

Implementation of `_allsgn()` Function in i386 Assembler

A proper implementation uses only 8 instructions in just 21 bytes, without conditional branches, and doesn’t clobber any register:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

_allsgn	proc	public			; int _allsgn(sqword argument)

if 0
	xor	eax, eax		; eax = 0
	cmp	eax, [esp+4]		; CF = (low dword of argument != 0)
	mov	ecx, [esp+8]		; ecx = high dword of argument
	cdq				; edx = 0
	sbb	edx, ecx		; edx:... = 0 - argument
	setl	al			; eax = (argument > 0) ? 1 : 0
	sar	ecx, 31			; ecx = (argument < 0) ? -1 : 0
	add	eax, ecx		; eax = (argument > 0) - (argument < 0)
					;     = {1, 0, -1}
elseif 0
	mov	eax, [esp+8]		; eax = high dword of argument
	xor	edx, edx		; edx = 0
	cmp	edx, [esp+4]		; CF = (low dword of argument != 0)
	sbb	edx, eax		; edx:... = 0 - argument
	cdq				; edx = (argument < 0) ? -1 : 0
	setl	al
	movzx	eax, al			; eax = (argument > 0) ? 1 : 0
	add	eax, edx		; eax = (argument > 0) - (argument < 0)
					;     = {1, 0, -1}
else
	xor	eax, eax
	cmp	eax, [esp+4]		; CF = (low dword of argument != 0)
	mov	edx, [esp+8]		; edx = high dword of argument
	sbb	eax, edx		; eax:... = 0 - argument
	sar	edx, 31			; edx = (argument < 0) ? -1 : 0
	shr	eax, 31			; eax = (argument > 0) ? 1 : 0
	or	eax, edx		; eax = (argument > 0) - (argument < 0)
					;     = {1, 0, -1}
endif
	ret

_allsgn	endp
	end

64-bit Integer Comparison for i386 Platform

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef UNSIGNED
int _aullcmp(unsigned long long p, unsigned long long q)
#else
int _allcmp(long long p, long long q)
#endif
{
    return (p > q) - (p < q);
}

Save the ANSI C source presented above as i386-compare.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly a first time:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-compare.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-compare.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-compare.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__allcmp
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-compare.c
;	COMDAT	__allcmd
_TEXT	SEGMENT
_p$ = 8							; size = 8
_q$ = 16						; size = 8
__allcmp PROC						; COMDAT

; 9    :     return (p > q) - (p < q);

  00000	8b 4c 24 08	 mov	 ecx, DWORD PTR _p$[esp]
  00004	8b 54 24 10	 mov	 edx, DWORD PTR _q$[esp]
  00008	56		 push	 esi
  00009	8b 74 24 10	 mov	 esi, DWORD PTR _q$[esp]
  0000d	57		 push	 edi
  0000e	8b 7c 24 0c	 mov	 edi, DWORD PTR _p$[esp+4]
  00012	3b ca		 cmp	 ecx, edx
  00014	7c 0d		 jl	 SHORT $LN5@allcmp
  00016	7f 04		 jg	 SHORT $LN7@allcmp
  00018	3b fe		 cmp	 edi, esi
  0001a	76 07		 jbe	 SHORT $LN5@allcmp
$LN7@allcmp:
  0001c	b8 01 00 00 00	 mov	 eax, 1
  00021	eb 02		 jmp	 SHORT $LN6@allcmp
$LN5@allcmp:
  00023	33 c0		 xor	 eax, eax
$LN6@allcmp:
  00025	3b ca		 cmp	 ecx, edx
  00027	7f 10		 jg	 SHORT $LN3@allcmp
  00029	7c 04		 jl	 SHORT $LN8@allcmp
  0002b	3b fe		 cmp	 edi, esi
  0002d	73 0a		 jae	 SHORT $LN3@allcmp
$LN8@allcmp:
  0002f	b9 01 00 00 00	 mov	 ecx, 1
  00034	5f		 pop	 edi
  00035	2b c1		 sub	 eax, ecx
  00037	5e		 pop	 esi

; 10   : }

  00038	c3		 ret	 0
$LN3@allcmp:

; 9    :     return (p > q) - (p < q);

  00039	33 c9		 xor	 ecx, ecx
  0003b	5f		 pop	 edi
  0003c	2b c1		 sub	 eax, ecx
  0003e	5e		 pop	 esi

; 10   : }

  0003f	c3		 ret	 0
__allcmp ENDP
_TEXT	ENDS
END

29 instructions in 64 bytes.

OUCH¹: 6 (in words: six) superfluous conditional branch instructions, and 2 registers clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium^®Pro and their users!

OUCH²: the first 2 highlighted instructions should be replaced with a single DEC instruction, saving 6 bytes.

OUCH³: the last 2 highlighted instructions should be replaced with 2 2 byte NOPs, which can then be removed together with the 2 POP and the following RET instruction, saving 7 more bytes.

Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.

Compile the source file i386-compare.c a second time, now with the preprocessor macro UNSIGNED defined, and display the generated assembly:

CL.EXE /DUNSIGNED i386-compare.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-compare.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-compare.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__aullcmp
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-compare.c
;	COMDAT	__allcmd
_TEXT	SEGMENT
_p$ = 8							; size = 8
_q$ = 16						; size = 8
__aullcmp PROC						; COMDAT

; 9    :     return (p > q) - (p < q);

  00000	8b 4c 24 08	 mov	 ecx, DWORD PTR _p$[esp]
  00004	8b 54 24 10	 mov	 edx, DWORD PTR _q$[esp]
  00008	56		 push	 esi
  00009	8b 74 24 10	 mov	 esi, DWORD PTR _q$[esp]
  0000d	57		 push	 edi
  0000e	8b 7c 24 0c	 mov	 edi, DWORD PTR _p$[esp+4]
  00012	3b ca		 cmp	 ecx, edx
  00014	72 0d		 jb	 SHORT $LN5@aullcmp
  00016	77 04		 ja	 SHORT $LN7@aullcmp
  00018	3b fe		 cmp	 edi, esi
  0001a	76 07		 jbe	 SHORT $LN5@aullcmp
$LN7@aullcmp:
  0001c	b8 01 00 00 00	 mov	 eax, 1
  00021	eb 02		 jmp	 SHORT $LN6@aullcmp
$LN5@aullcmp:
  00023	33 c0		 xor	 eax, eax
$LN6@aullcmp:
  00025	3b ca		 cmp	 ecx, edx
  00027	77 10		 ja	 SHORT $LN3@aullcmp
  00029	72 04		 jb	 SHORT $LN8@aullcmp
  0002b	3b fe		 cmp	 edi, esi
  0002d	73 0a		 jae	 SHORT $LN3@aullcmp
$LN8@aullcmp:
  0002f	b9 01 00 00 00	 mov	 ecx, 1
  00034	5f		 pop	 edi
  00035	2b c1		 sub	 eax, ecx
  00037	5e		 pop	 esi

; 10   : }

  00038	c3		 ret	 0
$LN3@aullcmp:

; 9    :     return (p > q) - (p < q);

  00039	33 c9		 xor	 ecx, ecx
  0003b	5f		 pop	 edi
  0003c	2b c1		 sub	 eax, ecx
  0003e	5e		 pop	 esi

; 10   : }

  0003f	c3		 ret	 0
__aullcmp ENDP
_TEXT	ENDS
END

Also 29 instructions in 64 bytes.

OUCH¹: again 6 (in words: six) superfluous conditional branch instructions, and 2 registers clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium^®Pro and their users!

OUCH²: the first 2 highlighted instructions should be replaced with a single DEC instruction, saving 6 bytes.

OUCH³: the last 2 highlighted instructions should be replaced with 2 2 byte NOPs, which can then be removed together with the 2 POP and the following RET instruction, saving 7 more bytes.

Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.

Implementation of `_allcmp()` and `_aullcmp()` Functions in i386 Assembler

Proper implementations use only 10 respectively 8 instructions in just 28 respectively 25 bytes, without conditional branches, and don’t clobber any register:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

_allcmp	proc	public			; int _allcmp(sqword p, sqword q)

	xor	eax, eax		; eax = 0
	mov	ecx, [esp+4]
	mov	edx, [esp+8]		; edx:ecx = p
	cmp	ecx, [esp+12]
	sbb	edx, [esp+16]		; edx:... = p - q,
					; eflags = (p - q)
	cdq				; edx = 0
	setg	al			; eax = (p > q)
	setl	dl			; edx = (p < q)
	sub	eax, edx		; eax = (p > q) - (p < q)
					;     = {1, 0, -1}
	ret

_allcmp	endp

_aullcmp proc	public			; int _aullcmp(qword p, qword q)

	xor	eax, eax		; eax = 0
	mov	ecx, [esp+4]
	mov	edx, [esp+8]		; edx:ecx = p
	cmp	ecx, [esp+12]
	sbb	edx, [esp+16]		; edx:... = p - q,
					; eflags = (p - q)
	seta	al			; eax = (p > q)
	sbb	eax, 0			; eax = (p > q) - (p < q)
					;     = {1, 0, -1}
	ret

_aullcmp endp
	end

64-bit Integer Maximum for i386 Platform

Instead of *max() functions, the Visual C compiler provides a preprocessor macro __max:

#define __max(a,b) (((a) > (b)) ? (a) : (b))

Note: the header files shipped with the Windows platform software development kit provide the preprocessor macros min and max, which are (fortunately) but not defined when the preprocessor macro NOMINMAX is defined.

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef UNSIGNED
unsigned long long _aullmax(unsigned long long p, unsigned long long q)
#else
long long _allmax(long long p, long long q)
#endif
{
    return p > q ? p : q;
}

Save the ANSI C source presented above as i386-maximum.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly a first time:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-maximum.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-maximum.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-maximum.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__allmax
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-maximum.c
;	COMDAT	__allmax
_TEXT	SEGMENT
_p$ = 8							; size = 8
_q$ = 16						; size = 8
__allmax PROC						; COMDAT

; 9    :     return p > q ? p : q;

  00000	8b 4c 24 08	 mov	 ecx, DWORD PTR _p$[esp]
  00004	8b 54 24 10	 mov	 edx, DWORD PTR _q$[esp]
  00008	56		 push	 esi
  00009	8b 74 24 10	 mov	 esi, DWORD PTR _q$[esp]
  0000d	3b ca		 cmp	 ecx, edx
  0000f	7c 0e		 jl	 SHORT $LN3@allmax
  00011	8b 44 24 08	 mov	 eax, DWORD PTR _p$[esp]
  00015	7f 04		 jg	 SHORT $LN5@allmax
  00017	3b c6		 cmp	 eax, esi
  00019	76 04		 jbe	 SHORT $LN3@allmax
$LN5@allmax:
  0001b	8b d1		 mov	 edx, ecx
  0001d	5e		 pop	 esi

; 10   : }

  0001e	c3		 ret	 0
$LN3@allmax:

; 9    :     return p > q ? p : q;

  0001f	8b c6		 mov	 eax, esi
  00021	5e		 pop	 esi

; 10   : }

  00022	c3		 ret	 0
__allmax ENDP
_TEXT	ENDS
END

16 instructions in 35 bytes.

OUCH: 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium^®Pro and their users!

Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.

Compile the source file i386-maximum.c a second time, now with the preprocessor macro UNSIGNED defined, and display the generated assembly:

CL.EXE /DUNSIGNED i386-maximum.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-maximum.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-maximum.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__aullmax
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-maximum.c
;	COMDAT	__aullmax
_TEXT	SEGMENT
_p$ = 8							; size = 8
_q$ = 16						; size = 8
__aullmax PROC						; COMDAT

; 9    :     return p > q ? p : q;

  00000	8b 4c 24 08	 mov	 ecx, DWORD PTR _p$[esp]
  00004	8b 54 24 10	 mov	 edx, DWORD PTR _q$[esp]
  00008	56		 push	 esi
  00009	8b 74 24 10	 mov	 esi, DWORD PTR _q$[esp]
  0000d	3b ca		 cmp	 ecx, edx
  0000f	72 0e		 jb	 SHORT $LN3@aullmax
  00011	8b 44 24 08	 mov	 eax, DWORD PTR _p$[esp]
  00015	77 04		 ja	 SHORT $LN5@aullmax
  00017	3b c6		 cmp	 eax, esi
  00019	76 04		 jbe	 SHORT $LN3@aullmax
$LN5@aullmax:
  0001b	8b d1		 mov	 edx, ecx
  0001d	5e		 pop	 esi

; 10   : }

  0001e	c3		 ret	 0
$LN3@aullmax:

; 9    :     return p > q ? p : q;

  0001f	8b c6		 mov	 eax, esi
  00021	5e		 pop	 esi

; 10   : }

  00022	c3		 ret	 0
__aullmax ENDP
_TEXT	ENDS
END

Also 16 instructions in 35 bytes.

OUCH: again 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium^®Pro and their users!

Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.

Implementation of `_allmax()` and `_aullmax()` Functions in i386 Assembler

Proper implementations use only 12 respectively 11 instructions in 34 respectively 32 bytes, without conditional branches, and don’t clobber any register:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

_allmax	proc	public			; sqword _allmax(sqword p, sqword q)

	mov	ecx, [esp+4]
	mov	eax, [esp+8]		; eax:ecx = p
	sub	ecx, [esp+12]
	sbb	eax, [esp+16]		; eax:ecx = p - q
	cdq				; edx = (p < q) ? -1 : 0
	not	edx			; edx = (p < q) ? 0 : -1
	and	ecx, edx
	and	edx, eax		; edx:ecx = (p < q) ? 0 : p - q
	add	ecx, [esp+12]
	adc	edx, [esp+16]		; edx:ecx = (p < q) ? q : p
	mov	eax, ecx		; edx:eax = max(p, q)
	ret

_allmax	endp

_aullmax proc	public			; qword _aullmax(qword p, qword q)

	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = p
	sub	eax, [esp+12]
	sbb	edx, [esp+16]		; edx:eax = p - q
	cmc				; CF = ~(p < q)
	sbb	ecx, ecx		; ecx = (p >= q) ? -1 : 0
	and	eax, ecx
	and	edx, ecx		; edx:eax = (p >= q) ? p - q : 0
	add	eax, [esp+12]
	adc	edx, [esp+16]		; edx:eax = (p >= q) ? p : q
					;         = max(p, q)
	ret

_aullmax endp
	end

64-bit Integer Minimum for i386 Platform

Instead of *min() functions, the Visual C compiler provides a preprocessor macro min:

#define __min(a,b) (((a) < (b)) ? (a) : (b))

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifdef UNSIGNED
unsigned long long _aullmin(unsigned long long p, unsigned long long q)
#else
long long _allmin(long long p, long long q)
#endif
{
    return p < q ? p : q;
}

Save the ANSI C source presented above as i386-minimum.c in an arbitrary, preferable empty directory, then execute the following 2 command lines to compile it and display the generated assembly a first time:

SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl
CL.EXE i386-maximum.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-minimum.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-minimum.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__allmin
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-minimum.c
;	COMDAT	__allmin
_TEXT	SEGMENT
_p$ = 8							; size = 8
_q$ = 16						; size = 8
__allmin PROC						; COMDAT

; 9    :     return p < q ? p : q;

  00000	8b 4c 24 08	 mov	 ecx, DWORD PTR _p$[esp]
  00004	8b 54 24 10	 mov	 edx, DWORD PTR _q$[esp]
  00008	56		 push	 esi
  00009	8b 74 24 10	 mov	 esi, DWORD PTR _q$[esp]
  0000d	3b ca		 cmp	 ecx, edx
  0000f	7f 0e		 jg	 SHORT $LN3@allmin
  00011	8b 44 24 08	 mov	 eax, DWORD PTR _p$[esp]
  00015	7c 04		 jl	 SHORT $LN5@allmin
  00017	3b c6		 cmp	 eax, esi
  00019	73 04		 jae	 SHORT $LN3@allmin
$LN5@allmin:
  0001b	8b d1		 mov	 edx, ecx
  0001d	5e		 pop	 esi

; 10   : }

  0001e	c3		 ret	 0
$LN3@allmin:

; 9    :     return p < q ? p : q;

  0001f	8b c6		 mov	 eax, esi
  00021	5e		 pop	 esi

; 10   : }

  00022	c3		 ret	 0
__allmin ENDP
_TEXT	ENDS
END

16 instructions in 35 bytes.

Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.

Compile the source file i386-minimum.c a second time, now with the preprocessor macro UNSIGNED defined, and display the generated assembly:

CL.EXE /DUNSIGNED i386-minimum.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-minimum.c

; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 

	TITLE	C:\Users\Stefan\Desktop\i386-minimum.c
	.686P
	.XMM
	include	listing.inc
	.model	flat

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	__aullmin
; Function compile flags: /Ogtpy
; File c:\users\stefan\desktop\i386-minimum.c
;	COMDAT	__aullmin
_TEXT	SEGMENT
_p$ = 8							; size = 8
_q$ = 16						; size = 8
__aullmin PROC						; COMDAT

; 9    :     return p < q ? p : q;

  00000	8b 4c 24 08	 mov	 ecx, DWORD PTR _p$[esp]
  00004	8b 54 24 10	 mov	 edx, DWORD PTR _q$[esp]
  00008	56		 push	 esi
  00009	8b 74 24 10	 mov	 esi, DWORD PTR _q$[esp]
  0000d	3b ca		 cmp	 ecx, edx
  0000f	77 0e		 ja	 SHORT $LN3@aullmin
  00011	8b 44 24 08	 mov	 eax, DWORD PTR _p$[esp]
  00015	72 04		 jb	 SHORT $LN5@aullmin
  00017	3b c6		 cmp	 eax, esi
  00019	73 04		 jae	 SHORT $LN3@aullmin
$LN5@aullmin:
  0001b	8b d1		 mov	 edx, ecx
  0001d	5e		 pop	 esi

; 10   : }

  0001e	c3		 ret	 0
$LN3@aullmin:

; 9    :     return p < q ? p : q;

  0001f	8b c6		 mov	 eax, esi
  00021	5e		 pop	 esi

; 10   : }

  00022	c3		 ret	 0
__aullmin ENDP
_TEXT	ENDS
END

Also 16 instructions in 35 bytes.

Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.

Implementation of `_allmin()` and `_aullmin()` Functions in i386 Assembler

Proper implementations use only 11 respectively 10 instructions in just 32 respectively 31 bytes, without conditional branches, and don’t clobber any register:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

_allmin	proc	public			; sqword _allmin(sqword p, sqword q)

	mov	ecx, [esp+4]
	mov	eax, [esp+8]		; eax:ecx = p
	sub	ecx, [esp+12]
	sbb	eax, [esp+16]		; eax:ecx = p - q
	cdq				; edx = (p < q) ? -1 : 0
	and	ecx, edx
	and	edx, eax		; edx:ecx = (p < q) ? p - q : 0
	add	ecx, [esp+12]
	adc	edx, [esp+16]		; edx:ecx = (p < q) ? p : q
	mov	eax, ecx		; edx:eax = min(p, q)
	ret

_allmin	endp

_aullmin proc	public			; qword _aullmin(qword p, qword q)

	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = p
	sub	eax, [esp+12]
	sbb	edx, [esp+16]		; edx:eax = p - q
	sbb	ecx, ecx		; ecx = (p < q) ? -1 : 0
	and	eax, ecx
	and	edx, ecx		; edx:eax = (p < q) ? p - q : 0
	add	eax, [esp+12]
	adc	edx, [esp+16]		; edx:eax = (p < q) ? p : q
					;         = min(p, q)
	ret

_aullmin endp
	end

`acos()`, `asin()`, `atan()`, `atan2()`, `cos()`, `cosh()`, `exp()`, `fmod()`, `log()`, `log10()`, `pow()`, `sin()`, `sinh()`, `sqrt()`, `tan()` and `tanh()` Standard Functions for i386 Platform

The MSDN article Intrinsics available on all architectures states:

The following math library functions have intrinsic forms on all architectures:

acos acosf acosl

asin asinf asinl

atan atanf atanl

atan2 atan2f atan2l

ceil ceilf ceill

cosh coshf coshl

cos cosf cosl

exp expf expl

floor floorf floorl

fmod fmodf fmodl

log logf logl

log10 log10f log10l

pow powf powl

sin sinf sinl

sinh sinhf sinhl

sqrt sqrtf sqrtl

tan tanf tanl

tanh tanhf tanhl

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

double fault(double x)
{
    x = acos(x);
    x = asin(x);
    x = atan(x);
    x = atan2(x, x);
    x = cos(x);
    x = cosh(x);
    x = exp(x);
    x = fmod(x, x);
    x = log(x);
    x = log10(x);
    x = pow(x, x);
    x = sin(x);
    x = sinh(x);
    x = sqrt(x);
    x = tan(x);
    x = tanh(x);
    return x;
}

Save the ANSI C source presented above as i386-fault.c in an arbitrary, preferable empty directory, then execute the following 3 command lines to compile and (attempt to) link it a first time:

SET CL=/Oi /W4
SET LINK=/ENTRY:fault /MACHINE:I386 /SUBSYSTEM:CONSOLE
CL.EXE /MD i386-fault.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-fault.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/ENTRY:fault /MACHINE:I386 /SUBSYSTEM:CONSOLE
/out:i386-fault.exe
i386-fault.obj
i386-fault.obj : error LNK2019: unresolved external symbol __CItanh referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CItan referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIsinh referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIlog10 referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIfmod referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIexp referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIcosh referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIatan2 referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIatan referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIasin referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol __CIacos referenced in function _fault
i386-fault.exe : fatal error LNK1120: 11 unresolved externals

OUCH¹: most obviously nobody at Microsoft ever built an application for the i386 platform which uses one of the floating-point functions acos(), asin(), atan(), atan2(), cosh(), exp(), fmod(), log10(), sinh(), tan() or tanh() with msvcrt.lib!

Repeat the first trial without using the intrinsic functions:

CL.EXE /fp:strict /FImath.h /MD i386-fault.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-fault.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/ENTRY:fault /MACHINE:I386 /SUBSYSTEM:CONSOLE
/out:i386-fault.exe
i386-fault.obj
i386-fault.obj : error LNK2019: unresolved external symbol _tanh referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _tan referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _sqrt referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _sinh referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _sin referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _pow referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _log10 referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _log referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _fmod referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _exp referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _cosh referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _cos referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _atan2 referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _atan referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _asin referenced in function _fault
i386-fault.obj : error LNK2019: unresolved external symbol _acos referenced in function _fault
i386-fault.exe : fatal error LNK1120: 16 unresolved externals

OUCH²: the steaming pile of crap got even worse!

Execute the following 2 command lines to compile and (attempt to) link i386-fault.c a third time, now as DLL and with the static runtime library libcmt.lib:

SET LINK=/MACHINE:I386 /NOENTRY
CL.EXE /LD i386-fault.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-fault.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/MACHINE:I386 /NOENTRY
/out:i386-fault.dll
/dll
/implib:i386-fault.lib
i386-fault.obj
LIBCMT.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
i386-fault.dll : fatal error LNK1120: 1 unresolved externals

OUCH³: despite building a DLL, the (intrinsic) floating-point functions reference an undocumented internal (startup) routine __tmainCRTStartup() for console applications, which in turn references a main() function – most obviously also nobody at Microsoft ever tried to build a DLL which uses floating-point functions!

Repeat the third trial without using the intrinsic functions:

CL.EXE /fp:strict /FImath.h /MD i386-fault.c

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-fault.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/MACHINE:I386 /NOENTRY
/out:i386-fault.dll
/dll
/implib:i386-fault.lib
i386-fault.obj
LIBCMT.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
i386-fault.dll : fatal error LNK1120: 1 unresolved externals

OUCH⁴: like before!

Note: the repetition of the last 2 trials in the 64-bit build environment is left as an exercise to the reader!

`_CI` and `_ftol` Routines

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.686
	.model	flat, C

single	record	sign:1, exponent:8, mantissa:23

bias	equ	1 shl (width exponent - 1) - 1

	.const

	public	_fltused
_fltused dword	9876h

	.code

; MSC internal intrinsic _CIacos():
; receives argument in FPU st(0), returns result in FPU st(0)

; NOTE: _CIacos() returns correct result for ±0.0 and ±1.0

_CIacos	proc	public

	fld	st(0)		; st(0) = st(1) = argument
	fmul	st(0), st(0)	; st(0) = argument**2,
				; st(1) = argument
	fld1			; st(0) = 1.0,
				; st(1) = argument**2,
				; st(2) = argument
	fsubrp	st(1), st(0)	; st(0) = 1.0 - argument**2,
				; st(1) = argument
	fsqrt			; st(0) = square root of (1.0 - argument**2),
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = square root of (1.0 - argument**2)
	fpatan			; st(0) = inverse circular cosine of argument
	ret

_CIacos	endp

; MSC internal intrinsic _CIasin():
; receives argument in FPU st(0), returns result in FPU st(0)

; NOTE: _CIasin() returns correct result for ±0.0 and ±1.0

_CIasin	proc	public

	fld	st(0)		; st(0) = st(1) = argument
	fmul	st(0), st(0)	; st(0) = argument**2,
				; st(1) = argument
	fld1			; st(0) = 1.0,
				; st(1) = argument**2,
				; st(2) = argument
	fsubrp	st(1), st(0)	; st(0) = 1.0 - argument**2,
				; st(1) = argument
	fsqrt			; st(0) = square root of (1.0 - argument**2),
				; st(1) = argument
	fpatan			; st(0) = inverse circular sine of argument
	ret

_CIasin	endp

; MSC internal intrinsic _CIatan():
; receives argument in FPU st(0), returns result in FPU st(0)

; NOTE: _CIatan() returns correct result for ±0.0 and ±INFINITY

_CIatan	proc	public

	fld1			; st(0) = 1.0,
				; st(1) = argument
	fpatan			; st(0) = inverse circular tangent of (argument / 1.0)
	ret

_CIatan	endp

; MSC internal intrinsic _CIatan2():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)

; NOTE: _CIatan2() returns correct result for ±0.0 and ±INFINITY

_CIatan2 proc	public

	fxch	st(1)		; st(0) = denominator,
				; st(1) = numerator
	fpatan			; st(0) = inverse circular tangent of (numerator / denominator)
	ret

_CIatan2 endp

; MSC internal intrinsic _CIcos():
; receives argument in FPU st(0), returns result in FPU st(0)

_CIcos	proc	public

	fcos			; st(0) = cosine of argument
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jnp	short return	; |argument| < 2**63?

	fldpi			; st(0) = pi,
				; st(1) = argument
	fadd	st(0), st(0)	; st(0) = 2.0 * pi,
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = 2.0 * pi
reduce:
	fprem1			; st(0) = argument modulo (2.0 * pi),
				; st(1) = 2.0 * pi
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce

	fstp	st(1)		; st(0) = argument modulo (2.0 * pi)
	fcos			; st(0) = cosine of argument modulo (2.0 * pi)
return:
	ret

_CIcos	endp

; MSC internal intrinsic _CIcosh():
; receives argument in FPU st(0), returns result in FPU st(0)

_CIcosh	proc	public

	call	_CIexp		; st(0) = e**argument
	fld1			; st(0) = 1.0,
				; st(1) = e**argument
	fdiv	st(0), st(1)	; st(0) = 1.0 / e**argument = e**-argument,
				; st(1) = e**argument
	faddp	st(1), st(0)	; st(0) = e**argument + e**-argument
	push	(bias - 1) shl width mantissa
				; [esp] = 0x3F000000
				;       = 0.5F
	fmul	real4 ptr [esp]	; st(0) = hyperbolic cosine of argument
	pop	eax
	ret

_CIcosh	endp

; MSC internal intrinsic _CIexp():
; receives argument in FPU st(0), returns result in FPU st(0)

; NOTE: _CIexp() returns correct result for ±INFINITY

_CIexp	proc	public

	fldl2e			; st(0) = log2(e),
				; st(1) = exponent
	fmulp	st(1), st(0)	; st(0) = exponent * log2(e)
if 0
	fld1			; st(0) = 1.0,
				; st(1) = exponent * log2(e)
	fld	st(1)		; st(0) = exponent * log2(e),
				; st(1) = 1.0,
				; st(2) = exponent * log2(e)
	fprem			; st(0) = (exponent * log2(e)) modulo 1.0,
				; st(1) = 1.0,
				; st(2) = exponent * log2(e)
	f2xm1			; st(0) = 2.0**((exponent * log2(e)) modulo 1.0) - 1.0,
				; st(1) = 1.0,
				; st(2) = exponent * log2(e)
	faddp	st(1), st(0)	; st(0) = 2.0**((exponent * log2(e)) modulo 1.0),
				; st(1) = exponent * log2(e)
	fscale			; st(0) = e**exponent,
				; st(1) = exponent * log2(e)
else
	fld	st(0)		; st(0) = st(1) = exponent * log2(e)
	frndint			; st(0) = integer(exponent * log2(e)),
				; st(1) = exponent * log2(e)
	fsub	st(1), st(0)	; st(0) = integer(exponent * log2(e)),
				; st(1) = fraction(exponent * log2(e))
	fxch	st(1)		; st(0) = fraction(exponent * log2(e)),
				; st(1) = integer(exponent * log2(e))
	f2xm1			; st(0) = 2.0**fraction(exponent * log2(e)) - 1.0,
				; st(1) = integer(exponent * log2(e))
	fld1			; st(0) = 1.0,
				; st(1) = 2.0**fraction(exponent * log2(e)) - 1.0,
				; st(2) = integer(exponent * log2(e))
	faddp	st(1), st(0)	; st(0) = 2.0**fraction(exponent * log2(e)),
				; st(1) = integer(exponent * log2(e))
	fscale			; st(0) = e**exponent,
				; st(1) = integer(exponent * log2(e))
endif
	fstp	st(1)		; st(0) = e**exponent
	ret

_CIexp	endp

; MSC internal intrinsic _CIfmod():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)

_CIfmod	proc	public

reduce:
	fprem			; st(0) = remainder,
				; st(1) = divisor
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce

	fstp	st(1)		; st(0) = remainder
	ret

_CIfmod	endp

; MSC internal intrinsic _CIlog():
; receives argument in FPU st(0), returns result in FPU st(0)

_CIlog	proc	public

	fldln2			; st(0) = ln(2.0),
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = ln(2.0)
	fyl2x			; st(0) = natural logarithm of argument
	ret

_CIlog	endp

; MSC internal intrinsic _CIlog10():
; receives argument in FPU st(0), returns result in FPU st(0)

_CIlog10 proc	public

	fldlg2			; st(0) = log10(2.0),
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = log10(2.0)
	fyl2x			; st(0) = logarithm to base 10 of argument
	ret

_CIlog10 endp

; MSC internal intrinsic _CIpow():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)

_CIpow	proc	public

	fxch	st(1)		; st(0) = base,
				; st(1) = exponent
	fyl2x			; st(0) = exponent * log2(base)
	fld	st(0)		; st(0) = st(1) = exponent * log2(base)
	frndint			; st(0) = integer(exponent * log2(base)),
				; st(1) = exponent * log2(base)
	fsub	st(1), st(0)	; st(0) = integer(exponent * log2(base)),
				; st(1) = fraction(exponent * log2(base))
	fxch	st(1)		; st(0) = fraction(exponent * log2(base)),
				; st(1) = integer(exponent * log2(base))
	f2xm1			; st(0) = 2.0**fraction(exponent * log2(base)) - 1.0,
				; st(1) = integer(exponent * log2(base))
	fld1			; st(0) = 1.0,
				; st(1) = 2.0**fraction(exponent * log2(base)) - 1.0,
				; st(2) = integer(exponent * log2(base))
	faddp	st(1), st(0)	; st(0) = 2.0**fraction(exponent * log2(base)),
				; st(1) = integer(exponent * log2(base))
	fscale			; st(0) = base**exponent,
				; st(1) = integer(exponent * log2(base))
	fstp	st(1)		; st(0) = base**exponent
	ret

_CIpow	endp

; MSC internal intrinsic _CIsin():
; receives argument in FPU st(0), returns result in FPU st(0)

_CIsin	proc	public

	fsin			; st(0) = sine of argument
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jnp	short return	; |argument| < 2**63?

	fldpi			; st(0) = pi,
				; st(1) = argument
	fadd	st(0), st(0)	; st(0) = 2.0 * pi,
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = 2.0 * pi
reduce:
	fprem1			; st(0) = argument modulo (2.0 * pi),
				; st(1) = 2.0 * pi
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce

	fstp	st(1)		; st(0) = argument modulo (2.0 * pi)
	fsin			; st(0) = sine of argument modulo (2.0 * pi)
return:
	ret

_CIsin	endp

; MSC internal intrinsic _CIsinh():
; receives argument in FPU st(0), returns result in FPU st(0)

_CIsinh	proc	public

	call	_CIexp		; st(0) = e**argument
	fld1			; st(0) = 1.0,
				; st(1) = e**argument
	fdiv	st(0), st(1)	; st(0) = 1.0 / e**argument = e**-argument,
				; st(1) = e**argument
	fsubp	st(1), st(0)	; st(0) = e**argument - e**-argument
	push	(bias - 1) shl width mantissa
				; [esp] = 0x3F000000
				;       = 0.5F
	fmul	real4 ptr [esp]	; st(0) = hyperbolic sine of argument
	pop	eax
	ret

_CIsinh	endp

; MSC internal intrinsic _CIsqrt():
; receives argument in FPU st(0), returns result in FPU st(0)

_CIsqrt	proc	public

	fsqrt			; st(0) = square root of radicand
	ret

_CIsqrt	endp

; MSC internal intrinsic _CItan():
; receives argument in FPU st(0), returns result in FPU st(0)

_CItan	proc	public

	fptan			; st(0) = 1.0,
				; st(1) = tangent of argument
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jnp	short return	; |argument| < 2**63?

	fldpi			; st(0) = pi,
				; st(1) = argument
	fadd	st(0), st(0)	; st(0) = 2.0 * pi,
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = 2.0 * pi
reduce:
	fprem1			; st(0) = argument modulo (2.0 * pi),
				; st(1) = 2.0 * pi
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce

	fstp	st(1)		; st(0) = argument modulo (2.0 * pi)
	fptan			; st(0) = 1.0,
				; st(1) = tangent of argument modulo (2.0 * pi)
return:
	fstp	st(0)		; st(0) = tangent of argument
	ret

_CItan	endp

; MSC internal intrinsic _CItanh():
; receives argument in FPU st(0), returns result in FPU st(0)

_CItanh	proc	public

	call	_CIexp		; st(0) = e**argument
	fmul	st(0), st(0)	; st(0) = e**argument * e**argument
				;       = e**(argument + argument)
	fld1			; st(0) = 1.0,
				; st(1) = e**(argument + argument)
	fadd	st(1), st(0)	; st(0) = 1.0,
				; st(1) = e**(argument + argument) + 1.0
	fadd	st(0), st(0)	; st(0) = 2.0,
				; st(1) = e**(argument + argument) + 1.0
	fdivrp	st(1), st(0)	; st(0) = 2.0 / (e**(argument + argument) + 1.0)
	fld1			; st(0) = 1.0,
				; st(1) = 2.0 / (e**(argument + argument) + 1.0)
	fsubrp	st(1), st(0)	; st(0) = 1.0 - 2.0 / (e**(argument + argument) + 1.0)
				;       = hyperbolic tangent of argument
	ret

_CItanh	endp

; MSC internal intrinsic _ftol():
; receives argument in FPU st(0), returns result in eax

; NOTE: fistp rounds to nearest (even) integer!

_ftol	proc	public

	push	eax
	fistp	dword ptr [esp]	; [esp] = integer(argument)
	pop	eax		; eax = integer(argument)
	ret

_ftol	endp

; MSC internal intrinsic _ftol2():
; receives argument in FPU st(0), returns result in edx:eax

; NOTE: fistp rounds to nearest (even) integer!

_ftol2	proc	public

	push	edx
	push	eax
	fistp	qword ptr [esp]	; [esp] = integer(argument)
	pop	eax
	pop	edx		; edx:eax = integer(argument)
	ret

_ftol2	endp

; MSC internal intrinsic _ftol2_sse():
; receives argument in FPU st(0), returns result in edx:eax

; NOTE: fisttp truncates, i.e. rounds towards ±0!

_ftol2_sse proc	public

	push	edx
	push	eax
	fisttp	qword ptr [esp]	; [esp] = integer(argument)
	pop	eax
	pop	edx		; edx:eax = integer(argument)
	ret

_ftol2_sse endp
	end

Save the i386 assembler source presented above as i386-fpu.asm in the directory where you created the object library i386.lib before, then execute the following 3 command lines to generate the object file i386-fpu.obj and add it to the existing object library i386.lib:

SET ML=/c /Gy /safeseh /W3 /X
ML.EXE i386-fpu.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-fpu.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: if the /Gy option to package every function in its own, separately linkable COMDAT section is not available with the version of the macro assembler ML.EXE you use, split the i386 assembler source into multiple pieces, with one function per source file.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 14.16.27023.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-fpu.asm

Microsoft (R) Library Manager Version 14.16.27049.0
Copyright (C) Microsoft Corporation.  All rights reserved.

`memchr()` Standard Function for i386 Platform

Although the memchr() function is neither a compiler helper nor an intrinsic function, it is included here for entertainment due to its ~~quality~~ deficiencies and flaws.

DIR "%source%\intel\mem*.asm"
TYPE "%source%\intel\memchr.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             4,097 memccpy.asm
02/18/2011  03:40 PM             5,003 memchr.asm
02/18/2011  03:40 PM            22,486 memcpy.asm
02/18/2011  03:40 PM               475 memmove.asm
02/18/2011  03:40 PM             4,426 memset.asm
               5 File(s)         36,307 bytes
               0 Dir(s)    9,876,543,210 bytes free

        page    ,132
        title   memchr - search memory for a given character
;***
;memchr.asm - search block of memory for a given character
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines memchr() - search memory until a character is
;       found or a limit is reached.
;
;*******************************************************************************

        .xlist
        include cruntime.inc
        .list

page
;***
;char *memchr(buf, chr, cnt) - search memory for given character.
;
;Purpose:
;       Searched at buf for the given character, stopping when chr is
;       first found or cnt bytes have been searched through.
;
;       Algorithm:
;       char *
;       memchr (buf, chr, cnt)
;               char *buf;
;               int chr;
;               unsigned cnt;
;       {
;               while (cnt && *buf++ != c)
;                       cnt--;
;               return(cnt ? --buf : NULL);
;       }
;
;Entry:
;       char *buf - memory buffer to be searched
;       char chr - character to search for
;       unsigned cnt - max number of bytes to search
;
;Exit:
;       returns pointer to first occurence of chr in buf
;       returns NULL if chr not found in the first cnt bytes
;
;Uses:
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

        public  memchr
memchr  proc \
        buf:ptr byte, \
        chr:byte, \
        cnt:dword

        OPTION PROLOGUE:NONE, EPILOGUE:NONE

        .FPO    ( 0, 1, 0, 0, 0, 0 )

        mov     eax,[esp+0ch]   ; eax = count
        push    ebx             ; Preserve ebx

        test    eax,eax         ; check if count=0
        jz      short retnull   ; if count=0, leave

        mov     edx,[esp+8]     ; edx = buffer
        xor     ebx,ebx

        mov     bl,[esp+0ch]    ; bl = search char
        movzx   ebx,byte ptr [esp+12]

        test    edx,3           ; test if string is aligned on 32 bits
        test    dl,3
        jz      short main_loop_start

str_misaligned:                 ; simple byte loop until string is aligned
        mov     cl,byte ptr [edx]
        add     edx,1
        xor     cl,bl
        inc     edx
        cmp     cl,bl
        je      short found
        sub     eax,1           ; counter--
        dec     eax
        jz      short retnull
        test    edx,3           ; already aligned ?
        test    dl,3
        jne     short str_misaligned

main_loop_start:
        sub     eax,4
        jb      short tail_less_then_4

; set all 4 bytes of ebx to [value]
        push    edi             ; Preserve edi
        mov     edi,ebx         ; edi=0/0/0/char
        shl     ebx,8           ; ebx=0/0/char/0
        add     ebx,edi         ; ebx=0/0/char/char
        mov     edi,ebx         ; edi=0/0/char/char
        shl     ebx,10h         ; ebx=char/char/0/0
        add     ebx,edi         ; ebx = all 4 bytes = [search char]
        imul    ebx,01010101h
        jmp     short main_loop_entry   ; ecx >=0

return_from_main:
        pop     edi

tail_less_then_4:
        add     eax,4
        jz      retnull

tail_loop:                      ; 0 < eax < 4
        mov     cl,byte ptr [edx]
        add     edx,1
        xor     cl,bl
        inc     edx
        cmp     cl,bl
        je      short found
        sub     eax,1
        dec     eax
        jnz     short tail_loop
retnull:
        pop     ebx
        ret                     ; _cdecl return

main_loop:
        sub     eax,4
        jb      short return_from_main
main_loop_entry:
        mov     ecx,dword ptr [edx]     ; read 4 bytes

        xor     ecx,ebx         ; ebx is byte\byte\byte\byte
        mov     edi,7efefeffh

        add     edi,ecx
        xor     ecx,-1

        xor     ecx,edi
        add     edx,4

        lea     edi,[ecx-01010101h]
        not     ecx
        and     ecx,80808080h
        and     ecx,edi
        and     ecx,81010100h
        je      short main_loop

; found zero byte in the loop?
char_is_found:
        bsf     ecx,ecx
        shr     ecx,3
        lea     eax,[edx+ecx-4]
        pop     edi
        pop     ebx
        ret
        mov     ecx,[edx - 4]
        xor     cl,bl           ; is it byte 0
        je      short byte_0
        xor     ch,bl           ; is it byte 1
        je      short byte_1
        shr     ecx,10h         ; is it byte 2
        xor     cl,bl
        je      short byte_2
        xor     ch,bl           ; is it byte 3
        je      short byte_3
        jmp     short main_loop ; taken if bits 24-30 are clear and bit
                                ; 31 is set

byte_3:
        pop     edi             ; restore edi
found:
        lea     eax,[edx - 1]
        pop     ebx             ; restore ebx
        ret                     ; _cdecl return

byte_2:
        lea     eax,[edx - 2]
        pop     edi
        pop     ebx
        ret                     ; _cdecl return

byte_1:
        lea     eax,[edx - 3]
        pop     edi
        pop     ebx
        ret                     ; _cdecl return

byte_0:
        lea     eax,[edx - 4]
        pop     edi             ; restore edi
        pop     ebx             ; restore ebx
        ret                     ; _cdecl return

memchr  endp
        end

With 76 instructions in 173 bytes, this routine is yet another true gem!

Oops¹: the ~~deleted~~ XOR instruction followed by the ~~deleted~~ MOV instruction should be replaced with the inserted MOVZX instruction.

Oops²: the ~~deleted~~ TEST instructions with immediate value 3 should be replaced with the inserted shorter ones, saving 6 bytes.

OUCH¹: the ~~deleted~~ ADD and SUB instructions which increment respectively decrement by 1, should be replaced with the inserted shorter INC or DEC instructions, saving 4 bytes!

OUCH²: instead of the 6 ~~deleted~~ XOR instructions which perform superfluous partial register write operations the inserted CMP instructions should be used!

OUCH³: instead of the 6 ~~deleted~~ instructions which copy the byte from register BL into the upper 3 bytes of register EBX the single inserted IMUL instruction should be used, saving 8 bytes!

OUCH⁴: instead of the ~~deleted~~ XOR instruction with immediate operand -1 the inserted shorter NOT instruction should be used, saving 1 byte!

OUCH⁵: when the 5 ~~deleted~~ instructions after label main_loop_entry: are replaced with the 4 inserted instructions, the 24 (in words: twenty-four) ~~deleted~~ instructions after label char_is_found: can be replaced with the 6 faster and shorter instructions inserted there, saving 45 (in words: fourty-five) bytes!

Note: Alan Mycroft posted the better, faster and shorter method on April 8, 1987 to the USENET news group comp.lang.c

You might be interested to know that such detection of null bytes in words can be done in 3 or 4 instructions on almost any hardware (nay even in C). (Code that follows relies on x being a 32 bit unsigned (or 2's complement int with overflow ignored)...) #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) Then if e is an expression without side effects (e.g. variable) has_nullbyte_(e) is nonzero iff the value of e has a null byte. (One can view this as explicit programming of the Manchester carry chain present in many processors which is hidden from the instruction level).

Note: see Bit Twiddling Hacks – Determine if a word has a zero byte for a comparison of both methods and more details.

Note: with the modifications shown in the source, this routine has 51 instructions in 118 bytes, i.e. two thirds of the original’s instructions and bytes.

Naïve Implementation in i386 Assembler

A proper naïve implementation needs but only 46 instructions in just 104 bytes, i.e. 60 % of Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)

	mov	eax, [esp+12]	; eax = count
	test	eax, eax
	jz	short return	; count = 0?

	movzx	ecx, byte ptr [esp+8]
	mov	edx, [esp+4]	; edx = address of buffer
head:
	test	dl, 3
	jz	short aligned	; count % 4 = 0?
unaligned:
	cmp	cl, [edx]
	je	short break

	inc	edx
	dec	eax
	jnz	short head	; count > 0?

	ret
aligned:
	sub	eax, 4
	jb	short tail	; count < 4?

	push	edi
	push	esi
	imul	ecx, 01010101h	; ecx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
next:
	mov	edi, [edx]	; edi = next 4 aligned bytes
	xor	edi, ecx
	lea	esi, [edi-01010101h]
	not	edi
	and	edi, 80808080h
	and	edi, esi
	jnz	short match

	add	edx, 4
	sub	eax, 4
	jae	short next	; count >= 4?

	pop	esi
	pop	edi
tail:
	add	eax, 4
	jz	short return	; count = 0?
@@:
	cmp	cl, [edx]
	je	short break

	inc	edx
	dec	eax
	jnz	short @b	; count > 0?
return:
	ret
break:
	mov	eax, edx	; eax = address of character
	ret
match:
	bsf	eax, edi	; eax = offset of character * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of character
				;     = {0, 1, 2, 3}
	add	eax, edx	; eax = address of character
	pop	esi
	pop     edi
	ret

memchr	endp
	end

Save the i386 assembler source presented above as i386-memchr.asm and the ANSI C source presented below as i386-memchr.c, then execute the 6 command lines following the ANSI C source to assemble i386-memchr.asm, compile i386-memchr.c, link the generated object files i386-memchr.obj and i386-memchr.tmp, and finally execute the image file i386-memchr.exe to demonstrate the correct operation:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

const	CHAR	szString[] = "^^9876543210$$";
const	LPCSTR	szFormat[] = {"0x%p: memchr(\"%hs\", \'$\', %lu) = NULL\r\n",
		              "0x%p: memchr(\"%hs\", \'$\', %lu) = 0x%p = \"%hs\"\r\n"};

__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPCSTR	lp;
	LPCSTR	lpString = szString + sizeof(szString);
	DWORD	dwError = ERROR_SUCCESS;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		dwError = GetLastError();
	else
		while (--lpString >= szString)
		{
			lp = memchr(lpString, '$', szString + sizeof(szString) - lpString);

			PrintFormat(hOutput,
			            szFormat[lp != NULL],
			            lpString, lpString, szString + sizeof(szString) - lpString, lp, lp);
		}

	ExitProcess(dwError);
}

SET ML=/c /safeseh /W3 /X
ML.EXE i386-memchr.asm
SET CL=/GAFy /Oy /W4 /Zl
SET LINK=/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
CL.EXE /Foi386-memchr.tmp i386-memchr.obj i386-memchr.c
.\i386-memchr.exe

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-memchr.asm

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-memchr.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
/out:i386-memchr.exe
i386-memchr.obj
i386-memchr.tmp

0x002F2082: memchr("", '$', 1) = NULL
0x002F2081: memchr("$", '$', 2) = 0x002F2080 = "$$"
0x002F2080: memchr("$$", '$', 3) = 0x002F2080 = "$$"
0x002F207F: memchr("0$$", '$', 4) = 0x002F2080 = "$$"
0x002F207E: memchr("10$$", '$', 5) = 0x002F2080 = "$$"
0x002F207D: memchr("210$$", '$', 6) = 0x002F2080 = "$$"
0x002F207C: memchr("3210$$", '$', 7) = 0x002F2080 = "$$"
0x002F207B: memchr("43210$$", '$', 8) = 0x002F2080 = "$$"
0x002F207A: memchr("543210$$", '$', 9) = 0x002F2080 = "$$"
0x002F2079: memchr("6543210$$", '$', 10) = 0x002F2080 = "$$"
0x002F2078: memchr("76543210$$", '$', 11) = 0x002F2080 = "$$"
0x002F2077: memchr("876543210$$", '$', 12) = 0x002F2080 = "$$"
0x002F2076: memchr("9876543210$$", '$', 13) = 0x002F2080 = "$$"
0x002F2075: memchr("^9876543210$$", '$', 14) = 0x002F2080 = "$$"
0x002F2074: memchr("^^9876543210$$", '$', 15) = 0x002F2080 = "$$"

Smart Implementation in i386 Assembler

A proper smart implementation without loops for head and tail needs only 40 instructions in 102 bytes, i.e. about half the instructions of Microsoft’s poor implementation; the corresponding smart implementation of the missing memrchr() function has also 40 instructions in 101 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model  flat, C
	.code

memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short return	; count = 0?

	cdq			; edx = 0
	mov	dl, [esp+8]	; edx = character
	imul	edx, 01010101h	; edx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	push	ebx
	mov	ebx, ecx	; ebx = address of buffer
	and	ecx, 3		; ecx = address of buffer % 4
				;     = 4 - number of unaligned bytes
	jz	short aligned	; address of buffer % 4 = 0?
unaligned:
	sub	ebx, ecx	; ebx = aligned address before buffer
	shl	ecx, 3		; ecx = (4 - number of unaligned bytes) * 8
				;     = 32 - number of unaligned bits
	dec	eax		; eax = ~0
	shl	eax, cl		; eax = ~0 for unaligned bytes, 0 elsewhere
	not	eax		; eax = 0 for unaligned bytes, ~0 elsewhere
	mov	ecx, [ebx]	; ecx = unaligned bytes
	xor	ecx, edx	; ecx = unaligned bytes ^ multiplied character
	or	eax, ecx	; eax = '\0' for matching bytes
	jmp	short mycroft
next:
	add	ebx, 4		; ebx = address of next 4 aligned bytes
	cmp	ebx, [esp+16]
	jae	short null	; address after buffer?
aligned:
	mov	eax, [ebx]	; eax = next 4 aligned bytes
	xor	eax, edx	; eax = next 4 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	eax, 80808080h
	and	eax, ecx	; eax = '\200' for matching bytes, '\0' elsewhere
	jz	short next	; no match in any byte?
match:
	bsf	eax, eax	; eax = offset of matching byte * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of matching byte
				;     = {0, 1, 2, 3}
	add	eax, ebx	; eax = address of matching byte
	cmp	eax, [esp+16]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	pop	ebx
return:
	ret

memchr	endp

memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short return	; count = 0?

	cdq			; edx = 0
	mov	dl, [esp+8]	; edx = character
	imul	edx, 01010101h	; edx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	push	ebx
	mov	ebx, ecx	; ebx = address after buffer
	and	ecx, 3		; ecx = address after buffer % 4
				;     = number of tail bytes
	jz	short aligned	; address after buffer % 4 = 0?
unaligned:
	sub	ebx, ecx	; ebx = aligned address of tail bytes
	shl	ecx, 3		; ecx = number of tail bytes * 8
				;     = number of tail bits
	dec	eax		; eax = ~0
	shl	eax, cl		; eax = 0 for tail bytes, ~0 elsewhere
	mov	ecx, [ebx]	; ecx = tail bytes
	xor	ecx, edx	; ecx = tail bytes ^ multiplied character
	or	eax, ecx	; eax = '\0' for matching tail bytes
	jmp	short mycroft
previous:
	sub	ebx, 4		; ebx = address of previous 4 aligned bytes
	cmp	ebx, [esp+8]
	jb	short null	; address before buffer?
aligned:
	mov	eax, [ebx]	; eax = previous 4 aligned bytes
	xor	eax, edx	; eax = previous 4 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	eax, 80808080h
	and	eax, ecx	; eax = '\200' for matching bytes, '\0' elsewhere
	jz	short previous	; no match in any byte?
match:
	bsr	eax, eax	; eax = offset of matching byte * 8 + 7
				;     = {31, 23, 15, 7}
	shr	eax, 3		; eax = offset of matching byte
				;     = {3, 2, 1, 0}
	add	eax, ebx	; eax = address of matching byte
	cmp	eax, [esp+8]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	pop	ebx
return:
	ret

memrchr	endp
	end

Implementation with SSE2 Instructions in i386 Assembler

Implementations for processors which support the Streaming SIMD Extensions 2 alias Willamette New Instructions, introduced November 20, 2000 with the Pentium^®4, need only 33 and 35 instructions in 101 and 104 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
if 0
	movd	xmm0, dword ptr [esp+8]
	punpcklbw xmm0, xmm0
	punpcklwd xmm0, xmm0
else
	mov	al, [esp+8]	; eax = character
	imul	eax, 01010101h	; eax = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	movd	xmm0, eax
endif
	pshufd	xmm0, xmm0, 0	; xmm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address of buffer % 16
				;     = 16 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memchr	endp

memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
if 0
	movd	xmm0, dword ptr [esp+8]
	punpcklbw xmm0, xmm0
	punpcklwd xmm0, xmm0
else
	mov	al, [esp+8]	; eax = character
	imul	eax, 01010101h	; eax = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	movd	xmm0, eax
endif
	pshufd	xmm0, xmm0, 0	; xmm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address after buffer % 16
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 16		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memrchr	endp
	end

Implementation with SSSE3 Instructions in i386 Assembler

Implementations for processors which support the Supplemental Streaming SIMD Extensions 3 alias Merom New Instructions, introduced June 26, 2006 with the Core^™ micro-architecture, need 32 and 34 instructions in 97 and 100 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?

	pxor	xmm0, xmm0	; xmm0 = 0
	movd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	pshufb	xmm1, xmm0	; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address of buffer % 16
				;     = 16 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memchr	endp

memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?

	pxor	xmm0, xmm0	; xmm0 = 0
	movd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	pshufb	xmm1, xmm0	; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address after buffer % 16
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 16		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memrchr	endp
	end

Implementation with AVX Instructions in i386 Assembler

Implementations for processors which support the Advanced Vector Extensions alias Sandy Bridge New Instructions, introduced January 9, 2011 with the Sandy Bridge micro-architecture, need 30 and 32 instructions in 89 and 92 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?

	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	vmovd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	vpshufb	xmm1, xmm1, xmm0; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address of buffer % 16
				;     = 16 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memchr	endp

memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?

	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	vmovd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	vpshufb	xmm1, xmm1, xmm0; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address after buffer % 16
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 16		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memrchr	endp
	end

Implementation with AVX2 Instructions in i386 Assembler

Implementations for processors which support the Advanced Vector Extensions 2 alias Haswell New Instructions, introduced June 4, 2013 with the Haswell micro-architecture, need 28 and 30 instructions in 82 and 85 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.ymm
	.model	flat, C
	.code

memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?

	vpbroadcastb ymm0, byte ptr [esp+8]
				; ymm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 31		; ecx = address of buffer % 32
				;     = 32 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 32		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memchr	endp

memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?

	vpbroadcastb ymm0, byte ptr [esp+8]
				; ymm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 31		; ecx = address after buffer % 32
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 32		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret

memrchr	endp
	end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code

memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; rax = 0
	test	r8, r8
	jz	short null	; count = 0?

	mov	r10, 0101010101010101h
if 0
	mov	r11, 8080808080808080h
elseif 0
	imul	r11, r10, 128	; r11 = 0x8080808080808080
else
	mov	r11, r10
	ror	r11, 1		; r11 = 0x8080808080808080
endif
	movzx	edx, dl		; rdx = character & 255
	imul	rdx, r10	; rdx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
				;     | character << 32
				;     | character << 40
				;     | character << 48
				;     | character << 56
	add	r8, rcx		; r8 = address after buffer
	mov	r9, rcx		; r9 = address of buffer
	and	ecx, 7		; rcx = address of buffer % 8
				;     = 8 - number of unaligned bytes
	jz	short aligned	; address of buffer % 8 = 0?
unaligned:
	sub	r9, rcx		; r9 = aligned address before buffer
	shl	ecx, 3		; rcx = (8 - number of unaligned bytes) * 8
				;     = 64 - number of unaligned bits
	dec	rax		; rax = ~0
	shl	rax, cl		; rax = ~0 for unaligned bytes, 0 elsewhere
	not	rax		; rax = 0 for unaligned bytes, ~0 elsewhere
	mov	rcx, [r9]	; rcx = unaligned bytes
	xor	rcx, rdx	; rcx = unaligned bytes ^ multiplied character
	or	rcx, rax	; rcx = '\0' for matching bytes
	jmp	short mycroft
next:
	add	r9, 8		; r9 = address of next 8 aligned bytes
	cmp	r9, r8
	jae	short null	; address after buffer?
aligned:
	mov	rcx, [r9]	; rcx = next 8 aligned bytes
	xor	rcx, rdx	; rcx = next 8 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	rax, rcx
	sub	rcx, r10
	not	rax
	and	rcx, r11
	and	rax, rcx	; rax = '\200' for matching bytes, '\0' elsewhere
	jz	short next	; no match in any byte?
match:
	bsf	rax, rax	; rax = offset of matching byte * 8 + 7
				;     = {7, 15, 23, 31, 39, 47, 55, 63}
	shr	eax, 3		; rax = offset of matching byte
				;     = {0, 1, 2, 3, 4, 5, 6, 7}
	add	rax, r9		; rax = address of matching byte
if 0
	cmp	rax, r8		; CF = (address inside buffer)
	sbb	rcx, rcx	; rcx = (address inside buffer) ? -1 : 0
	and	rax, rcx	; rax = address of character
else
	xor	ecx, ecx	; rcx = 0
	cmp	rax, r8		; CF = (address inside buffer)
	cmovnb	rax, rcx	; rax = address of character
endif
null:
	ret

memchr	endp

memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)

	xor	eax, eax	; rax = 0
	test	r8, r8
	jz	short null	; count = 0?

	mov	r10, 0101010101010101h
if 0
	mov	r11, 8080808080808080h
elseif 0
	imul	r11, r10, 128	; r11 = 0x8080808080808080
else
	mov	r11, r10
	ror	r11, 1		; r11 = 0x8080808080808080
endif
	movzx	edx, dl		; rdx = character & 255
	imul	rdx, r10	; rdx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
				;     | character << 32
				;     | character << 40
				;     | character << 48
				;     | character << 56
	add	r8, rcx		; r8 = address after buffer
	mov	r9, rcx		; r9 = address of buffer
	mov	rcx, r8
	and	ecx, 7		; rcx = address after buffer % 8
				;     = 8 - number of tail bytes
	jz	short aligned	; address after buffer % 8 = 0?
unaligned:
	sub	r8, rcx		; r8 = aligned address of tail bytes
	shl	ecx, 3		; rcx = (8 - number of tail bytes) * 8
				;     = 64 - number of tail bits
	dec	rax		; rax = ~0
	shl	rax, cl		; rax = '\0' for tail bytes, ~0 elsewhere
	mov	rcx, [r8]	; rcx = tail bytes
	xor	rcx, rdx	; rcx = tail bytes ^ multiplied character
	or	rcx, rax	; rcx = '\0' for matching tail bytes
	jmp	short mycroft
previous:
	sub	r8, 8		; r8 = address of previous 8 aligned bytes
	cmp	r8, r9
	jb	short null	; address before buffer?
aligned:
	mov	rcx, [r8]	; rcx = previous 8 aligned bytes
	xor	rcx, rdx	; rcx = previous 8 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	rax, rcx
	sub	rcx, r10
	not	rax
	and	rcx, r11
	and	rax, rcx	; rax = '\200' for matching bytes, '\0' elsewhere
	jz	short previous	; no match in any byte?
match:
	bsr	rax, rax	; rax = offset of matching byte * 8 + 7
				;     = {63, 55, 47, 39, 31, 23, 15, 7}
	shr	eax, 3		; rax = offset of matching byte
				;     = {7, 6, 5, 4, 3, 2, 1, 0}
	add	rax, r8		; rax = address of matching byte
if 0
	cmp	rax, r9		; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	rcx, rcx	; rcx = (address inside buffer) ? -1 : 0
	and	rax, rcx	; rax = address of character
else
	xor	ecx, ecx	; rcx = 0
	cmp	rax, r9		; CF = (address of matching byte < address of buffer)
	cmovb	rax, rcx	; rax = address of character
endif
null:
	ret

memrchr	endp
	end

`mem*()` Standard Functions

Although memcpy() and memset() are intrinsic functions, the Visual C compiler provides no inline implementation, but generates calls to external routines.

Proper implementations of these plus the memchr(), memcmp(), memmem(), memmove() and memrchr() functions for the i386 and the AMD64 platform follow with build instructions.

Note: the memmem() function is like the strstr() function and uses the same algorithm!

Note: both memmem() and memrchr() are not provided by the Visual C compiler or its runtime libraries!

Implementation in ANSI C

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define NULL	(void *) 0

#ifndef _WIN64
typedef	unsigned int	size_t;
#endif

#pragma function(memcmp, memcpy, memset)

void	*memchr(void const *destination, int character, size_t count)
{
	char const *mem = (unsigned char const *) destination;

	while (count)
	{
		if (*mem == (unsigned char) character)
			return (void *) mem;

		mem++, --count;
	}

	return NULL;
}

int	memcmp(void const *source, void const *destination, size_t count)
{
	char *dst = (unsigned char *) destination;
	char *src = (unsigned char *) source;

	if (count && source != destination)
		do
			if (*src - *dst)
#if 0
				return *src - *dst;
#else
				return (*src > *dst) - (*src < *dst);
#endif
		while (src++, dst++, --count);

	return 0;
}

void	*memcpy(void *destination, void const *source, size_t count)
{
	char *dst = (unsigned char *) destination;
	char *src = (unsigned char *) source;

	while (count)
		*dst++ = *src++, --count;

	return destination;
}

void	*memmem(void const *haystack, size_t count, void const *needle, size_t length)
{
	char const *mem;
	char const *hay = (unsigned char const *) haystack;
	char const *pin = (unsigned char const *) needle;

	if (!count || length > count)
		return NULL;

	if (!length)
		return (void *) haystack;

	if (!--length)		// needle is a single character?
		return memchr(haystack, *pin, count);

	count -= length;	// maximum number of characters to scan in haystack

	while (mem = hay, hay = (unsigned char const *) memchr(hay, *pin, count), hay)
	{			// *hay is first character of pin; compare
				//  last character of pin first, then proceed
		if (hay[length] == pin[length]
#if 0
		 && length == 1 || !memcmp(hay + 1, pin + 1, length - 1))
#else
		 && !memcmp(hay, pin, length))
#endif
			return (void *) hay;
				// skip character in haystack,
				//  adjust number of characters left in haystack
		count -= ++hay - mem;

		if (!count)
			break;
	}

	return NULL;
}

void	*memmove(void *destination, void const *source, size_t count)
{
	char *dst = (unsigned char *) destination;
	char *src = (unsigned char *) source;

	if (dst < src || dst - src >= count)
		while (count)
			*dst++ = *src++, --count;
	else if (dst > src)
	{			// overlapping buffers
		dst += count;
		src += count;

		while (count)
			*--dst = *--src, --count;
	}

	return destination;
}

void	*memrchr(void const *destination, int character, size_t count)
{
	char const *mem = (unsigned char const *) destination + count;

	while (count)
	{
		if (*--mem == (unsigned char) character)
			return (void *) mem;
		--count;
	}

	return NULL;
}

void	*memset(void *destination, int character, size_t count)
{
	char *dst = (unsigned char *) destination;

	while (count)
		*dst++ = (unsigned char) character, --count;

	return destination;
}

Implementation in i386 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

memchr	proc	public		; void *memchr(void const *destination, int character, size_t count)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of destination
	mov	eax, [esp+8]	; eax = character
	mov	ecx, [esp+12]	; ecx = count
	repne	scasb
	dec	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret

memchr	endp

memcmp	proc	public		; int memcmp(void const *source, void const *destination, size_t count)

	mov	eax, [esp+4]	; eax = address of source
	mov	edx, [esp+8]	; edx = address of destination
	cmp	edx, eax
	je	short equal	; address of destination = address of source?

	mov	ecx, [esp+12]	; ecx = count
if 0
	jecxz	short equal	; count = 0?
else
	cmp	ebx, ebx	; CF = 0,
				; ZF = 1 (required when count is 0)
endif
	xchg	esi, eax	; esi = address of source
	xchg	edi, edx	; edi = address of destination
	repe	cmpsb
	mov	edi, edx
	mov	esi, eax
	seta	al
	movzx	eax, al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	ret
equal:
	xor	eax, eax
	ret

memcmp	endp

memcpy	proc	public		; void *memcpy(void *destination, void const *source, size_t count)

	mov	eax, [esp+4]	; eax = address of destination
	mov	edx, [esp+8]	; edx = address of source
	cmp	edx, eax
	je	short return	; address of source = address of destination?

	mov	ecx, [esp+12]	; ecx = count
;;	jecxz	short return	; count = 0?

	xchg	esi, edx	; esi = address of source
	xchg	edi, eax	; edi = address of destination
if 1
	rep	movsb
else
	shr	ecx, 1		; ecx = count / 2
	jnc	short @f	; count % 2 = 0?

	movsb
@@:
	shr	ecx, 1		; ecx = count / 4
	jnc	short @f	; count % 4 = 0?

	movsw
@@:
	rep	movsd
endif
	mov	esi, edx
	mov	edi, eax
	mov	eax, [esp+4]	; eax = address of destination
return:
	ret

memcpy	endp

memmem	proc	public		; void *memmem(void const *haystack, size_t count,
				;              void const *needle, size_t length)

	xor	eax, eax	; eax = address of needle in haystack = 0
	mov	ecx, [esp+8]	; ecx = length of haystack
	test	ecx, ecx
	jz	short empty	; length of haystack = 0?

	mov	edx, [esp+16]	; edx = length of needle
	cmp	edx, ecx
	ja	short empty	; length of needle > length of haystack?

	mov	eax, [esp+4]	; eax = address of haystack
	test	edx, edx
	jz	short empty	; length of needle = 0?

	push	ebx
	push	edi
	mov	edi, eax	; edi = address of haystack
	push	esi
search:
	mov	esi, [esp+24]	; esi = address of needle
	mov	al, [esi]	; al = first character of needle
				; edi = current address in haystack
	repne	scasb		; edi = next address in haystack,
				; ecx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?

	dec	ecx		; ecx = next length of haystack

	mov	al, [esi+edx-1]	; al = last character of needle
	cmp	al, [edi+edx-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	eax, edi	; eax = next address in haystack
	mov	ebx, ecx	; ebx = next length of haystack
if 0
	dec	edi		; edi = current address in haystack
				;     = address of matching character
				; esi = address of needle
	mov	ecx, edx	; ecx = length of needle
else
				; edi = next address in haystack
	inc	esi		; esi = address of needle + 1
	mov	ecx, edx
	dec	ecx		; ecx = length of needle - 1,
				; ZF = (ecx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?

	mov	edi, eax	; edi = current address in haystack
	mov	ecx, ebx	; ecx = current length of haystack
continue:
	cmp	ecx, edx
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	pop	esi
	pop	edi
	pop	ebx
empty:
	ret
match:
	dec	eax		; eax = address of needle in haystack
	pop	esi
	pop	edi
	pop	ebx
	ret

memmem	endp

memmove	proc	public		; void *memmove(void *destination, void const *source, size_t count)

	mov	eax, [esp+4]	; eax = address of destination
	mov	edx, [esp+8]	; edx = address of source
	cmp	edx, eax
	je	short return	; address of source = address of destination?

	mov	ecx, [esp+12]	; ecx = count
;;	jecxz	short return	; count = 0?

	xchg	esi, edx	; esi = address of source
	xchg	edi, eax	; edi = address of destination

	ja	short default	; address of source > address of destination?
overlap:
	lea	edi, [edi+ecx-1]
	lea	esi, [esi+ecx-1]
	std
default:
	rep	movsb
	cld
	mov	esi, edx
	mov	edi, eax
	mov	eax, [esp+4]	; eax = address of destination
return:
	ret

memmove	endp

memrchr	proc	public		; void *memrchr(void const *destination, int character, size_t count)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of destination
	mov	eax, [esp+8]	; eax = character
	mov	ecx, [esp+12]	; ecx = count
	lea	edi, [edi+ecx-1]
	std
	repne	scasb
	cld
	inc	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret

memrchr	endp

memset	proc	public		; void *memset(void *destination, int character, size_t count)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of destination
	mov	eax, [esp+8]	; eax = character
	mov	ecx, [esp+12]	; ecx = count
;;	jecxz	short @f	; count = 0?

	rep	stosb
@@:
	mov	eax, [esp+4]	; eax = address of destination
	mov	edi, edx
	ret

memset	endp
	end

Save the i386 assembler source presented above as i386-mem.asm in the directory where you created the object library i386.lib before, then execute the following 3 command lines to generate the object file i386-mem.obj and add it to the existing object library i386.lib:

SET ML=/c /Gy /safeseh /W3 /X
ML.EXE i386-mem.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-mem.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 14.16.27023.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-mem.asm

Microsoft (R) Library Manager Version 14.16.27049.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Implementation in AMD64 Assembler

; Copyright © 2009-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; Microsoft calling convention for AMD64 platform:
; - first 4 integer or pointer arguments (from left to right) are passed
;   in registers RCX/R1, RDX/R2, R8 and R9;
; - arguments larger than 8 bytes are passed by reference;
; - surplus arguments are pushed on stack in reverse order (from right
;   to left), 8-byte aligned;
; - caller allocates memory for return value larger than 8 bytes and
;   passes pointer to it as (hidden) first argument, thus shifting
;   all other arguments;
; - caller always allocates "home space" for 4 arguments on stack,
;   even when less than 4 arguments are passed, but does not need to push
;   first 4 arguments;
; - callee can spill first 4 arguments from registers to "home space";
; - callee can clobber "home space";
; - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
;   when it calls other functions (CALL instruction pushes 8 bytes);
; - 64-bit integer or pointer result is returned in register RAX/R0;
; - 32-bit integer result is returned in register EAX;
; - registers RAX/R0, RCX/R1, RDX/R2, R8, R9, R10 and R11 are volatile
;   and can be clobbered;
; - registers RBX/R3, RSP/R4, RBP/R5, RSI/R6, RDI/R7, R12, R13, R14 and
;   R15 must be preserved.

	.code

memchr	proc	public		; void *memchr(void const *destination, int character, size_t count)

	mov	r9, rdi
	mov	rdi, rcx	; rdi = address of destination
	mov	rcx, r8		; rcx = count
	mov	eax, edx	; rax = character
	repne	scasb
	lea	rax, [rdi-1]
	test	rcx, rcx
	cmovz	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r9
	ret

memchr	endp

memcmp	proc	public		; int memcmp(void const *source, void const *destination, size_t count)

;;	xor	eax, eax	; rax = 0
;;	test	r8, r8
;;	jz	short equal	; count = 0?

;;	cmp	rcx, rdx
;;	je	short equal	; address of source = address of destination?

	mov	r9, rsi
	mov	rsi, rcx	; rsi = address of source
	mov	rcx, r8		; rcx = count
	mov	r8, rdi
	mov	rdi, rdx	; rdi = address of destination
	xor	eax, eax	; rax = 0,
				; CF = 0,
				; ZF = 1 (required when count is 0)
	repe	cmpsb
	seta	al		; rax = (*source > *destination)
	sbb	rax, 0		; rax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	mov	rdi, r8
	mov	rsi, r9
equal:
	ret

memcmp	endp

memcpy	proc	public		; void *memcpy(void *destination, void const *source, size_t count)

	mov	rax, rcx	; rax = address of destination
;;	test	r8, r8
;;	jz	short return	; count = 0?

;;	cmp	rcx, rdx
;;	je	short return	; address of destination = address of source?

	mov	r9, rdi
	mov	rdi, rcx	; rdi = address of destination
	mov	rcx, r8		; rcx = count
	mov	r8, rsi
	mov	rsi, rdx	; rsi = address of source
if 1
	rep	movsb
else
	shr	rcx, 1		; rcx = count / 2
	jnc	short @f	; count % 2 = 0?

	movsb
@@:
	shr	rcx, 1		; rcx = count / 4
	jnc	short @f	; count % 4 = 0?

	movsw
@@:
	shr	rcx, 1		; rcx = count / 8
	jnc	short @f	; count % 8 = 0?

	movsd
@@:
	rep	movsq
endif
	mov	rdi, r9
	mov	rsi, r8
return:
	ret

memcpy	endp

memmem	proc	public		; void *memmem(void const *haystack, size_t count,
				;              void const *needle, size_t length)

	xor	eax, eax	; rax = address of needle in haystack = 0
	test	rdx, rdx
	jz	short empty	; length of haystack = 0?

	cmp	rdx, r9
	jb	short empty	; length of haystack < length of needle?

	mov	rax, rcx	; rax = address of haystack
	test	r9, r9
	jz	short empty	; length of needle = 0?

	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of haystack
	mov	rcx, rdx	; rcx = length of haystack
	mov	r11, rsi
search:
	mov	al, [r8]	; al = first character of needle
				; rdi = current address in haystack
	repne	scasb		; rdi = next address in haystack,
				; rcx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?

	dec	rcx		; rcx = next length of haystack

	mov	al, [r8+r9-1]	; al = last character of needle
	cmp	al, [rdi+r9-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	rax, rdi	; rax = next address in haystack
	mov	rdx, rcx	; rdx = next length of haystack
if 0
	dec	rdi		; rdi = current address in haystack
				;     = address of matching character
	mov	rsi, r8		; rsi = address of needle
	mov	rcx, r9		; rcx = length of needle
else
				; rdi = next address in haystack
	mov	rsi, r8
	inc	rsi		; rsi = address of needle + 1
	mov	rcx, r9
	dec	rcx		; rcx = length of needle - 1,
				; ZF = (rcx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?

	mov	rdi, rax	; rdi = current address in haystack
	mov	rcx, rdx	; rcx = current length of haystack
continue:
	cmp	rcx, r9
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	mov	rdi, r10
	mov	rsi, r11
empty:
	ret
match:
	dec	rax		; rax = address of needle in haystack
	mov	rdi, r10
	mov	rsi, r11
	ret

memmem	endp

memmove	proc	public		; void *memmove(void *destination, void const *source, size_t count)

	mov	rax, rcx	; rax = address of destination
;;	test	r8, r8
;;	jz	short return	; count = 0?

	cmp	rcx, rdx
	je	short return	; address of destination = address of source?

	mov	r9, rdi
	mov	rdi, rcx	; rdi = address of destination
	mov	rcx, r8		; rcx = count
	mov	r8, rsi
	mov	rsi, rdx	; rsi = address of source

	jb	short default	; address of destination < address of source?

	add	rdx, rcx	; rdx = address of source + count
	cmp	rdi, rdx
	jae	short default	; address of destination >= address of source + count?
overlap:
	lea	rdi, [rdi+rcx-1]
	lea	rsi, [rsi+rcx-1]
	std
default:
	rep	movsb
	cld
	mov	rdi, r9
	mov	rsi, r8
return:
	ret

memmove	endp

memrchr	proc	public		; void *memrchr(void const *destination, int character, size_t count)

	mov	r9, rdi
	lea	rdi, [rcx+r8-1]	; rdi = address of destination + count - 1
	mov	eax, edx	; rax = character
	mov	rcx, r8		; rcx = count
	std
	repne	scasb
	cld
	lea	rax, [rdi+1]
	test	rcx, rcx
	cmovz	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r9
	ret

memrchr	endp

memset	proc	public		; void *memset(void *destination, int character, size_t count)

	mov	r9, rcx		; r9 = address of destination
	mov	rcx, r8		; rcx = count
;;	jrcxz	short @f	; count = 0?

	mov	r8, rdi
	mov	rdi, r9		; rdi = address of destination
	mov	eax, edx	; rax = character
	rep	stosb
	mov	rdi, r8
@@:
	mov	rax, r9		; rax = address of destination
	ret

memset	endp
	end

Save the AMD64 assembler source presented above as amd64-mem.asm in the directory where you created the object library amd64.lib before, then execute the following 3 command lines to generate the object file amd64-mem.obj and add it to the existing object library amd64.lib:

SET ML=/c /Gy /W3 /X
ML64.EXE amd64-mem.asm
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-mem.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: if the /Gy option to package every function in its own, separately linkable COMDAT section is not available with the version of the macro assembler ML64.EXE you use, split the AMD64 assembler source into multiple pieces, with one function per source file.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: amd64-mem.asm

Microsoft (R) Library Manager Version 14.16.27049.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Inline Implementation of `memcpy()` and `memset()` with Intrinsic Functions

Optionally add the following snippet at the very top of your C sources to allow the Visual C compiler to inline the memcpy() function as __movsb() alias REP MOVSB and the memset() function as __stosb() alias REP STOSB, but without shuffling as many registers as the external functions:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef _WIN64
typedef	unsigned int	size_t;
#endif

#pragma function(memcpy, memset)
#pragma intrinsic(__movsb, __stosb)

__inline
void	*memcpy(void *destination, void const *source, size_t count)
{
	__movsb((unsigned char *) destination, (unsigned char const *) source, count);

	return destination;
}

__inline
void	*memset(void *destination, int character, size_t count)
{
	__stosb((unsigned char *) destination, (unsigned char) character, count);

	return destination;
}

`strchr()` Standard Function for i386 Platform

Although the strchr() function is not a compiler helper function, it is like the memchr() included here for entertainment due to its ~~quality~~ deficiencies and flaws.

DIR "%source%\intel\strchr.asm"
TYPE "%source%\intel\strchr.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             5,904 strchr.asm
               1 File(s)          5,904 bytes
               0 Dir(s)    9,876,543,210 bytes free

        page    ,132
        title   strchr - search string for given character
;***
;strchr.asm - search a string for a given character
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines strchr() - search a string for a character
;
;*******************************************************************************

        .xlist
        include cruntime.inc
        .list

page
;***
;char *strchr(string, chr) - search a string for a character
;
;Purpose:
;       Searches a string for a given character, which may be the
;       null character '\0'.
;
;       Algorithm:
;       char *
;       strchr (string, chr)
;       char *string, chr;
;       {
;         while (*string && *string != chr)
;             string++;
;         if (*string == chr)
;             return(string);
;         return((char *)0);
;       }
;
;Entry:
;       char *string - string to search in
;       char chr     - character to search for
;
;Exit:
;       returns pointer to the first occurence of c in string
;       returns NULL if chr does not occur in string
;
;Uses:
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

found_bx:
        lea     eax,[edx - 1]
        pop     ebx                 ; restore ebx
        ret                         ; _cdecl return

        align   16
        public  strchr, __from_strstr_to_strchr
strchr  proc \
        string:ptr byte, \
        chr:byte

        OPTION PROLOGUE:NONE, EPILOGUE:NONE

        .FPO    ( 0, 2, 0, 0, 0, 0 )

        xor     eax,eax
        mov     al,[esp + 8]        ; al = chr (search char)

__from_strstr_to_strchr label proc

        push    ebx                 ; PRESERVE EBX
        mov     ebx,eax             ; ebx = 0/0/0/chr
        shl     eax,8               ; eax = 0/0/chr/0
        mov     edx,[esp + 8]       ; edx = buffer
        test    edx,3               ; test if string is aligned on 32 bits
        jz      short main_loop_start

str_misaligned:                     ; simple byte loop until string is aligned
        mov     cl,[edx]
        add     edx,1
        cmp     cl,bl
        je      short found_bx
        test    cl,cl
        jz      short retnull_bx
        test    edx,3               ; now aligned ?
        jne     short str_misaligned

main_loop_start:                    ; set all 4 bytes of ebx to [chr]
        or      ebx,eax             ; ebx = 0/0/chr/chr
        push    edi                 ; PRESERVE EDI
        mov     eax,ebx             ; eax = 0/0/chr/chr
        shl     ebx,10h             ; ebx = chr/chr/0/0
        push    esi                 ; PRESERVE ESI
        or      ebx,eax             ; ebx = all 4 bytes = [chr]

; in the main loop (below), we are looking for chr or for EOS (end of string)

main_loop:
        mov     ecx,[edx]           ; read  dword (4 bytes)
        mov     edi,7efefeffh       ; work with edi & ecx for looking for chr

        mov     eax,ecx             ; eax = dword
        mov     esi,edi             ; work with esi & eax for looking for EOS

        xor     ecx,ebx             ; eax = dword xor chr/chr/chr/chr
        add     esi,eax

        add     edi,ecx
        xor     ecx,-1

        xor     eax,-1
        xor     ecx,edi

        xor     eax,esi
        add     edx,4

        and     ecx,81010100h       ; test for chr
        jnz     short chr_is_found  ; chr probably has been found

        ; chr was not found, check for EOS

        and     eax,81010100h       ; is any flag set ??
        jz      short main_loop     ; EOS was not found, go get another dword

        and     eax,01010100h       ; is it in high byte?
        jnz     short retnull       ; no, definitely found EOS, return failure

        and     esi,80000000h       ; check was high byte 0 or 80h
        jnz     short main_loop     ; it just was 80h in high byte, go get
                                    ; another dword
retnull:
        pop     esi
        pop     edi
retnull_bx:
        pop     ebx
        xor     eax,eax
        ret                         ; _cdecl return

chr_is_found:
        mov     eax,[edx - 4]       ; let's look one more time on this dword
        cmp     al,bl               ; is chr in byte 0?
        je      short byte_0
        test    al,al               ; test if low byte is 0
        je      retnull
        cmp     ah,bl               ; is it byte 1
        je      short byte_1
        test    ah,ah               ; found EOS ?
        je      retnull
        shr     eax,10h             ; is it byte 2
        cmp     al,bl
        je      short byte_2
        test    al,al               ; if in al some bits were set, bl!=bh
        je      retnull
        cmp     ah,bl
        je      short byte_3
        test    ah,ah
        jz      retnull
        jmp     short main_loop     ; neither chr nor EOS found, go get
                                    ; another dword
byte_3:
        pop     esi
        pop     edi
        lea     eax,[edx - 1]
        pop     ebx                 ; restore ebx
        ret                         ; _cdecl return

byte_2:
        lea     eax,[edx - 2]
        pop     esi
        pop     edi
        pop     ebx
        ret                         ; _cdecl return

byte_1:
        lea     eax,[edx - 3]
        pop     esi
        pop     edi
        pop     ebx
        ret                         ; _cdecl return

byte_0:
        lea     eax,[edx - 4]
        pop     esi                 ; restore esi
        pop     edi                 ; restore edi
        pop     ebx                 ; restore ebx
        ret                         ; _cdecl return

strchr  endp
        end

With 89 instructions in 206 bytes, this implementation is even worse than that of the memchr() routine shown and dissected above!

Implementation in i386 Assembler

A proper implementation needs but only 46 instructions in just 112 bytes, i.e. 55 % of Microsoft’s poor implementation:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

strchr	proc	public		; char *strchr(unsigned char const *string, int character)

	xor	eax, eax	; eax = 0
	cdq			; edx = 0
	mov	dl, [esp+8]	; edx = character
	imul	edx, 01010101h	; edx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	mov	[esp+8], edx
	mov	ecx, [esp+4]	; ecx = address of string
	push	ebx
	mov	ebx, ecx
	and	ecx, 3		; ecx = address of string % 4
				;     = 4 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	ebx, ecx	; ebx = aligned address before string
	shl	ecx, 3		; ecx = (4 - number of unaligned characters) * 8
				;     = 32 - number of unaligned bits
	dec	eax		; eax = ~0
	shl	eax, cl		; eax = ~0 for unaligned characters, 0 elsewhere
	not	eax		; eax = 0 for unaligned characters, ~0 elsewhere
	mov	ecx, [ebx]	; ecx = unaligned characters
	xor	edx, ecx	; edx = unaligned characters ^ multiplied character
	or	edx, eax	; edx = '\0' for matching characters
	or	eax, ecx	; eax = unaligned characters, ~0 elsewhere
	jmp	mycroft
next:
	add	ebx, 4		; ebx = address of next 4 aligned characters
aligned:
	mov	edx, [esp+12]	; edx = multiplied character
	mov	eax, [ebx]	; eax = next 4 aligned characters
	xor	edx, eax	; edx = next 4 aligned characters ^ multiplied character
				;     = '\0' for matching characters
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	ecx, eax
	mov	eax, edx
	sub	eax, 01010101h
	not	edx
	and	eax, edx
	or	eax, ecx
	and	eax, 80808080h	; eax = '\200' for '\0' or matching characters
	jz	short next	; neither '\0' nor any matching character?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of '\0' or matching character
				;     = {0, 1, 2, 3}
	cdq			; edx = 0
	add	eax, ebx	; eax = address of '\0' or matching character
	mov	dl, [esp+12]	; edx = character
	cmp	dl, [eax]	; ZF = (character = matching character)
	setne	dl		; edx = (character = matching character) ? 0 : 1
	dec	edx		; edx = (character = matching character) ? -1 : 0
	and	eax, edx	; eax = address of matching character
	pop	ebx
	ret

strchr	endp
	end

Save the i386 assembler source presented above as i386-strchr.asm and the ANSI C source presented below as i386-strchr.c, then execute the 6 command lines following the ANSI C source to assemble i386-strchr.asm, compile i386-strchr.c, link the generated object files i386-strchr.obj and i386-strchr.tmp, and finally execute the image file i386-strchr.exe to demonstrate the correct operation:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

const	CHAR	szString[] = "01234567890";

__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPCSTR	lpString = szString + sizeof(szString);
	DWORD	dwError = ERROR_SUCCESS;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		dwError = GetLastError();
	else
		while (--lpString >= szString)
		{
			PrintFormat(hOutput,
			            "0x%p: strchr(\"%hs\", '0') = 0x%p\r\n",
			            lpString, lpString, strchr(lpString, '0'));

			PrintFormat(hOutput,
			            "0x%p: strchr(\"%hs\", '%hc') = 0x%p\r\n",
			            lpString, lpString, *lpString, strchr(lpString, *lpString));

			PrintFormat(hOutput,
			            "0x%p: strchr(\"%hs\", '\\0') = 0x%p\r\n",
			            lpString, lpString, strchr(lpString, '\0'));
		}

	ExitProcess(dwError);
}

SET ML=/c /safeseh /W3 /X
ML.EXE i386-strchr.asm
SET CL=/GAFy /Oy /W4 /Zl
SET LINK=/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
CL.EXE /Foi386-strchr.tmp i386-strchr.obj i386-strchr.c
.\i386-strchr.exe

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-strchr.asm

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-strchr.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
/out:i386-strchr.exe
i386-strchr.obj
i386-strchr.tmp

0x00922027: strchr("", '0') = 0x00000000
0x00922027: strchr("", '▯') = 0x00922027
0x00922027: strchr("", '\0') = 0x00922027
0x00922026: strchr("0", '0') = 0x00922026
0x00922026: strchr("0", '0') = 0x00922026
0x00922026: strchr("0", '\0') = 0x00922027
0x00922025: strchr("90", '0') = 0x00922026
0x00922025: strchr("90", '9') = 0x00922025
0x00922025: strchr("90", '\0') = 0x00922027
0x00922024: strchr("890", '0') = 0x00922026
0x00922024: strchr("890", '8') = 0x00922024
0x00922024: strchr("890", '\0') = 0x00922027
0x00922023: strchr("7890", '0') = 0x00922026
0x00922023: strchr("7890", '7') = 0x00922023
0x00922023: strchr("7890", '\0') = 0x00922027
0x00922022: strchr("67890", '0') = 0x00922026
0x00922022: strchr("67890", '6') = 0x00922022
0x00922022: strchr("67890", '\0') = 0x00922027
0x00922021: strchr("567890", '0') = 0x00922026
0x00922021: strchr("567890", '5') = 0x00922021
0x00922021: strchr("567890", '\0') = 0x00922027
0x00922020: strchr("4567890", '0') = 0x00922026
0x00922020: strchr("4567890", '4') = 0x00922020
0x00922020: strchr("4567890", '\0') = 0x00922027
0x0092201F: strchr("34567890", '0') = 0x00922026
0x0092201F: strchr("34567890", '3') = 0x0092201F
0x0092201F: strchr("34567890", '\0') = 0x00922027
0x0092201E: strchr("234567890", '0') = 0x00922026
0x0092201E: strchr("234567890", '2') = 0x0092201E
0x0092201E: strchr("234567890", '\0') = 0x00922027
0x0092201D: strchr("1234567890", '0') = 0x00922026
0x0092201D: strchr("1234567890", '1') = 0x0092201D
0x0092201D: strchr("1234567890", '\0') = 0x00922027
0x0092201C: strchr("01234567890", '0') = 0x0092201C
0x0092201C: strchr("01234567890", '0') = 0x0092201C
0x0092201C: strchr("01234567890", '\0') = 0x00922027

Implementation with SSE2 Instructions in i386 Assembler

An implementation for processors which support the Streaming SIMD Extensions 2 alias Willamette New Instructions, introduced November 20, 2000 with the Pentium^®4, needs 35 instructions in 111 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

strchr	proc	public		; char *strchr(unsigned char const *string, int character)

if 0
	xor	eax, eax	; eax = 0
	mov	al, [esp+8]	; eax = character
	imul	eax, 01010101h	; eax = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	movd	xmm0, eax
else
	movd	xmm0, dword ptr [esp+8]
	punpcklbw xmm0, xmm0
	punpcklwd xmm0, xmm0
endif
	pshufd	xmm0, xmm0, 0	; xmm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 characters
	pxor	xmm2, xmm2	; xmm2 = 0
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each '\0' in chunk
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching character in chunk
	por	xmm1, xmm2	; xmm1 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm1	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 characters
	pxor	xmm2, xmm2	; xmm2 = 0
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each '\0' in chunk
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching character in chunk
	por	xmm1, xmm2	; xmm1 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm1	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret

strchr	endp
	end

Implementation with SSSE3 Instructions in i386 Assembler

An implementation for processors which support the Supplemental Streaming SIMD Extensions 3 alias Merom New Instructions, introduced June 26, 2006 with the Core^™ micro-architecture, needs 33 instructions in 103 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

strchr	proc	public		; char *strchr(unsigned char const *string, int character)

	pxor	xmm0, xmm0	; xmm0 = 0
	movd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	pshufb	xmm1, xmm0	; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	movdqa	xmm2, [edx]	; xmm2 = chunk of 16 characters
;;	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, xmm2	; xmm0 = '\377' for each '\0' in chunk
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each matching character in chunk
	por	xmm0, xmm2	; xmm0 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm0	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	movdqa	xmm2, [edx]	; xmm2 = chunk of 16 characters
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, xmm2	; xmm0 = '\377' for each '\0' in chunk
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each matching character in chunk
	por	xmm0, xmm2	; xmm0 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm0	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret

strchr	endp
	end

Implementation with AVX Instructions in i386 Assembler

An implementation for processors which support the Advanced Vector Extensions alias Sandy Bridge New Instructions, introduced January 9, 2011 with the Sandy Bridge micro-architecture, needs 32 instructions in 99 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

strchr	proc	public		; char *strchr(unsigned char const *string, int character)

	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	vmovd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	vpshufb	xmm1, xmm1, xmm0; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vmovdqa	xmm2, xmmword ptr [edx]
				; xmm2 = chunk of 16 characters
	vpcmpeqb xmm3, xmm2, xmm1
				; xmm3 = '\377' for each matching character in chunk
	vpcmpeqb xmm2, xmm2, xmm0
				; xmm2 = '\377' for each '\0' in chunk
	vpor	xmm2, xmm2, xmm3; xmm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, xmm2	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	vmovdqa	xmm2, xmmword ptr [edx]
				; xmm2 = chunk of 16 characters
	vpcmpeqb xmm3, xmm2, xmm1
				; xmm3 = '\377' for each matching character in chunk
	vpcmpeqb xmm2, xmm2, xmm0
				; xmm2 = '\377' for each '\0' in chunk
	vpor	xmm2, xmm2, xmm3; xmm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, xmm2	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret

strchr	endp
	end

Implementation with AVX2 Instructions in i386 Assembler

An implementation for processors which support the Advanced Vector Extensions 2 alias Haswell New Instructions, introduced June 4, 2013 with the Haswell micro-architecture, needs 31 instructions in 95 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.ymm
	.model	flat, C
	.code

strchr	proc	public		; char *strchr(unsigned char const *string, int character)

	vpxor	ymm0, ymm0, ymm0; ymm0 = 0
	vpbroadcastb ymm1, byte ptr [esp+8]
				; ymm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 31		; ecx = address of string % 32
				;     = 32 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vmovdqa	ymm2, ymmword ptr [edx]
				; ymm2 = chunk of 32 characters
	vpcmpeqb ymm3, ymm2, ymm1
				; ymm3 = '\377' for each matching character in chunk
	vpcmpeqb ymm2, ymm2, ymm0
				; ymm2 = '\377' for each '\0' in chunk
	vpor	ymm2, ymm2, ymm3; ymm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, ymm2	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 32		; edx = address of next chunk of aligned characters
aligned:
	vmovdqa	ymm2, ymmword ptr [edx]
				; ymm2 = chunk of 32 characters
	vpcmpeqb ymm3, ymm2, ymm1
				; ymm3 = '\377' for each matching character in chunk
	vpcmpeqb ymm2, ymm2, ymm0
				; ymm2 = '\377' for each '\0' in chunk
	vpor	ymm2, ymm2, ymm3; ymm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, ymm2	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret

strchr	endp
	end

`strlen()` Standard Function for i386 Platform

Although the strlen() function is not a compiler helper function, it is like the memchr() included here for entertainment due to its ~~quality~~ deficiencies and flaws.

DIR "%source%\intel\strlen.asm"
TYPE "%source%\intel\strlen.asm"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel

02/18/2011  03:40 PM             3,208 strlen.asm
               1 File(s)          3,208 bytes
               0 Dir(s)    9,876,543,210 bytes free

        page    ,132
        title   strlen - return the length of a null-terminated string
;***
;strlen.asm - contains strlen() routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       strlen returns the length of a null-terminated string,
;       not including the null byte itself.
;
;*******************************************************************************

        .xlist
        include cruntime.inc
        .list

page
;***
;strlen - return the length of a null-terminated string
;
;Purpose:
;       Finds the length in bytes of the given string, not including
;       the final null character.
;
;       Algorithm:
;       int strlen (const char * str)
;       {
;           int length = 0;
;
;           while( *str++ )
;                   ++length;
;
;           return( length );
;       }
;
;Entry:
;       const char * str - string whose length is to be computed
;
;Exit:
;       EAX = length of the string "str", exclusive of the final null byte
;
;Uses:
;       EAX, ECX, EDX
;
;Exceptions:
;
;*******************************************************************************

        CODESEG

        public  strlen

strlen  proc \
        buf:ptr byte

        OPTION PROLOGUE:NONE, EPILOGUE:NONE

        .FPO    ( 0, 1, 0, 0, 0, 0 )

string  equ     [esp + 4]

        mov     ecx,string              ; ecx -> string
        test    ecx,3                   ; test if string is aligned on 32 bits
        test    cl,3
        je      short main_loop

str_misaligned:
        ; simple byte loop until string is aligned
        mov     al,byte ptr [ecx]
        add     ecx,1
        inc     ecx
        test    al,al
        je      short byte_3
        test    ecx,3
        test    cl,3
        jne     short str_misaligned
        jmp     short main_loop
byte_3:
        lea     eax,[ecx - 1]
        mov     ecx,string
        sub     eax,ecx
        ret

        add     eax,dword ptr 0         ; 5 byte nop to align label below

        align   16                      ; should be redundant

main_loop:
        mov     eax,dword ptr [ecx]     ; read 4 bytes
        mov     edx,7efefeffh
        add     edx,eax
        xor     eax,-1
        xor     eax,edx
        add     ecx,4
        test    eax,81010100h
        lea     edx,[eax-01010101h]
        not     eax
        and     eax,edx
        and     eax,80808080h
        je      short main_loop
        ; found zero byte in the loop
        bsf     eax,eax
        shr     eax,3
        lea     eax,[eax+ecx-4]
        mov     ecx,string
        sub     eax,ecx
        ret
        mov     eax,[ecx - 4]
        test    al,al                   ; is it byte 0
        je      short byte_0
        test    ah,ah                   ; is it byte 1
        je      short byte_1
        test    eax,00ff0000h           ; is it byte 2
        je      short byte_2
        test    eax,0ff000000h          ; is it byte 3
        je      short byte_3
        jmp     short main_loop         ; taken if bits 24-30 are clear and bit
                                        ; 31 is set

byte_3:
        lea     eax,[ecx - 1]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_2:
        lea     eax,[ecx - 2]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_1:
        lea     eax,[ecx - 3]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_0:
        lea     eax,[ecx - 4]
        mov     ecx,string
        sub     eax,ecx
        ret

strlen  endp

        end

With 44 instructions in 139 bytes, this routine is a real gem too – which nobody with a sane mind should but consider to use!

CAVEAT: Intel’s current Optimization Reference Manual: Volume 1, published January 2024, presents this dumb implementation as Example 14-3!

OOPS: the ~~deleted~~ TEST instructions with immediate value 3 should be replaced with the inserted shorter ones, saving 6 bytes.

OUCH¹: the ~~deleted~~ ADD instruction which increment by 1 should be replaced with the inserted shorter INC instruction, saving 1 byte.

Note: the 7 saved bytes allow to move the 4 instructions after label byte_3: before the label main_loop:, (ab)using them to align the loop.

OUCH²: instead of the ~~deleted~~ XOR instruction with immediate operand -1 the inserted shorter NOT instruction should be used, saving 1 byte!

OUCH³: when the 5 ~~deleted~~ instructions after label main_loop: are replaced with the 4 instructions inserted there, the 22 (in words: twenty-two) ~~deleted~~ instructions at the end of the function can be replaced with the 6 faster and shorter instructions inserted there, saving 42 (in words: fourty-two) bytes!

Note: Alan Mycroft posted the better, faster and shorter method on April 8, 1987 to the USENET news group comp.lang.c

You might be interested to know that such detection of null bytes in words can be done in 3 or 4 instructions on almost any hardware (nay even in C). (Code that follows relies on x being a 32 bit unsigned (or 2's complement int with overflow ignored)...) #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) Then if e is an expression without side effects (e.g. variable) has_nullbyte_(e) is nonzero iff the value of e has a null byte. (One can view this as explicit programming of the Manchester carry chain present in many processors which is hidden from the instruction level).

Note: see Bit Twiddling Hacks – Determine if a word has a zero byte for a comparison of both methods and more details.

Note: Microsoft’s strcat.asm, strchr.asm, strncat.asm and strncpy.asm sources suffer from the same plus some more deficiencies and flaws!

Note: with the modifications shown in the source, this routine has 27 instructions in 87 bytes, i.e. less than two thirds of the original’s instructions and bytes.

Implementation in i386 Assembler

A proper implementation needs but only 24 instructions in just 63 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	mov	edx, [esp+4]	; edx = address of string
	mov	ecx, edx
	and	ecx, 3		; ecx = address of string % 4
				;     = 4 - number of unaligned characters
	jz	short aligned	; address of string % 4 = 0?
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	shl	ecx, 3		; ecx = (4 - number of unaligned characters) * 8
				;     = 32 - number of unaligned bits
if 0
	xor	eax, eax
	dec			; eax = ~0
else
	or	eax, -1		; eax = ~0
endif
	shl	eax, cl		; eax = ~0 for unaligned characters, 0 elsewhere
	not	eax		; eax = 0 for unaligned characters, ~0 elsewhere
	or	eax, [edx]	; eax = unaligned characters
	jmp	short mycroft
next:
	add	edx, 4		; edx = address of next 4 aligned characters
aligned:
	mov	eax, [edx]	; eax = next 4 aligned characters
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	eax, 80808080h
	and	eax, ecx	; eax = '\200' for matching characters, '\0' elsewhere
	jz	short next	; no '\0' in any character?
match:
	bsf	eax, eax	; eax = offset of '\0' * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of '\0'
				;     = {0, 1, 2, 3}
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret

strlen	endp
	end

Save the i386 assembler source presented above as i386-strlen.asm and the ANSI C source presented below as i386-strlen.c, then execute the 6 command lines following the ANSI C source to assemble i386-strlen.asm, compile i386-strlen.c, link the generated object files i386-strlen.obj and i386-strlen.tmp, and finally execute the image file i386-strlen.exe to demonstrate the correct operation:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

#pragma function(strlen)

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

const	CHAR	szString[] = "987654321";

__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPCSTR	lpString = szString + sizeof(szString);
	DWORD	dwError = ERROR_SUCCESS;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);

	if (hOutput == INVALID_HANDLE_VALUE)
		dwError = GetLastError();
	else
		while (--lpString >= szString)
			PrintFormat(hOutput,
			            "0x%p: strlen(\"%hs\") = %lu\r\n",
			            lpString, lpString, strlen(lpString));

	ExitProcess(dwError);
}

SET ML=/c /safeseh /W3 /X
ML.EXE i386-strlen.asm
SET CL=/GAFy /Oy /W4 /Zl
SET LINK=/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
CL.EXE /Foi386-strlen.tmp i386-strlen.obj i386-strlen.c
.\i386-strlen.exe

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-strlen.asm

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-strlen.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
/out:i386-strlen.exe
i386-strlen.obj
i386-strlen.tmp

0x01202025: strlen("") = 0
0x01202024: strlen("1") = 1
0x01202023: strlen("21") = 2
0x01202022: strlen("321") = 3
0x01202021: strlen("4321") = 4
0x01202020: strlen("54321") = 5
0x0120201F: strlen("654321") = 6
0x0120201E: strlen("7654321") = 7
0x0120201D: strlen("87654321") = 8
0x0120201C: strlen("987654321") = 9

Implementation with SSE2 Instructions in i386 Assembler

An implementation for processors which support the Streaming SIMD Extensions 2 alias Willamette New Instructions, introduced November 20, 2000 with the Pentium^®4, needs only 21 instructions in just 60 bytes:

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [edx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; eax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [edx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; eax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' in chunk of characters
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret

strlen	endp
	end

Implementation with AVX Instructions in i386 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.ymm
	.model	flat, C
	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vpcmpeqb xmm1, xmm0, [edx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; eax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	vpcmpeqb xmm1, xmm0, [edx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; eax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' in chunk of characters
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret

strlen	endp
	end

Implementation with AVX2 Instructions in i386 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.ymm
	.model	flat, C
	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	vpxor	ymm0, ymm0, ymm0; ymm0 = 0
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 31		; ecx = address of string % 32
				;     = 32 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; eax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' in string
	jnz	short match
next:
	add	edx, 32		; edx = address of next chunk of aligned characters
aligned:
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; eax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' in chunk of characters
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret

strlen	endp
	end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	mov	r8, 0101010101010101h
if 0
	mov	r9, 8080808080808080h
elseif 0
	imul	r9, r8, 128	; r9 = 0x8080808080808080
else
	mov	r9, r8
	ror	r9, 1		; r9 = 0x8080808080808080
endif
	mov	r10, rcx
	mov	rdx, rcx	; rdx = address of string
	and	rcx, 7		; rcx = address of string % 8
				;     = 8 - number of unaligned characters
	jz	short aligned	; address of string % 8 = 0?
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	shl	ecx, 3		; rcx = (8 - number of unaligned characters) * 8
				;     = 64 - number of unaligned bits
ifdef AMD
	stc
	sbb	rax, rax	; rax = ~0
else
	or	rax, -1		; rax = ~0
endif
	shl	rax, cl		; rax = ~0 for unaligned characters, 0 elsewhere
	not	rax		; rax = 0 for unaligned characters, ~0 elsewhere
	or	rax, [rdx]	; rax = unaligned characters
	jmp	short mycroft
next:
	add	rdx, 8		; rdx = address of next 8 aligned characters
aligned:
	mov	rax, [rdx]	; rax = next 8 aligned characters
mycroft:
	mov	rcx, rax
	sub	rax, r8
	not	rcx
	and	rax, r9
	and	rcx, rax	; rax = '\200' for matching characters, '\0' elsewhere
	jz	short next	; no '\0' in any character?
match:
	bsf	rax, rcx	; rax = offset of '\0' * 8 + 7
				;     = {7, 15, 23, 31, 39, 47, 55, 63}
	shr	eax, 3		; rax = offset of '\0'
				;     = {0, 1, 2, 3, 4, 5, 6, 7}
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r10	; rax = length of string
	ret

strlen	endp
	end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	mov	rdx, rcx	; rdx = address of string
	mov	r8, rcx
	and	ecx, 15		; rcx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [rdx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; rax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; rax = bitmask for '\0' in string
	jnz	short match
next:
	add	rdx, 16		; rdx = address of next chunk of aligned characters
aligned:
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [rdx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; rax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; rax = offset of '\0' in chunk of characters
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r8		; rax = length of string
	ret

strlen	endp
	end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	mov	rdx, rcx	; rdx = address of string
	mov	r8, rcx
	and	ecx, 15		; rcx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	vpcmpeqb xmm1, xmm0, [rdx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; rax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; rax = bitmask for '\0' in string
	jnz	short match
next:
	add	rdx, 16		; rdx = address of next chunk of aligned characters
aligned:
	vpcmpeqb xmm1, xmm0, [rdx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; rax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; rax = offset of '\0' in chunk of characters
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r8		; rax = length of string
	ret

strlen	endp
	end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code

strlen	proc	public		; size_t strlen(unsigned char const *string)

	vpxor	ymm0, ymm0, ymm0; ymm0 = 0
	mov	rdx, rcx	; rdx = address of string
	mov	r8, rcx
	and	ecx, 31		; rcx = address of string % 32
				;     = 32 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	vpcmpeqb ymm1, ymm0, [rdx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; rax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; rax = bitmask for '\0' in string
	jnz	short match
next:
	add	rdx, 32		; rdx = address of next chunk of aligned characters
aligned:
	vpcmpeqb ymm1, ymm0, [rdx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; rax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; rax = offset of '\0' in chunk of characters
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r8		; rax = length of string
	ret

strlen	endp
	end

`strrchr()` and `strstr()` Standard Functions for i386 Platform

Although the strrchr() and strstr() functions are no compiler helper functions, they are like the memchr() function included here for entertainment due to their extra ordinary quality, and also to show that the routines implemented in ANSI C are as bad as those implemented in assembly language!

DIR "%source%\str*.c"
TYPE "%source%\strrchr.c"
TYPE "%source%\strstr.c"

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src

02/18/2011  03:40 PM             1,998 strcat.c
02/18/2011  03:40 PM               541 strcat_s.c
02/18/2011  03:40 PM             1,102 strchr.c
02/18/2011  03:40 PM             1,566 strcmp.c
02/18/2011  03:40 PM             2,532 strcoll.c
02/18/2011  03:40 PM               479 strcpy_s.c
02/18/2011  03:40 PM               337 strcspn.c
02/18/2011  03:40 PM             3,227 strdate.c
02/18/2011  03:40 PM             1,895 strdup.c
02/18/2011  03:40 PM             4,085 stream.c
02/18/2011  03:40 PM             4,414 strerror.c
02/18/2011  03:40 PM            42,150 strftime.c
02/18/2011  03:40 PM             2,757 stricmp.c
02/18/2011  03:40 PM             2,570 stricoll.c
02/18/2011  03:40 PM             1,009 strlen.c
02/18/2011  03:40 PM             1,276 strlen_s.c
02/18/2011  03:40 PM             5,994 strlwr.c
02/18/2011  03:40 PM             1,496 strncat.c
02/18/2011  03:40 PM               564 strncat_s.c
02/18/2011  03:40 PM             2,546 strncmp.c
02/18/2011  03:40 PM             1,250 strncnt.c
02/18/2011  03:40 PM             3,108 strncoll.c
02/18/2011  03:40 PM             1,464 strncpy.c
02/18/2011  03:40 PM               536 strncpy_s.c
02/18/2011  03:40 PM             3,628 strnicmp.c
02/18/2011  03:40 PM             2,988 strnicol.c
02/18/2011  03:40 PM             1,243 strnset.c
02/18/2011  03:40 PM               580 strnset_s.c
02/18/2011  03:40 PM               337 strpbrk.c
02/18/2011  03:40 PM             1,460 strrchr.c
02/18/2011  03:40 PM             1,204 strrev.c
02/18/2011  03:40 PM             1,204 strset.c
02/18/2011  03:40 PM               519 strset_s.c
02/18/2011  03:40 PM             4,922 strspn.c
02/18/2011  03:40 PM             1,371 strstr.c
02/18/2011  03:40 PM             3,226 strtime.c
02/18/2011  03:40 PM             3,500 strtod.c
02/18/2011  03:40 PM             4,167 strtok.c
02/18/2011  03:40 PM               450 strtok_s.c
02/18/2011  03:40 PM             8,862 strtol.c
02/18/2011  03:40 PM             7,726 strtoq.c
02/18/2011  03:40 PM             6,094 strupr.c
02/18/2011  03:40 PM             4,739 strxfrm.c
              43 File(s)        147,116 bytes
               0 Dir(s)    9,876,543,210 bytes free

/***
*strrchr.c - find last occurrence of character in string
*
*       Copyright (c) Microsoft Corporation. All rights reserved.
*
*Purpose:
*       defines strrchr() - find the last occurrence of a given character
*       in a string.
*
*******************************************************************************/

#include <cruntime.h>
#include <string.h>

/***
*char *strrchr(string, ch) - find last occurrence of ch in string
*
*Purpose:
*       Finds the last occurrence of ch in string.  The terminating
*       null character is used as part of the search.
*
*Entry:
*       char *string - string to search in
*       char ch - character to search for
*
*Exit:
*       returns a pointer to the last occurrence of ch in the given
*       string
*       returns NULL if ch does not occurr in the string
*
*Exceptions:
*
*******************************************************************************/

char * __cdecl strrchr (
        const char * string,
        int ch
        )
{
        char *start = (char *)string;

        while (*string++)                       /* find end of string */
                ;
                                                /* search towards front */
        while (--string != start && *string != (char)ch)
                ;

        if (*string == (char)ch)                /* char found ? */
                return( (char *)string );

        return(NULL);
}

/***
*strstr.c - search for one string inside another
*
*       Copyright (c) Microsoft Corporation. All rights reserved.
*
*Purpose:
*       defines strstr() - search for one string inside another
*
*******************************************************************************/

#include <cruntime.h>
#include <string.h>

/***
*char *strstr(string1, string2) - search for string2 in string1
*
*Purpose:
*       finds the first occurrence of string2 in string1
*
*Entry:
*       char *string1 - string to search in
*       char *string2 - string to search for
*
*Exit:
*       returns a pointer to the first occurrence of string2 in
*       string1, or NULL if string2 does not occur in string1
*
*Uses:
*
*Exceptions:
*
*******************************************************************************/

char * __cdecl strstr (
        const char * str1,
        const char * str2
        )
{
        char *cp = (char *) str1;
        char *s1, *s2;

        if ( !*str2 )
            return((char *)str1);

        while (*cp)
        {
                s1 = cp;
                s2 = (char *) str2;

                while ( *s1 && *s2 && !(*s1-*s2) )
                        s1++, s2++;

                if (!*s2)
                        return(cp);

                cp++;
        }

        return(NULL);

}

OUCH¹: the strrchr() function traverses its input string without necessity twice!

OUCH²: the strstr() function has quadratic, i.e. 𝒪(n²) runtime – a real shame!

Implementation with SSE4.2 Instructions in i386 Assembler

An implementation of the strrchr() function for processors which support the Streaming SIMD Extensions 4.2 alias Nehalem New Instructions, introduced November 11, 2008 with the Core^™i* line of processors, needs only 14 instructions in just 42 bytes:

; Copyright © 2009-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.xmm
	.model	flat, C
	.code

strrchr	proc	public		; char *strrchr(unsigned char const *string, int character)

	xor	eax, eax	; eax = 0
	mov	edx, [esp+4]	; edx = address of string
	and	edx, -16	; edx = aligned address before string
	movzx	ecx, byte ptr [esp+8]
	movd	xmm0, ecx	; xmm0 = prototype string "‹character›"
@@:
	pcmpistri xmm0, [edx], 40h
				; CF = ('‹character›' in chunk of characters),
				; ZF = ('\0' in chunk of characters),
				; ecx = ('\0' or '‹character›' in chunk of characters)
				;     ? index of '\0' or last matching '‹character›' : 16
	lea	ecx, [ecx+edx]
	cmovc	eax, ecx	; eax = address of last matching '‹character›'
	lea	edx, [edx+16]
	jnz	short @b	; no '\0' in chunk of characters?

	xor	ecx, ecx	; ecx = 0
	cmp	eax, [esp+4]
	cmovb	eax, ecx	; eax = (address of '‹character›' < address of string) ? 0
	ret

strrchr	endp
	end

`str*()` Standard Functions

Proper implementations of the strcat(), strchr(), strcmp(), strcoll(), strcpy(), strcspn(), strlen(), strncat(), strncmp(), strncpy(), strnlen(), strnset(), strpbrk(), strrchr(), strrev(), strset(), _strset() strspn(), strstr() strtok() strtok_s() and strtok_r() functions for the i386 and the AMD64 platform follow with build instructions.

Note: only strcat(), strcmp(), strcpy(), strlen() and strset() are available as intrinsic functions.

Implementation in ANSI C

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define NULL	(void *) 0

#ifndef _WIN64
typedef	unsigned int	size_t;
#endif

#pragma function(strcat, strcmp, strcpy, strlen, strset)
#pragma intrinsic(memcmp)

void	*memchr(void const *memory, int character, size_t count);
int	memcmp(void const *source, void const *destination, size_t count);
size_t	strlen(unsigned char const *string);

char	*strstr(unsigned char const *haystack, unsigned char const *needle)
{
#if 0
	if (*needle == '\0')	// needle is an empty string?
		return (char *) haystack;

	if (*haystack == '\0')	// haystack is an empty string?
		return NULL;

	return (char *) memmem(haystack, strlen(haystack), needle, strlen(needle));
#else
	unsigned char const *string;

	size_t length = strlen(needle);
	size_t count = strlen(haystack);

	if (!count || length > count)
		return NULL;

	if (!length)		// needle is an empty string?
		return (char *) haystack;

	if (!--length)		// needle is a single character?
		return memchr(haystack, *needle, count);

	count -= length;	// maximum number of characters to scan in haystack

	while (string = haystack,
	       haystack = (unsigned char const *) memchr(haystack, *needle, count),
	       haystack)	// *haystack is first character of needle; compare
	{			//  last character of needle first, then proceed
		if (haystack[length] == needle[length]
#if 0
		 && length == 1 || !memcmp(haystack + 1, needle + 1, length - 1))
#else
		 && !memcmp(haystack, needle, length))
#endif
			return (char *) haystack;
				// skip character in haystack,
				//  adjust number of characters left in haystack
		count -= ++haystack - string;

		if (!count)
			break;
	}

	return NULL;
#endif
}

char	*strrchr(unsigned char const *string, int character)
{
	char *address = NULL;

	do
		if (*string == (unsigned char) character)
			address = (char *) string;
	while (*string++);

	return address;
}

char	*strchr(unsigned char const *string, int character)
{
	do
		if (*string == (unsigned char) character)
			return (char *) string;
	while (*string++);

	return NULL;
}

char	*strcat(unsigned char *destination, unsigned char const *source)
{
	char *string = (char *) destination;
#if 0
	destination += strlen(destination);
#else
	while (*destination)
		destination++;
#endif
	while (*source)
		*destination++ = *source++;

	return string;
}

char	*strncat(unsigned char *destination, unsigned char const *source, size_t count)
{
	char *string = (char *) destination;
#if 0
	destination += strlen(destination);
#else
	while (*destination)
		destination++;
#endif
	while (count && *source)
		*destination++ = *source++, --count;

	*destination = '\0';

	return string;
}

char	*strcpy(unsigned char *destination, unsigned char const *source)
{
	char *string = (char *) destination;

	while (*source)
		*destination++ = *source++;

	return string;
}

char	*strncpy(unsigned char *destination, unsigned char const *source, size_t count)
{
	char *string = (char *) destination;

	while (count && *source)
		*destination++ = *source++, --count;

	while (count)
		*destination++ = '\0', --count;

	return string;
}

int	strcmp(unsigned char const *source, unsigned char const *destination)
{
	if (source != destination)
		do
			if (*source - *destination)
#if 0
				return *source - *destination;
#else
				return (*source > *destination) - (*source < *destination);
#endif
		while (destination++, *source++);

	return 0;
}

int	strncmp(unsigned char const *source, unsigned char const *destination, size_t count)
{
	if (count && source != destination)
		do
			if (*source - *destination)
#if 0
				return *source - *destination;
#else
				return (*source > *destination) - (*source < *destination);
#endif
		while (source++, *destination++ && --count);

	return 0;
}

size_t	strlen(unsigned char const *string)
{
#if 0
	unsigned char *source = string;

	while (*source)
		source++;

	return source - string;
#else
	return (unsigned char *) memchr(string, '\0', ~(size_t) 0) - string;
#endif
}

size_t	strnlen(unsigned char const *string, size_t count)
{
	unsigned char *nul = memchr(string, '\0', count);

	return	nul ? nul - string : count;
}

__declspec(safebuffers)
size_t	strspn(unsigned char const *string, unsigned char const *delimiter)
{
	// yield number of leading characters in array 'string'
	//  which are equal to any character in array 'delimiter'

	size_t bitmap[256 / (8 * sizeof(size_t))] = {0};

	if (!*delimiter)
		return 0;

	if (!*string)
		return 0;
	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);

	delimiter = string;

	while (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
		string++;

	return string - delimiter;
}

__declspec(safebuffers)
size_t	strcspn(unsigned char const *string, unsigned char const *delimiter)
{
	// yield number of leading characters in array 'string'
	//  which differ from each character in array 'delimiter'

	size_t bitmap[256 / (8 * sizeof(size_t))] = {1};

	if (!*delimiter)
		return strlen(string);

	if (!*string)
		return 0;
	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);

	delimiter = string;

	while (!(bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t)))))
		string++;

	return string - delimiter;
}

__declspec(safebuffers)
char	*strpbrk(unsigned char const *string, unsigned char const *delimiter)
{
	// yield pointer to first character in array 'string'
	//  which is equal to any character in array 'delimiter'
#if 0
	string += strcspn(string, delimiter);

	return *string ? (char *) string : NULL;
#else
	size_t bitmap[256 / (8 * sizeof(size_t))] = {0};

	if (!*delimiter)
		return NULL;

	if (!*string)
		return NULL;
	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);

	do
		if (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
			return (char *) string;
	while (*++string);

	return NULL;
#endif
}

char	*strset(char *string, int character)
{
	char *destination = string;

	while (*destination)
		*destination++ = (char) character;

	return string;
}

__declspec(safebuffers)
char	*strtok_r(unsigned char *string, unsigned char const *delimiter, char **next)
{
#if 0
	if (!string)
		string = (unsigned char *) *next;

	if (!string || !*string)
		return *next = NULL;
				// skip leading delimiters
	string += strspn(string, delimiter);

	if (!*string)		// no characters left?
		return *next = NULL;
				// skip token, i.e. non-delimiters,
				//  and save its address
	*next = (char *) string + strcspn(string, delimiter);

	if (!**next)		// no characters left?
		*next = NULL;
	else			// terminate token
		*(*next)++ = '\0';

	return (char *) string;
#else
	size_t bitmap[256 / (8 * sizeof(size_t))] = {0};

	if (!string)
		string = (unsigned char *) *next;

	if (!string || !*string)
		return *next = NULL;

	if (!*delimiter)
		return *next = NULL, (char *) string;

	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);
				// skip leading delimiters
	while (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
		string++;

	if (!*string)		// no characters left?
		return *next = NULL;

	delimiter = string;	// save (address of) token

	*bitmap |= 1;		// add '\0' as delimiter

	do			// skip token, i.e. non-delimiters
		string++;
	while (!(bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t)))));

	if (!*string)		// no characters left?
		string = NULL;
	else			// terminate token
		*string++ = '\0';

	*next = (char *) string;// save (address of) next character

	return (char *) delimiter;
#endif
}

Implementation in i386 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: characters are unsigned!

	.386
	.model	flat, C
	.code

strcat	proc	public		; char *strcat(char *destination, char const *source)

	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of source string (including '\0')
	push	ecx
	mov	edi, [esp+8]	; edi = address of destination string
;;	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	dec	edi		; edi = address of '\0'
				;     = end of destination string
	mov	eax, esi
	mov	esi, [esp+12]	; esi = address of source string
	pop	ecx		; ecx = length of source string (including '\0')
	rep	movsb
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret

strcat	endp

strchr	proc	public		; char *strchr(char const *string, int character)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of string (including '\0')
	sub	edi, ecx	; edi = address of string
	mov	eax, [esp+8]	; eax = character
	repne	scasb
	dec	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret

strchr	endp

strcmp	proc	public		; int strcmp(char const *source, char const *destination)

	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?

;;	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of destination string (including '\0')
	sub	edi, ecx	; edi = address of destination string
;;	xor	eax, eax	; eax = 0
	repe	cmpsb
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret

strcmp	endp

; NOTE: strcoll() is another implementation of strcmp()!

strcoll	proc	public		; int strcoll(char const *source, char const *destination)

	mov	ecx, [esp+4]	; ecx = address of source string
	mov	edx, [esp+8]	; edx = address of destination string
	sub	edx, ecx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	al, [ecx]
	cmp	al, [ecx+edx]
	jne	short different

	inc	ecx
	test	al, al
	jnz	short compare	; *source <> '\0'?
equal:
	xor	eax, eax	; eax = 0
	ret
different:
	sbb	eax, eax	; eax = (*source < *destination) ? -1 : 0
	or	eax, 1		; eax = (*source < *destination)
				;     - (*source > *destination)
				;     = {-1, 0, 1}
	ret

strcoll	endp

strcpy	proc	public		; char *strcpy(char *destination, char const *source)

	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of source string (including '\0')
	sub	edi, ecx	; edi = address of source string
	mov	eax, esi
	mov	esi, edi	; esi = address of source string
	mov	edi, [esp+4]	; edi = address of destination string
	rep	movsb
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret

strcpy	endp

strcspn	proc	public		; size_t strcspn(char const *string, char const *delimiter)

	mov	eax, [esp+4]	; eax = address of string
	mov	edx, [esp+8]	; edx = address of delimiter
	xor	ecx, ecx	; ecx = 0
	cmp	cl, [edx]
	je	short empty	; delimiter[0] = '\0'?

	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx		; bitmap[0..255] = 0,
				; esp = address of bitmap
setup:
	bts	[esp], ecx	; bitmap[ecx] = 1
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
	cmp	cl, ch
	jne	short setup	; ecx <> '\0'?

	mov	edx, eax	; edx = address of string
skip:
	mov	cl, [eax]	; ecx = *string
	inc	eax		; eax = ++string
	bt	[esp], ecx
	jnc	short skip	; bitmap[ecx] = 0 (no match)?
stop:
	sbb	eax, edx	; eax = number of non-matching characters
	add	esp, 32
	ret
empty:
	mov	edx, eax	; edx = address of string
count:
	inc	eax		; eax = ++string
	cmp	cl, [eax-1]
	jne	short count
if 1
	dec	eax		; eax = --string
	sub	eax, edx
else
	stc
	sbb	eax, edx	; eax = number of characters
endif
	ret

strcspn	endp

strlen	proc	public		; size_t strlen(char const *string)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb		; ecx = -1 - (address of '\0' + 1 - address of string)
				;     = -1 - (length of string + 1)
				;     = -2 - length of string
if 0
	mov	eax, -2
	sub	eax, ecx	; eax = -2 + 2 + length of string
				;     = length of string
else
	mov	eax, ecx	; eax = -1 - (length of string + 1)
	not	eax		; eax = length of string + 1
	dec	eax		; eax = length of string
endif
	mov	edi, edx
	ret

strlen	endp

strncat	proc	public		; char *strncat(char *destination, char const *source, size_t count)

	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = '\0'
	repne	scasb
	sub	edx, ecx	; edx = length of source string (including '\0')
	mov	edi, [esp+12]	; edi = address of destination string
;;	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	dec	edi		; edi = address of '\0'
				;     = end of destination string
	mov	ecx, edx	; ecx = length of source string (including '\0')
	rep	movsb
;;	xor	eax, eax	; eax = '\0'
	stosb
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret

strncat	endp

strncmp	proc	public		; int strncmp(char const *source, char const *destination, size_t count)

	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?

	mov	ecx, [esp+20]	; ecx = count
	test	ecx, ecx
	jz	short equal	; count = 0?

;;	xor	eax, eax	; eax = 0,
;;				; CF = 0,
;;				; ZF = 1 (required when count is 0)
	repe	cmpsb
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret

strncmp	endp

strncpy	proc	public		; char *strncpy(char *destination, char const *source, size_t count)

	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = '\0'
	repne	scasb
	sub	ecx, edx
	neg	ecx		; ecx = length of source string (including '\0')
	sub	edx, ecx	; edx = count - length of source string (including '\0')
	mov	edi, [esp+12]	; edi = address of destination string
	rep	movsb
	mov	ecx, edx	; ecx = count - length of source string (including '\0')
;;	xor	eax, eax	; eax = '\0'
	rep	stosb
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret

strncpy	endp

strnlen	proc	public		; size_t strnlen(char const *string, size_t count)

	mov	ecx, [esp+8]	; ecx = count
	test	ecx, ecx
	jz	short empty	; count = 0?

	xor	eax, eax	; eax = '\0'
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasb		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = '\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+8]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	edi, edx
empty:
	mov	eax, ecx	; eax = (length of string < count)
				;     ? length of string : count
	ret

strnlen	endp

strnset	proc	public		; char *strnset(char *string, int character, size_t count)

	mov	edx, [esp+4]	; edx = address of string
	mov	ecx, [esp+12]	; ecx = count
	test	ecx, ecx
	jz	short zero	; count = 0?

	xor	eax, eax	; eax = '\0'
	push	edi
	mov	edi, edx	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasb		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	mov	edi, edx	; edi = address of string
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = '\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+16]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	eax, [esp+12]	; eax = character
	rep	stosb
	pop	edi
zero:
	mov	eax, edx	; eax = address of string
	ret

strnset	endp

strpbrk	proc	public		; char *strpbrk(char const *string, char const *delimiter)

	mov	eax, [esp+4]	; eax = address of string
	mov	edx, [esp+8]	; edx = address of delimiter
	xor	ecx, ecx	; ecx = 0
	cmp	cl, [edx]
	je	short empty	; delimiter[0] = '\0'?

	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx		; bitmap[0..255] = 0,
				; esp = address of bitmap
setup:
	bts	[esp], ecx	; bitmap[ecx] = 1
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
	cmp	cl, ch
	jne	short setup	; ecx <> '\0'?
skip:
	mov	cl, [eax]	; ecx = *string
	inc	eax		; eax = ++string
	bt	[esp], ecx
	jnc	short skip	; bitmap[ecx] = 0 (no match)?
stop:
	dec	eax		; eax = --string
	neg	ecx
	sbb	ecx, ecx	; ecx = (*string = '\0') ? 0 : -1
	and	eax, ecx	; eax = (*string = '\0') ? 0 : address of string
	add	esp, 32
	ret
empty:
	xor	eax, eax
	ret

strpbrk	endp

strrchr	proc	public		; char *strrchr(char const *string, int character)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of string (including '\0')
	dec	edi		; edi = address of '\0'
				;     = end of string
	mov	eax, [esp+8]	; eax = character
	std
	repne	scasb
	cld
	inc	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret

strrchr	endp

strrev	proc	public		; char *strrev(char *string)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	add	ecx, edi	; ecx = address of string - 1
	dec	edi		; edi = address of '\0'
				;     = end of string
	jmp	short continue
reverse:
	mov	al, [ecx]
	mov	ah, [edi]
	mov	[ecx], ah
	mov	[edi], al
continue:
	inc	ecx
	dec	edi
	cmp	edi, ecx
	ja	short reverse

	mov	edi, edx
	mov	eax, [esp+4]	; eax = address of string
	ret

strrev	endp

strset	proc	public		; char *strset(char *string, int character)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of string (including '\0')
	sub	edi, ecx	; edi = address of string
	dec	ecx		; ecx = length of string
	mov	eax, [esp+8]	; eax = character
	rep	stosb
	mov	edi, edx
	mov	eax, [esp+4]	; eax = address of string
	ret

strset	endp

strspn	proc	public		; size_t strspn(char const *string, char const *delimiter)

	mov	eax, [esp+4]	; eax = address of string
	mov	edx, [esp+8]	; edx = address of delimiter
	xor	ecx, ecx	; ecx = 0
	cmp	cl, [edx]
	je	short empty	; delimiter[0] = '\0'?

	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx		; bitmap[0..255] = 0,
				; esp = address of bitmap
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
setup:
	bts	[esp], ecx	; bitmap[ecx] = 1
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
	cmp	cl, ch
	jne	short setup	; ecx <> '\0'?

	mov	edx, eax	; edx = address of string
skip:
	mov	cl, [eax]	; ecx = *string
	inc	eax		; eax = ++string
	bt	[esp], ecx
	jc	short skip	; bitmap[ecx] = 1 (match)?
if 1
	dec	eax		; eax = --string
	sub	eax, edx
else
	stc
	sbb	eax, edx	; eax = number of matching characters
endif
	add	esp, 32
	ret
empty:
	xor	eax, eax
	ret

strspn	endp

strstr	proc	public		; char *strstr(char const *haystack, char const *needle)

	push	edi
	mov	edi, [esp+12]	; edi = address of needle
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of needle (including '\0')
	dec	ecx		; ecx = length of needle
	mov	eax, [esp+8]	; eax = address of haystack
	jz	short empty	; length of needle = 0?

	mov	edx, ecx	; edx = length of needle
ifdef SIMPLE
	push	esi
compare:
	mov	esi, eax	; esi = current address in haystack
	mov	edi, [esp+16]	; edi = address of needle
	mov	ecx, edx	; ecx = length of needle
	repe	cmpsb
	je	short match	; needle in haystack?

	inc	eax		; eax = next address in haystack
	cmp	byte ptr [esi-1], 0
	jne	short compare	; non-matching character in haystack <> '\0'?

	xor	eax, eax
match:
else ; SIMPLE
	mov	edi, eax	; edi = address of haystack
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of haystack (including '\0')
	sub	edi, ecx	; edi = address of haystack
	dec	ecx		; ecx = length of haystack
	jz	short empty	; length of haystack = 0?

	cmp	ecx, edx
	jb	short empty	; length of haystack < length of needle?

	push	esi
	push	ebx
search:
	mov	esi, [esp+20]	; esi = address of needle
	mov	al, [esi]	; al = first character of needle
				; edi = current address in haystack
	repne	scasb		; edi = next address in haystack,
				; ecx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?

	dec	ecx		; ecx = next length of haystack

	mov	al, [esi+edx-1]	; al = last character of needle
	cmp	al, [edi+edx-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	eax, edi	; eax = next address in haystack
	mov	ebx, ecx	; ebx = next length of haystack
if 0
	dec	edi		; edi = current address in haystack
				;     = address of matching character
				; esi = address of needle
	mov	ecx, edx	; ecx = length of needle
else
				; edi = next address in haystack
	inc	esi		; esi = address of needle + 1
	mov	ecx, edx
	dec	ecx		; ecx = length of needle - 1,
				; ZF = (ecx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?

	mov	edi, eax	; edi = current address in haystack
	mov	ecx, ebx	; ecx = current length of haystack
continue:
	cmp	ecx, edx
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	pop	ebx
	pop	esi
	pop	edi
	ret
match:
	dec	eax		; eax = address of needle in haystack
	pop	ebx
endif ; SIMPLE
	pop	esi
empty:
	pop	edi
	ret

strstr	endp

strtok_r proc	public		; char *strtok_r(char *string, char const *delimiter, char **next)

	mov	ecx, [esp+4]	; ecx = address of string
	mov	eax, [esp+8]	; eax = address of delimiter
	mov	edx, [esp+12]	; edx = address of address of next
	test	ecx, ecx
	jnz	short start	; address of string <> 0?

	or	ecx, [edx]	; ecx = address of next
	jz	short null	; address of next = 0 = address of string?
start:
	cmp	byte ptr [ecx], 0
	je	short null	; string[0] = '\0'?

	cmp	byte ptr [eax], 0
	je	short empty	; delimiter[0] = '\0'?

	push	ebx
	xor	ebx, ebx	; ebx = 0
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx		; bitmap[0..255] = 0,
				; esp = address of bitmap
	mov	bl, [eax]	; ebx = *delimiter
	inc	eax		; eax = ++delimiter
setup:
	bts	[esp], ebx	; bitmap[ebx] = 1
	mov	bl, [eax]	; ebx = *delimiter
	inc	eax		; eax = ++delimiter
	cmp	bl, bh
	jne	short setup	; ebx <> '\0'?
skip:
	mov	bl, [ecx]	; ebx = *string
	inc	ecx		; ecx = ++string
	bt	[esp], ebx
	jc	short skip	; bitmap[ebx] = 1 (ebx is a delimiter)?

	cmp	bl, bh
	je	short none	; ebx = '\0'?

	mov	bl, bh		; ebx = 0
	bts	[esp], ebx	; bitmap['\0'] = 1
	mov	eax, ecx
	dec	eax		; eax = address of token
token:
	mov	bl, [ecx]	; ebx = *string
	inc	ecx		; ecx = ++string
	bt	[esp], ebx
	jnc	short token	; bitmap[ebx] = 0 (ebx is not a delimiter)?

	cmp	bl, bh
	je	short last	; ebx = '\0'?

	mov	[ecx-1], bh	; string[-1] = '\0' (terminate token)
	mov	[edx], ecx	; *next = address of string
	add	esp, 32
	pop	ebx
	ret
none:
	mov	eax, ebx	; eax = 0
last:
	mov	[edx], ebx	; *next = 0
	add	esp, 32
	pop	ebx
	ret
null:
	xor	eax, eax	; eax = 0
	mov	[edx], eax	; *next = 0
	ret
empty:
	mov	eax, ecx	; eax = address of string
	xor	ecx, ecx
	mov	[edx], ecx	; *next = 0
	ret

strtok_r endp
	end

Save the i386 assembler source presented above as i386-str.asm in the directory where you created the object library i386.lib before, then execute the following 3 command lines to generate the object file i386-str.obj and add it to the existing object library i386.lib:

SET ML=/c /Gy /safeseh /W3 /X
ML.EXE i386-str.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-str.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 14.16.27023.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-str.asm

Microsoft (R) Library Manager Version 14.16.27049.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Implementation in AMD64 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: characters are unsigned!

	.code

strcat	proc	public		; char *strcat(char *destination, char const *source)

	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
ifdef VARIANT
	mov	rdi, rcx	; rdi = address of destination string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	dec	rdi		; rdi = address of '\0'
				;     = end of destination string
	mov	r11, rsi
	mov	rsi, rdi	; rsi = end of destination string
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of source string (including '\0')
	mov	rdi, rsi	; rdi = end of destination string
	mov	rsi, rdx	; rsi = address of source string
else ; VARIANT
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of source string (including '\0')
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	mov	rdx, rcx
	mov	rdi, r9		; rdi = address of destination string
;;	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	dec	rdi		; rdi = address of '\0'
				;     = end of destination string
	mov	rcx, rdx	; rcx = length of source string (including '\0')
endif ; VARIANT
	rep	movsb
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret

strcat	endp

strchr	proc	public		; char *strchr(char const *string, int character)

	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of string (including '\0')
	mov	rax, rdx	; rax = character
	sub	rdi, rcx	; rdi = address of string
	repne	scasb
	lea	rax, [rdi-1]	; rax = address of character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r10
	ret

strchr	endp

strcmp	proc	public		; ssize_t strcmp(char const *source, char const *destination)

	xor	eax, eax	; rax = 0
	cmp	rcx, rdx
	je	short equal	; address of source string = address of destination string?

	mov	r11, rsi
	mov	rsi, rcx	; rsi = address of source string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of destination string (including '\0')
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = 0
	repe	cmpsb
	seta	al		; rax = (*source > *destination)
	sbb	rax, 0		; rax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	mov	rdi, r10
	mov	rsi, r11
equal:
	ret

strcmp	endp

; NOTE: strcoll() is another implementation of strcmp()!

strcoll	proc	public		; ssize_t strcoll(char const *source, char const *destination)

	sub	rdx, rcx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	al, [rcx]
	cmp	al, [rcx+rdx]
	jne	short different

	inc	rcx
	test	al, al
	jnz	short compare	; *source <> '\0'?
equal:
	xor	eax, eax	; rax = 0
	ret

different:
	sbb	rax, rax	; rax = (*source < *destination) ? -1 : 0
	or	rax, 1		; rax = (*source < *destination)
				;     - (*source > *destination)
				;     = {-1, 0, 1}
	ret

strcoll	endp

strcpy	proc	public		; char *strcpy(char *destination, char const *source)

	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of source string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of source string (including '\0')
	mov	rdi, r9		; rdi = address of destination string
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	rep	movsb
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret

strcpy	endp

strcspn	proc	public		; size_t strcspn(char const *string, char const *delimiter)

	xor	eax, eax	; rax = 0
	cmp	al, [rdx]
	je	short empty	; delimiter[0] = '\0'?

	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?

	mov	rdx, rcx	; rdx = address of string
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jnc	short skip	; bitmap[rax] = 0 (no match)?
stop:
	sbb	rcx, rdx	; rcx = number of non-matching characters
	mov	rax, rcx
	ret
empty:
	mov	rdx, rcx	; rdx = address of string
count:
	cmp	al, [rcx]
	lea	rcx, [rcx+1]	; rcx = ++string
	jne	short count	; *string <> '\0'?

	stc
	sbb	rcx, rdx	; rcx = number of characters
	mov	rax, rcx
	ret

strcspn	endp

strlen	proc	public		; size_t strlen(char const *string)

	mov	rdx, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
if 0
	not	rcx		; rcx = length of string (including '\0')
	dec	rcx
	mov	rax, rcx	; rax = length of string
else
	mov	rax, -2
	sub	rax, rcx	; rax = length of string
endif
	mov	rdi, rdx
	ret

strlen	endp

strncat	proc	private		; char *strncat(char *destination, char const *source, size_t count)

	ud2

strncat	endp

strncmp	proc	private		; int strncmp(char const *source, char const *destination, size_t count)

	ud2

strncmp	endp

strncpy	proc	private		; char *strncpy(char *destination, char const *source, size_t count)

	ud2

strncpy	endp

strnlen	proc	private		; size_t strnlen(char const *string, size_t count)

	ud2

strnlen	endp

strnset	proc	private		; char *strnset(char const *string, int character, size_t count)

	ud2

strnset	endp

strpbrk	proc	public		; char *strpbrk(char const *string, char const *delimiter)

	xor	eax, eax	; rax = 0
	cmp	al, [rdx]
	je	short empty	; delimiter[0] = '\0'?

	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jnc	short skip	; bitmap[rax] = 0 (no match)?
stop:
	dec	rcx		; rcx = --string
	neg	eax
	sbb	rax, rax	; rax = (*string = '\0') ? 0 : -1
	and	rax, rcx	; rax = (*string = '\0') ? 0 : address of string
empty:
	ret

strpbrk	endp

strrchr	proc	public		; char *strrchr(char const *string, int character)

	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of string (including '\0')
	mov	rax, rdx	; rax = character
	dec	rdi		; rdi = address of '\0'
				;     = end of string
	std
	repne	scasb
	cld
	lea	rax, [rdi+1]	; rax = address of character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r10
	ret

strrchr	endp

strrev	proc	private		; char *strrev(char *string)

	ud2

strrev	endp

strset	proc	public		; char *strset(char *string, int character)

	mov	r9, rcx		; r9 = address of string
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of string (including '\0')
	dec	rcx
	mov	rdi, r9		; rdi = address of string
	mov	rax, rdx	; rax = character
	rep	stosb
	mov	rax, r9		; rax = address of string
	mov	rdi, r10
	ret

strset	endp

strspn	proc	public		; size_t strspn(char const *string, char const *delimiter)

	xor	eax, eax	; rax = 0
	cmp	al, [rdx]
	je	short empty	; delimiter[0] = '\0'?

	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0

	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?

	mov	rdx, rcx	; rdx = address of string
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jc	short skip	; bitmap[rax] = 1 (match)?
if 0
	dec	rcx		; rcx = --string
	sub	rcx, rdx
else
	stc
	sbb	rcx, rdx	; rcx = number of matching characters
endif
	mov	rax, rcx
empty:
	ret

strspn	endp

strstr	proc	public		; char *strstr(char const *haystack, char const *needle)

	mov	r8, rcx		; r8 = address of haystack
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of needle
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of needle (including '\0')
	dec	rcx		; rcx = length of needle
	mov	rax, r8		; rax = address of haystack
	jz	short empty	; length of needle = 0?

	mov	r9, rcx		; r9 = length of needel
	mov	rdi, r8		; rdi = address of haystack
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of haystack (including '\0')
	sub	rdi, rcx	; rdi = address of haystack
	dec	rcx		; rcx = length of haystack
	jz	short empty	; length of haystack = 0?

	cmp	rcx, r9
	jb	short empty	; length of haystack <length of needle?

	mov	r11, rsi
search:
	mov	al, [rdx]	; al = first character of needle
				; rdi = current address in haystack
	repne	scasb		; rdi = next address in haystack,
				; rcx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?

	dec	rcx		; rcx = next length of haystack

	mov	al, [rdx+r9-1]	; al = last character of needle
	cmp	al, [rdi+r9-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	rax, rdi	; rax = next address in haystack
	mov	r8, rcx		; r8 = next length of haystack
if 0
	dec	rdi		; rdi = current address in haystack
				;     = address of matching character
	mov	rsi, rdx	; rsi = address of needle
	mov	rcx, r9		; rcx = length of needle
else
				; rdi = next address in haystack
	mov	rsi, rdx
	inc	rsi		; rsi = address of needle + 1
	mov	rcx, r9
	dec	rcx		; rcx = length of needle - 1,
				; ZF = (rcx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?

	mov	rdi, rax	; rdi = current address in haystack
	mov	rcx, r8		; rcx = current length of haystack
continue:
	cmp	rcx, r9
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	mov	rdi, r10
	mov	rsi, r11
empty:
	ret
match:
	dec	rax		; rax = address of needle in haystack
	mov	rdi, r10
	mov	rsi, r11
	ret

strstr	endp

strtok_r proc	public		; char *strtok_r(char *string, char const *delimiter, char **next)

	xor	eax, eax	; rax = 0
	test	rcx, rcx
	jnz	short start	; string <> 0?

	or	rcx, [r8]	; rcx = next
	jz	short null	; address of next = 0 = address of string?
start:
	cmp	al, [rcx]
	je	short null	; string[0] = '\0'?

	cmp	al, [rdx]
	je	short empty	; *delimiter = '\0'?

	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0

	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jc	short skip	; bitmap[rax] = 1 (rax is a delimiter)?

	cmp	al, ah
	je	short none	; rax = '\0'?

	mov	al, ah		; rax = 0
	bts	[rsp+8], rax	; bitmap['\0'] = 1
	lea	rdx, [rcx-1]	; rdx = address of token
token:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jnc	short token	; bitmap[rax] = 0 (rax is not a delimiter)?

	cmp	al, ah
	je	short last	; rax = '\0'?

	mov	[rcx-1], ah	; string[-1] = '\0' (terminate token)
	mov	[r8], rcx	; *next = address of string
	mov	rax, rdx	; rax = address of token
	ret
last:
	mov	[r8], rax	; *next = 0
	mov	rax, rdx	; rax = address of token
	ret
empty:
	mov	[r8], rax	; *next = 0
	mov	rax, rcx	; rax = address of token
	ret
null:
none:
	mov	[r8], rax	; *next = 0
	ret

strtok_r endp
	end

Save the AMD64 assembler source presented above as amd64-str.asm in the directory where you created the object library amd64.lib before, then execute the following 3 command lines to generate the object file amd64-str.obj and add it to the existing object library amd64.lib:

SET ML=/c /Gy /W3 /X
ML64.EXE amd64-str.asm
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-str.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: amd64-str.asm

Microsoft (R) Library Manager Version 14.16.27049.0
Copyright (C) Microsoft Corporation.  All rights reserved.

`wcs*()` Standard Functions

Proper implementations of the wcscat(), wcschr(), wcscmp(), wcscoll(), wcscpy(), wcslen(), wcsncat(), wcsncmp(), wcsncpy(), wcsnlen(), wcsnset(), wcsrchr(), wcsrev(), wcsset() and wcsstr() functions for the i386 and the AMD64 platform follow with build instructions. _wcsset()

Note: only wcscat(), wcscmp(), wcscpy() and wcslen() are available as intrinsic functions.

Implementation in i386 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: counts and lengths are numbers of wide characters, not bytes!

	.386
	.model	flat, C
	.code

wcscat	proc	public		; wchar_t *wcscat(wchar_t *destination, wchar_t const *source)

	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of source string (including L'\0')
	push	ecx
	mov	edi, [esp+8]	; edi = address of destination string
;;	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	dec	edi
	dec	edi		; edi = address of L'\0'
				;     = end of destination string
	mov	eax, esi
	mov	esi, [esp+12]	; esi = address of source string
	pop	ecx		; ecx = length of source string (including L'\0')
	rep	movsw
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret

wcscat	endp

wcschr	proc	public		; wchar_t *wcschr(wchar_t const *string, wchar_t character)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of string (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of string
	mov	eax, [esp+8]	; eax = wide character
	repne	scasw
	dec	edi
	dec	edi		; edi = address of wide character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = wide character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of wide character
	mov	edi, edx
	ret

wcschr	endp

wcscmp	proc	public		; int wcscmp(wchar_t const *source, wchar_t const *destination)

	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?

;;	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of destination string (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of destination string
;;	xor	eax, eax	; eax = 0
	repe	cmpsw
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret

wcscmp	endp

; NOTE: wcscoll() is another implementation of wcscmp()!

wcscoll	proc	public		; int wcscoll(wchar_t const *source, wchar_t const *destination)

	mov	ecx, [esp+4]	; ecx = address of source string
	mov	edx, [esp+8]	; edx = address of destination string
	sub	edx, ecx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	ax, [ecx]
	cmp	ax, [ecx+edx]
	jne	short different

	inc	ecx
	inc	ecx
	test	ax, ax
	jnz	short compare	; *source <> L'\0'?
equal:
	xor	eax, eax	; eax = 0
	ret
different:
	sbb	eax, eax	; eax = (*source < *destination) ? -1 : 0
	or	eax, 1		; eax = (*source < *destination)
				;     - (*source < *destination)
				;     = {-1, 0, 1}
	ret

wcscoll	endp

wcscpy	proc	public		; wchar_t *wcscpy(wchar_t *destination, wchar_t const *source)

	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of source string (including L'\0')
	mov	eax, esi
	mov	esi, [esp+8]	; esi = address of source string
	mov	edi, [esp+4]	; edi = address of destination string
	rep	movsw
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret

wcscpy	endp

wcslen	proc	public		; size_t wcslen(wchar_t const *string)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw		; ecx = -1 - (address of L'\0' + 2 - address of string)
				;     = -1 - (length of string + 1)
				;     = -2 - length of string
if 0
	mov	eax, -2
	sub	eax, ecx	; eax = -2 + 2 + length of string
				;     = length of string
else
	mov	eax, ecx	; eax = -1 - (length of string + 1)
	not	eax		; eax = length of string + 1
	dec	eax		; eax = length of string
endif
	mov	edi, edx
	ret

wcslen	endp

wcsncat	proc	public		; wchar_t *wcsncat(wchar_t *destination, wchar_t const *source, size_t count)

	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = L'\0'
	repne	scasw
	sub	edx, ecx	; edx = length of source string (including L'\0')
	mov	edi, [esp+12]	; edi = address of destination string
;;	xor	eax, eax	; eax = 'L\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	dec	edi
	dec	edi		; edi = address of L'\0'
				;     = end of destination string
	mov	ecx, edx	; ecx = length of source string (including L'\0')
	rep	movsw
;;	xor	eax, eax	; eax = L'\0'
	stosw
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret

wcsncat	endp

wcsncmp	proc	public		; int wcsncmp(wchar_t const *source, wchar_t const *destination, size_t count)

	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?

	mov	ecx, [esp+20]	; ecx = count
	test	ecx, ecx
	jz	short equal	; count = 0?

;;	xor	eax, eax	; eax = 0,
;;				; CF = 0,
				; ZF = 1 (required when count is 0)
	repe	cmpsw
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret

wcsncmp	endp

wcsncpy	proc	public		; wchar_t *wcsncpy(wchar_t *destination, wchar_t const *source, size_t count)

	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = L'\0'
	repne	scasw
	sub	ecx, edx
	neg	ecx		; ecx = length of source string (including L'\0')
	sub	edx, ecx	; edx = count - length of source string (including L'\0')
	mov	edi, [esp+12]	; edi = address of destination string
	rep	movsw
	mov	ecx, edx	; ecx = count - length of source string (including L'\0')
;;	xor	eax, eax	; eax = L'\0'
	rep	stosw
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret

wcsncpy	endp

wcsnlen	proc	public		; size_t wcsnlen(wchar_t const *string, size_t count)

	mov	ecx, [esp+8]	; ecx = count
	test	ecx, ecx
	jz	short empty	; count = 0?

	xor	eax, eax	; eax = L'\0'
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasw		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = L'\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+8]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	edi, edx
empty:
	mov	eax, ecx	; eax = (length of string < count)
				;     ? length of string : count
	ret

wcsnlen	endp

wcsnset	proc	public		; wchar_t *wcsnset(wchar_t *string, wchar_t character, size_t count)

	mov	edx, [esp+4]	; edx = address of string
	mov	ecx, [esp+12]	; ecx = count
	test	ecx, ecx
	jz	short zero	; count = 0?

	xor	eax, eax	; eax = L'\0'
	push	edi
	mov	edi, edx	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasw		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	mov	edi, edx	; edi = address of string
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = L'\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+16]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	eax, [esp+12]	; eax = wide character
	rep	stosw
	pop	edi
zero:
	mov	eax, edx	; eax = address of string
	ret

wcsnset	endp

wcsrchr	proc	public		; wchar_t *wcsrchr(wchar_t const *string, wchar_t character)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of string (including L'\0')
	dec	edi
	dec	edi		; edi = address of L'\0'
				;     = end of string
	mov	eax, [esp+8]	; eax = wide character
	std
	repne	scasw
	cld
	inc	edi
	inc	edi		; edi = address of wide character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = wide character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of wide character
	mov	edi, edx
	ret

wcsrchr	endp

wcsrev	proc	private		; wchar_t *wcsrev(wchar_t *string)

	ud2

wcsrev	endp

wcsset	proc	public		; wchar_t *wcsset(wchar_t *string, wchar_t character)

	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of string (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of string
	dec	ecx		; ecx = length of string
	mov	eax, [esp+8]	; eax = wide character
	rep	stosw
	mov	edi, edx
	mov	eax, [esp+4]	; eax = address of string
	ret

wcsset	endp

wcsstr	proc	public		; wchar_t *wcsstr(wchar_t const *haystack, wchar_t const *needle)

	push	edi
	mov	edi, [esp+12]	; edi = address of needle
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of needle (including L'\0')
	dec	ecx		; ecx = length of needle
	mov	eax, [esp+8]	; eax = address of haystack
	jz	short empty	; length of needle = 0?

	mov	edx, ecx	; edx = length of needle
ifdef SIMPLE
	push	esi
compare:
	mov	esi, eax	; esi = current address in haystack
	mov	edi, [esp+16]	; edi = address of needle
	mov	ecx, edx	; ecx = length of needle
	repe	cmpsw
	je	short match	; needle in haystack?

	inc	eax
	inc	eax		; eax = next address in haystack
	cmp	word ptr [esi-2], 0
	jne	short compare	; non-matching wide character in haystack <> L'\0'?

	xor	eax, eax
match:
else ; SIMPLE
	mov	edi, eax	; edi = address of haystack
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of haystack (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of haystack
	dec	ecx		; ecx = length of haystack
	jz	short empty	; length of haystack = 0?

	cmp	ecx, edx
	jb	short empty	; length of haystack < length of needle?

	push	esi
	push	ebx
search:
	mov	esi, [esp+20]	; esi = address of needle
	mov	ax, [esi]	; ax = first wide character of needle
				; edi = current address in haystack
	repne	scasw		; edi = next address in haystack,
				; ecx = current length of haystack
	jne	short break	; (first wide character of) needle not in haystack?

	dec	ecx		; ecx = next length of haystack

	mov	ax, [esi+edx*2-2]
				; ax = last wide character of needle
	cmp	ax, [edi+edx*2-4]
	jne	short continue	; last wide character of needle not in haystack?
compare:
	mov	eax, edi	; eax = next address in haystack
	mov	ebx, ecx	; ebx = next length of haystack
if 0
	dec	edi
	dec	edi		; edi = current address in haystack
				;     = address of matching wide character
				; esi = address of needle
	mov	ecx, edx	; ecx = length of needle
else
				; edi = next address in haystack
	inc	esi
	inc	esi		; esi = address of needle + 2
	mov	ecx, edx
	dec	ecx		; ecx = length of needle - 1,
				; ZF = (ecx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsw
	je	short match	; needle in haystack?

	mov	edi, eax	; edi = current address in haystack
	mov	ecx, ebx	; ecx = current length of haystack
continue:
	cmp	ecx, edx
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	pop	ebx
	pop	esi
	pop	edi
	ret
match:
	dec	eax
	dec	eax		; eax = address of needle in haystack
	pop	ebx
endif ; SIMPLE
	pop	esi
empty:
	pop	edi
	ret

wcsstr	endp
	end

Save the i386 assembler source presented above as i386-wcs.asm in the directory where you created the object library i386.lib before, then execute the following 3 command lines to generate the object file i386-wcs.obj and add it to the existing object library i386.lib:

SET ML=/c /Gy /safeseh /W3 /X
ML.EXE i386-wcs.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-wcs.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 14.16.27023.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-wcs.asm

Microsoft (R) Library Manager Version 14.16.27049.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Implementation in AMD64 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

; NOTE: counts and lengths are numbers of wide characters, not bytes!

	.code

wcscat	proc	public		; wchar_t *wcscat(wchar_t *destination, wchar_t const *source)

	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
ifdef VARIANT
	mov	rdi, rcx	; rdi = address of destination string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	dec	rdi		; rdi = address of L'\0'
				;     = end of destination string
	mov	r11, rsi
	mov	rsi, rdi	; rsi = end of destination string
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of source string (including L'\0')
	mov	rdi, rsi	; rdi = end of destination string
	mov	rsi, rdx	; rsi = address of source string
else ; VARIANT
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of source string (including L'\0')
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	mov	rdx, rcx
	mov	rdi, r9		; rdi = address of destination string
;;	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	dec	rdi
	dec	rdi		; rdi = address of L'\0'
				;     = end of destination string
	mov	rcx, rdx	; rcx = length of source string (including L'\0')
endif ; VARIANT
	rep	movsw
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret

wcscat	endp

wcschr	proc	public		; wchar_t *wcschr(wchar_t const *string, wchar_t character)

	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	mov	rax, rdx	; rax = wide character
	sub	rdi, rcx	; rdi = address of string
	repne	scasw
	lea	rax, [rdi-2]	; rax = address of wide character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of wide character
	mov	rdi, r10
	ret

wcschr	endp

wcscmp	proc	public		; int wcscmp(wchar_t const *source, wchar_t const *destination)

	xor	eax, eax	; rax = 0
	cmp	rcx, rdx
	je	short equal	; address of source string = address of destination string?

	mov	r11, rsi
	mov	rsi, rcx	; rsi = address of source string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of destination string (including L'\0')
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = 0
	repe	cmpsw
	seta	al		; rax = (*source > *destination)
	sbb	rax, 0		; rax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	mov	rdi, r10
	mov	rsi, r11
equal:
	ret

wcscmp	endp

; NOTE: wcscoll() is another implementation of wcscmp()!

wcscoll	proc	public		; int wcscoll(wchar_t const *source, wchar_t const *destination)

	sub	rdx, rcx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	ax, [rcx]
	cmp	ax, [rcx+rdx]
	jne	short different

	lea	rcx, [rcx+2]
	test	ax, ax
	jnz	short compare	; *source <> L'\0'?
equal:
	xor	eax, eax	; rax = 0
	ret

different:
	sbb	rax, rax	; rax = (*source < *destination) ? -1 : 0
	or	rax, 1		; rax = (*source < *destination)
				;     - (*source > *destination)
				;     = {-1, 0, 1}
	ret

wcscoll	endp

wcscpy	proc	public		; wchar_t *wcscpy(wchar_t *destination, wchar_t const *source)

	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of source string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of source string (including L'\0')
	mov	rdi, r9		; rdi = address of destination string
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	rep	movsw
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret

wcscpy	endp

wcslen	proc	public		; size_t wcslen(wchar_t const *string)

	mov	rdx, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	dec	rcx
	dec	rcx
	mov	rax, rcx	; rax = length of string
	mov	rdi, rdx
	ret

wcslen	endp

wcsncat	proc	private		; wchar_t *wcsncat(wchar_t *destination, wchar_t const *source, size_t count)

	ud2

wcsncat	endp

wcsncmp	proc	private		; int wcsncmp(wchar_t const *source, wchar_t const *destination, size_t count)

	ud2

wcsncmp	endp

wcsncpy	proc	private		; wchar_t *wcsncpy(wchar_t *destination, wchar_t const *source, size_t count)

	ud2

wcsncpy	endp

wcsnlen	proc	private		; size_t wcsnlen(wchar_t const *string, size_t count)

	ud2

wcsnlen	endp

wcsnset	proc	private		; wchar_t *wcsnset(wchar_t *string, wchar_t character, size_t count)

	ud2

wcsnset	endp

wcsrchr	proc	public		; wchar_t *wcsrchr(wchar_t const *string, wchar_t character)

	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	mov	rax, rdx	; rax = wide character
	lea	rdi, [rdi-2]	; rdi = address of L'\0'
				;     = end of string
	std
	repne	scasw
	cld
	lea	rax, [rdi+2]	; rax = address of wide character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of wide character
	mov	rdi, r10
	ret

wcsrchr	endp

wcsrev	proc	private		; wchar_t *wcsrev(wchar_t *string)

	ud2

wcsrev	endp

wcsset	proc	public		; wchar_t *wcsset(wchar_t *string, wchar_t character)

	mov	r9, rcx		; r9 = address of string
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	dec	rcx
	mov	rdi, r9		; rdi = address of string
	mov	rax, rdx	; rax = wide character
	rep	stosw
	mov	rax, r9		; rax = address of string
	mov	rdi, r10
	ret

wcsset	endp

wcsstr	proc	public		; wchar_t *wcsstr(wchar_t const *haystack, wchar_t const *needle)

	mov	r8, rcx		; r8 = address of haystack
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of needle
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of needle (including L'\0')
	dec	rcx		; rcx = length of needle
	mov	rax, r8		; rax = address of haystack
	jz	short empty	; length of needle = 0?

	mov	r9, rcx		; r9 = length of needel
	mov	rdi, r8		; rdi = address of haystack
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of haystack (including L'\0')
	mov	rdi, r8		; rdi = address of haystack
	dec	rcx		; rcx = length of haystack
	jz	short empty	; length of haystack = 0?

	cmp	rcx, r9
	jb	short empty	; length of haystack <length of needle?

	mov	r11, rsi
search:
	mov	ax, [rdx]	; ax = first wide character of needle
				; rdi = current address in haystack
	repne	scasw		; rdi = next address in haystack,
				; rcx = current length of haystack
	jne	short break	; (first wide character of) needle not in haystack?

	dec	rcx		; rcx = next length of haystack

	mov	ax, [rdx+r9*2-2]
				; ax = last wide character of needle
	cmp	ax, [rdi+r9*2-4]
	jne	short continue	; last wide character of needle not in haystack?
compare:
	mov	rax, rdi	; rax = next address in haystack
	mov	r8, rcx		; r8 = next length of haystack
if 0
	lea	rdi, [rdi-2]	; rdi = current address in haystack
				;     = address of matching character
	mov	rsi, rdx	; rsi = address of needle
	mov	rcx, r9		; rcx = length of needle
else
				; rdi = next address in haystack
	lea	rsi, [rdx+2]	; rsi = address of needle + 2
	mov	rcx, r9
	dec	rcx		; rcx = length of needle - 1,
				; ZF = (rcx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsw
	je	short match	; needle in haystack?

	mov	rdi, rax	; rdi = current address in haystack
	mov	rcx, r8		; rcx = current length of haystack
continue:
	cmp	rcx, r9
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	mov	rdi, r10
	mov	rsi, r11
empty:
	ret
match:
	lea	rax, [rax-2]	; rax = address of needle in haystack
	mov	rdi, r10
	mov	rsi, r11
	ret

wcsstr	endp
	end

Save the AMD64 assembler source presented above as amd64-wcs.asm in the directory where you created the object library amd64.lib before, then execute the following 3 command lines to generate the object file amd64-wcs.obj and add it to the existing object library amd64.lib:

SET ML=/c /Gy /W3 /X
ML64.EXE amd64-wcs.asm
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-wcs.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: amd64-wcs.asm

Microsoft (R) Library Manager Version 14.16.27049.0
Copyright (C) Microsoft Corporation.  All rights reserved.

Thread Local Storage Support

The MSDN article Thread Local Storage (TLS) specifies:

Thread Local Storage (TLS) is the method by which each thread in a given multithreaded process can allocate locations in which to store thread-specific data. Dynamically bound (run-time) thread-specific data is supported by way of the TLS API (TlsAlloc, TlsGetValue, TlsSetValue, TlsFree). Win32 and the Microsoft C++ compiler now support statically bound (load-time) per-thread data in addition to the existing API implementation.
[…]
Visual C also provides a Microsoft-specific attribute, thread, as extended storage class modifier. Use the __declspec keyword to declare a thread variable. For example, the following code declares an integer thread local variable and initializes it with a value:
__declspec( thread ) int tls_i = 1;
[…]

On Windows operating systems before Windows Vista, __declspec( thread ) has some limitations. If a DLL declares any data or object as __declspec( thread ), it can cause a protection fault if dynamically loaded. After the DLL is loaded with LoadLibrary, it causes system failure whenever the code references the __declspec( thread ) data. Because the global variable space for a thread is allocated at run time, the size of this space is based on a calculation of the requirements of the application plus the requirements of all the DLLs that are statically linked. When you use LoadLibrary, you can't extend this space to allow for the thread local variables declared with __declspec( thread ). Use the TLS APIs, such as TlsAlloc, in your DLL to allocate TLS if the DLL might be loaded with LoadLibrary.

Under the heading The .tls section, the specification of the PE Format states:

The .tls section provides direct PE and COFF support for static thread local storage (TLS). […] a static TLS variable can be defined as follows, without using the Windows API:
__declspec (thread) int tlsFlag = 1;
To support this programming construct, the PE and COFF .tls section specifies the following information: initialization data, callback routines for per-thread initialization and termination, and the TLS index, which are explained in the following discussion.
Note
Statically declared TLS data objects can be used only in statically loaded image files. This fact makes it unreliable to use static TLS data in a DLL unless you know that the DLL, or anything statically linked with it, will never be loaded dynamically with the LoadLibrary API function.
Executable code accesses a static TLS data object through the following steps:

At link time, the linker sets the Address of Index field of the TLS directory. This field points to a location where the program expects to receive the TLS index.
The Microsoft run-time library facilitates this process by defining a memory image of the TLS directory and giving it the special name "__tls_used" (Intel x86 platforms) or "_tls_used" (other platforms). The linker looks for this memory image and uses the data there to create the TLS directory. Other compilers that support TLS and work with the Microsoft linker must use this same technique.

When a thread is created, the loader communicates the address of the thread's TLS array by placing the address of the thread environment block (TEB) in the FS register. A pointer to the TLS array is at the offset of 0x2C from the beginning of TEB. This behavior is Intel x86-specific.

The loader assigns the value of the TLS index to the place that was indicated by the Address of Index field.

The executable code retrieves the TLS index and also the location of the TLS array.

The code uses the TLS index and the TLS array location (multiplying the index by 4 and using it as an offset to the array) to get the address of the TLS data area for the given program and module. Each thread has its own TLS data area, but this is transparent to the program, which does not need to know how data is allocated for individual threads.

An individual TLS data object is accessed as some fixed offset into the TLS data area.

Ouch: even the very first (highlighted) sentence is wrong; the IMAGE_TLS_DIRECTORY provides the TLS support.

Note: the .tls section is required only when TLS data is initialised, it is not needed when data is just declared.

Ouch: the initial note is but obsolete and wrong; Windows Vista and later versions of Windows NT support static TLS data in dynamically loaded DLLs!

Note: the multiplier 4 is of course only correct for 32-bit platforms; 64-bit platforms require the multiplier 8.

The documentation misses the following step for the x64 alias AMD64 processor architecture, and corresponding steps for other processor architectures as well:

When a thread is created, the loader communicates the address of the thread's TLS array by placing the address of the thread environment block (TEB) in the GS register. A pointer to the TLS array is at the offset of 0x58 from the beginning of the TEB. This behavior is Intel x64-specific.

Note: despite the fixed value of this offset, the Visual C compiler references the address of the external symbol __tls_array on the i386 alias x86 platform.

The specification of the PE Format continues:

The TLS directory has the following format:

Offset (PE32/PE32+) Size (PE32/PE32+) Field Description

0 4/8 Raw Data Start VA The starting address of the TLS template. The template is a block of data that is used to initialize TLS data. The system copies all of this data each time a thread is created, so it must not be corrupted. Note that this address is not an RVA; it is an address for which there should be a base relocation in the .reloc section.

4/8 4/8 Raw Data End VA The address of the last byte of the TLS, except for the zero fill. As with the Raw Data Start VA field, this is a VA, not an RVA.

8/16 4/8 Address of Index The location to receive the TLS index, which the loader assigns. This location is in an ordinary data section, so it can be given a symbolic name that is accessible to the program.

12/24 4/8 Address of Callbacks The pointer to an array of TLS callback functions. The array is null-terminated, so if no callback function is supported, this field points to 4 bytes set to zero. For information about the prototype for these functions, see TLS Callback Functions.

16/32 4 Size of Zero Fill The size in bytes of the template, beyond the initialized data delimited by the Raw Data Start VA and Raw Data End VA fields. The total template size should be the same as the total size of TLS data in the image file. The zero fill is the amount of data that comes after the initialized nonzero data.

20/36 4 Characteristics The four bits [23:20] describe alignment info. Possible values are those defined as IMAGE_SCN_ALIGN_*, which are also used to describe alignment of section in object files. The other 28 bits are reserved for future use.

Offset (PE32/PE32+)	Size (PE32/PE32+)	Field	Description
0	4/8	Raw Data Start VA	The starting address of the TLS template. The template is a block of data that is used to initialize TLS data. The system copies all of this data each time a thread is created, so it must not be corrupted. Note that this address is not an RVA; it is an address for which there should be a base relocation in the .reloc section.
4/8	4/8	Raw Data End VA	The address of the last byte of the TLS, except for the zero fill. As with the Raw Data Start VA field, this is a VA, not an RVA.
8/16	4/8	Address of Index	The location to receive the TLS index, which the loader assigns. This location is in an ordinary data section, so it can be given a symbolic name that is accessible to the program.
12/24	4/8	Address of Callbacks	The pointer to an array of TLS callback functions. The array is null-terminated, so if no callback function is supported, this field points to 4 bytes set to zero. For information about the prototype for these functions, see TLS Callback Functions.
16/32	4	Size of Zero Fill	The size in bytes of the template, beyond the initialized data delimited by the Raw Data Start VA and Raw Data End VA fields. The total template size should be the same as the total size of TLS data in the image file. The zero fill is the amount of data that comes after the initialized nonzero data.
20/36	4	Characteristics	The four bits [23:20] describe alignment info. Possible values are those defined as IMAGE_SCN_ALIGN_*, which are also used to describe alignment of section in object files. The other 28 bits are reserved for future use.

Note: the documentation lacks the information that the Visual C compiler puts all data for the TLS template in COFF sections .tls$‹suffix› – which it declares but writable instead of read-only, i.e. it fails to protect the template data against corruption, an easily avoidable safety hazard!

OOPS: the Raw Data End VA field contains the address of the first byte after the TLS template!

OUCH: the Size of Zero Fill field is not supported at all!

Note: if the size of the initialised data of the .tls section in the image file is less than the section size, the module loader fills the additional uninitialised data with zeroes, i.e. the Size of Zero Fill field is superfluous.

Under the heading TLS Callback Functions, the specification of the PE Format states:

The program can provide one or more TLS callback functions […]
The prototype for a callback function (pointed to by a pointer of type PIMAGE_TLS_CALLBACK) has the same parameters as a DLL entry-point function:
typedef VOID
(NTAPI *PIMAGE_TLS_CALLBACK) (
    PVOID DllHandle,
    DWORD Reason,
    PVOID Reserved
    );
The Reserved parameter should be set to zero. The Reason parameter can take the following values:

Setting Value Description

DLL_PROCESS_ATTACH 1 A new process has started, including the first thread.

DLL_THREAD_ATTACH 2 A new thread has been created. This notification sent for all but the first thread.

DLL_THREAD_DETACH 3 A thread is about to be terminated. This notification sent for all but the first thread.

DLL_PROCESS_DETACH 0 A process is about to terminate, including the original thread.

Setting	Value	Description
DLL_PROCESS_ATTACH	1	A new process has started, including the first thread.
DLL_THREAD_ATTACH	2	A new thread has been created. This notification sent for all but the first thread.
DLL_THREAD_DETACH	3	A thread is about to be terminated. This notification sent for all but the first thread.
DLL_PROCESS_DETACH	0	A process is about to terminate, including the original thread.

Demonstration in ANSI C

The following test application, consisting of a program and a DLL statically linked to the program, starts a thread in the entry point functions mainCRTStartup() and _DllMainCRTStartup() of both its components, and uses a TLS callback function to log the thread’s progress:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

__declspec(thread)
DWORD	dwTLS = 'MSVC';			// placed in writable ".tls$" section by the compiler

#ifndef LIBRARY
#pragma data_seg(".tls")

DWORD	_tls_begin = 'JUNK';		// placed before all TLS template data by the linker

#pragma data_seg(".tls$~~~")

DWORD	_tls_end = 'JUNK';		// placed after all TLS template data by the linker

#pragma data_seg()

#pragma bss_seg(".bss$T")

DWORD	_tls_index;			// assigned by the module loader

#pragma bss_seg()
#else
extern	const	DWORD	_tls_index;
#endif // LIBRARY

__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1024];
	DWORD	dwFormat;
	DWORD	dwOutput;

	va_list	vaInput;
	va_start(vaInput, lpFormat);

	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);

	va_end(vaInput);

	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;

	return dwOutput == dwFormat;
}

const	LPCSTR	szReason[4] = {"process detach",
		               "process attach",
		               "thread attach",
		               "thread detach"};

__declspec(safebuffers)
VOID	WINAPI	TLSCallback(LPVOID hModule, DWORD dwReason, LPVOID lpUnused)
{
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName(hModule, szModule, sizeof(szModule));

	if (hOutput == INVALID_HANDLE_VALUE)
		return;

	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';

	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));

		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}

	PrintFormat(hOutput,
	            "\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tArguments:\r\n"
	            "\t\tModule = 0x%p\r\n"
	            "\t\tReason = %lu (%hs)\r\n"
	            "\t\tUnused = 0x%p\r\n"
	            "\tThread id = %lu\r\n",
	            TLSCallback,
	            hModule, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            hModule, dwReason, szReason[dwReason], lpUnused,
	            GetCurrentThreadId());
}

#ifndef LIBRARY
const	PIMAGE_TLS_CALLBACK	_tls_callbacks[] = {TLSCallback, NULL};

const	IMAGE_TLS_DIRECTORY	_tls_used = {&_tls_begin,
				             &_tls_end + sizeof(_tls_end),
				             &_tls_index,
				             _tls_callbacks,
				             'VOID',
				             0UL};
#else
extern	IMAGE_TLS_DIRECTORY	_tls_used;

#pragma const_seg(".ptr$")		// added to ".ptr" section by the linker

//const	PIMAGE_TLS_CALLBACK	_tls_callback = TLSCallback;

#pragma const_seg()			// place more pointers to callback routines above here

__declspec(allocate(".ptr$"))		// added to ".ptr" section by the linker
const	PIMAGE_TLS_CALLBACK	_tls_callback = TLSCallback;
#endif // LIBRARY

extern	IMAGE_DOS_HEADER	__ImageBase;

#ifdef _DLL
__declspec(dllexport)
__declspec(safebuffers)
DWORD	WINAPI	ThreadProc(LPVOID lpParameter)
{
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName((HMODULE) &__ImageBase, szModule, sizeof(szModule));

	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));

		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}

	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';

	PrintFormat(lpParameter,
	            "\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tParameter = 0x%p\r\n"
	            "\tThread id = %lu\r\n",
	            ThreadProc,
	            &__ImageBase, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            lpParameter,
	            GetCurrentThreadId());

	return GetLastError();
}

__declspec(safebuffers)
BOOL	WINAPI	_DllMainCRTStartup(HMODULE hModule, DWORD dwReason, CONTEXT *lpContext)
{
	DWORD	dwThreadId = GetCurrentThreadId();
	HANDLE	hThread;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName(hModule, szModule, sizeof(szModule));

	if (hOutput == INVALID_HANDLE_VALUE)
		return FALSE;

	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));

		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}

	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';

	PrintFormat(hOutput,
	            "\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tArguments:\r\n"
	            "\t\tModule = 0x%p\r\n"
	            "\t\tReason = %lu (%hs)\r\n"
	            "\t\tUnused = 0x%p\r\n"
	            "\tThread id = %lu\r\n",
	            _DllMainCRTStartup,
	            hModule, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            hModule, dwReason, szReason[dwReason], lpContext,
	            dwThreadId);

	if (dwReason != DLL_PROCESS_ATTACH)
		return FALSE;

	PrintFormat(hOutput,
	            "\a"
	            "\tTLS index = %lu\r\n"
	            "\tTLS value = 0x%p\r\n"
	            "\tTLS array @ 0x%p\r\n"
	            "\tTLS block @ 0x%p\r\n"
	            "\tTLS dword @ 0x%p = \"%.4hs\"\r\n"
	            "\tTLS directory     @ 0x%p\r\n"
	            "\t\tStart     @ 0x%p\r\n"
	            "\t\tEnd       @ 0x%p\r\n"
	            "\t\tIndex     @ 0x%p\r\n"
	            "\t\tCallbacks @ 0x%p\r\n"
	            "\t\tZerofill  = 0x%08lX = \"%.4hs\"\r\n"
	            "\t\tAlignment = 0x%08lX\r\n" + (dwTLS == 'MSVC'),
	            _tls_index,
	            TlsGetValue(_tls_index),
#ifdef _M_IX86
	            __readfsdword(44),
	            ((LPVOID *) __readfsdword(44))[_tls_index],
#elif _M_AMD64
	            __readgsqword(88),
	            ((LPVOID *) __readgsqword(88))[_tls_index],
#else
#error Only I386 and AMD64 supported!
#endif
	            &dwTLS, &dwTLS,
	            &_tls_used,
	            _tls_used.StartAddressOfRawData,
	            _tls_used.EndAddressOfRawData,
	            _tls_used.AddressOfIndex,
	            _tls_used.AddressOfCallBacks,
	            _tls_used.SizeOfZeroFill, &_tls_used.SizeOfZeroFill,
	            _tls_used.Characteristics);

	hThread = CreateThread((LPSECURITY_ATTRIBUTES) NULL,
	                       (SIZE_T) 65536,
	                       ThreadProc,
	                       hOutput,
	                       0,
	                       &dwThreadId);

	if (hThread == NULL)
		PrintFormat(hOutput,
		            "CreateThread() returned error %lu\r\n",
		            GetLastError());
	else
	{
		PrintFormat(hOutput,
		            "\r\n"
		            "Thread %lu created and started\r\n",
		            dwThreadId);

		if (!CloseHandle(hThread))
			PrintFormat(hOutput,
			            "CloseHandle() returned error %lu\r\n",
			            GetLastError());
	}

	return TRUE;
}
#else // _DLL
__declspec(dllimport)
DWORD	WINAPI	ThreadProc(LPVOID lpParameter);

DWORD	CDECL	mainCRTStartup(VOID)
{
	DWORD	dwError = ERROR_SUCCESS;
	DWORD	dwThreadId = GetCurrentThreadId();
	DWORD	dwThread;
	HANDLE	hThread;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName((HMODULE) &__ImageBase, szModule, sizeof(szModule));

	if (hOutput == INVALID_HANDLE_VALUE)
		return GetLastError();

	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));

		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}

	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';

	PrintFormat(hOutput,
	            "\a\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tThread id = %lu\r\n"
	            "\tTLS index = %ld\r\n"
	            "\tTLS value = 0x%p\r\n"
	            "\tTLS array @ 0x%p\r\n"
	            "\tTLS block @ 0x%p\r\n"
	            "\tTLS dword @ 0x%p = \"%.4hs\"\r\n"
	            "\tTLS directory     @ 0x%p\r\n"
	            "\t\tStart     @ 0x%p\r\n"
	            "\t\tEnd       @ 0x%p\r\n"
	            "\t\tIndex     @ 0x%p\r\n"
	            "\t\tCallbacks @ 0x%p\r\n"
	            "\t\tZerofill  = 0x%08lX = \"%.4hs\"\r\n"
	            "\t\tAlignment = 0x%08lX\r\n" + (dwTLS == 'MSVC'),
	            mainCRTStartup,
	            &__ImageBase, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            dwThreadId,
	            _tls_index,
	            TlsGetValue(_tls_index),
#ifdef _M_IX86
	            __readfsdword(44),
	            ((LPVOID *) __readfsdword(44))[_tls_index],
#elif _M_AMD64
	            __readgsqword(88),
	            ((LPVOID *) __readgsqword(88))[_tls_index],
#else
#error Only I386 and AMD64 supported!
#endif
	            &dwTLS, &dwTLS,
	            &_tls_used,
	            _tls_used.StartAddressOfRawData,
	            _tls_used.EndAddressOfRawData,
	            _tls_used.AddressOfIndex,
	            _tls_used.AddressOfCallBacks,
	            _tls_used.SizeOfZeroFill, &_tls_used.SizeOfZeroFill,
	            _tls_used.Characteristics);

	hThread = CreateThread((LPSECURITY_ATTRIBUTES) NULL,
	                       (SIZE_T) 65536,
	                       ThreadProc,
	                       hOutput,
	                       0UL,
	                       &dwThreadId);

	if (hThread == NULL)
		PrintFormat(hOutput,
		            "CreateThread() returned error %lu\r\n",
		            dwError = GetLastError());
	else
	{
		PrintFormat(hOutput,
		            "\r\n"
		            "Thread %lu created and started\r\n",
		            dwThreadId);

		if (WaitForSingleObject(hThread, INFINITE) == WAIT_FAILED)
			PrintFormat(hOutput,
			            "WaitForSingleObject() returned error %lu\r\n",
			            dwError = GetLastError());

		if (!GetExitCodeThread(hThread, &dwThread))
			PrintFormat(hOutput,
			            "GetExitCodeThread() returned error %lu\r\n",
			            dwError = GetLastError());
		else
			PrintFormat(hOutput,
			            "\r\n"
			            "Thread %lu exited with code %lu\r\n",
			            dwThreadId, dwThread);

		if (!CloseHandle(hThread))
			PrintFormat(hOutput,
			            "CloseHandle() returned error %lu\r\n",
			            GetLastError());
	}

	return dwError;
}
#endif // _DLL

Save the ANSI C source presented above as tls-demo.c in an arbitrary, preferable empty directory, then execute the following 6 command lines to compile and link it a first time to generate the DLL tls-demo.dll and its import library tls-demo.lib for the AMD64 platform, to compile and link it a second time to generate the image file tls-demo.exe for the AMD64 platform too, and finally execute the latter:

SET CL=/GAFy /Oisy /W4 /Zl
SET LINK=/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /SECTION:.tls,!w
CL.EXE /LD /MD tls-demo.c
SET LINK=/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SECTION:.tls,!w /SUBSYSTEM:CONSOLE
CL.EXE tls-demo.c tls-demo.lib
.\tls-demo.exe

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

tls-demo.c
tls-demo.c(108) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *'
tls-demo.c(109) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *'
tls-demo.c(110) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *'
tls-demo.c(111) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *'

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /SECTION:.tls,!w
/out:tls-demo.dll
/dll
/implib:tls-demo.lib
tls-demo.obj
   Creating library tls-demo.lib and object tls-demo.exp

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

tls-demo.c
tls-demo.c(108) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *'
tls-demo.c(109) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *'
tls-demo.c(110) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *'
tls-demo.c(111) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *'

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SECTION:.tls,!w /SUBSYSTEM:CONSOLE
/out:tls-demo.exe
tls-demo.obj
tls-demo.lib

TLSCallback() function @ 0x000007FEFACA10D0
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000017F038 = 0x0000000077845078
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 1 (process attach)
		Unused = 0x0000000000000000
	Thread id = 7544

_DllMainCRTStartup() function @ 0x000007FEFACA1384
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000017F0A8 = 0x0000000077837C3E
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 1 (process attach)
		Unused = 0x000000000017F830
	Thread id = 7544
	TLS index = 1
	TLS value = 0x0000000000000000
	TLS array @ 0x00000000002C3280
	TLS block @ 0x00000000002EA590
	TLS dword @ 0x00000000002C32D4 = "CVSM"
	TLS directory     @ 0x000007FEFACA20E0
		Start     @ 0x000007FEFACA5000
		End       @ 0x000007FEFACA5018
		Index     @ 0x000007FEFACA3000
		Callbacks @ 0x000007FEFACA20D0
		Zerofill  = 0x564F4944 = "DIOV"
		Alignment = 0x00000000

Thread 11820 created and started

TLSCallback() function @ 0x000000013F8910D0
	Called module  @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000017F038 = 0x0000000077845078
	Arguments:
		Module = 0x000000013F890000
		Reason = 1 (process attach)
		Unused = 0x0000000000000000
	Thread id = 7544

mainCRTStartup() function @ 0x000000013F891258
	Called module  @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll
	Return address @ 0x000000000017FC88 = 0x00000000776F556D
	Thread id = 7544
	TLS index = 0
	TLS value = 0x0000000000000000
	TLS array @ 0x00000000002C3280
	TLS block @ 0x00000000002C32D0
	TLS dword @ 0x00000000002C32D4 = "CVSM"
	TLS directory     @ 0x000000013F892100
		Start     @ 0x000000013F895000
		End       @ 0x000000013F895018
		Index     @ 0x000000013F893000
		Callbacks @ 0x000000013F8920F0
		Zerofill  = 0x564F4944 = "DIOV"
		Alignment = 0x00000000

Thread 11888 created and started

TLSCallback() function @ 0x000007FEFACA10D0
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000201F458 = 0x0000000077845078
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 2 (thread attach)
		Unused = 0x0000000000000000
	Thread id = 11888

_DllMainCRTStartup() function @ 0x000007FEFACA1384
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000201F4C8 = 0x00000000778383CC
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 2 (thread attach)
		Unused = 0x0000000000000000
	Thread id = 11888

TLSCallback() function @ 0x000000013F8910D0
	Called module  @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000201F458 = 0x0000000077845078
	Arguments:
		Module = 0x000000013F890000
		Reason = 2 (thread attach)
		Unused = 0x0000000000000000
	Thread id = 11888

ThreadProc() function @ 0x000007FEFACA1258
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll
	Return address @ 0x000000000201FAE8 = 0x00000000776F556D
	Parameter = 0x0000000000000070
	Thread id = 11888

TLSCallback() function @ 0x000007FEFACA10D0
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000201F688 = 0x0000000077845078
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 3 (thread detach)
		Unused = 0x0000000000000000
	Thread id = 11888

_DllMainCRTStartup() function @ 0x000007FEFACA1384
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000201F6F8 = 0x0000000077838785
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 3 (thread detach)
		Unused = 0x0000000000000000
	Thread id = 11888

TLSCallback() function @ 0x000000013F8910D0
	Called module  @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000201F688 = 0x0000000077845078
	Arguments:
		Module = 0x000000013F890000
		Reason = 3 (thread detach)
		Unused = 0x0000000000000000
	Thread id = 11888

Thread 11888 exited with code 0

TLSCallback() function @ 0x000007FEFACA10D0
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x0000000001E0F2A8 = 0x0000000077845078
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 2 (thread attach)
		Unused = 0x0000000000000000
	Thread id = 11820

_DllMainCRTStartup() function @ 0x000007FEFACA1384
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x0000000001E0F318 = 0x00000000778383CC
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 2 (thread attach)
		Unused = 0x0000000000000000
	Thread id = 11820

TLSCallback() function @ 0x000000013F8910D0
	Called module  @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x0000000001E0F2A8 = 0x0000000077845078
	Arguments:
		Module = 0x000000013F890000
		Reason = 2 (thread attach)
		Unused = 0x0000000000000000
	Thread id = 11820

TLSCallback() function @ 0x000007FEFACA10D0
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000017F758 = 0x0000000077845078
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 3 (thread detach)
		Unused = 0x0000000000000000
	Thread id = 7544

_DllMainCRTStartup() function @ 0x000007FEFACA1384
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000017F7C8 = 0x0000000077838785
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 3 (thread detach)
		Unused = 0x0000000000000000
	Thread id = 7544

TLSCallback() function @ 0x000000013F8910D0
	Called module  @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x000000000017F758 = 0x0000000077845078
	Arguments:
		Module = 0x000000013F890000
		Reason = 3 (thread detach)
		Unused = 0x0000000000000000
	Thread id = 7544

ThreadProc() function @ 0x000007FEFACA1258
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll
	Return address @ 0x0000000001E0F938 = 0x00000000776F556D
	Parameter = 0x0000000000000070
	Thread id = 11820

TLSCallback() function @ 0x000007FEFACA10D0
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x0000000001E0F488 = 0x0000000077845078
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 0 (process detach)
		Unused = 0x0000000000000000
	Thread id = 11820

_DllMainCRTStartup() function @ 0x000007FEFACA1384
	Called module  @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x0000000001E0F4F8 = 0x000000007783775B
	Arguments:
		Module = 0x000007FEFACA0000
		Reason = 0 (process detach)
		Unused = 0x0000000000000001
	Thread id = 11820

TLSCallback() function @ 0x000000013F8910D0
	Called module  @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll
	Return address @ 0x0000000001E0F488 = 0x0000000077845078
	Arguments:
		Module = 0x000000013F890000
		Reason = 0 (process detach)
		Unused = 0x0000000000000000
	Thread id = 11820

The program works as documented and intended: the variable dwTLS is initialised with the ASCII text CVSM, the TLSCallback() function runs on the secondary thread 11820 and the tertiary thread 11888 before its ThreadProc() function and after the latter returns, and it runs also on the primary thread 7544 before the entry point functions of both the DLL and the program as well as after the latter returns from its entry point function.

Note: the primary thread 7544 exits before the secondary thread 11820 here; as documented in the MSDN article Terminating a Process, the program terminates with its last thread. ExitProcess()

Note: the MSDN article Terminating a Thread specifies that threads are terminated upon return of their ThreadProc() callback function. ExitThread()

Now (attempt to) build this application for the i386 platform, using the same command lines as before:

SET CL=/GAFy /Oisy /W4 /Zl
SET LINK=/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /SECTION:.tls,!w
CL.EXE /LD /MD tls-demo.c
[…]

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

tls-demo.c
tls-demo.c(106) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(107) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(108) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(109) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *'

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /SECTION:.tls,!w
/out:tls-demo.dll
/dll
/implib:tls-demo.lib
tls-demo.obj
   Creating library tls-demo.lib and object tls-demo.exp
tls-demo.obj : error LNK2019: unresolved external symbol __tls_array referenced in function __DllMainCRTStartup@12
tls-demo.dll : fatal error LNK1120: 1 unresolved externals

OUCH: due to the braindead behaviour of the Visual C compiler for the i386 platform, which references the symbol __tls_array in the generated machine code instead to use its fixed value 44, this build fails!

Implementation in i386 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.model	flat, C

	public	_tls_array
_tls_array equ	44		; offset of 'ThreadLocalStoragePointer' member in TEB;
				;  symbol referenced in code generated by the compiler!
_tls_32	struct	4
	dword	offset _tls_begin
	dword	offset _tls_end
	dword	offset _tls_index
	dword	offset _tls_start
	dword	'VOID'		; BUG: the module loader does NOT support the 'SizeOfZeroFill' member!
	dword	0
_tls_32	ends

_tls_bss segment alias(".bss$T") dword read write 'BSS'
	public	_tls_index
_tls_index dword ?		; assigned by the module loader!
_tls_bss ends

_tls_note segment alias(".comment") discard info read 'INFO'
	byte	"(C)opyright 2004-2024, Stefan Kanthak"
_tls_note ends

_tls_info segment alias(".drectve") discard info read 'INFO'
	byte	"/MERGE:.ptr=.rdata /SECTION:.tls,!w"
_tls_info ends

_tls_start segment alias(".ptr") dword read 'CONST'
_tls_start ends

_tls_stop segment alias(".ptr$~~~") dword read 'CONST'
	dword	0		; callback function array terminator
_tls_stop ends

_tls	segment alias(".rdata$T") dword read 'CONST'
	public	_tls_used
_tls_used _tls_32 <>		; IMAGE_TLS_DIRECTORY32
_tls	ends

_tls_begin segment alias(".tls") para read write 'DATA'
_tls_begin ends

_tls_end segment alias(".tls$~~~") byte read write 'DATA'
_tls_end ends
	end

Save the i386 assembler source presented above as i386-tls.asm in the directory where you created the object library i386.lib before, then execute the following 3 command lines to generate the object file i386-tls.obj and add it to the existing object library i386.lib:

SET ML=/c /safeseh /W3 /X
ML.EXE i386-tls.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-tls.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-tls.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Move the ANSI C source file tls-demo.c created before into the current directory, then execute the following 6 command lines to compile and link it a first time with the TLS support module from the object library i386.lib to generate the DLL tls-demo.dll and its import library tls-demo.lib for the i386 platform, to compile and link it a second time to generate the image file tls-demo.exe for the i386 platform too, and finally execute the latter:

SET CL=/c /DLIBRARY /GAFy /Oisy /W4 /Zl
SET LINK=/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib
CL.EXE /LD /MD tls-demo.c
SET LINK=/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
CL.EXE tls-demo.c tls-demo.lib
.\tls-demo.exe

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

tls-demo.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib
/out:tls-demo.dll
/dll
/implib:tls-demo.lib
tls-demo.obj
   Creating library tls-demo.lib and object tls-demo.exp

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

tls-demo.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:i386.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
/out:tls-demo.exe
tls-demo.obj
tls-demo.lib

TLSCallback() function @ 0x70351078
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x004AF390 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 1 (process attach)
		Unused = 0x00000000
	Thread id = 1724

_DllMainCRTStartup() function @ 0x70351240
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x004AF3CC = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 1 (process attach)
		Unused = 0x004AF6D0
	Thread id = 1724
	TLS index = 1
	TLS value = 0x00000000
	TLS array @ 0x007F20D0
	TLS block @ 0x0080B728
	TLS dword @ 0x007F4FC8 = "CVSM"
	TLS directory     @ 0x70352468
		Start     @ 0x70354000
		End       @ 0x70354004
		Index     @ 0x70353000
		Callbacks @ 0x70352088
		Zerofill  = 0x564F4944 = "DIOV"
		Alignment = 0x00000000

Thread 2716 created and started

TLSCallback() function @ 0x00331078
	Called module  @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x004AF390 = 0x779F9280
	Arguments:
		Module = 0x00330000
		Reason = 1 (process attach)
		Unused = 0x00000000
	Thread id = 1724

TLSCallback() function @ 0x70351078
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x007CF720 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 2 (thread attach)
		Unused = 0x00000000
	Thread id = 2716

_DllMainCRTStartup() function @ 0x70351240
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x007CF75C = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 2 (thread attach)
		Unused = 0x00000000
	Thread id = 2716

TLSCallback() function @ 0x00331078
	Called module  @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x007CF720 = 0x779F9280
	Arguments:
		Module = 0x00330000
		Reason = 2 (thread attach)
		Unused = 0x00000000
	Thread id = 2716

mainCRTStartup() function @ 0x0033116B
	Called module  @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll
	Return address @ 0x004AF938 = 0x774E343D
	Thread id = 1724
	TLS index = 0
	TLS value = 0x00000000
	TLS array @ 0x007F20D0
	TLS block @ 0x007F4FC8
	TLS dword @ 0x007F4FC8 = "CVSM"
	TLS directory     @ 0x00332400
		Start     @ 0x00334000
		End       @ 0x00334004
		Index     @ 0x00333000
		Callbacks @ 0x00332098
		Zerofill  = 0x564F4944 = "DIOV"
		Alignment = 0x00000000

ThreadProc() function @ 0x7035116B
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll
	Return address @ 0x007CFAF0 = 0x774E343D
	Parameter = 0x00000074
	Thread id = 2716

TLSCallback() function @ 0x70351078
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x007CF7B4 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 3 (thread detach)
		Unused = 0x00000000
	Thread id = 2716

_DllMainCRTStartup() function @ 0x70351240
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x007CF7F0 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 3 (thread detach)
		Unused = 0x00000000
	Thread id = 2716

TLSCallback() function @ 0x00331078
	Called module  @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x007CF7B4 = 0x779F9280
	Arguments:
		Module = 0x00330000
		Reason = 3 (thread detach)
		Unused = 0x00000000
	Thread id = 2716

Thread 11748 created and started

TLSCallback() function @ 0x70351078
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x021AFBA4 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 2 (thread attach)
		Unused = 0x00000000
	Thread id = 11748

_DllMainCRTStartup() function @ 0x70351240
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x021AFBE0 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 2 (thread attach)
		Unused = 0x00000000
	Thread id = 11748

TLSCallback() function @ 0x00331078
	Called module  @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x021AFBA4 = 0x779F9280
	Arguments:
		Module = 0x00330000
		Reason = 2 (thread attach)
		Unused = 0x00000000
	Thread id = 11748

ThreadProc() function @ 0x7035116B
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll
	Return address @ 0x021AFF74 = 0x774E343D
	Parameter = 0x00000074
	Thread id = 11748

TLSCallback() function @ 0x70351078
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x021AFC38 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 3 (thread detach)
		Unused = 0x00000000
	Thread id = 11748

_DllMainCRTStartup() function @ 0x70351240
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x021AFC74 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 3 (thread detach)
		Unused = 0x00000000
	Thread id = 11748

TLSCallback() function @ 0x00331078
	Called module  @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x021AFC38 = 0x779F9280
	Arguments:
		Module = 0x00330000
		Reason = 3 (thread detach)
		Unused = 0x00000000
	Thread id = 11748

Thread 11748 exited with code 0

TLSCallback() function @ 0x70351078
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x004AF5CC = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 0 (process detach)
		Unused = 0x00000000
	Thread id = 1724

_DllMainCRTStartup() function @ 0x70351240
	Called module  @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x004AF608 = 0x779F9280
	Arguments:
		Module = 0x70350000
		Reason = 0 (process detach)
		Unused = 0x00000001
	Thread id = 1724

TLSCallback() function @ 0x00331078
	Called module  @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe
	Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll
	Return address @ 0x004AF5CC = 0x779F9280
	Arguments:
		Module = 0x00330000
		Reason = 0 (process detach)
		Unused = 0x00000000
	Thread id = 1724

With the object module i386-tls.obj, program and DLL work as documented and intended now, exhibiting the insignificant difference that the program terminates with the primary thread 1724 here.

Implementation in AMD64 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

_tls_64	struct	8
	qword	offset _tls_begin
	qword	offset _tls_end
	qword	offset _tls_index
	qword	offset _tls_start
	dword	'VOID'		; BUG: the module loader does NOT support the 'SizeOfZeroFill' member!
	dword	0
_tls_64	ends

_bss	segment alias(".bss$T") dword read write 'BSS'
	public	_tls_index
_tls_index dword ?		; assigned by the module loader!
_bss	ends

_note	segment alias(".comment") discard info read 'INFO'
	byte	"(C)opyright 2004-2024, Stefan Kanthak"
_note	ends

_linker	segment alias(".drectve") discard info read 'INFO'
	byte	"/MERGE:.ptr=.rdata /SECTION:.tls,!w"
_linker	ends

_start	segment alias(".ptr") para read 'CONST'
_tls_start equ	$
_start	ends

_stop	segment alias(".ptr$~~~") read 'CONST'
	qword	0		; callback function array terminator
_stop	ends

_const	segment alias(".rdata$T") para read 'CONST'
	public	_tls_used
_tls_used _tls_64 <>		; IMAGE_TLS_DIRECTORY64
_const	ends

_begin	segment alias(".tls") para read write 'DATA'
_tls_begin equ	$
_begin	ends

_end	segment alias(".tls$~~~") byte read write 'DATA'
_tls_end equ	$
_end	ends
	end

Save the AMD64 assembler source presented above as amd64-tls.asm in the directory where you created the object library amd64.lib before, then execute the following 3 command lines to generate the object file amd64-tls.obj and add it to the existing object library amd64.lib:

SET ML=/c /W3 /X
ML64.EXE amd64-tls.asm
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-tls.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: amd64-tls.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Move the ANSI C source file tls-demo.c created before into the current directory, then execute the following 6 command lines to compile and link it a first time with the TLS support module from the object library amd64.lib to generate the DLL tls-demo.dll and its import library tls-demo.lib for the AMD64 platform, to compile and link it a second time to generate the image file tls-demo.exe for the AMD64 platform too, and finally execute the latter:

SET CL=/c /DLIBRARY /GAFy /Oisy /W4 /Zl
SET LINK=/DEFAULTLIB:amd64.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib
CL.EXE /LD /MD tls-demo.c
SET LINK=/DEFAULTLIB:amd64.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
CL.EXE tls-demo.c tls-demo.lib
.\tls-demo.exe

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

tls-demo.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:amd64.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib
/out:tls-demo.dll
/dll
/implib:tls-demo.lib
tls-demo.obj
   Creating library tls-demo.lib and object tls-demo.exp

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

tls-demo.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/DEFAULTLIB:amd64.lib /DEFAULTLIB:kernel32.lib /DEFAULTLIB:user32.lib /ENTRY:mainCRTStartup /SUBSYSTEM:CONSOLE
/out:tls-demo.exe
tls-demo.obj
tls-demo.lib

[…]

`_load_config_used` and `__security_check_cookie()` Function (`/GS` Support)

Under the heading The Load Configuration Structure (Image Only), the specification of the PE Format states:

The load configuration structure (IMAGE_LOAD_CONFIG_DIRECTORY) was formerly used in very limited cases in the Windows NT operating system itself to describe various features too difficult or too large to describe in the file header or optional header of the image. Current versions of the Microsoft linker and Windows XP and later versions of Windows use a new version of this structure for 32-bit x86-based systems that include reserved SEH technology.
[…]
The Microsoft linker automatically provides a default load configuration structure to include the reserved SEH data. If the user code already provides a load configuration structure, it must include the new reserved SEH fields. Otherwise, the linker cannot include the reserved SEH data and the image is not marked as containing reserved SEH.

OUCH¹: the highlighted statement is but wrong, LINK.EXE neither provides an IMAGE_LOAD_CONFIG_DIRECTORY structure nor reports its omission with an error message!

The documentation of the /SAFESEH compiler options gives proper information:

If you link with /NODEFAULTLIB and you want a table of safe exception handlers, you need to supply a load config struct (…) that contains all the entries defined for Visual C++. For example:

#include <windows.h>
extern DWORD_PTR __security_cookie;  /* /GS security cookie */

/*
* The following two names are automatically created by the linker for any
* image that has the safe exception table present.
*/

extern PVOID __safe_se_handler_table[]; /* base of safe handler entry table */
extern BYTE  __safe_se_handler_count;  /* absolute symbol whose address is
                                           the count of table entries */

const IMAGE_LOAD_CONFIG_DIRECTORY32 _load_config_used = {
    sizeof(IMAGE_LOAD_CONFIG_DIRECTORY32),
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    &__security_cookie,
    __safe_se_handler_table,
    (DWORD)(DWORD_PTR) &__safe_se_handler_count
};

The specification of the PE format continues with the following desinformation:

Load Configuration Layout
The load configuration structure has the following layout for 32-bit and 64-bit PE files:

Offset Size Field Description

0 4 Characteristics Flags that indicate attributes of the file, currently unused.

[…]

54/78 2 Reserved Must be zero.

Offset	Size	Field	Description
0	4	Characteristics	Flags that indicate attributes of the file, currently unused.
[…]
54/78	2	Reserved	Must be zero.

OUCH²: the documentation for the IMAGE_LOAD_CONFIG_DIRECTORY structure but states that the field at offset 0 stores the size of the structure, and the field at offset 54 (for 32-bit images) or 78 (for 64-bit images) stores the /DEPENDENTLOADFLAG!

Caveat: only with the GuardFlags member present in the IMAGE_LOAD_CONFIG_DIRECTORY structure, i.e. if its Size member is at least 92 on 32-bit platforms and 148 on 64-bit platforms, the module loader honors the /DEPENDENTLOADFLAG on Windows 10 1607 alias Anniversary Update, codenamed Redstone 1, and later versions of Windows NT!

The documentation of the /GS compiler option states:

The /GS compiler option requires that the security cookie be initialized before any function that uses the cookie is run. The security cookie must be initialized immediately on entry to an EXE or DLL. This is done automatically if you use the default VCRuntime entry points: mainCRTStartup, wmainCRTStartup, WinMainCRTStartup, wWinMainCRTStartup, or _DllMainCRTStartup. If you use an alternate entry point, you must manually initialize the security cookie by calling __security_init_cookie.

OOPS¹: contrary to the first highlighted statement, the code generated by the compiler requires only that the (arbitrary) value of the security cookie does not change between entry and exit of any function which uses it!

OOPS²: the documentation cited above but fails to provide the following (implementation) details:

the security cookie is (typically) initialised during compile time to a well-known non-zero default value, 0xBB40E64E = 3141592654 = π × 10⁹ for 32-bit object modules and 0x00002B992DDFA232 = π × 10¹⁸ ÷ 2¹⁶ for 64-bit object modules;
the module loader assigns a (pseudo-)random value to the security cookie of 32-bit DLLs since Windows XP SP2 if it has this default value;
the module loader assigns a (pseudo-)random value to the security cookie of 32-bit programs since Windows 10 if it has this default value;
the module loader assigns a (pseudo-)random value to the security cookie of 64-bit programs as well as DLLs since Windows 10 if it has this default value and the size of the _load_config_used structure matches the size of the IMAGE_LOAD_CONFIG_DIRECTORY64 structure in the eleventh entry of the IMAGE_DATA_DIRECTORY array in the IMAGE_OPTIONAL_HEADER structure;
the function __security_init_cookie() provided in the MSVCRT libraries (re)initialises the security cookie only if it has this default value or is 0;
TLS callback functions are called before the normal (default) entry points, conventionally named mainCRTStartup, wmainCRTStartup, WinMainCRTStartup, wWinMainCRTStartup and _DllMainCRTStartup!

OOPS³: contrary to the second highlighted statement there is no need to call the __security_init_cookie() function to (re)initialise the security cookie any more!

Note: while compiler and linker reference the security cookie by its symbol name __security_cookie, the module loader references it through the virtual address stored in the SecurityCookie member of the IMAGE_LOAD_CONFIG_DIRECTORY structure.

The MSDN magazine articles Protecting Your Code with Visual C++ Defenses and Visual C++ Support for Stack-Based Buffer Protection provide additional information. strict_gs_check pragma Security Checks at Runtime and Compile Time Compiler Security Checks In Depth

CAVEAT: when an exception is thrown in a function and not handled in place, but by one of the calling functions, i.e. when the function’s epilog is not executed, an overwritten stack cookie is not detected!

Implementation in ANSI C

Similar to the public symbol __tls_used to locate the IMAGE_TLS_DIRECTORY on the i386 platform and _tls_used on other platforms, the linker locates the IMAGE_LOAD_CONFIG_DIRECTORY structure via the public symbol __load_config_used on the i386 platform and _load_config_used on other platforms.

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
#define LOAD_LIBRARY_SEARCH_SYSTEM32	0x00000800UL
#endif

#ifndef _WIN64
#if 0
	DWORD	__security_cookie = 0xBB40E64EUL;
		               // = 3141592654 = 10**9 * pi
#else
const	DWORD	__security_cookie = 2654435769UL;
		               // = 0x9E3779B9UL
		               // = 2**32 / phi
#endif
extern	LPVOID	__safe_se_handler_table[];
extern	BYTE	__safe_se_handler_count;

const	struct	_IMAGE_LOAD_CONFIG_DIRECTORY_32
{
	DWORD	Size;
	DWORD	TimeDateStamp;
	WORD	MajorVersion;
	WORD	MinorVersion;
	DWORD	GlobalFlagsClear;
	DWORD	GlobalFlagsSet;
	DWORD	CriticalSectionDefaultTimeout;
	DWORD	DeCommitFreeBlockThreshold;
	DWORD	DeCommitTotalFreeThreshold;
	DWORD	LockPrefixTable;
	DWORD	MaximumAllocationSize;
	DWORD	VirtualMemoryThreshold;
	DWORD	ProcessHeapFlags;
	DWORD	ProcessAffinityMask;
	WORD	CSDVersion;
#if LCU > 2				// Redstone 1 (1607)
	WORD	DependentLoadFlags;
#else
	WORD	Reserved1;
#endif
	DWORD	EditList;
	DWORD	SecurityCookie;
	DWORD	SEHandlerTable;
	DWORD	SEHandlerCount;
#if LCU > 0				// Threshold 1 (1507)
	DWORD	GuardCFCheckFunctionPointer;
	DWORD	GuardCFDispatchFunctionPointer;
	DWORD	GuardCFFunctionTable;
	DWORD	GuardCFFunctionCount;
	DWORD	GuardFlags;
#if LCU > 1				// Threshold 2 (1511)
	struct	// _IMAGE_LOAD_CONFIG_CODE_INTEGRITY
	{
		WORD	Flags;
		WORD	Catalog;
		DWORD	CatalogOffset;
		DWORD	Reserved;
	} CodeIntegrity;
#if LCU > 2				// Redstone 1 (1607)
	DWORD	GuardAddressTakenIatEntryTable;
	DWORD	GuardAddressTakenIatEntryCount;
	DWORD	GuardLongJumpTargetTable;
	DWORD	GuardLongJumpTargetCount;
	DWORD	DynamicValueRelocTable;
	DWORD	CHPEMetadataPointer;
#if LCU > 3				// Redstone 2 (1703)
	DWORD	GuardRFFailureRoutine;
	DWORD	GuardRFFailureRoutineFunctionPointer;
	DWORD	DynamicValueRelocTableOffset;
	WORD	DynamicValueRelocTableSection;
	WORD	Reserved2;
	DWORD	GuardRFVerifyStackPointerFunctionPointer;
	DWORD	HotPatchTableOffset;
#if LCU > 4				// Redstone 3 (1709)
	DWORD	Reserved3;
	DWORD	EnclaveConfigurationPointer;
#if LCU > 5				// Redstone 4 (1803)
	DWORD	VolatileMetadataPointer;
#if LCU > 6				// Redstone 5 (1809)
	DWORD	GuardEHContinuationTable;
	DWORD	GuardEHContinuationCount;
					// Titanium (1903)
					// Vanadium (1909)
					// Vibranium 1 (2004)
					// Vibranium 2 (20H2)
#if LCU > 7				// Vibranium 3 (21H1)
	DWORD	GuardXFGCheckFunctionPointer;
	DWORD	GuardXFGDispatchFunctionPointer;
	DWORD	GuardXFGTableDispatchFunctionPointer;
#if LCU > 8				// Vibranium 4 (21H2)
	DWORD	CastGuardOsDeterminedFailureMode;
#if LCU > 9				// Vibranium 5 (22H2)
	DWORD	GuardMemcpyFunctionPointer;
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
} _load_config_used = {sizeof(_load_config_used),
                       'DATE',		// = 2006-04-15 20:15:01 UTC
                       _MSC_VER / 100,
                       _MSC_VER % 100,
                       0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL,
                       0U,
                       LOAD_LIBRARY_SEARCH_SYSTEM32,
                       0UL,
                       &__security_cookie,
                       __safe_se_handler_table,
                       &__safe_se_handler_count,
                       0UL, 0UL, 0UL, 0UL, 0UL};
#else // _WIN64
#if 0
	DWORD64	__security_cookie = 0x00002B992DDFA232ULL;
		               // = 3141592653589793241 >> 16
		               // = 10**18 / 2**16 * pi
#else
const	DWORD64	__security_cookie = 173961102589770ULL;
		               // = 0x00009E3779B97F4AULL
		               // = 2**48 / phi
#endif
const	struct	_IMAGE_LOAD_CONFIG_DIRECTORY_64
{
	DWORD	Size;
	DWORD	TimeDateStamp;
	WORD	MajorVersion;
	WORD	MinorVersion;
	DWORD	GlobalFlagsClear;
	DWORD	GlobalFlagsSet;
	DWORD	CriticalSectionDefaultTimeout;
	DWORD64	DeCommitFreeBlockThreshold;
	DWORD64	DeCommitTotalFreeThreshold;
	DWORD64	LockPrefixTable;
	DWORD64	MaximumAllocationSize;
	DWORD64	VirtualMemoryThreshold;
	DWORD64	ProcessAffinityMask;
	DWORD	ProcessHeapFlags;
	WORD	CSDVersion;
#if LCU > 2				// Redstone 1 (1607)
	WORD	DependentLoadFlags;
#else
	WORD	Reserved1;
#endif
	DWORD64	EditList;
	DWORD64	SecurityCookie;
	DWORD64	SEHandlerTable;
	DWORD64	SEHandlerCount;
#if LCU > 0				// Threshold 1 (1507)
	DWORD64	GuardCFCheckFunctionPointer;
	DWORD64	GuardCFDispatchFunctionPointer;
	DWORD64	GuardCFFunctionTable;
	DWORD64	GuardCFFunctionCount;
	DWORD	GuardFlags;
#if LCU > 1				// Threshold 2 (1511)
	struct	// _IMAGE_LOAD_CONFIG_CODE_INTEGRITY
	{
		WORD	Flags;
		WORD	Catalog;
		DWORD	CatalogOffset;
		DWORD	Reserved;
	} CodeIntegrity;
#if LCU > 2				// Redstone 1 (1607)
	DWORD64	GuardAddressTakenIatEntryTable;
	DWORD64	GuardAddressTakenIatEntryCount;
	DWORD64	GuardLongJumpTargetTable;
	DWORD64	GuardLongJumpTargetCount;
	DWORD64	DynamicValueRelocTable;
	DWORD64	CHPEMetadataPointer;
#if LCU > 3				// Redstone 2 (1703)
	DWORD64	GuardRFFailureRoutine;
	DWORD64	GuardRFFailureRoutineFunctionPointer;
	DWORD	DynamicValueRelocTableOffset;
	WORD	DynamicValueRelocTableSection;
	WORD	Reserved2;
	DWORD64	GuardRFVerifyStackPointerFunctionPointer;
	DWORD	HotPatchTableOffset;
#if LCU > 4				// Redstone 3 (1709)
	DWORD	Reserved3;
	DWORD64	EnclaveConfigurationPointer;
#if LCU > 5				// Redstone 4 (1803)
	DWORD64	VolatileMetadataPointer;
#if LCU > 6				// Redstone 5 (1809)
	DWORD64	GuardEHContinuationTable;
	DWORD64	GuardEHContinuationCount;
					// Titanium (1903)
					// Vanadium (1909)
					// Vibranium 1 (2004)
					// Vibranium 2 (20H2)
#if LCU > 7				// Vibranium 3 (21H1)
	DWORD64	GuardXFGCheckFunctionPointer;
	DWORD64	GuardXFGDispatchFunctionPointer;
	DWORD64	GuardXFGTableDispatchFunctionPointer;
#if LCU > 8				// Vibranium 4 (21H2)
	DWORD64	CastGuardOsDeterminedFailureMode;
#if LCU > 9				// Vibranium 5 (22H2)
	DWORD64	GuardMemcpyFunctionPointer;
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
} _load_config_used = {sizeof(_load_config_used),
                       'TIME',		// = 2014-10-23 18:47:33 UTC
                       _MSC_VER / 100,
                       _MSC_VER % 100,
                       0UL, 0UL, 0UL,
                       0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
                       0UL,
                       0U,
                       LOAD_LIBRARY_SEARCH_SYSTEM32,
                       0ULL,
                       &__security_cookie,
                       0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
                       0UL};
#endif // _WIN64

__declspec(noreturn)
#ifdef _WIN64
VOID	__security_check_cookie(DWORD64 qwCookie)
{
	if (qwCookie == __security_cookie)
		return;
#else // _WIN64
VOID	__security_check_cookie(DWORD dwCookie)
{
	if (dwCookie == __security_cookie)
		return;
#endif // _WIN64
#ifdef FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
	__fastfail(FAST_FAIL_STACK_COOKIE_CHECK_FAILURE);
#else
#ifdef FAIL_FAST_GENERATE_EXCEPTION_ADDRESS
	RaiseFailFastException((EXCEPTION_RECORD *) NULL, (CONTEXT *) NULL, FAIL_FAST_GENERATE_EXCEPTION_ADDRESS);
#else
	SetUnhandledExceptionFilter(NULL);
	RaiseException(EXCEPTION_STACK_BUFFER_OVERRUN, EXCEPTION_NONCONTINUABLE, 1UL, _AddressOfReturnAddress());
#endif
#pragma comment(lib, "kernel32")
#endif
}

#pragma comment(user, "(C)opyright 2004-2024, Stefan Kanthak")

Note: the __fastfail() intrinsic function is supported since Windows 8, the RaiseFailFastException() function is supported since Windows 7.

Note: see the MSDN articles LoadLibraryEx() function or SetDefaultDllDirectories() function for the values of the DependentLoadFlags member, the articles HeapCreate() function, HeapAlloc() function or HeapReAlloc() function for the values of the ProcessHeapFlags member, and the article Gflags Flag Reference for the values of the GlobalFlagsClear as well as the GlobalFlagsSet member. Managing Heap Memory Global Flag Reference

Save the ANSI C source presented above as lcu.c in the directory where you created the object library i386.lib before, then execute the following 3 command lines to compile it, write the assembly to the text file tls.cod and add the generated object file i386-lcu.obj to the existing object library i386.lib:

SET CL=/c /DLCU /FAsc /GAFry /Oxy /W4 /Zl
CL.EXE /Foi386-lcu.obj lcu.c
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-lcu.obj

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

lcu.c
lcu.c(117) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'const DWORD *'
lcu.c(118) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *'
lcu.c(119) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *'

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

CAVEAT: verify in the assembly written to the text file lcu.cod that the __security_check_cookie() function clobbers at most register ECX upon return to the caller when the stack cookie is intact!

Move the ANSI C source file lcu.c into the directory where you created the object library amd64.lib before, then execute the following 3 command lines to compile it, write the assembly to the text file lcu.cod and add the generated object file amd64-lcu.obj to the object library amd64.lib:

SET CL=/c /DLCU /FAsc /GAFy /Oxy /W4 /Zl
CL.EXE /Foamd64-lcu.obj lcu.c
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-lcu.obj

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

lcu.c
lcu.c(227) : warning C4047: 'initializing' : 'DWORD64' differs in levels of indirection from 'const DWORD64 *'

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

CAVEAT: verify in the assembly written to the text file lcu.cod that the __security_check_cookie() function clobbers no register except RCX, R8, R9, R10 and R11 upon return to the caller when the stack cookie is intact!

Implementation in i386 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.686
	.model	flat; C

	extern	___safe_se_handler_count :abs
	extern	___safe_se_handler_table :ptr proc

_lcu_32	struct	4
	dword	sizeof _lcu_32
	dword	'VOID'		; 2006-04-21 21:32:06 UTC
	word	@Version / 100
	word	@Version mod 100
	dword	10 dup (0)
	word	0, 2048		; LOAD_LIBRARY_SEARCH_SYSTEM32
	dword	0
	dword	offset ___security_cookie
	dword	offset ___safe_se_handler_table
	dword	offset ___safe_se_handler_count
	dword	5 dup (0)
_lcu_32	ends

	.const

	public	__load_config_used
__load_config_used \
	_lcu_32	<>		; IMAGE_LOAD_CONFIG_DIRECTORY32

	.data

	public	___security_cookie
___security_cookie \
	dword	3141592654

	.code

@__security_check_cookie@4 \
	proc	public		; void __fastcall __security_check_cookie(dword cookie)

	cmp	ecx, ___security_cookie
	jne	short fastfail
	ret
fastfail:
	mov	ecx, 2		; ecx = FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
	int	41
	ud2

@__security_check_cookie@4 \
	endp

___security_init_cookie \
	proc	public		; void __cdecl __security_init_cookie(void)

	mov	eax, ___security_cookie
	cmp	eax, 3141592654
	je	short init

	test	eax, eax
	jne	short exit
init:
	rdtsc			; eax = low dword of time stamp counter,
				; edx = high dword of time stamp counter
	xor	eax, edx	; eax = random number
	mov	___security_cookie, eax
exit:
	ret

___security_init_cookie	\
	endp
	end

Save the i386 assembler source presented above as i386-lcu.asm in the directory where you created the object library i386.lib before, then execute the following 3 command lines to generate the object file i386-lcu.obj and add it to the existing object library i386.lib:

SET ML=/c /safeseh /W3 /X
ML.EXE i386-lcu.asm
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-lcu.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-lcu.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Implementation in AMD64 Assembler

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

_lcu_64	struct	8
	dword	sizeof _lcu_64
	dword	'VOID'		; 2006-04-21 21:32:06 UTC
	word	@Version / 100
	word	@Version mod 100
	dword	0, 0, 0
	qword	0, 0, 0, 0, 0, 0
	dword	0
	word	0, 2048		; LOAD_LIBRARY_SEARCH_SYSTEM32
	qword	0
	qword	offset __security_cookie
	qword	0, 0, 0, 0, 0, 0
	dword	0
_lcu_64	ends

	.const

	public	_load_config_used
_load_config_used \
	_lcu_64	<>		; IMAGE_LOAD_CONFIG_DIRECTORY64

	.data

	public	__security_cookie
__security_cookie \
	qword	3141592653589793241 shr 16

	.code

__security_check_cookie \
	proc	public		; void __security_check_cookie(qword cookie)

	cmp	rcx, __security_cookie
	jne	short fastfail

;;	shr	rcx, 48
;;	jnz	short fastfail

	ret
fastfail:
	mov	ecx, 2		; rcx = FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
	int	41
	ud2

__security_check_cookie \
	endp

__security_init_cookie \
	proc	public		; void __security_init_cookie(void)

	mov	rax, __security_cookie
	mov	rcx, 3141592653589793241 shr 16
	cmp	rcx, rax
	je	short init

	test	rax, rax
	jne	short exit
init:
	rdtsc			; rax = low dword of time stamp counter,
				; rdx = high dword of time stamp counter
	mov	ecx, edx	; rcx = high dword of time stamp counter
	bswap	edx
	imul	rcx, rax	; rcx = high dword of time stamp counter
				;     * low dword of time stamp counter
	bswap	rax
	xor	rax, rdx	; rax = byte-swapped time stamp counter
	mul	rcx
	xor	rax, rdx	; rax = random number
	shr	rax, 16
	mov	__security_cookie, rax
exit:
	ret

__security_init_cookie \
	endp

__GSHandlerCheck \
	proc	private		; int __GSHandlerCheck(void *, void *, void *, void *)

	xor	eax, eax
	inc	eax		; rax = ExceptionContinueSearch
	ret

__GSHandlerCheck \
	endp
	end

Save the AMD64 assembler source presented above as amd64-lcu.asm in the directory where you created the object library amd64.lib before, then execute the following 3 command lines to generate the object file amd64-lcu.obj and add it to the existing object library amd64.lib:

SET ML=/c /W3 /X
ML64.EXE amd64-lcu.asm
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-lcu.obj

For details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: amd64-lcu.asm

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Delay Load Support

The MSDN articles /DELAYLOAD Linker Support for Delay-Loaded DLLs Specifying DLLs to Delay Load Constraints of Delay Loading DLLs Binding Imports Explicitly Unloading a Delay-Loaded DLL Understanding the Helper Function Error Handling and Notification Exceptions Failure Hooks Notification Hooks Structure and Constant Definitions Developing Your Own Helper Function Calling Conventions, Parameters, and Return Type Calculating Necessary Values Unloading a Delay-Loaded DLL

Implementation in ANSI C

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

#pragma comment(lib, "kernel32")

#ifndef EXCEPTION_DELAY_LOAD_INVALID_PARAMETER
#define EXCEPTION_DELAY_LOAD_INVALID_PARAMETER	0xC06D0057
#endif

#ifndef EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND	0xC06D007E
#endif

#ifndef EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND	0xC06D007F
#endif

extern	const	IMAGE_DOS_HEADER	__ImageBase;

typedef	DWORD	RVA;

typedef	enum	dliNotify
{
	dliStartProcessing,
	dliNotePreLoadLibrary,
	dliNotePreGetProcAddress,
	dliFailLoadLib,
	dliFailGetProc,
	dliNoteEndProcessing
} dliNotify;

typedef	struct	DelayLoadDescr
{
	DWORD	dwAttributes;		// 1UL = all members are RVAs
	DWORD	dwDllName;
	DWORD	dwHMODULE;		// RVA of module handle
	DWORD	dwIAT;			// RVA of import address table
	DWORD	dwINT;			// RVA of import name table
	DWORD	dwBoundIAT;		// RVA of optional bound import address table
	DWORD	dwUnloadIAT;		// RVA of optional copy of original import address table
	DWORD	dwTimeStamp;
} DelayLoadDescr;

typedef	struct	DelayLoadProc
{
	BOOL	fImportByName;
	union
	{
		LPCSTR	szProcName;
		DWORD	dwOrdinal;
	};
} DelayLoadProc;

typedef	struct	DelayLoadInfo
{
	DWORD		cb;		// size of structure
	DelayLoadDescr	*pidd;		// raw form of data (everything is there)
	FARPROC		*ppfn;		// points to address of function to load
	LPCSTR		szDll;		// name of DLL
	DelayLoadProc	dlp;		// name or ordinal of function to load
	HMODULE		hmodCur;	// handle of DLL
	FARPROC		pfnCur;		// actual function that will be called
	DWORD		dwLastError;	// error received (if an error notification)
} DelayLoadInfo;

typedef	FARPROC	(WINAPI *PfnDliHook) (dliNotify, DelayLoadInfo *);

BOOL	WINAPI	__FUnloadDelayLoadedDLL2(LPCSTR szDll);

FARPROC	WINAPI	__delayLoadHelper2(DelayLoadDescr *lpDLD, FARPROC *lpfnIATEntry)
{
	HMODULE	hModule;
	HMODULE			*lpHMODULE = (HMODULE *) ((LPBYTE) &__ImageBase + lpDLD->dwHMODULE);
	IMAGE_THUNK_DATA	*lpINT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwINT);
	IMAGE_THUNK_DATA	*lpIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwIAT);
	IMAGE_THUNK_DATA	*lpBoundIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwBoundIAT);
	DWORD			dwEntry = (IMAGE_THUNK_DATA *) lpfnIATEntry - lpIAT;

	// NOTE: *lpfnIATEntry == lpIAT[dwEntry].u1.Function

	DelayLoadInfo	dli = {sizeof(DelayLoadInfo),
			       lpDLD,
			       lpfnIATEntry,
			       (LPCSTR) &__ImageBase + lpDLD->dwDllName,
			       {!IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal),
			         IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       ? IMAGE_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       : ((IMAGE_IMPORT_BY_NAME *) ((LPBYTE) &__ImageBase + lpINT[dwEntry].u1.AddressOfData))->Name},
			       *lpHMODULE,
			       (FARPROC) NULL,
			       ERROR_SUCCESS};

	if (lpDLD->dwAttributes != 0UL)
	{
		dli.dwLastError = ERROR_INVALID_PARAMETER;

		RaiseException(EXCEPTION_DELAY_LOAD_INVALID_PARAMETER,
		               EXCEPTION_NONCONTINUABLE,
		               1UL,
		               (DWORD_PTR *) &dli);

		return (FARPROC) NULL;
	}

	if (dli.hmodCur == NULL)	// module not yet loaded?
	{
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
		dli.hmodCur = LoadLibraryA(dli.szDll);
#else
		dli.hmodCur = LoadLibraryExA(dli.szDll, NULL, LOAD_LIBRARY_SEARCH_SYSTEM32);
#endif
		if (dli.hmodCur == NULL)
		{
			dli.dwLastError = GetLastError();

			RaiseException(EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND,
			               EXCEPTION_NONCONTINUABLE,
			               1UL,
			               (DWORD_PTR *) &dli);

			return (FARPROC) NULL;
		}
#ifndef _WIN64
		hModule = (HMODULE) InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#else
		hModule = (HMODULE) _InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#endif
		if (hModule == dli.hmodCur)
			FreeLibrary(dli.hmodCur);
		else
			if (lpDLD->dwUnloadIAT != 0UL)
			{
				// ...
			}
	}

	if ((lpBoundIAT != NULL) && (lpDLD->dwTimeStamp != 0UL))
	{
		IMAGE_NT_HEADERS	*lpModule = (IMAGE_NT_HEADERS *) ((LPBYTE) dli.hmodCur + ((IMAGE_DOS_HEADER *) dli.hmodCur)->e_lfanew);

		if ((lpModule->Signature == IMAGE_NT_SIGNATURE)
		 && (lpModule->FileHeader.TimeDateStamp == lpDLD->dwTimeStamp)
		 && (lpModule->OptionalHeader.ImageBase == dli.hmodCur))
		{
			dli.pfnCur = (FARPROC) lpBoundIAT[dwEntry].u1.Function;

			if (dli.pfnCur != NULL)
				return *lpfnIATEntry = dli.pfnCur;
		}
	}

	dli.pfnCur = GetProcAddress(dli.hmodCur, dli.dlp.szProcName);

	if (dli.pfnCur != NULL)		// function address resolved?
		return *lpfnIATEntry = dli.pfnCur;

	dli.dwLastError = GetLastError();

	RaiseException(EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND,
	               EXCEPTION_NONCONTINUABLE,
	               1UL,
	               (DWORD_PTR *) &dli);

	return (FARPROC) NULL;
}

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.model	flat, C

	extern	__pfnDliDefaultHook2 :ptr proc
	extern	__pfnDliFailureHook2 (__pfnDliDefaultHook2) :ptr proc
	extern	__pfnDliNotifyHook2 (__pfnDliDefaultHook2) :ptr proc
	end

; Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	extern	__pfnDliDefaultHook2 :ptr proc
	extern	__pfnDliFailureHook2 (__pfnDliDefaultHook2) :ptr proc
	extern	__pfnDliNotifyHook2 (__pfnDliDefaultHook2) :ptr proc
	end

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

#pragma comment(lib, "kernel32")

#ifdef _WIN64
#pragma comment(linker, "/ALTERNATENAME:__pfnDliFailureHook2=__pfnDliDefaultHook2")
#pragma comment(linker, "/ALTERNATENAME:__pfnDliNotifyHook2=__pfnDliDefaultHook2")
#else
#pragma comment(linker, "/ALTERNATENAME:___pfnDliFailureHook2=___pfnDliDefaultHook2")
#pragma comment(linker, "/ALTERNATENAME:___pfnDliNotifyHook2=___pfnDliDefaultHook2")
#endif

#ifndef EXCEPTION_DELAY_LOAD_INVALID_PARAMETER
#define EXCEPTION_DELAY_LOAD_INVALID_PARAMETER	0xC06D0057
#endif

#ifndef EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND	0xC06D007E
#endif

#ifndef EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND	0xC06D007F
#endif

extern	const	IMAGE_DOS_HEADER	__ImageBase;

typedef	DWORD	RVA;

typedef	enum	dliNotify
{
	dliStartProcessing,
	dliNotePreLoadLibrary,
	dliNotePreGetProcAddress,
	dliFailLoadLib,
	dliFailGetProc,
	dliNoteEndProcessing
} dliNotify;

typedef	struct	DelayLoadDescr
{
	DWORD	dwAttributes;		// 1UL = all members are RVAs
	DWORD	dwDllName;
	DWORD	dwHMODULE;		// RVA of module handle
	DWORD	dwIAT;			// RVA of import address table
	DWORD	dwINT;			// RVA of import name table
	DWORD	dwBoundIAT;		// RVA of optional bound import address table
	DWORD	dwUnloadIAT;		// RVA of optional copy of original import address table
	DWORD	dwTimeStamp;
} DelayLoadDescr;

typedef	struct	DelayLoadProc
{
	BOOL	fImportByName;
	union
	{
		LPCSTR	szProcName;
		DWORD	dwOrdinal;
	};
} DelayLoadProc;

typedef	struct	DelayLoadInfo
{
	DWORD		cb;		// size of structure
	DelayLoadDescr	*pidd;		// raw form of data (everything is there)
	FARPROC		*ppfn;		// points to address of function to load
	LPCSTR		szDll;		// name of DLL
	DelayLoadProc	dlp;		// name or ordinal of function to load
	HMODULE		hmodCur;	// handle of DLL
	FARPROC		pfnCur;		// actual function that will be called
	DWORD		dwLastError;	// error received (if an error notification)
} DelayLoadInfo;

typedef	FARPROC	(WINAPI *PfnDliHook) (dliNotify, DelayLoadInfo *);

extern	PfnDliHook	__pfnDliNotifyHook2;
extern	PfnDliHook	__pfnDliFailureHook2;
const	PfnDliHook	__pfnDliDefaultHook2 = NULL;

BOOL	WINAPI	__FUnloadDelayLoadedDLL2(LPCSTR szDll);

FARPROC	WINAPI	__delayLoadHelper2(DelayLoadDescr *lpDLD, FARPROC *lpfnIATEntry)
{
	HMODULE	hModule;
	HMODULE			*lpHMODULE = (HMODULE *) ((LPBYTE) &__ImageBase + lpDLD->dwHMODULE);
	IMAGE_THUNK_DATA	*lpINT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwINT);
	IMAGE_THUNK_DATA	*lpIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwIAT);
	IMAGE_THUNK_DATA	*lpBoundIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwBoundIAT);
	DWORD			dwEntry = (IMAGE_THUNK_DATA *) lpfnIATEntry - lpIAT;

	// NOTE: *lpfnIATEntry == lpIAT[dwEntry].u1.Function

	DelayLoadInfo	dli = {sizeof(DelayLoadInfo),
			       lpDLD,
			       lpfnIATEntry,
			       (LPCSTR) &__ImageBase + lpDLD->dwDllName,
			       {!IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal),
			         IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       ? IMAGE_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       : ((IMAGE_IMPORT_BY_NAME *) ((LPBYTE) &__ImageBase + lpINT[dwEntry].u1.AddressOfData))->Name},
			       *lpHMODULE,
			       (FARPROC) NULL,
			       ERROR_SUCCESS};

	if (__pfnDliNotifyHook2 != NULL)
	{
		dli.pfnCur = (*__pfnDliNotifyHook2)(dliStartProcessing, &dli);

		if (dli.pfnCur != NULL)
			goto SUCCESS;
	}

	if (lpDLD->dwAttributes != 0UL)
	{
		dli.dwLastError = ERROR_INVALID_PARAMETER;

		RaiseException(EXCEPTION_DELAY_LOAD_INVALID_PARAMETER,
		               EXCEPTION_NONCONTINUABLE,
		               1UL,
		               (DWORD_PTR *) &dli);

		goto FAILURE;
	}

	if (dli.hmodCur != NULL)	// module already loaded?
		goto ADDRESS;

	if (__pfnDliNotifyHook2 != NULL)
		dli.hmodCur = (HMODULE) (*__pfnDliNotifyHook2)(dliNotePreLoadLibrary, &dli);

	if (dli.hmodCur != NULL)	// module handle resolved by notification routine?
		goto ADDRESS;
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
	dli.hmodCur = LoadLibraryA(dli.szDll);
#else
	dli.hmodCur = LoadLibraryExA(dli.szDll, NULL, LOAD_LIBRARY_SEARCH_SYSTEM32);
#endif
	if (dli.hmodCur == NULL)	// module not loaded?
	{
		dli.dwLastError = GetLastError();

		if (__pfnDliFailureHook2 != NULL)
			dli.hmodCur = (HMODULE) (*__pfnDliFailureHook2)(dliFailLoadLib, &dli);

		if (dli.hmodCur == NULL)
		{
			RaiseException(EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND,
			               EXCEPTION_NONCONTINUABLE,
			               1UL,
			               (DWORD_PTR *) &dli);

			goto FAILURE;
		}
#ifndef _WIN64
		hModule = (HMODULE) InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#else
		hModule = (HMODULE) _InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#endif
		if (hModule == dli.hmodCur)
			FreeLibrary(dli.hmodCur);
		else
			if (lpDLD->dwUnloadIAT != 0UL)
			{
				// ...
			}
	}
ADDRESS:
	if (__pfnDliNotifyHook2 != NULL)
		dli.pfnCur = (*__pfnDliNotifyHook2)(dliNotePreGetProcAddress, &dli);

	if (dli.pfnCur != NULL)		// function address resolved by notification routine?
		goto SUCCESS;

	if ((lpBoundIAT != NULL) && (lpDLD->dwTimeStamp != 0UL))
	{
		IMAGE_NT_HEADERS	*lpModule = (IMAGE_NT_HEADERS *) ((LPBYTE) dli.hmodCur + ((IMAGE_DOS_HEADER *) dli.hmodCur)->e_lfanew);

		if ((lpModule->Signature == IMAGE_NT_SIGNATURE)
		 && (lpModule->FileHeader.TimeDateStamp == lpDLD->dwTimeStamp)
		 && (lpModule->OptionalHeader.ImageBase == dli.hmodCur))
		{
			dli.pfnCur = (FARPROC) lpBoundIAT[dwEntry].u1.Function;

			if (dli.pfnCur != NULL)
				goto SUCCESS;
		}
	}

	dli.pfnCur = GetProcAddress(dli.hmodCur, dli.dlp.szProcName);

	if (dli.pfnCur != NULL)		// function address resolved?
		goto SUCCESS;

	dli.dwLastError = GetLastError();

	if (__pfnDliFailureHook2 != NULL)
		dli.pfnCur = (*__pfnDliFailureHook2)(dliFailGetProc, &dli);

	if (dli.pfnCur != NULL)		// function address resolved by failure routine?
		goto SUCCESS;

	RaiseException(EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND,
	               EXCEPTION_NONCONTINUABLE,
	               1UL,
	               (DWORD_PTR *) &dli);

	goto FAILURE;

SUCCESS:
	*lpfnIATEntry = dli.pfnCur;
FAILURE:
	if (__pfnDliNotifyHook2 != NULL)
		(*__pfnDliNotifyHook2)(dliNoteEndProcessing, &dli);

	return dli.pfnCur;
}

Save the ANSI C source presented above as dli.c in the directory where you created the object library i386.lib before, then execute the following 3 command lines to compile it and add the generated object file i386-dli.obj to the existing object library i386.lib:

SET CL=/c /GAFyz /Oxy /W4 /Zl
CL.EXE /Foi386-dli.obj dli.c
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-dli.obj

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

dli.c
dli.c(63) : warning C4201: nonstandard extension used : nameless struct/union
dli.c(98) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(99) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(100) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(101) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(104) : warning C4047: ':' : 'DWORD' differs in levels of indirection from 'BYTE *'
dli.c(104) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(104) : warning C4057: 'initializing' : 'LPCSTR' differs in indirection to slightly different base types from 'BYTE *'
dli.c(105) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(133) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE'
dli.c(147) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE'
dli.c(184) : warning C4047: '==' : 'DWORD' differs in levels of indirection from 'HMODULE'

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Move the ANSI C source file dli.c into the directory where you created the object library amd64.lib before, then execute the following 3 command lines to compile it and add the generated object file amd64-dli.obj to the object library amd64.lib:

SET CL=/c /GAFy /Oxy /W4 /Zl
CL.EXE /Foamd64-dli.obj dli.c
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-dli.obj

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

dli.c
dli.c(63) : warning C4201: nonstandard extension used : nameless struct/union
dli.c(93) : warning C4244: 'initializing' : conversion from '__int64' to 'DWORD', possible loss of data
dli.c(98) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(99) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(100) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(101) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(104) : warning C4047: ':' : 'ULONGLONG' differs in levels of indirection from 'BYTE *'
dli.c(104) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(104) : warning C4057: 'initializing' : 'LPCSTR' differs in indirection to slightly different base types from 'BYTE *'
dli.c(105) : warning C4204: nonstandard extension used : non-constant aggregate initializer
dli.c(133) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE'
dli.c(147) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE'
dli.c(184) : warning C4047: '==' : 'ULONGLONG' differs in levels of indirection from 'HMODULE'

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

`main()` and `wmain()` Support

Under the heading Remarks, the documentation of the linker option /ENTRY:‹symbol› states:

Remarks
The /ENTRY option specifies an entry point function as the starting address for an .exe file or DLL.
The function must be defined to use the __stdcall calling convention. The parameters and return value depend on if the program is a console application, a windows application or a DLL. It is recommended that you let the linker set the entry point so that the C run-time library is initialized correctly, and C++ constructors for static objects are executed.
By default, the starting address is a function name from the C run-time library. The linker selects it according to the attributes of the program, as shown in the following table.

Function name Default for

mainCRTStartup
(or wmainCRTStartup) An application that uses /SUBSYSTEM:CONSOLE; calls main (or wmain)

WinMainCRTStartup
(or wWinMainCRTStartup) An application that uses /SUBSYSTEM:WINDOWS; calls WinMain (or wWinMain), which must be defined to use __stdcall

_DllMainCRTStartup A DLL; calls DllMain if it exists, which must be defined to use __stdcall

If the /DLL or /SUBSYSTEM option is not specified, the linker selects a subsystem and entry point depending on whether main or WinMain is defined.
The functions main, WinMain, and DllMain are the three forms of the user-defined entry point.

Function name	Default for
mainCRTStartup (or wmainCRTStartup)	An application that uses /SUBSYSTEM:CONSOLE; calls `main` (or `wmain`)
WinMainCRTStartup (or wWinMainCRTStartup)	An application that uses /SUBSYSTEM:WINDOWS; calls `WinMain` (or `wWinMain`), which must be defined to use `__stdcall`
_DllMainCRTStartup	A DLL; calls `DllMain` if it exists, which must be defined to use `__stdcall`

OUCH: mainCRTStartup(), mainCRTStartup(), WinMainCRTStartup() and wWinMainCRTStartup(), the 4 entry point functions for applications, use but the __cdecl calling and naming convention; they take the address of the Process Environment Block as argument and return a 32-bit integer as exit code of the thread respectively the process.

The MSDN article Format of a C Decorated Name specifies:

The form of decoration for a C function depends on the calling convention used in its declaration, as shown below. Note that in a 64-bit environment, functions are not decorated.

Calling convention Decoration

__cdecl (the default) Leading underscore (_)

__stdcall Leading underscore (_) and a trailing at sign (@) followed by a number representing the number of bytes in the parameter list

__fastcall Same as __stdcall, but prepended by an at sign instead of an underscore

__vectorcall Two trailing at signs (@@) followed by the decimal number of bytes in the parameter list.

Calling convention	Decoration
__cdecl (the default)	Leading underscore (_)
__stdcall	Leading underscore (_) and a trailing at sign (@) followed by a number representing the number of bytes in the parameter list
__fastcall	Same as __stdcall, but prepended by an at sign instead of an underscore
__vectorcall	Two trailing at signs (@@) followed by the decimal number of bytes in the parameter list.

The MSDN articles __cdecl and __stdcall provide more details.

Falsification in ANSI C

// Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

typedef	unsigned short	wchar_t;

#ifdef CONSOLE
#ifdef UNICODE
int wmain(int argc, wchar_t *argv[], wchar_t *envp[])
#else
int main(int argc, char *argv[], char *envp[])
#endif
{
    return *envp != argv[argc];
}
#else // WINDOWS
#ifdef UNICODE
int wWinMain(void *current, void *previous, wchar_t *cmdline, int show)
#else
int WinMain(void *current, void *previous, char *cmdline, int show)
#endif
{
    return cmdline[current == previous] != show;
}
#endif // WINDOWS

Save the ANSI C source presented above as i386-sys.c in an arbitrary, preferable empty directory, then execute the following 5 command lines to compile and (attempt to) link it:

SET CL=/W4 /X /Zl
CL.EXE /DUNICODE /Gz i386-sys.c
CL.EXE /Gz i386-sys.c
CL.EXE /DCONSOLE /Gd i386-sys.c
CL.EXE /DCONSOLE /DUNICODE /Gd i386-sys.c

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-sys.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:i386-sys.exe
i386-sys.obj
LINK : error LNK2001: unresolved external symbol _wWinMainCRTStartup
i386-sys.exe : fatal error LNK1120: 1 unresolved externals

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-sys.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:i386-sys.exe
i386-sys.obj
LINK : error LNK2001: unresolved external symbol _WinMainCRTStartup
i386-sys.exe : fatal error LNK1120: 1 unresolved externals

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-sys.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:i386-sys.exe
i386-sys.obj
LINK : error LNK2001: unresolved external symbol _mainCRTStartup
i386-sys.exe : fatal error LNK1120: 1 unresolved externals

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

i386-sys.c

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:i386-sys.exe
i386-sys.obj
LINK : error LNK2001: unresolved external symbol _wmainCRTStartup
i386-sys.exe : fatal error LNK1120: 1 unresolved externals

OUCH: the linker expects the 4 entry point functions for applications, mainCRTStartup(), mainCRTStartup(), WinMainCRTStartup() and wWinMainCRTStartup(), to be defined using the __cdecl naming convention, i.e. without the decoration appended to the name of functions defined using the __stdcall naming convention!

Demonstration in i386 Assembler

; Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.386
	.model	flat, C
	.code

main	proc	public

	assume	fs :flat	; fs = address of TEB
	mov	eax, fs:[48]	; eax = address of PEB
	xor	eax, [esp+4]	; eax = 0
	ret

main	endp
	end	main		; writes "/ENTRY:main" to '.drectve' section

Save the i386 assembler source presented above as i386-sys.asm in an arbitrary, preferable empty directory, then execute the following 5 command lines to build the console application i386-sys.exe, execute it and display its exit code:

SET ML=/safeseh /W3 /X
SET LINK=
ML.EXE i386-sys.asm
.\i386-sys.exe
ECHO %ERRORLEVEL%

Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: i386-sys.asm

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/OUT:i386-sys.exe
i386-sys.obj

0

Demonstration in AMD64 Assembler

; Copyleft © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

	.code

wmain	proc	public
				; gs = address of TEB
				; rcx = address of PEB
	xor	eax, eax	; rax = 0
	cmp	rcx, gs:[96]
	sete	al		; rax = 1
	ret

wmain	endp
	end

Save the AMD64 assembler source presented above as amd64-sys.asm in an arbitrary, preferable empty directory, then execute the following 5 command lines to build the console application amd64-sys.exe, execute it and display its exit code:

SET ML=/W3 /X
SET LINK=/ENTRY:wmain
ML64.EXE amd64-sys.asm
.\amd64-sys.exe
ECHO %ERRORLEVEL%

Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: amd64-sys.asm

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/ENTRY:wmain
/OUT:amd64-sys.exe
amd64-sys.obj

1

Implementation in ANSI C

The MSDN articles main Function and Program Execution, main Function and Command-Line Arguments, Using wmain, Argument Description and Parsing C Command-Line Arguments specify the main() and wmain() functions, their arguments, how to parse the command line returned from the GetCommandLine() function and how to split the environment block returned from the GetEnvironmentStrings() function to derive these arguments.

Note: the CommandLineToArgvW() function uses the same algorithm, but supports only UTF-16LE encoding.

The following ANSI C program provides the glue between the mainCRTStartup() or wmainCRTStartup() entry point functions and the main() or wmain() functions:

// Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#define STRICT
#define WIN32_LEAN_AND_MEAN

#include <windows.h>

#pragma comment(lib, "kernel32")

extern	int	main(int argc, char const *argv[], char const *envp[]);
extern	int	wmain(int argc, wchar_t const *argv[], wchar_t const *envp[]);

__declspec(noreturn)
__declspec(safebuffers)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPSTR	lpArgument;
	LPCSTR	lpCmdLine = GetCommandLineA();
	LPCSTR	lpBlock = GetEnvironmentStringsA();
	LPCSTR	lpCount = lpCmdLine;
	UINT	uiCount = 0U;
	UINT	uiQuote = 0U;
	UINT	argc;
	LPCSTR	*argv;
	LPCSTR	*envp;

	argc = (*lpCount != ' ' && *lpCount != '\t');

	while (*lpCount != '\0')	// count arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCount == ' ' || *lpCount == '\t'))
		{
			do		// skip unquoted whitespace
				lpCount++;
			while (*lpCount == ' ' || *lpCount == '\t');

			argc += (*lpCount != '\0');

			uiCount = 0U;

			continue;
		}
		else if (*lpCount == '\\')
			uiCount ^= ~0U;
		else if (!uiCount	// unescaped double quote?
		      && *lpCount == '"')
			uiQuote ^= ~0U;
		else			// regular character
			uiCount = 0U;

		lpCount++;
	}

	if (uiQuote)			// unpaired double quote?
		SetLastError(ERROR_BAD_ARGUMENTS);

	argv = (LPCSTR *) _alloca((argc + 1U) * sizeof(*argv) + (lpCount + 1U - lpCmdLine) * sizeof(**argv));

	argv[0] = lpArgument = (LPSTR) (argv + argc + 1U);

	argc = uiCount = uiQuote = 0U;

	while (*lpCmdLine != '\0')	// process arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCmdLine == ' ' || *lpCmdLine == '\t'))
		{			// terminate current argument
			*lpArgument = '\0';

			do		// skip unquoted whitespace
				lpCmdLine++;
			while (*lpCmdLine == ' ' || *lpCmdLine == '\t');

			if (*lpCmdLine != '\0')
					// store address of next argument
				argv[++argc] = lpArgument = (LPSTR) lpCmdLine;

			uiCount = 0U;
		}
		else if (*lpCmdLine == '\\')
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount++;	// count backslash
		}
		else if (*lpCmdLine == '"')
		{
			lpArgument -= (uiCount + 1U) / 2U;

			if (uiCount & 1U)
					// double quote preceeded by odd number
					//  of backslashes: keep half of them
					//   and the (escaped) double quote
				*lpArgument++ = *lpCmdLine++;
			else		// double quote preceeded by even number
					//  of backslashes: keep half of them
				if (*++lpCmdLine == '"' && uiQuote)
					// double quote inside double quotes and
					//  followed by another double quote:
					//   keep one double quote
					*lpArgument++ = *lpCmdLine++;
				else	// skip double quote and toggle state
					uiQuote ^= ~0U;
			uiCount = 0U;
		}
		else			// regular character
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount = 0U;
		}
	}

	*lpArgument = '\0';		// terminate (last) argument

	argv[++argc] = NULL;		// store terminating NULL pointer
	envp = argv + argc;

	if (lpBlock != NULL)
	{
		for (uiCount = 0U,	// count environment strings
		     lpCount = lpBlock; *lpCount != '\0'; lpCount += strlen(lpCount) + 1U)
			if (*lpCount != '=')
				uiCount++;

		if (uiCount > 0U)	// process environment strings
		{
			envp = (LPCSTR *) _alloca((uiCount + 1U) * sizeof(*envp));

			for (uiCount = 0U,
			     lpCount = lpBlock; *lpCount != '\0'; lpCount += strlen(lpCount) + 1U)
				if (*lpCount != '=')
					envp[uiCount++] = lpCount;

			envp[uiCount] = (LPCSTR) NULL;
		}
	}

	ExitProcess(main(argc, argv, envp));
}

__declspec(noreturn)
__declspec(safebuffers)
VOID	CDECL	wmainCRTStartup(VOID)
{
	LPWSTR	lpArgument;
	LPCWSTR	lpCmdLine = GetCommandLineW();
	LPCWSTR	lpBlock = GetEnvironmentStringsW();
	LPCWSTR	lpCount = lpCmdLine;
	UINT	uiCount = 0U;
	UINT	uiQuote = 0U;
	UINT	argc;
	LPCWSTR	*argv;
	LPCWSTR	*envp;

	argc = (*lpCount != L' ' && *lpCount != L'\t');

	while (*lpCount != L'\0')	// count arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCount == L' ' || *lpCount == L'\t'))
		{
			do		// skip unquoted whitespace
				lpCount++;
			while (*lpCount == L' ' || *lpCount == L'\t');

			argc += (*lpCount != L'\0');

			uiCount = 0U;

			continue;
		}
		else if (*lpCount == L'\\')
			uiCount ^= ~0U;
		else if (!uiCount	// unescaped double quote?
		      && *lpCount == L'"')
			uiQuote ^= ~0U;
		else			// regular character
			uiCount = 0U;

		lpCount++;
	}

	if (uiQuote)			// unpaired double quote?
		SetLastError(ERROR_BAD_ARGUMENTS);

	argv = (LPCWSTR *) _alloca((argc + 1U) * sizeof(*argv) + (lpCount + 1U - lpCmdLine) * sizeof(**argv));

	argv[0] = lpArgument = (LPWSTR) (argv + argc + 1U);

	argc = uiCount = uiQuote = 0U;

	while (*lpCmdLine != L'\0')	// process arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCmdLine == L' ' || *lpCmdLine == L'\t'))
		{			// terminate current argument
			*lpArgument = L'\0';

			do		// skip unquoted whitespace
				lpCmdLine++;
			while (*lpCmdLine == L' ' || *lpCmdLine == L'\t');

			if (*lpCmdLine != L'\0')
					// store address of next argument
				argv[++argc] = lpArgument = (LPWSTR) lpCmdLine;

			uiCount = 0U;
		}
		else if (*lpCmdLine == L'\\')
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount++;	// count backslash
		}
		else if (*lpCmdLine == L'"')
		{
			lpArgument -= (uiCount + 1U) / 2U;

			if (uiCount & 1U)
					// double quote preceeded by odd number
					//  of backslashes: keep half of them
					//   and the (escaped) double quote
				*lpArgument++ = *lpCmdLine++;
			else		// double quote preceeded by even number
					//  of backslashes: keep half of them
				if (*++lpCmdLine == L'"' && uiQuote)
					// double quote inside double quotes and
					//  followed by another double quote:
					//   keep one double quote
					*lpArgument++ = *lpCmdLine++;
				else	// skip double quote and toggle state
					uiQuote ^= ~0U;
			uiCount = 0U;
		}
		else			// regular character
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount = 0U;
		}
	}

	*lpArgument = L'\0';		// terminate (last) argument

	argv[++argc] = NULL;		// store terminating NULL pointer
	envp = argv + argc;

	if (lpBlock != NULL)
	{
		for (uiCount = 0U,	// count environment strings
		     lpCount = lpBlock; *lpCount != L'\0'; lpCount += wcslen(lpCount) + 1U)
			if (*lpCount != L'=')
				uiCount++;

		if (uiCount > 0U)	// process environment strings
		{
			envp = (LPCWSTR *) _alloca((uiCount + 1U) * sizeof(*envp));

			for (uiCount = 0U,
			     lpCount = lpBlock; *lpCount != L'\0'; lpCount += wcslen(lpCount) + 1U)
				if (*lpCount != L'=')
					envp[uiCount++] = lpCount;

			envp[uiCount] = (LPCWSTR) NULL;
		}
	}

	ExitProcess(wmain(argc, argv, envp));
}

Note: the mainCRTStartup() function allocates up to 32768 bytes for the command line plus 16384×4 bytes (32-bit platforms) or 16384×8 bytes (64-bit platforms) for the argv[] array on the stack, i.e. at most 96 kiB on 32-bit platforms and 160 kiB on 64-bit platforms; the wmainCRTStartup() function allocates up to 65536 bytes for the command line plus 16384×4 bytes (32-bit platforms) or 16384×8 bytes (64-bit platforms) for the argv[] array on the stack, i.e. at most 128 kiB on 32-bit platforms and 192 kiB on 64-bit platforms.

Save the ANSI C source presented above as sys.c in the directory where you created the object library i386.lib before, then execute the following 3 command lines to compile it and add the generated object file i386-sys.obj to the existing object library i386.lib:

SET CL=/c /GAFdy /J /Oxy /W4 /Zl
CL.EXE /Foi386-sys.obj sys.c
LINK.EXE /LIB /OUT:i386.lib i386.lib i386-sys.obj

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

sys.c

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Move the ANSI C source file sys.c into the directory where you created the object library amd64.lib before, then execute the following 3 command lines to compile it and add the generated object file amd64-sys.obj to the object library amd64.lib:

SET CL=/c /GAFy /J /Oxy /W4 /Zl
CL.EXE /Foamd64-sys.obj sys.c
LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-sys.obj

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

sys.c

Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

Usage Instructions

Link the object library amd64.lib respectively i386.lib before or instead of the MSVCRT libraries.

Footnote: Bloat

The MSDN article for the linker options /LD, /MD and /MT specifies:

Option Description

/LD Creates a DLL.
Passes the /DLL option to the linker. The linker looks for, but does not require, a DllMain function. If you do not write a DllMain function, the linker inserts a DllMain function that returns TRUE.
Links the DLL startup code.
Creates an import library (.lib), if an export (.exp) file is not specified on the command line. You link the import library to applications that call your DLL.
Interprets /Fe (Name EXE File) as naming a DLL rather than an .exe file. By default, the program name becomes basename.dll instead of basename.exe.
[…]
Implies /MT unless you explicitly specify /MD.

Option	Description
/LD	Creates a DLL. Passes the /DLL option to the linker. The linker looks for, but does not require, a `DllMain` function. If you do not write a `DllMain` function, the linker inserts a `DllMain` function that returns TRUE. Links the DLL startup code. Creates an import library (.lib), if an export (.exp) file is not specified on the command line. You link the import library to applications that call your DLL. Interprets /Fe (Name EXE File) as naming a DLL rather than an .exe file. By default, the program name becomes basename.dll instead of basename.exe. […] Implies /MT unless you explicitly specify /MD.

Demonstration

Execute the following 4 command lines to create an empty source file .c in an arbitrary, preferable empty directory, then compile and (attempt to) link it with the object library msvcrt.lib against the Visual C runtime DLL:

COPY NUL: .c
SET CL=/LD /W4 /X
SET LINK=/MACHINE:I386 /MAP /OPT:ICF,REF
CL.EXE /MD .c

For details and reference see the MSDN articles Compiler Options and Linker Options.

Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.

Note: the command lines can be copied and pasted as block into a Command Processor window!

Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.

.c(1) : warning C4206: nonstandard extension used : translation unit is empty

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/MACHINE:I386 /MAP /OPT:ICF,REF
/out:.dll
/dll
/implib:.lib
.obj
LINK : error LNK2001: unresolved external symbol __DllMainCRTStartup@12
.dll : fatal error LNK1120: 1 unresolved externals

OUCH: the combined import and object library msvcrt.lib shipped with the Visual C compiler does not provide the entry point function _DllMainCRTStartup() required to build DLLs!

Repeat the last command line without the compiler option /MD to link with the object library libcmt.lib shipped with the Visual C compiler, then display the size of the generated empty DLL .dll:

CL.EXE .c
DIR .dll

Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

.c(1) : warning C4206: nonstandard extension used : translation unit is empty

Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.

/MACHINE:I386 /MAP /OPT:ICF,REF
/out:.dll
/dll
/implib:.lib
.obj

 Volume in drive C has no label.
 Volume Serial Number is 1957-0427

 Directory of C:\Users\Stefan\Desktop

04/27/2015  08:15 PM            32,256 .dll
               1 File(s)         32,256 bytes
               0 Dir(s)    9,876,543,210 bytes free

OOPS: an empty DLL is 31.5 kiB (in words: thirty-one and a half kilobyte) small!

Note: the inspection of the generated text file .map to determine what the linker included is left as an exercise to the reader.

Note: the corresponding demonstration for console applications with empty main() and wmain() functions as well as Windows applications with empty WinMain() and wWinMain() functions is also left an exercise to the reader.

Note: the repetition of these demonstrations using the 64-bit build environment is left as an exercise to the reader too.

Appendix

`.CRT` Section Usage

The MSDN article CRT initialization gives an example for the use of the .CRT section.

The following table shows how the Visual C compiler and its runtime use the .CRT section:

Visual C compiler `.CRT` section usage
Section$Group	Public Name	Purpose and Usage
Section$Group	Public Name	Purpose and Usage
.CRT$XCA	__xc_a	`NULL` pointer before array of C++ constructor and initialiser function pointers.
.CRT$XCAA		`pre_cpp_init()` function pointer.
.CRT$XCU		Dynamic initialiser function pointers.
.CRT$XCZ	__xc_z	Terminating `NULL` pointer after array of C++ constructor and initialiser function pointers.
.CRT$XDA	__xd_a	`NULL` pointer before array of C++ TLS initialiser callback function pointers.
.CRT$XDC		C++ TLS initialiser callback function pointers.
.CRT$XDL		C++ TLS initialiser callback function pointers.
.CRT$XDU		C++ TLS initialiser callback function pointers.
.CRT$XDZ	__xd_z	Terminating `NULL` pointer after array of C++ TLS initialiser call function pointers.
.CRT$XIA	__xi_a	`NULL` pointer before array of C initialiser function pointers.
.CRT$XIAA		`pre_c_init()` and `_mixed_pre_c_init()` function pointers.
.CRT$XIC		`__initmbctable()`, `__initstdio()`, `__inittime()`, `__lconv_init()` and `__onexitinit()` function pointers.
.CRT$XID		`__set_emptyinvalidparamhandler`, `__set_loosefpmath()` and `_InitCPLocHash()` function pointers.
.CRT$XIY		`__CxxSetUnhandledExceptionFilter()` function pointer.
.CRT$XIZ	__xi_z	Terminating `NULL` pointer after array of C initialiser function pointers.
.CRT$XLA	__xl_a	`NULL` pointer before array of TLS callback function pointers.
.CRT$XLC		`__dyn_tls_dtor()` function pointer.
.CRT$XLD		`__dyn_tls_init()` function pointer.
.CRT$XLZ	__xl_z	Terminating `NULL` pointer after array of TLS callback function pointers.
.CRT$XPA	__xp_a	`NULL` pointer before array of C pre-termination function pointers.
.CRT$XPB		`_concrt_static_cleanup()` function pointer.
.CRT$XPX		`__termconin()`, `__termconout()`, `_locterm()` and `_rmtmp()` function pointers.
.CRT$XPXA		`__endstdio()` function pointer.
.CRT$XPZ	__xp_z	Terminating `NULL` pointer after array of C pre-termination function pointers.
.CRT$XTA	__xt_a	`NULL` pointer before array of C termination function pointers.
.CRT$XTZ	__xt_z	Terminating `NULL` pointer after array of C termination function pointers.

CAVEAT: all symbols have global scope and pollute the name space without necessity!

`.rtc` Section Usage

The MSDN articles Run-Time Error Checking and Native Run-Time Checks Customization give an introduction. _RTC_Initialize() _RTC_Terminate()

The following table shows how the Visual C compiler and its runtime use the .rtc section:

Visual C compiler `.rtc` section usage
Section$Group	Public Name	Purpose and Usage
Section$Group	Public Name	Purpose and Usage
.rtc$IAA	__rtc_iaa	Terminating `NULL` pointer before array of RTC initialisation function pointers.
.rtc$IZZ	__rtc_izz	Terminating `NULL` pointer after array of RTC initialisation function pointers.
.rtc$TAA	__rtc_taa	Terminating `NULL` pointer before array of RTC termination function pointers.
.rtc$TZZ	__rtc_tzz	Terminating `NULL` pointer after array of RTC termination function pointers.

CAVEAT: all symbols have global scope and pollute the name space without necessity!

Contact

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

The software and the documentation on this site are provided as is without any warranty, neither express nor implied.
In no event will the author be held liable for any damage(s) arising from the use of the software or the documentation.
Permission is granted to use the current version of the software and the current version of the documentation solely for personal private and non-commercial purposes.
An individuals use of the software or the documentation in his or her capacity or function as an agent, (independent) contractor, employee, member or officer of a business, corporation or organisation (commercial or non-commercial) does not qualify as personal private and non-commercial purpose.
Without written approval from the author the software or the documentation must not be used for a business, for commercial, corporate, governmental, military or organisational purposes of any kind, or in a commercial, corporate, governmental, military or organisational environment of any kind.
Redistribution of the software and the documentation is allowed only in unmodified form of its current version and free of charge.

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):

the (pseudonymised) IP address;
the date and time of the request;
the URL of the requested web page or file;
the Referer and User-Agent HTTP headers sent by the web browser;
the result (success or failure) of the request;
the amount of data received and sent.

acos	acosf	acosl
asin	asinf	asinl
atan	atanf	atanl
atan2	atan2f	atan2l
ceil	ceilf	ceill
cosh	coshf	coshl
cos	cosf	cosl
exp	expf	expl
floor	floorf	floorl
fmod	fmodf	fmodl
log	logf	logl
log10	log10f	log10l
pow	powf	powl
sin	sinf	sinl
sinh	sinhf	sinhl
sqrt	sqrtf	sqrtl
tan	tanf	tanl
tanh	tanhf	tanhl

Me, myself & IT

Microsoft® Visual C Compiler Helper Routines: Poor and Stupid Implementation

Purpose

Introduction

__chkstk Routine

Implementation in AMD64 Assembler

_alloca Routine

_chkstk Routine

Implementation in i386 Assembler

_allmul Routine

Implementations in i386 Assembler

_alldiv Routine

Implementations in i386 Assembler

_alldvrm Routine

Implementation in i386 Assembler

_allrem Routine

Implementation in i386 Assembler

_aulldiv Routine

Implementation in i386 Assembler

_aulldvrm Routine

Implementation in i386 Assembler

_aullrem Routine

Implementation in i386 Assembler

_aullshr Routine

Implementation in i386 Assembler

_allshl Routine

Implementation in i386 Assembler

_allshr Routine

Implementation in i386 Assembler

Revision History of _all* and _aull* Routines in Leaked Source

Runtime Measurement of _all* and _aull* Routines

_rotl64() and _rotr64() Intrinsic Functions for i386 Platform

Implementation of _allrol() and _allror() Functions in i386 Assembler

_abs64() Intrinsic Function for i386 Platform

Implementation of _allabs() Function in i386 Assembler

64-bit Integer Negation for i386 Platform

Implementation of _allneg() Function in i386 Assembler

64-bit Integer Negation for i386 Platform (Call by Reference)

64-bit Integer Signum for i386 Platform

Implementation of _allsgn() Function in i386 Assembler

64-bit Integer Comparison for i386 Platform

Implementation of _allcmp() and _aullcmp() Functions in i386 Assembler

64-bit Integer Maximum for i386 Platform

Implementation of _allmax() and _aullmax() Functions in i386 Assembler

64-bit Integer Minimum for i386 Platform

Implementation of _allmin() and _aullmin() Functions in i386 Assembler

acos(), asin(), atan(), atan2(), cos(), cosh(), exp(), fmod(), log(), log10(), pow(), sin(), sinh(), sqrt(), tan() and tanh() Standard Functions for i386 Platform

_CI* and _ftol* Routines

memchr() Standard Function for i386 Platform

Naïve Implementation in i386 Assembler

Smart Implementation in i386 Assembler

Implementation with SSE2 Instructions in i386 Assembler

Implementation with SSSE3 Instructions in i386 Assembler

Implementation with AVX Instructions in i386 Assembler

Implementation with AVX2 Instructions in i386 Assembler

Smart Implementation in AMD64 Assembler

mem*() Standard Functions

Implementation in ANSI C

Implementation in i386 Assembler

Implementation in AMD64 Assembler

Inline Implementation of memcpy() and memset() with Intrinsic Functions

strchr() Standard Function for i386 Platform

Implementation in i386 Assembler

Implementation with SSE2 Instructions in i386 Assembler

Implementation with SSSE3 Instructions in i386 Assembler

Implementation with AVX Instructions in i386 Assembler

Implementation with AVX2 Instructions in i386 Assembler

strlen() Standard Function for i386 Platform

Implementation in i386 Assembler

Implementation with SSE2 Instructions in i386 Assembler

Implementation with AVX Instructions in i386 Assembler

Implementation with AVX2 Instructions in i386 Assembler

Implementation in AMD64 Assembler

strrchr() and strstr() Standard Functions for i386 Platform

Implementation with SSE4.2 Instructions in i386 Assembler

str*() Standard Functions

Implementation in ANSI C

Implementation in i386 Assembler

Implementation in AMD64 Assembler

wcs*() Standard Functions

Microsoft^® Visual C Compiler Helper Routines: Poor and Stupid Implementation

`__chkstk` Routine

`_alloca` Routine

`_chkstk` Routine

`_allmul` Routine

`_alldiv` Routine

`_alldvrm` Routine

`_allrem` Routine

`_aulldiv` Routine

`_aulldvrm` Routine

`_aullrem` Routine

`_aullshr` Routine

`_allshl` Routine

`_allshr` Routine

Revision History of `_all` and `_aull` Routines in Leaked Source

Runtime Measurement of `_all` and `_aull` Routines

`_rotl64()` and `_rotr64()` Intrinsic Functions for i386 Platform

Implementation of `_allrol()` and `_allror()` Functions in i386 Assembler

`_abs64()` Intrinsic Function for i386 Platform

Implementation of `_allabs()` Function in i386 Assembler

Implementation of `_allneg()` Function in i386 Assembler

Implementation of `_allsgn()` Function in i386 Assembler

Implementation of `_allcmp()` and `_aullcmp()` Functions in i386 Assembler

Implementation of `_allmax()` and `_aullmax()` Functions in i386 Assembler

Implementation of `_allmin()` and `_aullmin()` Functions in i386 Assembler

`acos()`, `asin()`, `atan()`, `atan2()`, `cos()`, `cosh()`, `exp()`, `fmod()`, `log()`, `log10()`, `pow()`, `sin()`, `sinh()`, `sqrt()`, `tan()` and `tanh()` Standard Functions for i386 Platform

`_CI` and `_ftol` Routines

`memchr()` Standard Function for i386 Platform

`mem*()` Standard Functions

Inline Implementation of `memcpy()` and `memset()` with Intrinsic Functions

`strchr()` Standard Function for i386 Platform

`strlen()` Standard Function for i386 Platform

`strrchr()` and `strstr()` Standard Functions for i386 Platform

`str*()` Standard Functions

`wcs*()` Standard Functions

`_load_config_used` and `__security_check_cookie()` Function (`/GS` Support)

`main()` and `wmain()` Support

`.CRT` Section Usage

`.rtc` Section Usage