System call optimization with the SYSENTER instruction

My previous article "How do Windows NT system calls REALLY work?" explains how Windows NT calls system services by using an 'int 2e' software interrupt. Newer platforms such as Windows XP and 2003 normally use another method to call system services. Like explained in my previous article, the 'int 2e' instruction uses both an interrupt gate and a code segment descriptor to find the interrupt service routine (KiSystemService) which services the 'int 2e' software interrupt. Since the CPU will have to load one interrupt gate and one segment descriptor from memory in order to know what interrupt service routine to call, significant overhead is involved in making an 'int 2e' system call. The SYSENTER instruction drastically reduces this overhead.

By John Gulbrandsen
John.Gulbrandsen@SummitSoftConsulting.com

Why is SYSENTER faster?

Like explained in my previous article, the interrupt gate (entry 2e in the Interrupt Descriptor Table) identifies the entry in the Global Descriptor Table which in turn identifies the code segment that contains the KiSystemService function. Loading the 8 byte interrupt gate and segment descriptors from memory is sped up by keeping these gate/descriptors cached in the processors on-chip (level 1) or off-chip (level 2) cache. The CPU is very likely to find these gate/descriptors cached since each and every Windows NT system call uses the same interrupt gate and code segment descriptor when making a system call via the 'int 2e' software interrupt. However, the CPU must still perform memory read cycles to read from the cache, make access privilege checks etc every time when switching the privilege level via the 'int 2e' software interrupt. After having analyzed the whole sequence of events involved in switching to kernel-mode it is clear that it would be much faster if the CPU could be hard coded to always switch to the same location in a kernel-mode segment when a system call is issued. Since the destination function is now hard coded, no memory reads are necessary to find out where the system call should end up. This would speed up system calls significantly. This is exactly what is being done by the Intel SYSENTER and the AMD SYSCALL instructions which are present in the Pentium II, AMD K7 and newer CPUs. These instructions are collectively referred to as "Fast System Call" instructions.

SYSENTER or SYSCALL?

Why are there two different instructions to make a fast system call? Most likely Intel and AMD simultaneously and independently developed their versions of the Fast System Call instructions. They are both functionally identical but they use somewhat different configuration registers in the CPU to setup the destination segment and the offset within the destination segment where the system call function resides. Because they are both so similar I will below mainly describe the SYSENTER version and point out differences where they matter.

How does a system call via the SYSENTER instruction work?

Like explained above, the SYSENTER call uses hard-coded code segment descriptors to describe the target code segment. Instead of setting up the CPU accordingly to a specification in memory described by a code segment descriptor (segment base, segment size, segment privilege level etc) the CPU always sets up the target segments base to 0, its size to 4GB and its privilege level to 0 (kernel-mode). What is NOT hard-coded is the exact target location within the target segment, i.e. the address of the function being called in the kernel mode code segment. This function is called 'KiFastCallEntry' in Windows XP and newer platforms. So if the address of the KiFastCallEntry function is not hard-coded, how does the CPU know where to jump after switching to the target code segment? The answer is that the CPU uses the "Model Specific Registers" (MSR). MSRs are configuration registers that are only used by the operating system, application programs never use them. The content of the MSRs define how the CPU will behave. The RDMSR (Read MSR) and WRMSR (Write MSR) instructions are used to modify the MSRs. The CPU is using an MSR called SYSENTER_EIP_MSR in order to know where to jump when the SYSENTER instruction is executed. In other words, the SYSENTER_EIP_MSR register contains the address of the KiFastCallEntry function. This MSR must be set up by the operating system very early in the boot process in order for system calls via the SYSENTER instruction to work. Like explained in my previous article, the operating system switches to the kernel-mode stack when an operating system call is made. This behavior must be the same when making a SYSENTER call or else the stability of the system will be compromised (the whole point of switching to a kernel-mode stack is to assure that the integrity of the stack used in kernel-mode can be trusted). So how does the CPU switch to the kernel-mode stack? Again, it uses Model Specific Registers. Like the Code Segment, the Stack Segment is loaded with hard-coded values when the CPU executes a SYSENTER instruction. It is loaded with exactly the same values that a system call via an 'int 2e' instruction would result in, i.e. a flat model where the base is 0 and the size is 4GB. Like the EIP, the ESP is not hard-coded. Its value is taken from the SYSENTER_ESP_MSR which is also set up by the operating system at boot time.

The mechanics of SYSENTER

All Model Specific Registers are 64-bit registers. They are loaded from EDX:EAX using the WRMSR instruction. The MSR index in the ECX register tells the WRMSR instruction which MSR to load. The RDMSR register works the same way but it stores the current value of an MSR into EDX:EAX. The Programming manual for the CPU used specifies what index to use for any given MSR. The table below lists the MSRs used by the SYSENTER/SYSEXIT instructions.

Model Specific Register name

Index

Usage

SYSENTER_CS_MSR

174h

CS Selector of  the target segment

SYSENTER_ESP_MSR

175h

Target ESP

SYSENTER_EIP_MSR

176h

Target EIP

Table 1. The Model Specific Registers used by the SYSENTER instruction.

Note that SYSENTER_CS_MSR contains the Code Segment Selector of the target code segment (the segment that contains the KiFastCallEntry function). This value is loaded into the visible part of the CS register but it is in fact never used by the SYSENTER or SYSEXIT instructions! Remember that all information related to the target code segment is hard-coded by the SYSENTER instruction and that therefore the Segment Selector loaded into CS is not used to find the target code segment in the GDT like in the case of the 'int 2e' method of making system calls. In order to keep consistency between the value in the CS Segment Register and the Descriptor it points to, the operating system must however set up a real Code Segment Descriptor in GDT. In fact, the operating system must set up four Segment Descriptors in the Global Descriptor Table in order to keep consistency between the Segment Registers and the content in the GDT. Intel specifies that these GDT descriptors must reside contiguously in the GDT. Figure 1 below illustrates this.

As figure 1 shows, the operating system sets up four segment descriptors in the GDT. The "CS Enter Descriptor" at index 1 in the GDT describes the kernel-mode code segment that contains the KiFastCallEntry routine. The "SS Enter Descriptor" describes the kernel-mode stack segment that will be switched to when calling into kernel-mode via a SYSENTER instruction. The "CS Exit Descriptor" and "SS Exit Descriptor" are used when switching back from kernel-mode to user-mode via the SYSEXIT instruction. The details involved in switching back into user-mode will be covered in detailed later in this article.

To summarize, the steps taken when executing the SYSENTER instructions are:

1)      The CPU loads the Segment Selector in the SYSENTER_CS_MSR into the visible part of the CS register.

2)      The hidden part of the CS register is loaded with hard-coded values like previously described.

3)      The SS register is loaded with a segment selector that points to the entry in the GDT after the CS Enter Descriptor, i.e. to the SS Enter Descriptor. Since the SYSENTER_CS_MSR (and also the CS register) contains the binary value 00001000 or hexadecimal 0x08, the SS will be loaded with a binary value of 00010000 or hexadecimal 0x10. The Intel Programmer's manual simply says that "the SS register is set to the sum of 8 plus the value in SYSENTER_CS_MSR" which results in a segment selector with an index one higher than the segment selector in SYSENTER_CS_MSR.

4)      The hidden part of the SS register is loaded with hard-coded values like previously described.

The EIP register is loaded from the SYSENTER_EIP_MSR and the CPU starts executing code in kernel-mode (KiFastCallEntry).

The mechanics of SYSEXIT

The SYSEXIT instruction is very similarly to the SYSENTER instruction with the main difference that the hidden part of the CS Register is now set to a priority of 3 (user-mode) instead of 0 (kernel-mode). As shown in figure 1 above, the GDT contains the CS Exit Descriptor and SS Exit Descriptors at index 3 and 4. Like in the case of the SYSENTER instruction, the CS and SS Exit Descriptors are not used at all by the SYSEXIT instruction. These descriptors are only there to create consistency between the selectors selected into the CS and SS registers and the corresponding CS and SS Exit Descriptors when returning to user-mode. The selectors loaded into the CS and SS Registers by the SYSEXIT instruction correctly points to the unused Exit CS and SS Descriptors in the GDT. These selectors are:

Selector (binary and hexadecimal)

Usage

00011000b = 18h

Points to the CS Exit Descriptor (Index 3 in GDT)

00100000b = 20h

Points to the SS Exit Descriptor (Index 4 in GDT)

Table 2. The CS and SS Exit Selectors used by the SYSEXIT instruction.

Like in the case of loading the SS selector during the SYSENTER instruction, the SYSEXIT instruction loads the CS and SS with descriptors that have indices into the GDT 2 and 3 higher than the index in the segment selector in the SYSENTER_CS_MSR register.

If you have paid close attention so far you might have noticed that there is no "SYSEXIT_EIP_MSR" or "SYSEXIT_ESP_MSR" registers. So how does the SYSEXIT instruction know where to return to in the user-mode code that initially called SYSENTER? When you think about it, such information could not be fixed in an MSR because each system call can potentially originate from completely different locations in user-mode. Therefore, it is the responsibility of the caller (the code that calls SYSENTER) to place the address the CPU is to return to after the system call has returned in the EDX register. The caller must also place the current stack pointer (the value of ESP) in the ECX register. The SYSEXIT instruction will then restore the original value in the EIP and ESP by copying the content from EDX and ECX respectively. This will cause the execution to continue at the instruction after the original SYSENTER instruction.

SYSENTER or 'int 2e'?

How does the operating system (XP or newer) know if it should use the new SYSENTER instruction when calling a kernel-mode function? The answer is that the operating system queries the CPU to find out if the SYSENTER instruction is supported via the CPUID instruction. If the SEP (SysEnter Present) bit is set, the operating system will use the SYSENTER instruction instead of 'int 2e'. This information is cached by the operating system so that once it has been determined that SYSENTER is supported it will always be used instead of 'int 2e'. The same is true for the AMD CPUs SYSCALL instruction.

Are there different operating system binaries for SYSENTER and 'int 2e'?

Like described in my previous article, the NTDLL.dll system call stub DLL is responsible for calling the 'int 2e' instruction whenever calls into the kernel was made on Windows NT (Windows 2000 and older, not including Windows 9x which has a completely different architecture). Since Windows XP now has three different ways to call a kernel-mode function, will the operating system have to check which method to use before each and every system call? The answer is no. Instead it calls a special page of memory that is mapped into all processes called the "SharedUserData" page which contains a function called "SystemCallStub". NTDLL calls the SystemCallStub for each system-call. Since the SystemCallStub calls a kernel-mode function differently depending on if SYSENTER, SYSCALL or 'int 2e' is used, the operating system binaries are identical regardless of the capabilities of the CPU.

KiFastCallEntry reuses the good old KiSystemService function

KiSystemService still does all the hard work involved in the actual dispatching of the system call once kernel-mode has been reached. KiFastCallEntry simply calls the implementation of KiSystemService after first having prepared a stack image identical to one produced by an 'int 2e' style system call (see my previous article for the details of how KiSystemService expects the stack to be set up). The question now is; how does the KiSystemService know if SYSEXIT, SYSRETURN or 'iretd' should be used to return to user-mode? For this to work the end of the KiSystemService function has been modified to handle any of the three system call types. In fact, there are three different Exit-routines depending of what call-style was used to enter kernel-mode:

Kernel Function Name

Call style

Exit instruction

KiSystemCallExit

'int 2e'

iretd

KiSystemCallExit2

SYSENTER

SYSEXIT

KiSystemCallExit3

SYSCALL

SYSRETURN

Table 3. The three different ways to exit a system call. 

The really interested reader can disassemble these functions to see what is really going on but this is not done in this article. The bottom line is that the choice of which of these three functions to use to return to user-mode is made in the "KiSystemServiceExit" function based on the feature-bits of the CPU (returned from the CPUID instruction).

Windows 2000 Experiment

We can confirm that the information presented in this article is correct through a couple of debugging sessions with WinDbg on Windows 2000 and Windows XP systems. Let's first see what the content of the MSRs are on our Windows 2000 OS running on a dual Pentium III machine:

0: kd> rdmsr 174
msr[174] = 00000000:00000000
0: kd> rdmsr 175
msr[175] = 00000000:00000000
0: kd> rdmsr 176
msr[176] = 00000000:00000000

The MSRs are all zero as expected since Windows 2000 is not aware of the SYSENTER instruction. It therefore does not initialize the SYSENTER_CS_MSR, SYSENTER_EIP_MSR or SYSENTER_ESP_MSR Model Specific Registers. Let's confirm that the SEP bit is set in the result returned from the CPUID instruction:

0: kd> !cpuinfo
CP F/M/S Manufacturer MHz Update Signature Features
0 6,8,3 GenuineIntel 797>0000001300000000<00002fff
1 6,8,3 GenuineIntel 797 0000000c00000000 00002fff

The feature bits (00002fff) translated into binary are 0010 1111 1111 1111. As can be seen, the SEP bit (bit 11) is set which tells us that the CPU supports the SYSENTER and SYSEXIT instructions but Windows 2000 doesn't (since the MSRs were not set up).

We can confirm that Windows 2000 uses the 'int 2e' method of calling system functions by disassembling an arbitrary system call, let's pick CreateMutex which ultimately ends up in the user-mode stub ZwCreateMutant in NTDLL.dll:

ntdll!ZwCreateMutant:
77f853b8 b825000000 mov eax,0x25
77f853bd 8d542404 lea edx,[esp+0x4]
77f853c1 cd2e int 2e
77f853c3 c21000 ret 0x10

As can be seen, our Windows 2000 system indeed uses 'int 2e' to make the system call.

Windows XP Experiment

If we are making the exact same tests on a Windows XP OS running on our Pentium III machine we should be able to verify that the system uses SYSENTER instead of 'int 2e' when system calls are made. Let's first check the MSRs:

0: kd> RDMSR 174
msr[174] = 00000000:00000008
0: kd> RDMSR 175
msr[175] = 00000000:00000000
0: kd> RDMSR 176
msr[176] = 00000000:804fa1e0

As expected, the MSRs are set up by Windows XP. As previously explained, the MSR with ID 174 is the SYSENTER_CS_MSR. It contains the selector that points to the Code Segment Descriptor in the GDT that describes the kernel-mode segment that contains the system call function (KiFastCallEntry). Let's take a look at the selector in SYSENTER_CS_MSR (MSR index 174):

If we peek into the GDT at index 1 with the "ProtMode" WinDbg debugger extension DLL presented in my previous article, we see the following information:

0: kd> !ProtMode.Descriptor GDT 1
----------------- Code Segment Descriptor -----------------
GDT base = 0x8003F000, Index = 0x01, Descriptor @ 0x8003f008
8003f008 ff ff 00 00 00 9b cf 00
Segment size is in 4KB pages, 32-bit default operand and data size
Segment is present, DPL = 0, Not system segment, Code segment
Segment is not conforming, Segment is readable, Segment is accessed
Target code segment base address = 0x00000000
Target code segment size = 0x000fffff

As can be seen, this is the same descriptor that was described in my previous article (the single 4GB kernel-mode segment that contains the system address space). The descriptor table base is however different on the Windows XP system (0x8003F000) compared to (0x80036000) on the Windows 2000 system used in my previous article. The MSR with MSR index 176 (SYSENTER_EIP_MSR) contains the address of the kernel-mode function that will be called when a SYSENTER instruction is executed. Let's verify that the address 804fa1e0 indeed is the address of KiFastCallEntry:

0: kd> u 804fa1e0
nt!KiFastCallEntry:
804fa1e0 b930000000 mov ecx,0x30
804fa1e5 8ee1 mov fs,ecx
804fa1e7 648b0d40000000 mov ecx,fs:[00000040]
804fa1ee 368b6104 mov esp,ss:[ecx+0x4]
804fa1f2 b90403fe7f mov ecx,0x7ffe0304

Let's finally see what our CreateMutex call looks like on our Windows XP system:

ntdll!ZwCreateMutant:
77f7e663 b82b000000 mov eax,0x2b
77f7e668 ba0003fe7f mov edx,0x7ffe0300
77f7e66d ffd2 call edx {SharedUserData!SystemCallStub (7ffe0300)}
77f7e66f c21000 ret 0x10
77f7e672 90 nop

We here see that the ZwCreateMutant stub function in NTDLL no longer calls directly into kernel-mode but instead calls the SystemCallStub function that resides in the SharedUserData page like described above. Below is a disassembly of the SystemCallStub itself:

SharedUserData!SystemCallStub:
7ffe0300 8bd4 mov edx,esp
7ffe0302 0f34 sysenter
7ffe0304 c3 ret

Ah, finally we reach the SYSENTER instruction!

How much faster is SYSENTER than 'int 2e'?

The below test program calls CreateMutex approximately 16.7 million times and then prints out the time the application started and finished. The results are displayed in table 4 below.

Platform

Time

Windows 2000 SP4 on PIII Dual 800MHz

4 minutes

Windows XP SP0 on PIII Dual 800MHz

1 minute 30 seconds

Table 4. The SYSENTER system call performance improvement over 'int 2e'.

As table 4 shows, the SYSENTER way of making system calls is 266% faster than 'int 2e'. This is quite impressing and it may be a hidden but very good reasons to upgrade to Windows XP. Of course, very few applications call system services with this frequency but the SYSENTER instruction still does a very good optimization job.

#include <WINDOWS.H>
#include <CRTDBG.H>

void DisplaySystemTime(LPSYSTEMTIME pSystemTime, char * pszHdr)
{
     char szBuf[1024];
     sprintf(szBuf, "%02d:%02d:%02d",
        pSystemTime->wHour,
        pSystemTime->wMinute,
        pSystemTime->wSecond);

     MessageBox(NULL, szBuf, pszHdr, MB_OK);
}

int main(int argc, char* argv[])
{
     SYSTEMTIME stStart;
     GetSystemTime(&stStart);


     for(DWORD dwCount= 0;dwCount<0x00FFFFFF; dwCount++)
     {
          HANDLE hMutex = CreateMutex(
               NULL, // SD.
               FALSE, // Initial owner?
               NULL); // Name.
          _ASSERTE(hMutex != NULL);

          CloseHandle(hMutex);
     }

     SYSTEMTIME stEnd;
     GetSystemTime(&stEnd);

     DisplaySystemTime(&stStart, "Start time");
     DisplaySystemTime(&stEnd, "End time");

     return 0;
}

Further Reading

For information on the Protected Mode of the Intel x86 CPU there are two great sources:

1)      "Intel Architecture Software Developers Manual, Volume 3 - System Programming Guide". Available from Intel's web site in PDF format.

2)      "Protected Mode Software Architecture" by Tom Shanley. Available from Amazon.com (published by Addison Wesley).

For more programming details about the x86 CPU, must-haves are:

1)              Intel Architecture Software Developers Manual, Volume 1 - Basic Architecture.

2)              Intel Architecture Software Developers Manual, Volume 2 - Instruction Set Reference Manual.

      Both these books are available in PDF format on the Intel web site (you can also get a free hardcopy of these two books. Volume 3 is however only available in PDF format).

About the Author

John Gulbrandsen is the founder and president of Summit Soft Consulting. John has a formal background in Microprocessor-, digital- and analog- electronics design as well as in embedded and Windows systems development. John has programmed Windows since 1992 (Windows 3.0). He is as comfortable with programming Windows applications and web systems in C++, C# and VB as he is writing and debugging Windows kernel mode device drivers in  SoftIce.  

To contact John drop him an email: John.Gulbrandsen@SummitSoftConsulting.com

About Summit Soft Consulting

Summit Soft Consulting is a Southern California-based consulting firm specializing in Microsoft's operating systems and core technologies. Our specialty is Windows Systems Development including kernel mode and NT internals programming.

To visit Summit Soft Consulting on the web: http://www.summitsoftconsulting.com