Chapter 1. Windows NT: An Inside Look

Size: px

Start display at page:

Download "Chapter 1. Windows NT: An Inside Look"

Doris May
5 years ago
Views:

1 Chapter 1 Windows NT: An Inside Look

2 Abstract This chapter begins with an evaluation of Windows NT and then examines the overall architecture of the operating system. THIS BOOK IS AN EXPLORATION of the internals of the Windows NT operating system. Before entering the jungle of Windows NT internals, an overview of the topic is necessary. In this chapter, we explain the overall structure of the Windows NT operating system. EVALUATING WINDOWS NT The qualities of an operating system are the result of the way in which the operating system is designed and implemented. For an operating system to be portable, extensible, and compatible with previous releases, the basic architecture has to be well designed. In the following sections, we evaluate Windows NT in light of these issues. Portability As you know, Windows NT is available on several platforms, namely, Intel, MIPS, Power PC, and DEC Alpha. Many factors contribute to Windows NT portability. Probably the most important factor of all is the language used for implementation. Windows NT is mostly coded in C, with some parts coded in C++. Assembly language, which is platform specific, is used only where necessary. The Windows NT team also isolated the hardware-dependent sections of the operating system in HAL.DLL. As a result, the hardware-independent portions of Windows NT can be coded in a high-level language, such as C, and easily ported across platforms. Extensibility Windows NT is highly extensible, but because of a lack of documentation, its extensibility features are rarely explored. The list of undocumented features starts with the subsystems. The subsystems provide multiple operating system interfaces in one operating system. You can extend Windows NT to have a new operating system interface simply by adding a new subsystem program. Windows NT provides Win32, OS/2, POSIX, Win16, and DOS interfaces using the subsystems concept, but Microsoft keeps mum when it comes to documenting the procedure to add a new subsystem. The Windows NT kernel is highly extensible because of dynamically loadable kernel modules that are loaded as device drivers. In Windows NT, Microsoft provides enough documentation for you to write hardware device drivers that is, hard disk device drivers, network card device drivers, tape drive device drivers, and so on. In Windows NT, you can write device drivers that do not control any hardware device. Even file systems are loaded as device drivers under Windows NT. Another example of Windows NT extensibility is its implementation of the system call interface. Developers commonly modify operating system behavior by hooking or adding system calls. The Windows NT development team designed the system call interface to facilitate easy hooking and adding of system calls, but again Microsoft has not documented

3 these mechanisms. Compatibility Downward compatibility has been a long-standing characteristic of Intel microprocessors and Microsoft operating systems, and a key to the success of these two giants. Windows NT had to allow programs for DOS, Win16, and OS/2 to run unaltered. Compatibility is another reason the NT development team went for the subsystem concept. Apart from binary compatibility, where the executable has to be allowed to run unaltered, Windows NT also provides source compatibility for POSIX-compliant applications. In another attempt to increase compatibility, Windows NT supports other file systems, such as the file allocation table (FAT) file system from DOS and the High Performance File System (HPFS) from OS/2, in addition to the native NT file system (NTFS). Maintainability Windows NT is a big piece of code, and maintaining it is a big job. The NT development team has achieved maintainability through an object-oriented design. Also, the breakup of the operating system functionality into various layers improves maintainability. The topmost layer, which is the one that is seen by the users of the operating system, is the subsystems layer. The subsystems use the system call interface to provide the application programming interface (API) to the outside world. Below the system call interface layer lies the NT executive, which in turn rests on the kernel, which ultimately relies on the hardware abstraction layer (HAL) that talks directly with the hardware. The NT development team choice of programming language also contributes to Windows NT maintainability. As we stated previously, the entire operating system has been coded in C and C++, except for a few portions where the use of assembly language was inevitable. Plus Points over Windows 95/98 Microsoft has come up with two 32-bit operating systems: Windows 95/98 and Windows NT. Windows NT is a high-end operating system that offers additional features separate from those provided by conventional PC or desktop operating systems, such as process management, memory management, and storage management. Security Windows NT is a secure operating system based on the following characteristic: A user needs to log in to the system before he or she can access it. The resources in the system are treated as objects, and every object has a security descriptor associated with it. A security descriptor has access control lists attached to it that dictate which users can access the object. All this being said, a secure operating system cannot be complete without a secure file system, and the FAT file system from the days of DOS does not have any provision for security. DOS, being a single-user operating system, did not care about security.

4 In response to this shortcoming, the Windows NT team came up with a new file system based on the HPFS, which is the native file system for OS/2. This new native file system for Windows NT, known as NTFS, has support for access control. A user can specify the access rights for a file or directory being created under NTFS, and NTFS allows only the processes with proper access rights to access that file or directory. Caution: Keep in mind that no system is 100 percent secure. Windows NT, although remarkably secure, is not DoD compliant. (For the latest news on DoD compliance, check out Multiprocessing Windows NT supports symmetric multiprocessing, the workstation version of Windows NT can support two processors, and the server version of Windows NT can support up to four processors. The operating system needs special synchronization constructs for supporting multiprocessing. On a single-processor system, critical portions of code can be executed without interruption by disabling all the hardware interrupts. This is required to maintain the integrity of the kernel data structures. In a multiprocessor environment, it is not possible to disable the interrupts on all processors. Windows NT uses spin locks to protect kernel data structures in a multiprocessor environment. Note: Multiprocessing can be classified as asymmetric and symmetric. In asymmetric multiprocessing, a single processor acts as the master processor and the other processors act as slaves. Only the master processor runs the kernel code, while the slaves can run only the user threads. Whenever a thread running on a slave processor invokes a system service, the master processor takes over the thread and executes the requested kernel service. The scheduler, being a kernel code, runs only on the master processor. Thus, the master processor acts as the scheduler, dispatching user mode threads to the slave processors. Naturally, the master processor is heavily loaded and the system is not scalable. Compare this with symmetric multiprocessing, where any processor can run the kernel code as well as the user code. International Language Support A significant portion of PC users today use languages other than English. The key to reaching these users is to have the operating system support their languages. Windows NT achieves this by adopting the Unicode standard for character sets. The Unicode standard has 16-bit character set, while ASCII uses an 8-bit character set. The first 256 characters in Unicode match the ASCII character set. This leaves enough space for representing characters from non-latin scripts and languages. The Win32 API allows Unicode as well as ASCII character sets, but the Windows NT kernel uses and understands only Unicode. Although the application programmer can get away without knowing Unicode, device driver developers need to be familiar with Unicode because the kernel interface functions accept only Unicode strings and the driver entry points are supplied with Unicode strings.

5 Multiprogramming Windows NT 3.51 and Windows NT 4.0 lack an important feature, namely, the support for remote login or Telnet of a server operating system. Both these versions of Windows NT can operate as file servers because they support the common Internet file system (CIFS) protocol. But they cannot act as CPU servers because logging into a Windows NT machine over the network is not possible. Consequently, only one user can access a Windows NT machine at a time. Windows 2000 plans to overcome this deficiency by providing a Telnet server along with the operating system. This will enable multiple programmers to log in on the machine at the same time, making Windows 2000 a true server operating system. Note: Third-party Telnet servers are available for Windows NT 3.51 and Windows NT 4.0. However, Microsoft own Telnet server comes only with Windows DELVING INTO THE WINDOWS NT ARCHITECTURE Windows NT borrows its core architecture from the MACH operating system, which was developed at Carnegie Mellon University. The basic approach of the MACH operating system is to reduce the kernel size to the minimum by pushing complex operating system functionality outside the kernel onto user-level server processes. This client-server architecture of the operating system serves yet another purpose: It allows multiple APIs for the same operating system. This is achieved by implementing the APIs through the server processes. The MACH operating system kernel provides a very simple set of interface functions. A server process implementing a particular API uses these interface functions to provide a more complex set of interface functions. Windows NT borrows this idea from the MACH operating system. The server processes in Windows NT are called as the subsystems. NT choice of the client-server architecture shows its commitment to good software management principles such as modularity and structured programming. Windows NT had the option to implement the required APIs in the kernel. Also, the NT team could have added different layers on top of the Windows NT kernel to implement different APIs. The NT team voted in favor of the subsystem approach for purposes of maintainability and extensibility. The Subsystems There are two types of subsystems in Windows NT: integral subsystems and environment subsystems. The integral subsystems, such as the security manager subsystem, perform some essential operating system task. The environment subsystems enable different types of APIs to be used on a Windows NT machine. Windows NT comes with subsystems to support the following APIs: Win32 Subsystem. The Win32 subsystem provides the Win32 API. The applications conforming to the Win32 API are supposed to run unaltered on all the 32-bit platforms provided by Microsoft that is, Windows NT, Windows 95, and Win32s. Unfortunately, as you will see later

6 in this book, this is not always the case. WOW Subsystem. The Windows on Windows (WOW) subsystem provides backward compatibility to 16-bit Windows applications, enabling Win16 applications to run on Windows NT. These applications can run on Windows NT unless they use some of the undocumented API functions from Windows 3.1 that are not defined in Windows NT. NTVDM Subsystem. The NT Virtual DOS Machine (NTVDM) provides a text-based environment where DOS applications can run. OS/2 Subsystem. The OS/2 subsystem enables OS/2 applications to run. WOW, NTVDM, and OS/2 are available only on Intel platforms because they provide binary compatibility to applications. One cannot run the executable files or binary files created for one type of processor on another type of processor because of the differences in machine code format. POSIX Subsystem. The POSIX subsystem provides API compliance to the POSIX standard. The applications are unaware of the fact that the API calls invoked are processed by the corresponding subsystem. This is hidden from the applications by the respective client-side DLLs for each subsystem. This DLL translates the API call into a local procedure call (LPC). LPC is similar to the remote procedure call (RPC) facility available on networked Unix machines. Using RPC, a client application can invoke a function residing in a server process running on another machine over the network. LPC is optimized for the client and the server running on the same machine. THE WIN32 SUBSYSTEM The Win32 subsystem is the most important subsystem. Other subsystems such as WOW and OS/2 are provided mainly for backward compatibility, while the POSIX subsystem is very restrictive in functionality. (For example, POSIX applications do not have access to any network that exists.) The Win32 subsystem is important because it controls access to the graphics device. In addition, the other subsystems are actually Win32 applications that use the Win32 API to provide their own different APIs. In essence, all the subsystems are based on the core Win32 subsystem. The Win32 subsystem in Windows NT 3.51 contains the following components: CSRSS.EXE. This is the user mode server process that serves the USER and GDI calls. Note: Traditionally, Windows API calls are classified as user/gdi calls and kernel calls. The majority of user/gdi functions are related to the graphical user interface (GUI) and reside in USER.DLL under Windows 3.x. The kernel functions are related to non-gui O/S services such as file system management and process management and reside in KERNEL.EXE under Windows 3.x. KERNEL32.DLL. The KERNEL.EXE in Windows 3.1 has changed to KERNEL32.DLL in Windows NT. This is more than a change in name. The KERNEL.EXE contained all the kernel code for Windows 3.1, while KERNEL32.DLL contains just the stub functions. These stub functions call the corresponding NTDLL.DLL functions, which in turn invoke system call code in

7 the kernel. USER32.DLL. This is another client-side DLL for the Win32 subsystem. The majority of the functions in USER32.DLL are stub functions that convert the function call to an LPC for the server process. GDI32.DLL. The functions calls related to the graphical device interface are handled by another client-side DLL for the Win32 subsys tem. The functions in GDI32.DLL are similar to those in USER32.DLL in that they are just stubs invoking LPCs for the server process. Under Windows NT 4.0 and Windows 2000, the functionality of CSRSS is moved into a kernel mode driver (WIN32K.SYS) and USER32 and GDI32 use the system calls interface to call the services in WIN32K.SYS. The Core We have to resort to new terminology for explaining the kernel component of the Windows NT operating system. Generally, the part of an operating system that runs in privileged mode is called as the kernel. The Windows NT design team strove to achieve a structured design for the operating system. The privileged-mode component of Windows NT is also designed in a layered fashion. A layer uses only the functions provided by the layer below itself. The main layers in the Windows NT core are the HAL, the kernel, and the NT executive. Because one of the layers running in privileged mode is itself called as the kernel, we had to come up with a new term that refers to all these layers together. We l refer to it as the core of Windows NT. Note: Most modern microprocessors run in at least two modes:normal and privileged. Some machine instructions can be executed only when the processor is in privileged mode. Also, some memory area can be marked as to be accessed in privileged mode only. The operating systems use this feature of the processors to implement a secure operating environment for multitasking. The user processes run in normal (nonprivileged) mode, and the operating system kernel runs in privileged mode. Thus, the operating system ensures that user processes cannot harm the operating system. This division of the Windows NT core into layers is logical. Physically, only the HAL comes as a separate module. The kernel, NT executive, and the system call layer are all packed in a single NTOSKRNL.EXE (or NTKRNLMP.EXE, for multiprocessor systems). Though they are considered part of the NT executive in this chapter, the device drivers (including the file system drivers) are separate driver modules and are loaded dynamically. THE HAL The lowest of the aforementioned layers is the hardware abstraction layer, which deals directly with the hardware of the machine. The HAL, as its name suggests, hides hardware idiosyncrasies from the layers above it. As we mentioned previously, Windows NT is a highly portable operating system that runs on DEC Alpha, MIPS, and Power-PC, in addition to Intel machines. Along with the processor, the other aspects of a machine, such as the bus architecture, interrupt handling, and DMA management also change. The HAL.DLL file

8 contains the code that hides the processor- and machine-specific details from other parts of the core. The kernel component of the core and the device drivers use the HAL interface functions. Thus, only the HAL code changes from platform to platform; the rest of the core code that uses the HAL interface is highly portable. THE KERNEL The kernel of Windows NT offers very primitive but essential services such as multiprocessor synchronization, thread scheduling, interrupt dispatching, and so on. The kernel is the only core component that cannot be preempted or paged out. All the other components of the Windows NT core are preemptive. Hence, under Windows NT, one can find more than one thread running in privileged mode. Windows NT is one of the few operating systems in which the core is also multithreaded. A very natural question to ask is Why is the kernel nonpreemptive and nonpageable? Actually, you can page out the kernel, but a problem arises when you page in. The kernel is responsible for handling page faults and bringing in the required pages in memory from secondary storage. Hence, the kernel itself cannot be paged out, or rather, it cannot be paged in if it is paged out. The same problem prevents the disk drivers supporting the swap space from being pageable. As the kernel and the device drivers use the HAL services, naturally, the HAL is also nonpreemptive. THE NT EXECUTIVE The NT executive constitutes the majority of the Windows NT core. It sits on top of the kernel and provides a complex interface to the outside world. The executive is designed in an object-oriented manner. The NT executive forms the part of the Windows NT core that is fully preemptive. Generally, the core components added by developers form a part of the NT executive or rather the I/O Manager. Hence, driver developers should always keep in mind that their code has to be fully preemptive. The NT executive can further be subdivided into separate components that implement different operating system functionality. The various components of the executive are described in the following sections. THE OBJECT MANAGER Windows NT is designed in an object-oriented fashion. Windows, devices, drivers, files, mutexes, processes, and threads have one thing in common: All of them are treated as objects. In simpler terms, an object is the data bundled with the set of methods that operate on this data. The Object Manager makes the task of handling objects much easier by implementing the common functionality required to manage any type of object. The main tasks of the Object Manager are as follows: Memory allocation/deallocation for objects.

9 Object name space maintenance. The Windows NT object name space is structured as a tree, just like a file system directory structure. An object name is composed of the entire directory path, starting from the root directory. The Object Manager is responsible for maintaining this object name space. Unrelated processes can access an object by getting a handle to it using the object name. Handle maintenance. To use an object, a process opens the object and gets back a handle. The process can use this handle to perform further operations on the object. Each process has a handle table that is maintained by the Object Manager. A handle table is nothing more than an array of pointers to objects; a handle is just an index in this array. When a process refers to a handle, the Object Manager gets hold of the actual object by indexing the handle in the handle table. Reference count maintenance. The Object Manager maintains a reference count for objects, and automatically deletes an object when the corresponding reference count drops to zero. The user mode code accesses objects via handles, while the kernel mode code uses pointers to directly access objects. The Object Manager increments the object reference count for every handle pointing to the particular object. The reference count is decremented whenever a handle to the object is closed. Whenever the kernel mode code references an object, the reference count for that object is incremented. The reference count is decremented as soon as the kernel mode code is finished accessing the object. Object security. The Object Manager also checks whether a process is allowed to perform a certain operation on an object. When a process creates an object, it specifies the security descriptor for that object. When another process tries to open the object, the Object Manager verifies whether the process is allowed to open the object in the specified mode. The Object Manager returns a handle to the object if the open request succeeds. As described earlier, a handle is simply an index in a per-process table that has pointers to actual objects. The mode in which the open request on an object is granted is stored in the handle table along with the object pointers. Later, when the process tries to access the object using the handle, the Object Manager ensures that proper access rights are associated with the handle. THE I/O MANAGER The I/O Manager controls everything related to input and output. It provides a framework that all the I/O-related modules (device drivers, file systems, Cache Manager, and network drivers) must adhere to. Device Drivers. Windows NT supports a layered device driver model. The I/O Manager defines a common interface that all the device drivers need to provide. This ensures that the I/O Manager can treat all the devices in the same manner. Also, device drivers can be layered, and a device driver can expect the same interface from the driver sitting below it. A typical example of layering is the device driver stack to access a hard disk. The lowest-level driver can talk in terms of sectors, tracks, and sides. There may be a second layer that can deal with hard disk partitions and provide an interface for dealing with logical block numbers. The third layer can be a volume manager driver that can club several partitions into volumes. Finally, a file system driver that provides an interface to the outside world can sit on top of the volume manager. File Systems. File systems are also coded as loadable device drivers under Windows NT. Consequently, a file system can be stacked on top of a disk device driver. Also, multiple file systems can be layered in such a manner that each layer adds to the functionality. For example,

10 a replication file system can be layered on top of a normal disk file system. The replication file system need not implement the code for on-disk structure modifications. Cache Manager. In her book Inside Windows NT, Helen Custer considers the Cache Manager part of the I/O Manager, though the Cache Manager does not adhere to the device driver interface. The Cache Manager is responsible for ensuring faster file read/write response. Though hard disk speeds are increasing, reading/writing to a hard disk is much slower than reading/writing to RAM. Hence, most operating systems cache the file data in RAM to satisfy the read requests without needing to read the actual disk block. Also, a write request can be satisfied without actually writing to the disk. The actual block write happens when system activity is low. This technique is called as delayed write. Another technique called as read ahead improves response time. In this technique, the operating system guesses the disk blocks that will be read in the future, depending on the access patterns. These blocks are read even before they are requested. The Cache Manager uses the memory mapping features of the Virtual Memory Manager to implement caching. Network Drivers. The network drivers have an interface standard different from regular device drivers. The network card drivers stick to the network driver interface specification (NDIS) standard. The drivers providing transport level interface are layered above the network card drivers and provide transport driver interface (TDI). THE SECURITY REFERENCE MONITOR The Security Reference Monitor is responsible for validating a process access permissions against the security descriptor of an object. The Object Manager uses the services of the Security Reference Monitor while validating a process request to access any object. THE VIRTUAL MEMORY MANAGER An operating system performs two essential tasks: 1. It provides a virtual machine, which is easy to program, on top of raw hardware, which is cumbersome to program. For example, an operating system provides services to access and manipulate files. Maintaining data in files is much easier than maintaining data on a raw hard disk. 2. It allows the applications to share the hardware in a transparent way. For example, an operating system provides applications with a virtual view of the CPU, where the CPU is exclusively allotted to the application. In reality, the CPU is shared by various applications, and the operating system acts as an arbitrator. These two tasks are performed by the Virtual Memory Manager component of the operating system when it comes to the hardware memory. Modern microprocessors need an intricate data structure setup (for example, the segment table setup or the page table setup) for accessing the memory. The Virtual Memory Manager performs this task for you, which makes life easier. Furthermore, the Virtual Memory Manager enables the applications to share the physical memory transparently. It presents each application with a virtual address space where the entire address space is owned by the application.

11 The virtual memory concept is one of the key concepts in modern operating systems. The idea behind it is as follows. In case the operating system loads the entire program in memory while executing it, the size of the program is severely constrained by the size of physical memory. A very straightforward solution to the problem is not to load the entire program in memory at one time, but to load portions of it as and when required. A fact that supports this solution is the locality of reference phenomenon. Note: A process accesses only a small number of adjacent memory locations, if one considers a small time frame. This is even more pronounced because of the presence of looping constructs. In other words, the access is localized to a small number of memory pages, which is the reason it is called as locality of reference. The operating system needs to keep only the working set of a process in memory. The rest of the address space of the process is supported by the swap space on the secondary storage. The Virtual Memory Manager is responsible for bringing in the pages from the secondary storage to the main memory in case the process accesses a paged-out memory location. The Virtual Memory Manager is also responsible for providing a separate address space for every process so that no process can hamper the behavior of any other process. The Virtual Memory Manager is also responsible for providing shared memory support and memory-mapped files. The Cache Manager uses the memory-mapping interface of the Virtual Memory Manager. Note: A working set is the set of memory pages that needs to be in memory for a process to execute without incurring too many page faults. A page fault is the hardware exception received by the operating system when an attempt is made to access a paged-out memory location. THE PROCESS MANAGER The Process Manager is responsible for creating processes and threads. Windows NT makes a very clear distinction between processes and threads. A process is composed of the memory space along with various objects (such as files, mutexes, and others) opened by the process and the threads running in the process. A thread is simply an execution context that is, the CPU state (especially the register contents). A process has one or more threads running in it. THE LOCAL PROCEDURE CALL FACILITY The local procedure call (LPC) facility is specially designed for the subsystem communication. LPC is based on remote procedure call (RPC), which is the de facto Unix standard for communication between processes running on two different machines. LPC has been optimized for communication between processes running on the same machine. As discussed earlier, the LPC facility is used as the communication mechanism between the subsystems and their client processes. A client thread invokes LPC when it needs some service from the subsystem. The LPC mechanism passes on the parameters for the service invocation to the server thread. The server thread executes the service and passes the results back to the client thread using the LPC facility. WIN32K.SYS: A Core Architecture Modification

12 In Windows NT 3.51, the KERNEL32.DLL calls are translated to system calls via NTDLL.DLL, while the GDI and user calls are passed on to the Win32 subsystem process. Windows NT 4.0 has maintained more or less the same architecture as Version However, there is a major modification in the core architecture (apart from the completely revamped GUI). In Windows NT 4.0, Microsoft moved the entire Win32 subsystem to the kernel space in an attempt to improve performance. A new device driver, WIN32K.SYS, implements the Win32 API, and API calls are translated as system calls instead of LPCs. These system calls invoke the functions in the new WIN32K.SYS driver. Moving the services out of the subsystem process avoids the context switches required to process a service request. In Windows NT 3.51, each call to the Win32 subsystem involves two context switches: one from the client thread to the subsystem thread, and the second from the subsystem thread back to the client thread. Windows 2000 also continues with the kernel implementation of the Win32 subsystem. As you will see in Chapter 8, in Windows NT 3.51 the Win32 subsystem uses quick LPC, which is supposed to be much faster than regular LPC. Still, two context switches per GDI/user call is quite a bit of overhead. In Windows NT 4.0 and Windows 2000, the GDI/user calls are processed by the kernel mode driver in the context of the calling thread, thus avoiding the context switching overheads. THE SYSTEM CALL INTERFACE The system call interface is a very thin layer whose only job is to direct the system call requests from the user mode processes to appropriate functions in the Windows NT core. Though the layer is quite thin, it is a very important because it is the face of the core (kernel mode) component of Windows NT that the outside user-mode world sees. The system call interface defines the services offered by the core. The key portion of the system call interface is to change the processor mode from user mode to privileged mode. On Intel platforms, this can be achieved through software interrupts. Windows NT uses the software interrupt 2Eh to implement the system call interface. The handing routine for interrupt 2Eh passes on the control to the appropriate routine in the core component, depending on the requested system service ID. NTDLL.DLL is the user mode component of the system call interface. The user mode programs call NTDLL.DLL functions (through KERNEL32.DLL functions). The NTDLL.DLL functions are stub routines that set up appropriate parameters and trigger interrupt 2Eh.. The stub functions in NTDLL.DLL also pass the system service ID to the interrupt 2Eh handler. The interrupt handler indexes the service ID in the system call table to get to the core function that fulfills the requested system service. The interrupt handler calls this core function after copying the required parameters from the user mode stack to the kernel mode stack. SUMMARY In this chapter, we discussed the overall architecture of Windows NT. Windows NT architecture is robust in the areas of portability, extensibility, compatibility, and maintainability. Features such as security, symmetric multiprocessor support, and international language support

13 position the Windows NT operating system on the high end of the scale compared to Windows 95. The subsystems that run in user mode and the Windows NT core that runs in kernel mode make up the operating system environment. The Win32 subsystem is the most important of the environment subsystems. The Win32 subsystem comprises the client-side DLLs and the CSRSS process. The Win32 subsystem implements the Win32 API atop the native services provided by the Windows NT core. The Windows NT core comprises the hardware abstraction layer (HAL), the kernel, the Windows NT executive, and the system call interface. The NT executive, which forms a major portion of the NT core, consists of the Object Manager, the I/O Manager, the Security Reference Monitor, the Virtual Memory Manager, the Process Manager, and the local procedure call (LPC) facility. The chapters that follow cover the main components of the Windows NT operating system in detail.

14 Chapter 2 Writing Windows NT Device Driver

15 Abstract This chapter covers the software requirements for building Windows NT device drivers, the procedure for building device drivers, and the structure of a typical device driver. MOST OF THE SAMPLES IN this book are Windows NT kernel mode device drivers. This chapter contains the information you need to build device drivers and understand the samples in this book. This chapter is not a complete guide to writing device drivers. The best sources of information for detailed coverage of the topic are Art Baker s The Windows NT Device Driver Book: A Guide for Programmers and the documentation that ships with the Windows NT Device Driver Kit (DDK). PREREQUISITES TO WRITING NT DEVICE DRIVERS You must install the following tools to create a working development environment for Windows NT kernel mode device drivers: Windows NT Device Driver Kit (DDK) from Microsoft For the development of device drivers, you need to install the Device Driver Kit on your machine. The Device Driver Kit is available with the MSDN Level 2 subscription. The kit consists of sets of header files, libraries, and tools that enable easy development of device drivers. 32-bit compiler You need a 32-bit compiler to compile the device drivers. We strongly recommend using the Microsoft compiler to build the samples in this book. Win32 Software Development Kit (SDK) Although it is not necessary for compiling the samples from this book, we recommend installing the latest version of the Win32 SDK on your machine. Also, when you build device drivers using the DDK tools, you should set the environment variable MSTOOLS to point to the location where the Win32 SDK is installed. You can fake the installation of the Win32 SDK by adding the environment variable MSTOOLS with the System applet in the Control Panel. DRIVER BUILD PROCEDURE The Windows NT 4.0 Device Driver Kit installation adds four shortcuts to the Start menu: Free Build Environment, Checked Build Environment, DDK Help, and Getting Started. The Free Build Environment and Checked Build Environment shortcuts both refer to a batch file called SETENV.BAT, but have different command line arguments. Assuming that the DDK is installed in directory E:\DDK40, the Free Build Environment shortcut refers to this command line: %SystemRoot%\System32\cmd.exe /k E:\DDK40\bin\setenv.bat E:\DDK40 free The Checked Build Environment shortcut, on the other hand, refers to this command line: %SystemRoot%\System32\cmd.exe /k E:\DDK40\bin\setenv.bat E:\DDK40

16 checked Both shortcuts spawn CMD.EXE and ask it to execute the SETENV.BAT file with appropriate parameters. After executing the command, CMD.EXE still keeps running because of the presence of the /k switch. The SETENV.BAT file sets the environment variables, which are added to the CMD.EXE process s environment variable list. The DDK tools, which are spawned from CMD.EXE, refer to these environment variables. SETENV.BAT sets the environment variables, including BUILD_DEFAULT, BUILD_DEFAULT_TARGETS, BUILD_MAKE_PROGRAM, and DDKBUILDENV. The drivers are compiled using the utility called BUILD.EXE, which is shipped with the DDK. This utility takes as input a file named SOURCES. This file contains the list of source files to be compiled to build the driver. This file also contains the name of the target executable, the type of the target executable (for example, DRIVER or PROGRAM), and the path of the directory where the target executable is to be created. Each sample device driver included with the DDK contains a makefile. However, this is not the actual makefile for the device driver sample. Instead, the makefile for each sample device driver includes a common makefile, named MAKEFILE.DEF, which is present in the INC directory of the DDK installation directory. Here is the sample makefile from the DDK sample: # # DO NOT EDIT THIS FILE!!! Edit.\sources. if you want to add a new source # file to this component. This file merely indirects to the real make file # that is shared by all the driver components of the Windows NT DDK #!INCLUDE $(NTMAKEENV)\makefile.def Some of the driver samples in this book have Assembly language files (.ASM files). You cannot refer to the.asm file directly into the SOURCES file. Instead, you have to create a directory called I386 in the directory where the source files for the drivers are kept. All the.asm files for the drivers must be kept in the I386 directory. The BUILD.EXE utility automatically uses ML.EXE to compile these.asm files. BUILD.EXE generates the appropriate driver or application based on the settings specified in the SOURCES file and using the platform-dependent environment variables. If there are any errors during the BUILD process, the errors are logged to a file called as BUILD.ERR. If there are any warnings, they are logged to the BUILD.WRN file. Also, the BUILD utility generates a file called BUILD.LOG, which contains lists of commands invoked by the BUILD utility and the messages given by these tools. STRUCTURE OF A DEVICE DRIVER

17 Just as every Win32 application has an entry point (main/winmain), every kernel mode device driver has an entry point called DriverEntry. A special process called SYSTEM loads the device drivers. Hence, the DriverEntry of each device driver is called in the context of the SYSTEM process. Each device driver is represented by a device name in the system, so each driver has to create a device name for its device. This is done with the IoCreateDevice function. If Win32 applications need to open the handle to a device driver, the driver needs to create a symbolic link for its device in the DosDevices object directory. This is done using a call to IoCreateSymbolicLink. Typically, in the DriverEntry routine of a device driver, the device object and the symbolic link object are created for a device and some driver or device-specific initialization is performed. Most of the device driver samples in this book involve pseudo device drivers. These drivers do not control any physical device. Instead, they complete tasks that can be performed only from the device driver. (The device driver runs at the most privileged mode of the processor Ring 0 in Intel processors.) In addition, the DriverEntry is supposed to provide sets of entry points for other functions, such as OPEN, CLOSE, DEVICEIOCONTROL, and so on. These entry points are provided by filling in some fields in the device object, which is passed as a parameter to the DriverEntry function. Because most of the drivers in this book are pseudo device drivers, the DriverEntry routine is the same for all of them. Only the device driver pecific initialization is different. Instead of repeating the same piece of code in each of the driver samples, a macro is written. The macro is called MYDRIVERENTRY: #define MYDRIVERENTRY(DriverName,DeviceId,DriverSpecificInit) PDEVICE_OBJECT deviceobject=null; NTSTATUS ntstatus; WCHAR devicenamebuffer[]=l"\\device\\"##drivername; UNICODE_STRING devicenameunicodestring;\ WCHAR devicelinkbuffer[]=l"\\dosdevices\\"##drivername; UNICODE_STRING devicelinkunicodestring; RtlInitUnicodeString(&deviceNameUnicodeString, devicenamebuffer); ntstatus = IoCreateDevice(DriverObject, 0, &devicenameunicodestring, ##DeviceId, 0, TRUE, &deviceobject); if (NT_SUCCESS(ntStatus)){ RtlInitUnicodeString(&deviceLinkUnicodeString, devicelinkbuffer);

18 ntstatus= IoCreateSymbolicLink( &devicelinkunicodestring, &devicenameunicodestring); if (!NT_SUCCESS(ntStatus)) { IoDeleteDevice (deviceobject); return ntstatus; ntstatus=##driverspecificinit; if (!NT_SUCCESS(ntStatus)) { IoDeleteDevice (deviceobject); IoDeleteSymbolicLink(&deviceLinkUnicodeString); return ntstatus; DriverObject->MajorFunction[IRP_MJ_CREATE] = DriverObject->MajorFunction[IRP_MJ_CLOSE] = DriverObject->MajorFunction[IRP_MJ_DEVICE_CONTROL] = DriverDispatch; DriverObject->DriverUnload=DriverUnload; return STATUS_SUCCESS; else { return ntstatus; ; The macro takes the following three parameters: The first parameter is the name of the driver, which will be used for creating the device name and symbolic link. The second parameter is the device ID, which uniquely identifies the device. The third parameter is the name of the function, which contains the driver-specific initialization. The macro expands into calling the necessary functions such as IoCreateDevice and IoCreateSymbolicLink. If these functions succeed, the driver calls the driver-specific initialization function specified by the third parameter. If the function returns failures, the macro returns the error code of the specific initialization function. If the function succeeds, the macro fills in various function pointers for other functions supported by the driver in the DriverObject. Once this macro is used in the DriverEntry function, you need to write the DriverDispatch and DriverUnload functions, as the macro refers to these functions. The macro definition can be found in UNDOCNT.H on the included CD-ROM. All the requests to device driver are sent in the form of an I/O Request packet (IRP). The driver

19 expects the system to call the specific driver function for all device driver requests based on the function pointers filled in during DriverEntry. We assume that all the driver functions are filled in with the address of the DriverDispatch function in the following discussion. The DriverDispatch function is called with an IRP containing the command code of IRP_MJ_CREATE whenever an application opens a handle to a device driver using the CreateFile API call. The DriverDispatch function is called with an IRP containing the command code of IRP_MJ_CLOSE whenever an application closes its handle to a device driver using the CloseHandle API function. The DriverDispatch function is called with an IRP containing the command code of IRP_MJ_DEVICE_CONTROL whenever the application uses the DeviceIoControl API function to send or receive data from a device driver. If the driver functionality is being used by multiple processes, the driver can use the CREATE and CLOSE entry points to perform per-process initialization. Because all these requests end up calling DriverDispatch, you need to have a way to identify the actual function requested. You can accomplish this by looking at the MajorFunction field in an I/O Request Packet (IRP). The request packet contains the function code and any other additional parameters required to complete the request. The DriverUnload routine is called when the device driver is unloaded from the system. Just like DriverEntry, the DriverUnload function is called in the context of the SYSTEM process. Typically, in a DriverUnload routine, the device driver deletes the symbolic link and the device name created during DriverEntry and performs some device-specific uninitialization. SUMMARY In this chapter, we covered the software requirements for building Windows NT device drivers, the procedure for building device drivers, and the structure of a typical device driver. Along the way, we explained a simple macro that you can use to generate the driver entry code for a typical device drive.

20 Chapter 3 Win32 Implementations: A Comparative Look

21 Abstract This chapter covers the Win32 implementation on Windows 95/98 and Windows NT. The authors discuss the differences between these two implementations with respect to address space, process startup, toolhelp functions, multitasking, thunking, device drivers, security, and API calls. EACH OPERATING SYSTEM provides sets of services referred to as an application programming interface (API) to developers in some form or another. The developers write software applications using this API. For example, DOS provides this interface in the form of the famous INT 21h interface. Microsoft s newer 32-bit operating systems, such as Windows 95 and Windows NT, provide the interface in the form of the Win32 API. Presently, there are four Win32 API implementations available from Microsoft: Windows 95/98 Windows NT Win32S Windows CE Of these, Win32S is very limited due to bugs and the restrictions of the underlying operating system. Presently, Win32 API implementations on Windows 95/98 and Windows NT are very popular among developers. Windows CE is meant for palmtop computers. The Win32 API was first implemented on the Windows NT operating system. Later, the same API was made available in Windows 95. Ideally, an application written using the standard Win32 API should work on any operating system that supports the Win32 API implementation. (However, this is not necessarily true due to the differences between the implementations.) The Win32 API should hide all the details of the underlying implementations and provide a consistent view to the outside world. In this chapter, we focus on the differences between the implementations of the Win32 API under Windows NT and Windows 95. As developers, you should be aware of these differences while you develop applications that can run on both of these operating systems. WIN32 API IMPLEMENTATION ON WINDOWS 95 The Win32 API is provided in the form of the famous trio of the KERNEL32, USER32, and GDI32 dynamic link libraries (DLLs). However, in most cases, these DLLs are just wrappers that use generic thunking to call the 16-bit functions. Note: Generic thunking is a way of calling 16-bit functions from a 32-bit application. (More on thunking later in this chapter.)

22 The major design goal for Windows 95 was backward compatibility. Hence, instead of porting all the 16-bit functions to 32-bit, Microsoft decided to reuse the existing 16-bit code (from the Windows 3. x operating system) by wrapping it in 32-bit code. This 32-bit code would in turn call the 16-bit functions. This was a good approach because the tried-and-true 16-bit code was already running on millions of machines all over the world. In this Win32 API implementation, most of the functions from KERNEL32 thunk down to KRNL386, USER32 thunks down to USER.EXE, and GDI32 thunks down to GDI.EXE. WIN32 API IMPLEMENTATION ON WINDOWS NT On Windows NT also, the Win32 API is provided in the form of the famous trio of the KERNEL32, USER32, and GDI32 DLLs. However, this implementation is done completely from scratch without using any existing 16-bit code, so it is purely a 32-bit implementation of Win32 API. Even 16-bit applications end up calling this 32-bit API. Windows NT s 16-bit subsystem uses universal thunking to achieve this. Note: Universal thunking is a way of calling 32-bit functions from 16-bit applications. (More on thunking later in this chapter.) KRNL386.EXE, USER.EXE, and GDI.EXE, which are used to support 16-bit applications, thunk up to KERNEL32, USER32, and GDI32 through the WOW (Windows on Windows) layer. Most of the functions provided by KERNEL32.DLL call one or more native system services to do the actual work. The native system services are available through a DLL called NTDLL.DLL. XREF: All these system services are discussed in Chapter 6. As far as USER32 and GDI32 are concerned, the implementation differs in NT versions 3.51 and later versions. Under Windows NT 3.51, a separate subsystem process implements the USER32 and GDI32 calls. The DLLs USER32 and GDI32 contain stubs, which pass the function parameters to the Win32 subsystem (CSRSS.EXE) and get the results back. The communication between the client application and the Win32 subsystem is achieved by using the local procedure call facility provided by the NT executive. XREF: Chapter 8 covers the details of the local procedure call (LPC) mechanism. Under Windows NT 4.0 and Windows 2000, the USER32 GDI32 calls the system services provided by a kernel-mode device driver called WIN32K.SYS. USER32 and GDI32 contain stubs that call these system services using the 2Eh interrupt. Hence, most of the functionality of the Win32 Subsystem process (CSRSS.EXE) is taken over by the kernel-mode driver (WIN32K.SYS). The CSRSS process still exists in NT 4.0 and Windows 2000 however, its role is limited to mainly supporting Console I/O.

23 It is interesting to note that the Win32 API completely hides NTDLL.DLL from the developer. Actually, most of the functions provided by the Win32 API ultimately call one or more system services. This system service layer is very powerful and many times contains functions that do not have equivalent Win32 API functions. Most of the Windows NT Resource Kit utilities link to this DLL implicitly. WIN32 IMPLEMENTATION DIFFERENCES Now we will consider a few aspects of the Win32 API implementation on Windows NT and Windows 95 that might affect the way developers program using this so-called standard Win32 API. Address Space Both Windows 95 and Windows NT deal with flat, 32-bit linear addresses that give 4GB of virtual address space. Of this, the upper 2GB (hereafter referred to as the shared address space) is reserved for operating system use, and the lower 2GB (hereafter referred to as the private address space) is used by the running process. The private address space of each process is different for each process. Although the virtual addresses in the private address space of all processes is the same, they may point to a different physical page. The addresses in the shared address space of all the processes point to the same physical page. Under Windows 95/98, the operating system DLLs, such as KERNEL32, USER32, and GDI32, reside in the shared address space, whereas in Windows NT these DLLs are loaded in the process s private address space. Hence, under Windows 95/98, it is possible for one application to interfere with the working of another application. For example, one application can accidentally overwrite memory areas occupied by these DLLs and affect the working of all the other processes. Note: Although the shared address space is protected at the page table level, a kernel-mode component (for example, a VXD) is able to write at any location in 4GB address space. In addition, under Windows 95/98, it is possible to load a dynamic link library in the shared address space. These DLLs will have the same problem described previously if the DLL is used by multiple applications in the system. Windows NT loads all the system DLLs, such as KERNEL32, USER32, and GDI32, in the private address space. As a result, it is never possible for one application to interfere with the other applications in the system without intending to do so. If one application accidentally overwrites these DLLs, it will affect only that application. Other applications will continue to run without any problems. Memory-mapped files are loaded in the shared address space under Windows 95/98, whereas

24 they are loaded in the private address space in Windows NT. In Windows 95/98, it is possible for one application to create and map a memory-mapped file, pass its address to another application, and have the other application use this address to share memory. This is not possible under Windows NT. You have to explicitly create and map a named memory-mapped file in one application and open and map the memory-mapped file in another application in order to share it. The address space differences have strong impacts on global API hooking. The topic of global API hooking has been covered many times in different articles and books. There is still no common API hooking solution for both Windows NT and Windows 95/98. The basic problem with global API hooking is that under Windows 95/98, it is possible to load a DLL in shared memory. Also, all the system DLLs reside in shared memory. Hooking an API call amounts to patching the few instructions at the start of function and routing them to a function in a shared DLL using a simple JMP instruction. This does not work under Windows NT because if you patch the bytes at the start of the function, they will be patched only in your address space as the function resides in the private address space. To do any kind of global API hooking under Windows NT, you have to make sure that the hooking is performed in each of the running processes. For this, you need to play with the address space of other processes. In addition, the same hooking also needs to be done in newly started processes. Windows NT provides a way to automatically load a particular DLL in each process through the AppInit_DLL registry key. Process Startup There are several differences in the way the process is started under Windows 95/98 and Windows NT. Although the same CreateProcess API call is used in Windows 95/98 and Windows NT, the implementation is quite different. In this chapter, we are looking only at an example of a CreateProcess API call. Ideally, both of the CreateProcess implementations should give the same view to the outside world. When somebody says that a particular API call is standard, this means that given a specific set of parameters to a function, the function should behave exactly the same on all the implementations of this API call. In addition, the function should return the same error codes based on the type of error. Consider a simple problem such as detecting the successful start of an application. If you try to spawn a program that has some startup problem (for example, implicitly linked DLLs are missing), it should return an appropriate error code. The Windows 95/98 implementation returns an appropriate error code such as STATUS_DLL_NOT_FOUND, whereas Windows NT does not return any error. Windows NT s implementation will return an error only if the file spawned is not present at the expected location. This happens mainly because of the way the CreateProcess call is implemented under Windows NT and Windows 95/98. When you spawn a process in Windows 95/98, the complete loading and startup of the process is performed as part of the CreateProcess call itself. That is, when the CreateProcess call returns, the spawned process is already running.

25 It is interesting to see Windows NT s implementation of the CreateProcess call. Windows NT s CreateProcess calls the native system service (NtCreateProcess) to create a process object. As part of this call, NTDLL.DLL is mapped in the process s address space. Then, the CreateProcess API calls the native system service to create the primary thread in the process (NtCreateThread). The implicitly linked DLL loading does not happen as part of the CreateProcess API call. Instead, the primary thread of the process starts at a function in NTDLL.DLL. This function in turn loads the implicitly loaded DLLs. As a result, there is no way for the caller to know whether the process has started properly or not. Of course, for GUI applications, you can use WaitForInputIdle to synchronize with the startup of a process. However, for non-gui applications, there is no standard way to achieve this. Toolhelp Functions Win32 implementation on Windows 95/98 provides some functions that enable you to enumerate the processes running in the system, module list, and so on. These functions are provided by KERNEL32.DLL. The functions are CreateToolHelp32 SnapShot, Process32First, Process32Next, and others. These functions are not implemented under Windows NT s implementation of KERNEL32. The programs that use these functions implicitly will not start at all under Windows NT. The Windows NT 4.0 SDK comes with a new DLL called PSAPI.DLL, which provides the equivalent functionality. The header file for this PSAPI.H is also included with the Windows NT 4.0 SDK. Windows 2000 has this toolhelp functionality built into KERNEL32.DLL. Note: A function is implicitly linked if the program calls the function directly by name and includes the appropriate.lib file in the project. That is, it does not use GetProcAddress to get the address of the function. Multitasking Both Windows 95 and Windows NT use time slice ased preemptive multitasking. However, because the Windows 95 implementation of the WIN32 API depends largely on 16-bit code, it has a few inherent drawbacks. The major one is the Win16Mutex. Because the existing 16-bit code is not well suited for multitasking, the easiest choice for Microsoft was to ensure that the 16-bit code is not entered from multiple tasks. To achieve this, Microsoft came up with the Win16Mutex solution. Before entering the 16-bit code, the operating system acquires the Win16Mutex, and it leaves the Win16Mutex while returning from 16-bit code. The Win16Mutex is always acquired when a 16-bit application is running, which results in reduced multitasking. Windows NT does not have this problem because the entire code is 32-bit and is well suited for time slice ased preemptive multitasking. Also, the 16-bit code thunks up to 32-bit code in the case of Windows NT. Thunking Thunking enables 16-bit applications to run in a 32-bit environment and vice versa. It is a way of calling a function written in one bitness from the code running at a different bitness. Bitness

26 is a property of the processor, and you can program the processor to adjust the bitness. Bitness decides the way instructions are decoded by the processor. There are two different types of thunking available: Universal thunking Generic thunking Universal thunking enables you to call a 32-bit function from 16-bit code, whereas generic thunking enables you to call a 16-bit function from 32-bit code. Windows 95/98 supports both generic and universal thunking, but Windows NT supports only universal thunking. As you saw earlier in this chapter, generic thunking is used extensively in WIN32 API implementation of Windows 95/98. For example, a 32-bit USER32.DLL calls functions from a 16-bit USER.EXE, and a 32-bit GDI32.DLL calls functions from a 16-bit GDI.EXE. Various issues are involved in thunking, such as converting 16:16 far pointers in 16-bit code to flat 32-bit address and manipulating a stack for making a proper call from code running at one bitness to code running at a different bitness. Microsoft provides tools such as thunk compilers to automate most of these tasks. Many vendors who write code for Windows 95/98 use generic thunking to avoid a major redesign of their applications. For example, say a particular vendor has a product for Windows 3.1 and would like to port it to Windows 95. Instead of rewriting the code for Windows 95, an easier solution is to use the majority of the existing 16-bit code and use generic thunking as a way of calling this code from 32-bit applications. However, these applications need to be rewritten for Windows NT as Windows NT does not support generic thunking. Device Drivers Device drivers are trusted components of the operating system that have full access to the entire hardware. There are no restrictions on what device drivers can do. Each operating system provides some way of adding new device drivers to the system. The device drivers need to be written according to the semantics imposed by the operating system. The device drivers are called virtual device drivers (VXD) in Windows 95/98, and they are called as kernel-mode device drivers in Windows NT. Windows 95 uses LE file format for virtual device drivers, whereas Windows NT uses the PE format. As a result, the applications that use VXDs cannot be run on Windows NT. They need to be ported to a Windows NT (kernel-mode) device driver. XREF: Chapter 2 explains how to write device drivers. Microsoft has come up with a Common Driver Model in Windows 98 and Windows At this point, however, you need to port all the applications that use VXDs to Windows NT by writing an equivalent kernel-mode driver. Security The major WIN32 API implementation difference between Windows 95/98 and Windows NT is

27 security. Windows 95/98 s implementation does not have any support for security. In all the Win32 API functions that have SECURITY ATTRIBUTES as one of the parameters, Windows 95/98 s implementation just ignores these parameters. This has some impact on the way a developer programs. Registry APIs such as RegSaveKey and RegRestoreKey work fine under Windows 95/98. However, under Windows NT, you need to do a few things before you can use these functions. In Windows NT, there is a concept of privileges. There are different kinds of privileges, such as Shutdown, Backup, and Restore. Before using a function such as RegSaveKey, you need to acquire the Backup privilege. To use RegRestoreKey, you need to acquire the Restore privilege, and to use the InitiateSystemShutdown function, you need to acquire the Shutdown privilege. Under Windows 95/98, anybody can install a VXD. To install a kernel-mode device driver under Windows NT, you need administrator privilege for security purposes. As mentioned previously, device drivers are trusted components of the operating system and have access to the entire hardware. By requiring privileges to install a device driver, Windows NT restricts the possibility that a guest account holder will install a device driver, which could potentially bring the whole system down to its knees. Newly Added API Calls With each version of Windows NT, new APIs are being added to the WIN32 API set. Most of these APIs do not have an equivalent API under Windows 95/98. Also, there are a few APIs, such as CreateRemoteThread, that do not have the real implementation under Windows 95/98. Under Windows 95/98, this function returns ERROR_CALL_NOT_IMPLEMENTED. As a result, there will always be a few API calls that are not available on Windows 95/98 or are not implemented on Windows 95/98. At this point, one can only hope that Microsoft will implement the API in Windows 95/98 when they add a new API to Windows NT unless the API is architecture dependent. SUMMARY This chapter covered the WIN32 API implementation on Windows 95/98 and Windows NT. We discussed the differences between these two implementations with respect to address space, process startup, toolhelp functions, multitasking, thunking, device drivers, security, and newly added API calls.

28 Chapter 4 Memory Management

29 Abstract This chapter examines memory models in Microsoft operating systems, examines how Windows NT uses features of the processor's architecture, and explores the function of virtual memory. MEMORY MANAGEMENT HAS ALWAYS been one of the most important and interesting aspects of any operating system for serious developers. It is an aspect that kernel developers ignore. Memory management, in essence, provides a thumbnail impression of any operating system. Microsoft has introduced major changes in the memory management of each new operating system they have produced. Microsoft had to make these changes because they developed all of their operating systems for Intel microprocessors, and Intel introduced major changes in memory management support with each new microprocessor they introduced. This chapter is a journey through the various Intel microprocessors and the memory management changes each one brought along with it in the operating system that used it. MEMORY MODELS IN MICROSOFT OPERATING SYSTEMS Early PCs based on Intel 8086/8088 microprocessors could access only 640K of RAM and used the segmented memory model. Consequently, good old DOS allows only 640K of RAM and restricts the programmer to the segmented memory model. In the segmented model, the address space is divided into segments. Proponents of the segmented model claim that it matches the programmer s view of memory. They claim that a programmer views memory as different segments containing code, data, stack, and heap. Intel 8086 supports very primitive segmentation. A segment, in the 8086 memory model, has a predefined base address. The length of each segment is also fixed and is equal to 64K. Some programs find a single segment insufficient. Hence, there are a number of memory models under DOS. For example, the tiny model that supports a single segment for code, data, and stack together, or the small model that allows two segments one for code and the other for data plus stack, and so on. This example shows how the memory management provided by an operating system directly affects the programming environment. The Intel (which followed the Intel 8086) could support more than 640K of RAM. Hence, programmers got new interface standards for accessing extended and expanded memory from DOS. Microsoft s second-generation operating system, Windows 3.1, could run on in standard mode and used the segmented model of The provided better segmentation than the In s model, segments can have a programmable base address and size limit. Windows 3.1 had another mode of operation, the enhanced mode, which required the Intel processor. In the enhanced mode, Windows 3.1 used the paging mechanisms of to provide additional performance. The virtual 8086 mode was also used to implement multiple DOS boxes on which DOS programs could run.

30 Windows 3.1 does not make full use of the s capabilities. Windows 3.1 is a 16-bit operating system, meaning that 16-bit addresses are used to access the memory and the default data size is also 16 bits. To make full use of s capabilities, a 32-bit operating system is necessary. Microsoft came up with a 32-bit operating system, Windows NT. The rest of this chapter examines the details of Windows NT memory management. Microsoft also developed Windows 95 after Windows NT. Since both these operating systems run on and compatibles, their memory management schemes have a lot in common. However, you can best appreciate the differences between Windows NT and Windows 95/98 after we review Windows NT memory management. Therefore, we defer this discussion until a later section of this chapter. WINDOWS NT MEMORY MANAGEMENT OVERVIEW We ll first cover the view Windows NT memory management presents to the outside world. In the next section, we explain the special features provided by Intel microprocessors to implement memory management. Finally, we discuss how Windows NT uses these features to implement the interface provided to the outside world. Memory Management Interface Programme r s View Windows NT offers programmers a 32-bit flat address space. The memory is not segmented; rather, it is 4GB of continuous address space. (Windows NT marked the end of segmented architecture programmers clearly preferred flat models to segmented ones.) Possibly, with languages such as COBOL where you need to declare data and code separately, programmers view memory as segments. However, with new languages such as C and C++, data variables and code can be freely mixed and the segmented memory model is no longer attractive. Whatever the reason, Microsoft decided to do away with the segmented memory model with Windows NT. The programmer need not worry whether the code/data fits in 64K segments. With the segmented memory model becoming extinct, the programmer can breathe freely. At last, there is a single memory model, the 32-bit flat address space. Windows NT is a protected operating system; that is, the behavior (or misbehavior) of one process should not affect another process. This requires that no two processes are able to see each other s address space. Thus, Windows NT should provide each process with a separate address space. Out of this 4GB address space available to each process, Windows NT reserves the upper 2GB as kernel address space and the lower 2GB as user address space, which holds the user-mode code and data. The entire address space is not separate for each process. The kernel code and kernel data space (the upper 2GB) is common for all processes; that is, the kernel-mode address space is shared by all processes. The kernel-mode address space is protected from being accessed by user-mode code. The system DLLs (for example, KERNEL32.DLL, USER32.DLL, and so on) and other DLLs are mapped in user-mode space. It is inefficient to have a separate copy of a DLL for each process. Hence, all processes using the DLL or executable module share the DLL code and incidentally the executable module

31 code. Such a shared code region is protected from being modified because a process modifying shared code can adversely affect other processes using the code. Sharing of the kernel address space and the DLL code can be called implicit sharing. Sometimes two processes need to share data explicitly. Windows NT enables explicit sharing of address space through memory-mapped files. A developer can map a named file onto some address space, and further accesses to this memory area are transparently directed to the underlying file. If two or more processes want to share some data, they can map the same file in their respective address spaces. To simply share memory between processes, no file needs to be created on the hard disk. BELOW THE OPERATING SYSTEM In her book Inside Windows NT, Helen Custer discusses memory management in the context of the MIPS processor. Considering that a large number of the readers would be interested in a similar discussion that focuses on Intel processors, we discuss the topic in the context of the Intel processor (whose memory management architecture is mimicked by the later and Pentium series). If you are already conversant with the memory management features of the processor, you may skip this section entirely. We now examine the s addressing capabilities and the fit that Windows NT memory management provides for it. Intel is a 32-bit processor; this implies that the address bus is 32-bit wide, and the default data size is as well. Hence, 4GB (2 32 bytes) of physical RAM can be addressed by the microprocessor. The microprocessor supports segmentation as well as paging. To access a memory location, you need to specify a 16-bit segment selector and a 32-bit offset within the segment. The segmentation scheme is more advanced than that in The 8086 segments start at a fixed location and are always 64K in size. With 80386, you can specify the starting location and the segment size separately for each segment. Segments may overlap that is, two segments can share address space. The necessary information (the starting offset, size, and so forth) is conveyed to the processor via segment tables. A segment selector is an index into the segment table. At any time, only two segment tables can be active: a Global Descriptor Table (GDT) and a Local Descriptor Table (GDT). A bit in the selector indicates whether the processor should refer to the LDT or the GDT. Two special registers, GDTR and LDTR, point to the GDT and the LDT, respectively. The instructions to load these registers are privileged, which means that only the operating system code can execute them. A segment table is an array of segment descriptors. A segment descriptor specifies the starting address and the size of the segment. You can also specify some access permission bits with a segment descriptor. These bits specify whether a particular segment is read-only, read-write, executable, and so on. Each segment descriptor has 2 bits specifying its privilege level, called as the descriptor privilege level (DPL).

32 The processor compares the DPL with the Requested Privilege Level (RPL) before granting access to a segment. The RPL is dictated by 2 bits in the segment selector while specifying the address. The Current Privilege Level (CPL) also plays an important role here. The CPL is the DPL of the code selector being executed. The processor grants access to a particular segment only if the DPL of the segment is less than or equal to the RPL as well as the CPL. This serves as a protection mechanism for the operating system. The CPL of the processor can vary between 0 and 3 (because 2 bits are assigned for CPL). The operating system code generally runs at CPL=0, also called as ring 0, while the user processes run at ring 3. In addition, all the segments belonging to the operating system are allotted DPL=0. This arrangement ensures that the user mode cannot access the operating system memory segments. It is very damaging to performance to consult the segment tables, which are stored in main memory, for every memory access. Caching the segment descriptor in special CPU registers, namely, CS (Code Selector), DS (Data Selector), SS (Stack Selector), and two general-purpose selectors called ES and FS, solves this problem. The first three selector registers in this list that is, CS, DS, and SS act as default registers for code access, data access, and stack access, respectively. To access a memory location, you specify the segment and offset within that segment. The first step in address translation is to add the base address of the segment to the offset. This 32-bit address is the physical memory address if paging is not enabled. Otherwise this address is called as the logical or linear address and is converted to a physical RAM address using the page address translation mechanism (refer to Figure 4-1). Figure 4-1: Linear to physical address translation The memory management scheme is popularly known as paging because the memory is divided into fixed-size regions called pages. On Intel processors (80386 and higher), the size

33 of one page is 4 kilobytes. The 32-bit address bus can access up to 4GB of RAM. Hence, there are one million (4GB/4K) pages. Page address translation is a logical to physical address mapping. Some bits in the logical/linear address are used as an index in the page table, which provides a logical to physical mapping for pages. The page translation mechanism on Intel platforms has two levels, with a structure called page table directory at the second level. As the name suggests, a page table directory is an array of pointers to page tables. Some bits in the linear address are used as an index in the page table directory to get the appropriate page table to be used for address translation. The page address translation mechanism in the requires two important data structures to be maintained by the operating system, namely, the page table directory and the page tables. A special register, CR3, points to the current page table directory. This register is also called Page Directory Base Register (PDBR). A page table directory is a 4096-byte page with 1024 entries of 4 bytes each. Each entry in the page table directory points to a page table. A page table is a 4096-byte page with 1024 entries of 4 bytes (32 bits) each. Each Page Table Entry (PTE) points to a physical page. Since there are 1 million pages to be addressed, out of the 32 bits in a PTE, 20 bits act as upper 20 bits of physical address. The remaining 12 bits are used to maintain attributes of the page. Some of these attributes are access permissions. For example, you can denote a page as read-write or read-only. A page also has an associated security bit called as the supervisor bit, which specifies whether a page can be accessed from the user-mode code or only from the kernel-mode code. A page can be accessed only at ring 0 if this bit is set. Two other bits, namely, the accessed bit and the dirty bit, indicate the status of the page. The processor sets the accessed bit whenever the page is accessed. The processor sets the dirty bit whenever the page is written to. Some bits are available for operating system use. For example, Windows NT uses one such bit for implementing the copy-on-write protection. You can also mark a page as invalid and need not specify the physical page address. Accessing such a page generates a page fault exception. An exception is similar to a software interrupt. The operating system can install an exception handler and service the page faults. You ll read more about this in the following sections. 32-bit memory addresses break down as follows. The upper 10 bits of the linear address are used as the page directory index, and a pointer to the corresponding page table is obtained. The next 10 bits from the linear address are used as an index in this page table to get the base address of the required physical page. The remaining 12 bits are used as offset within the page and are added to the page base address to get the physical address. THE INSIDE LOOK In this section, we examine how Windows NT has selectively utilized existing features of the

34 80386 processor s architecture to achieve its goals. Flat Address Space First, let s see how Windows NT provides 32-bit flat address space to the processes. As we know from the previous section, Intel offers segmentation as well as paging. So how does Windows NT provide a flat memory instead of a segmented one? Turn off segmentation? You cannot turn off segmentation on However, the processor enables the operating system to load the segment register once and then specify only 32-bit offsets for subsequent instructions. This is exactly what Windows NT does. Windows NT initializes all the segment registers to point to memory locations from 0 to 4GB, that is, the base is set as 0 and the limit is set as 4GB. The CS, SS, DS, and ES are initialized with separate segment descriptors all pointing to locations from 0 to 4GB. So now the applications can use only 32-bit offset, and hence see a 32-bit flat address space. A 32-bit application running under Windows NT is not supposed to change any of its segment registers. Process Isolation The next question that comes to mind is, How does Windows NT keep processes from seeing each other s address space? Again, the mechanism for achieving this design goal is simple. Windows NT maintains a separate page table directory for each process and based on the process in execution, it switches to the corresponding page table directory. As the page table directories for different processes point to different page tables and these page tables point to different physical pages and only one directory is active at a time, no process can see any other process s memory. When Windows NT switches the execution context, it also sets the CR3 register to point to the appropriate page table directory. The kernel-mode address space is mapped for all processes, and all page table directories have entries for kernel address space. However, another feature of is used to disallow user-mode code from accessing kernel address space. All the kernel pages are marked as supervisor pages; therefore, user-mode code cannot access them. Code Page Sharing in DLLs For sharing code pages of a DLL, Windows NT maps corresponding page table entries for all processes sharing the DLL onto the same set of physical pages. For example, if process A loads X.DLL at address xxxx and process B loads the same X.DLL at address yyyy, then the PTE for xxxx in process A s page table and the PTE for yyyy in process B s page table point to the same physical page. Figure 4-2 shows two processes sharing a page via same page table entries. The DLL pages are marked as read-only so that a process inadvertently attempting to write to this area will not cause other processes to crash.

35 Figure 4-2: Sharing pages via same page table entries Note: This is guaranteed to be the case when xxxx==yyyy. However, if xxxx!=yyyy, the physical page might not be same. We will discuss the reason behind this later in the chapter. Kernel address space is shared using a similar technique. Because the entire kernel space is common for all processes, Windows NT can share page tables directly. Figure 4-3 shows how processes share physical pages by using same page tables. Consequently, the upper half of the page table directory entries are the same for all processes.

36 Figure 4-3: Sharing pages via same page directory entries Listing 4-1 shows the sample program that demonstrates this. Listing 4-1: SHOWDIR.C This initial portion of the SHOWDIR.C file contains, apart from the header inclusion, the global definition for the array to hold the page directory. The inclusion of the header file GATE.H is of interest. This header file prototypes the functions for using the callgate mechanism. Using the callgate mechanism, you can execute your code in the kernel mode without writing a new device driver. XREF: We discuss the callgate mechanism in Chapter 10. For this sample program, we need this mechanism because the page directory is not accessible to the user-mode code. For now, it s sufficient to know that the mechanism allows a function inside a normal executable to be exec uted in kernel mode. Turning on to the definition of the page directory, we have already described that the size of each directory entry is 4 bytes and a page directory contains 1024 entries. Hence, the PageDirectory is an array of 1024 DWORDs. Each DWORD in the array represents the corresponding directory entry. /* C function called from the assembly stub */ void _stdcall CFuncGetPageDirectory() { DWORD *PageDir=(DWORD *)0xC ; int i=0; for (i=0; i<1024; i++) { PageDirectory[i] = PageDir[i]; CfuncGetPageDirectory() is the function that is executed in the kernel mode using the callgate mechanism. This function simply makes a copy of the page directory in the user-mode memory area so that the other user-mode code parts in the program can access it. The page directory is mapped at virtual address 0xC in every process s address space. This address is not accessible from the user mode. The CFuncGetPageDirectory() function copies

37 1024 DWORDs from the 0xC address to the global PageDirectory variable that is accessible to the user-mode code in the program. /* Displays the contents of page directory. Starting * virtual address represented by the page directory * entry is shown followed by the physical page * address of the page table */ void DisplayPageDirectory() { int i; int ctr=0; printf("page directory for the process, pid=%x\n", GetCurrentProcessId()); for (i=0; i<1024; i++) { if (PageDirectory[i]&0x01) { if ((ctr%3)==0) { printf("\n"); printf("%08x:%08x ", i << 22, PageDirectory[i] & 0xFFFFF000); ctr++; printf("\n"); The DisplayPageDirectory() function operates in user mode and prints the PageDirectory array that is initialized by the CfuncGetPageDirectory() function. The function checks the Least Significant Bit (LSB) of each of the entries. A page directory entry is valid only if the last bit or the LSB is set. The function skips printing invalid entries. The function prints three entries on every line or, in other words, prints a newline character for every third entry. Each directory entry is printed as the logical address and the address of the corresponding page table as obtained from the page directory. As described earlier, the first 10 bits (or the 10 Most Significant Bits [MSB]) of the logical address are used as an index in the page directory. In other words, a directory entry at index i represents the logical addresses that have i as the first 10 bits. The function prints the base of the logical address range for each directory entry. The base address (that is, the least address in the range) has the last 22 bits (or 22 LSBs) as zeros. The function obtains this base address by shifting i to the first 10 bits. The address of the page table corresponding to the logical address is stored in the first 20 bits (or 20 MSBs) of the page directory entry. The 12 LSBs are the flags for the entry. The function calculates the page table address by masking off the flag bits. main()

38 { WORD CallGateSelector; int rc; static short farcall[3]; /* Assembly stub that is called through callgate */ extern void GetPageDirectory(void); /* Creates a callgate to read the page directory from Ring 3 */ rc = CreateCallGate(GetPageDirectory, 0, &CallGateSelector); if (rc == SUCCESS) { farcall[2] = CallGateSelector; _asm { call fword ptr [farcall] DisplayPageDirectory(); getchar(); /* Releases the callgate */ rc=freecallgate(callgateselector); if (rc!=success) { printf("freecallgate failed, CallGateSelector=%x, rc=%x\n", CallGateSelector, rc); else { printf("createcallgate failed, rc=%x\n", rc); return 0; The main() function starts by creating a callgate that sets up the GetPageDirectory() function to be executed in the kernel mode. The GetPageDirectory() function is written in Assembly language and is a part of the RING0.ASM file. The CreateCallGate() function, used by the program to create the callgate, is provided by CALLGATE.DLL. The function returns with a callgate selector. XREF: The mechanism of calling the desired function through callgate is explained in Chapter 10. We ll quickly mention a few important points here. The callgate selector returned by CreateCallGate() is a segment selector for the given function: in this case, GetPageDirectory(). To invoke the function pointed by the callgate selector, you need to issue a far call instruction. The far call instruction expects a 16-bit segment selector and a 32-bit offset within the segment. When you are calling through a callgate, the offset does not matter; the processor always jumps at the start of the function pointed to by the callgate. Hence, the program only initializes the third member of the farcall array that corresponds to the segment selector. Issuing a call through the callgate transfers the execution control to the GetPageDirectory() function. This

39 function calls the CfuncGetPageDirectory() function that copies the page directory in the PageDirectory array. After the callgate call returns, the program prints the page directory copied in the PageDirectory by calling the DisplayPageDirectory() function. The program frees the callgate before exiting. Listing 4-2: RING0.ASM.386.model small.code include..\include\undocnt.inc public _GetPageDirectory extrn ;Assembly stub called from callgate _GetPageDirectory proc Ring0Prolog call Ring0Epilog retf _GetPageDirectory endp END The function to be called from the callgate needs to be written in assembly language for a couple of reasons. First, the function needs to execute a prolog and an epilog, both of which are assembly macros, to allow paging in kernel mode. Second, the function needs to issue a far return at the end. The function leaves the rest of the job to the CFuncGetPageDirectory() function written in C. If you compare the output of the showdir program for two different processes, you find that the upper half of the page table directories for the two processes is exactly the same except for two entries. In other words, the corresponding kernel address space for these two entries is not shared by the two processes. Listing 4-3: First instance of SHOWDIR Page directory for the process, pid=6f : :00f :0152f000 5f800000:00e c00000:0076b000 7f400000:012cb000 7fc00000:0007e : : : c00000:00c : : : c00000:01c00000

40 : : : c00000:02c : : : c00000:03c : : : c00000:04c : : : c00000:05c : : : c00000:06c : : : c00000:07c00000 a :0153d000 c :00e5d000 c :00c9e000 c0c00000: c : c : c : c1c00000: c : c : c : c2c00000: c :0004a000 c :0004b000 c :0004c000 c3c00000:0004d000 c :0004e000 c :0000f000 c : c4c00000: c : c : c : c5c00000: c : c : c : c6c00000: c :0005a000 c :0005b000 c :0005c000 c7c00000:0005d000 c :0005e000 c :0005f000 c : c8c00000: c : c : c : c9c00000: ca000000: ca400000: ca800000: cac00000: cb000000:0002a000 cb400000:0002b000 cb800000:0002c000 cbc00000:0002d000 cc000000:0002e000 cc400000:0002f000 cc800000:002f0000 ccc00000:002f1000 cd000000:002f2000 cd400000:002f3000 cd800000:002f4000 cdc00000:002f5000 ce000000:002f6000 ce400000: ce800000: cec00000: cf000000:0003a000 cf400000:0003b000 cf800000:0003c000 cfc00000:0003d000 d :0003e000 d :0003f000 d : d0c00000: d : d : d : d1c00000: d : d : d : d2c00000: d :0030a000 d :0030b000 d :0030c000 d3c00000:0030d000 d :0030e000 d :0004f000 d : d4c00000: e : e :010fe000 fc400000:0038d000 fc800000:0038e000 fcc00000:0038f000 fd000000: fd400000: fd800000: fdc00000: fe000000: fe400000: fe800000: fec00000: ff000000: ff400000:

41 ff800000:0039a000 ffc00000: Listing 4-4: Second instance of SHOWDIR Page directory for the process, pid=7d :00fa :00fa :0110a000 5f800000:015ac000 77c00000:01a f400000:013ac000 7fc00000:0145e : : : c00000:00c : : : c00000:01c : : : c00000:02c : : : c00000:03c : : : c00000:04c : : : c00000:05c : : : c00000:06c : : : c00000:07c00000 a :0153d000 c :00d94000 c : c0c00000: c : c : c : c1c00000: c : c : c : c2c00000: c :0004a000 c :0004b000 c :0004c000 c3c00000:0004d000 c :0004e000 c :0000f000 c : c4c00000: c : c : c : c5c00000: c : c : c : c6c00000: c :0005a000 c :0005b000 c :0005c000 c7c00000:0005d000 c :0005e000 c :0005f000 c : c8c00000: c : c : c : c9c00000: ca000000: ca400000: ca800000: cac00000: cb000000:0002a000 cb400000:0002b000 cb800000:0002c000 cbc00000:0002d000 cc000000:0002e000 cc400000:0002f000 cc800000:002f0000 ccc00000:002f1000 cd000000:002f2000 cd400000:002f3000 cd800000:002f4000 cdc00000:002f5000 ce000000:002f6000 ce400000: ce800000: cec00000: cf000000:0003a000 cf400000:0003b000 cf800000:0003c000 cfc00000:0003d000 d :0003e000 d :0003f000 d : d0c00000: d : d : d : d1c00000: d : d : d : d2c00000: d :0030a000 d :0030b000

42 d :0030c000 d3c00000:0030d000 d :0030e000 d :0004f000 d : d4c00000: e : e :010fe000 fc400000:0038d000 fc800000:0038e000 fcc00000:0038f000 fd000000: fd400000: fd800000: fdc00000: fe000000: fe400000: fe800000: fec00000: ff000000: ff400000: ff800000:0039a000 ffc00000: Let s analyze, one step at a time, why the two entries are different. The page tables themselves need to be mapped onto some linear address. When Windows NT needs to access the page tables, it uses this linear address range. To represent 4GB of memory divided into 1MB pages of 4K each, we need 1K page tables each having 1K entries. To map these 1K page tables, Windows NT reserves 4MB of linear address space in each process. As we saw earlier, each process has a different set of page tables. Whatever the process, Windows NT maps the page tables on the linear address range from 0xC to 0xC03FFFFF. Let s call this linear address range as the page table address range. In other words, the page table address range maps to different page tables that is, to different physical pages for different processes. As you may have noticed, the page table addresses range falls in the kernel address space. Windows NT cannot map this crucial system data structure in the user address space and allow user-mode processes to play with the memory. Ultimately, the result is that two processes cannot share pages in the page table address range although the addresses lie in the kernel-mode address range. Exactly one page table is required to map 4MB address space because each page table has 1K entries and each entry corresponds to a 4K page. Consequently, Windows NT cannot share the page table corresponding to the page table address range. This accounts for one of the two mysterious entries in the page table directory. However, the entry s mystery does not end here there is one more subtle twist to this story. The physical address specified in this entry matches the physical address of the page table directory. The obvious conclusion is that the page table directory acts also as the page table for the page table address range. This is possible because the formats of the page table directory entry and PTE are the same on The processor carries out an interesting sequence of actions when the linear address within the page table address range is translated to a physical address. Let s say that the CR3 register points to page X. As the first step in the address translation process, the processor treats the page X as the page table directory and finds out the page table for the given linear address. The page table happens to be page X again. The processor now treats page X as the required page table and finds out the physical address from it. A more interesting case occurs when the operating system is accessing the page table directory itself. In this case, the physical address also falls in page X! Let s now turn to the second mysterious entry. The 4MB area covered by this page directory

43 entry is internally referred to as hyperspace. This area is used for mapping the physical pages belonging to other processes into virtual address space. For example, a function such as MmMapPageInHyperspace() uses the virtual addresses in this range. This area is also used during the early stages of process creation. For example, when a parent process such as PROGMAN.EXE spawns a child process such as NOTEPAD.EXE, PROGMAN.EXE has to create the address space for NOTEPAD.EXE. This is done as a part of the MmCreateProcessAddressSpace() function. For starting any process, an address space must be created for the process. Address space is nothing but page directory. Also, the upper-half entries of page directory are common for all processes except for the two entries that we have already discussed. These entries need to be created for the process being spawned. The MmCreateProcessAddressSpace() function allocates three pages of memory: the first page for the page directory, the second page for holding the hyperspace page table entries, and the third page for holding the working set information for the process being spawned. Once these pages are allocated, the function maps the first physical page in the address space using the MmMapPageInHyperSpace() function. Note that the MmMapPageInHyperSpace() function runs in the context of PROGMAN.EXE. Now the function copies the page directory entries in the upper half of the page directory to the mapped hyperspace virtual address. In short, PROGMAN.EXE creates the page directory for the NOTEPAD.EXE. Windows NT supports memory-mapped files. When two processes map the same file, they share the same set of physical pages. Hence, memory-mapped files can be used for sharing memory. In fact, Windows NT itself uses memory-mapped files to load DLLs and executables. If two processes map the same DLL, they automatically share the DLL pages. The memory-mapped files are implemented using the section object under Windows NT. A data structure called PROTOPTE is associated with each section object. This data structure is a variable-length structure based on the size of the section. This data structure contains a 4-byte entry for each page in the virtual address space mapped by the section object. Each 4-byte entry has the same structure as that of the PTE. When the page is not being used by any of the processes, the protopte entry is invalid and contains enough information to get the page back. In this case, the CPU PTE contains a fixed value that is 0xFFFFF480, which indicates that accessing this page will be considered a protopte fault. Now comes the toughest of all questions: "How can Windows NT give away 4GB of memory to each process when there is far less physical RAM available on the board?" Windows NT, as well as all other operating systems that allow more address space than actual physical memory, uses a technique called virtual memory to achieve this. In the next section, we discuss virtual memory management in Windows NT. VIRTUAL MEMORY MANAGEMENT The basic idea behind virtual memory is very simple. For each process, the operating system

44 maps few addresses to real physical memory because RAM is expensive and relatively rare. Remaining memory for each process is really maintained on secondary storage (usually a hard disk). That s why it is called virtual memory. The addresses that are not mapped on physical RAM are marked as such. Whenever a process accesses such an address, the operating system brings the data into memory from secondary storage. If the operating system runs out of physical RAM, some data is thrown out to make space. We can always get back this data because a copy is maintained on secondary storage. The data to be thrown out is decided by the replacement policy. Windows NT uses First-In-First-Out (FIFO) replacement policy. According to this policy, the oldest data (that is, the data that was brought in the RAM first) is thrown out whenever there is a space crunch. To implement virtual memory management, Windows NT needs to maintain a lot of data. First, it needs to maintain whether each address is mapped to physical RAM or the data is to be brought in from secondary storage when a request with the address comes. Maintaining this information for each byte itself takes a lot of space (actually, more space than the address space for which the information is to be maintained). So Windows NT breaks the address space into 4KB pages and maintains this information in page tables. As we saw earlier, a page table entry (PTE) consists of the address of the physical page (if the page is mapped to physical RAM) and attributes of the page. Since the processor heavily depends on PTEs for address translation, the structure of PTE is processor dependent. If a page is not mapped onto physical RAM, Windows NT marks the page as invalid. Any access to this page causes a page fault, and the page fault handler can bring in the page from the secondary storage. To be more specific, when the page contains DLL code or executable module code, the page is brought in from the DLL or executable file. When the page contains data, it is brought in from the swap file. When the page represents a memory-mapped file area, it is brought in from the corresponding file. Windows NT needs to keep track of free physical RAM so that it can allocate space for a page brought in from secondary storage in case of a page fault. This information is maintained in a kernel data structure called the Page Frame Database (PFD). The PFD also maintains a FIFO list of in-memory pages so that it can decide on pages to throw out in case of a space crunch. Before throwing out a page, Windows NT must ensure that the page is not dirty. Otherwise, it needs to write that page to secondary storage before throwing it out. If the page is not shared, the PFD contains the pointer to PTE so that if the operating system decides to throw out a particular page, it can then go back and mark the PTE as invalid. If the page is shared, the PFD contains a pointer to the corresponding PROTOPTE entry. In this case, the PFD also contains a reference count for the page. A page can be thrown out only if its reference count is 0. In general, the PFD maintains the status of every physical page. The PFD is an array of 24-byte entries, one for each physical page. Hence, the size of this array is equal to the number of physical pages that are stored in a kernel variable, namely, MmNumberOfPhysicalPages. The pointer to this array is stored in a kernel variable, namely, MmpfnDatabase. A physical page can be in several states for example, it can be in-use, free,

45 free but dirty, and so on. A PFD entry is linked in a doubly linked list, depending on the state of the physical page represented by it. For example, the PFD entry representing a free page is linked in the free pages list. Figure 4-4 shows these lists linked through the PFD. The forward links are shown on the left side of the PFD, and the backward links are shown on the right side. There are in all six kinds of lists. The heads of these lists are stored in following kernel variables: MmStandbyPageListHead MmModifiedNoWritePageListHead MmModifiedPageListHead MmFreePageListHead MmBadPageListHead MmZeroedPageListHead All these list heads are actually structures of 16 bytes each. Here is the structure definition: Figure 4-4: Various lists linked through PFD typedef struct PageListHead { DWORD NumberOfPagesInList, DWORD TypeOfList, DWORD FirstPage, DWORD LastPage PageListHead_t; The FirstPage field can be used as an index into the PFD. The PFD entry contains a pointer to the next page. Using this, you can traverse any of the lists. Here is the structure definition for the PFD entry:

46 typedef struct PfdEntry { DWORD NextPage, void *PteEntry/*PpteEntry, DWORD PrevPage, DWORD PteReferenceCount, void *OriginalPte, DWORD Flags; PfdEntry_t; Using this, you can easily write a program to dump the PFD. However, there is one problem: kernel variables, such as list heads, MmPfnDatabase, and MmNumberOfPhysicalPages, are not exported. Therefore, you have to deal with absolute addresses, which makes the program dependent on the Windows NT version and build type. VIRTUAL ADDRESS DESCRIPTORS Along with the free physical pages, Windows NT also needs to keep track of the virtual address space allocation for each process. Whenever a process allocates a memory block for example, to load a DLL Windows NT checks for a free block in the virtual address space, allocates virtual address space, and updates the virtual address map accordingly. The most obvious place to maintain this information is page tables. For each process, Windows NT maintains separate page tables. There are 1 million pages, and each page table entry is 4 bytes. Hence, full page tables for a single process would take 4MB of RAM! There is a solution to this: Page tables themselves can be swapped out. It is inefficient to swap in entire page tables when a process wants to allocate memory. Hence, Windows NT maintains a separate binary search tree containing the information about current virtual space allocation for each process. A node in this binary search tree is called a Virtual Address Descriptor (VAD). For each block of memory allocated to a process, Windows NT adds a VAD entry to the binary search tree. Each VAD entry contains the allocated address range that is, the start address and the end address of the allocated block, pointers to left and right children VADs, and a pointer to the parent VAD. The process environment block (PEB) contains a pointer, namely, VadRoot, to the root of this tree. Listing 4-5: VADDUMP.C /* Should be compiled in release mode */ #define _X86_ #include <ntddk.h> #include <string.h> #include <stdio.h> #include "undocnt.h" #include "gate.h" /*Define the WIN32 calls we are using, since we can not include both

47 NTDDK.H and WINDOWS.H in the same C file.*/ typedef struct _OSVERSIONINFO{ ULONG dwosversioninfosize; ULONG dwmajorversion; ULONG dwminorversion; ULONG dwbuildnumber; ULONG dwplatformid; CCHAR szcsdversion[ 128 ]; OSVERSIONINFO, *LPOSVERSIONINFO; BOOLEAN _stdcall GetVersionExA(LPOSVERSIONINFO); PVOID _stdcall VirtualAlloc(PVOID, ULONG, ULONG, ULONG); /* Max vad entries */ #define MAX_VAD_ENTRIES 0x200 /* Following variables are accessed in RING0.ASM */ ULONG NtVersion; ULONG PebOffset; ULONG VadRootOffset; #pragma pack(1) typedef struct VadInfo { void *VadLocation; VAD Vad; VADINFO, *PVADINFO; #pragma pack() VADINFO VadInfoArray[MAX_VAD_ENTRIES]; int VadInfoArrayIndex; PVAD VadTreeRoot; The initial portion of the VADDUM P.C file has a few definitions apart from the header inclusion. In this program, we use the callgate mechanism as we did in the showdir program hence the inclusion of the GATE.H header file. After the header inclusion, the file defines the maximum number of VAD entries that we ll process. There is no limit on the nodes in a VAD tree. We use the callgate mechanism for kernel-mode execution of a function that dumps the VAD tree in an array accessible from the user mode. This array can hold up to MAX_VAD_ENTRIES entries. Each entry in the array is of type VADINFO. The VADINFO structure has two members: the address of the VAD tree node and the actual VAD tree node. The VAD tree node structure is defined in the UNDOCNT.H file as follows: typedef struct vad { void *StartingAddress; void *EndingAddress; struct vad *ParentLink; struct vad *LeftLink; struct vad *RightLink;

48 DWORD Flags; VAD, *PVAD; The first two members dictate the address range represented by the VAD node. Each VAD tree node maintains a pointer to the parent node and a pointer to the left child and the right child. The VAD tree is a binary tree. For every node in the tree, the left subtree consists of nodes representing lower address ranges, and the right subtree consists of nodes representing the higher address ranges. The last member in the VAD node is the flags for the address range. The VADDUMP.C file has a few other global variables apart from the VadInfoArray. A couple of global variables are used while locating the root of the VAD tree. The PEB of a process points to the VAD tree root for that process. The offset of this pointer inside the PEB varies with the Windows NT version. We set the VadRootOffset to the appropriate offset value of the VAD root pointer depending on the Windows NT version. There is a similar problem of Windows NT version dependency while accessing the PEB for the process. We use the Thread Environment Block (TEB) to get to the PEB. One field in TEB points to the PEB, but the offset of this field inside the TEB structure varies with the Windows NT version. We set the PebOffset variable to the appropriate offset value of the PEB pointer inside the TEB structure depending on the Windows NT version. Another global variable, NtVersion, stores the version of Windows NT running on the machine. That leaves us with two more global variables, namely, VadInfoArrayIndex and VadTreeRoot. The VadInfoArrayIndex is the number of initialized entries in the VadInfoArray. The VadInfoArray entries after VadInfoArrayIndex are free. The VadTreeRoot variable stores the root of the VAD tree. The sample has been tested on Windows NT 3.51, 4.0 and Windows 2000 beta2. The sample will run on other versions of Windows 2000, provided the offsets of VadRoot and PEB remain same. /* Recursive function which walks the vad tree and * fills up the global VadInfoArray with the Vad * entries. Function is limited by the * MAX_VAD_ENTRIES. Other VADs after this are not * stored */ void _stdcall VadTreeWalk(PVAD VadNode) { if (VadNode == NULL) { return; if (VadInfoArrayIndex >= MAX_VAD_ENTRIES) { return; VadTreeWalk(VadNode->LeftLink);

49 VadInfoArray[VadInfoArrayIndex].VadLocation = VadNode; VadInfoArray[VadInfoArrayIndex].Vad.StartingAddress = VadNode->StartingAddress; VadInfoArray[VadInfoArrayIndex].Vad.EndingAddress = VadNode->EndingAddress; if (NtVersion == 5) { (DWORD)VadInfoArray[VadInfoArrayIndex]. Vad.StartingAddress <<= 12; (DWORD)VadInfoArray[VadInfoArrayIndex]. Vad.EndingAddress += 1; (DWORD)VadInfoArray[VadInfoArrayIndex]. Vad.EndingAddress <<= 12; (DWORD)VadInfoArray[VadInfoArrayIndex]. Vad.EndingAddress -= 1; VadInfoArray[VadInfoArrayIndex].Vad.ParentLink = VadNode->ParentLink; VadInfoArray[VadInfoArrayIndex].Vad.LeftLink = VadNode->LeftLink; VadInfoArray[VadInfoArrayIndex].Vad.RightLink = VadNode->RightLink; VadInfoArray[VadInfoArrayIndex].Vad.Flags = VadNode->Flags; VadInfoArrayIndex++; VadTreeWalk(VadNode->RightLink); The VadTreeWalk() function is executed in the kernel mode using the callgate mechanism. The function traverses the VAD tree in the in-order fashion and fills up the VadInfoArray. The function simply returns if the node pointer parameter is NULL or the VadInfoArray is full. Otherwise, the function recursively calls itself for the left subtree. The recursion is terminated when the left child pointer is NULL. The function then fills up the next free entry in the VadInfoArray and increments the VadInfoArrayIndex to point to the next free entry. Windows 2000 stores the page numbers instead of the actual addresses in VAD. Hence, for Windows 2000, we need to calculate the starting address and the ending address from the page numbers stored in these fields. As the last step in the in-order traversal, the function issues a self-recursive to process the right subtree. /* C function called through assembly stub */ void _stdcall CFuncDumpVad(PVAD VadRoot) { VadTreeRoot = VadRoot;

50 VadInfoArrayIndex = 0; VadTreeWalk(VadRoot); The CfuncDumpVad is the caller of the VadTreeWalk() function. It just initializes the global variables used by the VadTreeWalk() function and calls the VadTreeWalk() function for the root of the VAD tree. /* Displays the Vad tree */ void VadTreeDisplay() { int i; printf("vadroot is printf("vad@\t Starting\t Ending\t Parent\t LeftLink\t RightLink\n"); for (i=0; i < VadInfoArrayIndex; i++) { printf("%08x %08x %08x %8x %08x %08x\n", VadInfoArray[i].VadLocation, VadInfoArray[i].Vad.StartingAddress, VadInfoArray[i].Vad.EndingAddress, VadInfoArray[i].Vad.ParentLink, VadInfoArray[i].Vad.LeftLink, VadInfoArray[i].Vad.RightLink); printf("\n\n"); The VadTreeDisplay() function is a very simple function that is executed in user mode. The function iterates through all the entries initialized by the VadTreeWalk() function and prints the entries. Essentially, the function prints the VAD tree in the infix order because the VadTreeWalk() function dumps the VAD tree in the infix order. void SetDataStructureOffsets() { switch (NtVersion) { case 3: PebOffset = 0x40; VadRootOffset = 0x170; break; case 4: PebOffset = 0x44; VadRootOffset = 0x170; break;

51 case 5: PebOffset = 0x44; VadRootOffset = 0x194; break; As we described earlier, the offset of the PEB pointer within TEB and the offset of the VAD root pointer within the PEB are dependent on the Windows NT version. The SetDataStructureOffsets() function sets the global variables indicating these offsets depending on the Windows NT version. main() { WORD CallGateSelector; int rc; short farcall[3]; void DumpVad(void); void *ptr; OSVERSIONINFO VersionInfo; VersionInfo.dwOSVersionInfoSize = sizeof(versioninfo); if (GetVersionEx(&VersionInfo) == TRUE) { NtVersion = VersionInfo.dwMajorVersion; if ((NtVersion < 3) (NtVersion > 5)) { printf("unsupported NT version, exiting..."); return 0; SetDataStructureOffsets(); /* Creates call gate to read vad tree from Ring 3 */ rc = CreateCallGate(DumpVad, 0, &CallGateSelector); if (rc!= SUCCESS) { printf("createcallgate failed, rc=%x\n", rc); return 1; farcall[2] = CallGateSelector; _asm { call fword ptr [farcall] printf("dumping the Vad tree...\n\n");

52 VadTreeDisplay(); printf("allocating memory using VirtualAlloc"); ptr = VirtualAlloc(NULL, 4096, MEM_COMMIT, PAGE_READONLY); if (ptr == NULL) { printf("unable to allocate memory\n"); goto Quit; printf("\nmemory ptr); _asm { call fword ptr [farcall] printf("\n\ndumping the Vad tree again...\n\n"); VadTreeDisplay(); Quit: rc = FreeCallGate(CallGateSelector); if (rc!= SUCCESS) { printf("freecallgate failed, Selector=%x, rc=%x\n", CallGateSelector, rc); return 0; The main() function starts by getting the Windows NT version and calling SetDataStructureOffsets() to set the global variables storing the offsets for the PEB and the VAD tree root. It then creates a callgate in the same manner as in the SHOWDIR sample program. Issuing a call through this callgate ultimately results in the execution of the VadTreeWalk() function that fills up the VadInfoArray. The main() function then calls the VadTreeDisplay() function to print the VadInfoArray entries. We also show you the change in the VAD tree due to memory allocation in this sample program. After printing the VAD tree once, the program allocates a chunk of memory. Then, the program issues the callgate call again and prints the VAD tree after returning from the call. You can observe the updates that happened to the VAD tree because of the memory allocation. The program frees up the callgate before exiting. Listing 4-6: RING0.ASM.386.model small.code public _DumpVad extrn _CFuncDumpVad@4:near extrn _PebOffset:near extrn _VadRootOffset:near

53 include..\include\undocnt.inc _DumpVad proc Ring0Prolog ;Gets the current thread MOV EAX,FS:[ h] ;Gets the current process ADD EAX, DWORD PTR [_PebOffset] MOV EAX,[EAX] ;Push Vad Tree root ADD EAX, DWORD PTR [_VadRootOffset] MOV EAX, [EAX] PUSH EAX CALL Ring0Epilog RETF _DumpVad endp END The function to be called from the callgate needs to be written in the Assembly language for reasons already described. The DumpVad() function gets hold of the VAD root pointer and calls the CFuncDumpVad() function that dumps the VAD tree in the VadInfoArray. The function gets hold of the VAD root from the PEB after getting hold of the PEB from the TEB. The TEB of the currently executing thread is always pointed to by FS:128h. As described earlier, the offset of the VAD root pointer inside PEB and the offset of the PEB pointer inside the TEB vary with the Windows NT version. The DumpVad() function uses the offset values stored in the global variable by the SetDataStructureOffsets() function. Listing 4-7 presents the output from an invocation of the VADDUMP program. Note that the VAD tree printed after allocating memory at address 0x shows an additional entry for that address range. Listing 4-7: Program output Dumping the Vad tree... VadRoot is Starting Ending Parent LeftLink RightLink fe216b fff fe21a9c fe25a0e8 fe25a0e fff fe216b fe275da8 fe275da ffff fe25a0e fe22a428 fe22a fff fe275da fe26b328 fe26b ffff fe22a fe210fc8 fe210fc ffff fe26b fe21a8c8

54 fe21a8c fff fe210fc fe21be68 fe21be dfff fe21a8c fe215dc8 fe215dc b0fff fe21be fe231e88 fe231e88 002c c0fff fe215dc fe2449e8 fe2449e8 002d dffff fe231e fe21cb48 fe21cb48 002e e0fff fe2449e fe23b7a8 fe23b7a8 002f fffff fe21cb fe21a9c cfff 0 fe216b08 fe23c488 fe21b3e dfff fe2333e fe fe2176c8 77e e4bfff fe fe2326e8 fe2152c8 77e e54fff fe2326e fe2326e8 77e e9bfff fe2176c8 fe2152c fe ea ed7fff fe21b3e8 fe2176c8 fe2197c8 fe2197c8 77ee f12fff fe fe2333e8 77f f73fff fe23c488 fe21b3e fe23c488 77f fcdfff fe21a9c8 fe2333e8 fe25aa88 fe22b408 7f2d0000 7f5cffff fe25aa fe22c4a8 fe22c4a8 7f5f0000 7f7effff fe22b fe23f5e8 fe23f5e8 7ff ffaffff fe22c4a fe25aa88 7ffb0000 7ffd3fff fe23c488 fe22b408 fe fe21da88 7ffde000 7ffdefff fe fe ffdf000 7ffdffff fe25aa88 fe21da Allocating memory using VirtualAlloc Memory Dumping the Vad tree again... VadRoot is Vad@ Starting Ending Parent LeftLink RightLink fe216b fff fe21a9c fe25a0e8 fe25a0e fff fe216b fe275da8 fe275da ffff fe25a0e fe22a428 fe22a fff fe275da fe26b328 fe26b ffff fe22a fe210fc8 fe210fc ffff fe26b fe21a8c8 fe21a8c fff fe210fc fe21be68 fe21be dfff fe21a8c fe215dc8 fe215dc b0fff fe21be fe231e88 fe231e88 002c c0fff fe215dc fe2449e8 fe2449e8 002d dffff fe231e fe21cb48 fe21cb48 002e e0fff fe2449e fe23b7a8 fe23b7a8 002f fffff fe21cb fe27b628 fe27b fff fe23b7a

55 fe21a9c cfff 0 fe216b08 fe23c488 fe21b3e dfff fe2333e fe fe2176c8 77e e4bfff fe fe2326e8 fe2152c8 77e e54fff fe2326e fe2326e8 77e e9bfff fe2176c8 fe2152c fe ea ed7fff fe21b3e8 fe2176c8 fe2197c8 fe2197c8 77ee f12fff fe fe2333e8 77f f73fff fe23c488 fe21b3e fe23c488 77f fcdfff fe21a9c8 fe2333e8 fe25aa88 fe22b408 7f2d0000 7f5cffff fe25aa fe22c4a8 fe22c4a8 7f5f0000 7f7effff fe22b fe23f5e8 fe23f5e8 7ff ffaffff fe22c4a fe25aa88 7ffb0000 7ffd3fff fe23c488 fe22b408 fe fe21da88 7ffde000 7ffdefff fe fe ffdf000 7ffdffff fe25aa88 fe21da The output of the VADDUMP program does not really look like a tree. You have to trace through the output to get the tree structure. The entry with a null parent link is the root of the tree. Once you find the root, you can follow the child pointers. To follow a child pointer, search the pointer in the first column, named Vad@, in the output. The Vad entry with the same Vad@ is the entry for the child that you are looking for. An all-zero entry for a left/right child pointer indicates that there is no left/right subtree for the node. Figure 4-5 shows a partial tree constructed from the output shown previously. Figure 4-5: VAD tree IMPACT ON HOOKING

56 Now we ll look at the impact of the memory management scheme explained in the last section in the area of hooking DLL API calls. To hook a function from a DLL, you need to change the first few bytes from the function code. As you saw earlier, the DLL code is shared by all processes and is write protected so that a misbehaving process cannot affect other processes. Does this mean that you cannot hook a function in Windows NT? The answer is, Hooking is possible under Windows NT, but you need to do a bit more work to comply with stability requirements. Windows NT provides a system call, VirtualProtect, that you can use to change page attributes. Hence, hooking is now a two-step process: Change the attributes of the page containing DLL code to read-write, and then change the code bytes. Copy-on-Write Eureka! you might say, I violated Windows NT security. I wrote to a shared page used by other processes also. No! You did not do that. You changed only your copy of the DLL code. The DLL code page was being shared while you did not write to the page. The moment you wrote on that page, a separate copy of it was made, and the writes went to this copy. All other processes are safely using the original copy of the page. This is how Windows NT protects processes from each other while consuming as few resources as possible. The VirtualProtect() function does not mark the page as read-write it keeps the page as read-only. Nevertheless, to distinguish this page from normal read-only pages, it is marked for copy-on-write. Windows NT uses one of the available PTE bits for doing this. When this page is written onto, because it is a read-only page, the processor raises a page fault exception. The page fault handler makes a copy of the page and modifies the page table of the faulting process accordingly. The new copy is marked as read-write so that the process can write to it. Windows NT itself uses the copy-on-write mechanism for various purposes. The DLL data pages are shared with the copy-on-write mark. Hence, whenever a process writes to a data page, it gets a personal copy of it. Other processes keep sharing the original copy, thus maximizing the sharing and improving memory usage. A DLL may be loaded in memory at different linear address for different processes. The memory references for example, address for call instruction, address for a memory to register move instruction, and so on in the DLL need to be adjusted (patched) depending on the linear address where the DLL gets loaded. This process is called as relocating the DLL. Obviously, relocation has to be done separately for each process. While relocating, Windows NT marks the DLL code pages as copy-on-write temporarily. Thus, only the pages requiring page relocation are copied per process. Other pages that do not have memory references in them are shared by all processes. This is the reason Microsoft recommends that a DLL be given a preferred base address and be loaded at that address. The binding of the DLL to a specific base address ensures that the DLL need not be relocated if it is loaded at the specified base address. Hence, if all processes load

57 the DLL at the preferred base address, all can share the same copy of DLL code. The POSIX subsystem of Windows NT uses the copy-on-write mechanism to implement the fork system call. The fork system call creates a new process as a child of a calling process. The child process is a replica of the parent process, and it has the same state of code and data pages as the parent. Since these are two different processes, the data pages should not be shared by them. However, generally it is wasteful to make a copy of the parent s data pages because in most cases the child immediately invokes the exec system call. The exec system call discards the current memory image of the process, loads a new executable module, and starts executing the new executable module. To avoid copying the data pages, the fork system call marks the data pages as copy-on-write. Hence, a data page is copied only if the parent or the child writes to it. Copy -on-write is an extremely important concept contributing to the efficiency of NT memory management. The following sample program demonstrates how copy -on-write works. By running two instances of the program, you can see how the concepts described in this section work. The application loads a DLL, which contains two functions and two data variables. One function does not refer to the outside world, so no relocations are required for it. The other function accesses one global variable, so it contains relocatable instructions or instructions that need relocation. One data variable is put in a shared data section so it will be shared across multiple instances of DLL. One variable is put in a default data section. The two functions are put in separate code sections just to make them page aligned. When you run the first instance of the application, the application loads and prints the physical addresses of two functions and two data variables. After this, you run the second instance of the same application. In the second instance, the application arranges to load the DLL at a different base address than that of the first instance. Then it prints the physical addresses of two functions and two data variables. Next, the application arranges to load the DLL at the same base address as that of the first instance. In this case, all physical pages are seen to be shared. Next, the application modifies the shared and nonshared variable and modifies the first few bytes of one function, and it prints the physical addresses for two functions and two variables again. We first discuss the code for this sample program and then describe how the output from the sample program demonstrates memory sharing and the effects of the copy-on-write mechanism. Listing 4-8: SHOWPHYS.C #include <windows.h> #include <stdio.h> #include "gate.h" #include "getphys.h" HANDLE hfilemapping;

58 /* Imported function/variable addresses */ static void *NonRelocatableFunction = NULL; static void *RelocatableFunction = NULL; static void *SharedVariable = NULL; static void *NonSharedVariable = NULL; HINSTANCE hdllinstance; The initial portion of the file contains the header inclusion and global variable definitions. The program demonstrates the use of various page attributes, especially to implement the copy-on-write mechanism. As described earlier, the program uses four different types of memory sections. The pointers to the four different types of memory sections are defined as global variables. The hdllinstance stores the instance of the instance handle of the DLL that contains the different kind of memory sections used in this demonstration. /* Loads MYDLL.DLL and initializes addresses of imported functions/variables from MYDLL.DLL and locks the imported areas */ int LoadDllAndInitializeVirtualAddresses() { hdllinstance = LoadLibrary("MYDLL.DLL"); if (hdllinstance == NULL) { printf("unable to load MYDLL.DLL\n"); return -1; printf("mydll.dll loaded at base address = %x\n", hdllinstance); NonRelocatableFunction =GetProcAddress(GetModuleHandle("MYDLL"),"_NonRelocatableFunction@0"); RelocatableFunction =GetProcAddress(GetModuleHandle("MYDLL"),"_RelocatableFunction@0"); SharedVariable = GetProcAddress(GetModuleHandle("MYDLL"),"SharedVariable"); NonSharedVariable =GetProcAddress(GetModuleHandle("MYDLL"),"NonSharedVariable"); if((!nonrelocatablefunction) (!RelocatableFunction) (!SharedVariable) (!NonSharedVariable)) { printf("unable to get the virtual addresses for imports from MYDLL.DLL\n"); FreeLibrary(hDllInstance); HDllInstance = 0; return -1;

59 VirtualLock(NonRelocatableFunction, 1); VirtualLock(RelocatableFunction, 1); VirtualLock(SharedVariable, 1); VirtualLock(NonSharedVariable, 1); return 0; The four different types of memory sections that we use for the demonstration reside in MYDLL.DLL. The LoadDllAndInitializeVirtualAddresses() function loads MYDLL.DLL in the calling process s address space and initializes the global variables to point to different types of memory sections in the DLL. The function uses the GetProcAddress() function to get hold of pointers to the exported functions and variables in MYDLL.DLL. The function stores the instance handle for MYDLL.DLL in a global variable so that the FreeDll() function can later use it to unload the DLL. The function also locks the different memory sections so that the pages are loaded in memory and the page table entries are valid. Generally, Windows NT does not load the page table entries unless the virtual address is actually accessed. In other words, the memory won t be paged in unless accessed. Also, the system can page out the memory that is not used for some time, again marking the page table entries as invalid. We use the VirtualLock() function to ensure that the pages of interest are always loaded and the corresponding page table entries remain valid. /* Unlocks the imported areas and frees the MYDLL.DLL */ void FreeDll() { VirtualUnlock(NonRelocatableFunction, 1); VirtualUnlock(RelocatableFunction, 1); VirtualUnlock(SharedVariable, 1); VirtualUnlock(NonSharedVariable, 1); FreeLibrary(hDllInstance); HDllInstance = 0; NonRelocatableFunction = NULL; RelocatableFunction = NULL; SharedVariable = NULL; NonSharedVariable = NULL; The FreeDll() function uses the VirtualUnlock() function to unlock the memory locations locked by the LoadDllAndInitializeVirtualAddresses() function. The function unloads MYDLL.DLL after unlocking the memory locations from the DLL. As the DLL is unloaded, the global pointers to the memory sections in the DLL become invalid. The function sets all these pointers to NULL according to good programming practice. /* Converts the page attributes in readable form */ char *GetPageAttributesString(unsigned int PageAttr)

60 { static char buffer[100]; strcpy(buffer, ""); strcat(buffer, (PageAttr&0x01)? "P ": "NP "); strcat(buffer, (PageAttr&0x02)? "RW ": "R "); strcat(buffer, (PageAttr&0x04)? "U ": "S "); strcat(buffer, (PageAttr&0x40)? "D ": " "); return buffer; The GetPageAttributesString() function returns a string with characters showing the page attributes given the page attribute flags. The LSB in the page attributes indicates whether the page is present in memory or the page table entry is invalid. This information is printed as P or NP, which stands for present or not present. Similarly, R or RW means a read-only or read-write page; S or U means a supervisor-mode or a user-mode page; and D means a dirty page. The various page attributes are represented by different bits in the PageAttr parameter to this function. The function checks the bits and determines whether the page possesses the particular attributes. /* Displays virtual to physical address mapping */ int DisplayVirtualAndPhysicalAddresses() { DWORD pnonrelocatablefunction = 0; DWORD prelocatablefunction = 0; DWORD psharedvariable = 0; DWORD pnonsharedvariable = 0; DWORD anonrelocatablefunction = 0; DWORD arelocatablefunction = 0; DWORD asharedvariable = 0; DWORD anonsharedvariable = 0; printf("\nvirtual to Physical address mapping\n"); printf("\n \n"); printf("variable/function Virtual Physical Page\n"); printf(" Address Address Attributes\n"); printf(" \n"); GetPhysicalAddressAndPageAttributes(NonRelocatableFunction,&pNonRelocatableF unction, &anonrelocatablefunction); GetPhysicalAddressAndPageAttributes(RelocatableFunction,&pRelocatableFunctio n, &arelocatablefunction); GetPhysicalAddressAndPageAttributes(SharedVariable,&pSharedVariable, &asharedvariable); GetPhysicalAddressAndPageAttributes(NonSharedVariable,&pNonSharedVariable,

61 &anonsharedvariable); printf("nonrelocatablefunction\t %8x\t %8x\t %s\n",nonrelocatablefunction, pnonrelocatablefunction,getpageattributesstring(anonrelocatablefunction)); printf("relocatablefunction\t %8x\t %8x\t %s\n",relocatablefunction, prelocatablefunction,getpageattributesstring(arelocatablefunction)); printf("sharedvariable\t %8x\t %8x\t %s\n", SharedVariable, psharedvariable, GetPageAttributesString(aSharedVariable)); printf("nonsharedvariable\t %8x\t %8x\t %s\n",nonsharedvariable, pnonsharedvariable,getpageattributesstring(anonsharedvariable)); printf(" \n\n"); return 0; The DisplayVirtualAndPhysicalAddresses() function is a utility function that displays the virtual address, the physical address, and the page attributes for different memory sections. It uses the global pointers to the different sections in MYDLL.DLL initialized by the LoadDllAndInitializeVirtualAddresses() function. It uses the GetPhysicalAddressAndPageAttributes() function to get hold of the physical page address and the page attributes for the given virtual address. The first parameter to the GetPhysicalAddressAndPageAttributes() function is the input virtual address. The function fills in the physical address for the input virtual address in the memory location pointed to by the second parameter and the page attributes in the location pointed to by the third parameter. int FirstInstance() { printf("***this is the first instance of the showphys program***\n\n"); printf("loading DLL MYDLL.DLL\n"); if (LoadDllAndInitializeVirtualAddresses()!=0) { return -1; DisplayVirtualAndPhysicalAddresses(); printf("now Run another copy of showphys...\n"); getchar(); FreeDll(); We want to demonstrate the sharing of memory sections by the DLL loaded by two different processes. You need to run two instances of the demonstration program. The FirstInstance() function is executed when you run the first instance of the program. The first instance loads the DLL and prints the physical addresses and page attributes for the various memory sections in the DLL. Then, the function asks you to run another instance of the program. Now there are two processes that loaded MYDLL.DLL. You can compare the outputs from these two

62 instances to check how the memory sections are shared. More on this when we explain the output from this sample program. int NonFirstInstance() { DWORD OldAttr; HINSTANCE hjunk; printf("***this is another instance of the showphys program***\n\n"); printf("loading DLL MYDLL.DLL at diffrent base address than that of the first instance\n"); CopyFile("MYDLL.DLL", "JUNK.DLL", FALSE); hjunk=loadlibrary("junk.dll"); if (hjunk==null) { printf("could not find JUNK.DLL\n"); return -1; if (LoadDllAndInitializeVirtualAddresses()!=0) { FreeLibrary(hJunk); return -1; FreeLibrary(hJunk); DisplayVirtualAndPhysicalAddresses(); FreeDll(); printf("loading DLL MYDLL.DLL at same base address as that of the first instance\n"); if (LoadDllAndInitializeVirtualAddresses()!=0) { return -1; DisplayVirtualAndPhysicalAddresses(); printf("...modifying the code bytes at the start of NonRelocatableFunction\n"); &OldAttr); VirtualProtect(NonRelocatableFunction, 1, PAGE_READWRITE, *(unsigned char *)NonRelocatableFunction=0xE9; printf("...modifying the value of SharedVariable\n"); *(char *)SharedVariable=0x10; printf("...modifying the NonSharedVariable s value\n\n"); *(char *)NonSharedVariable=0x10; DisplayVirtualAndPhysicalAddresses();

63 FreeDll(); return 0; The second instance of the program does a lot more work than the first instance. The sharing of the DLL memory sections depends on the way the instance loads the DLL and accesses the memory locations in the DLL. In more concrete terms, the sharing depends on whether the second instance loads the DLL at the same base address as the first instance. It also depends on whether the instances only read the memory sections or any of the instances write to the memory sections. To demonstrate this, the NonFirstInstance() function first loads the DLL at a different base address than the first instance. The function ensures that the DLL is loaded at a different base address by loading JUNK.DLL before loading MYDLL.DLL. JUNK.DLL has the same preferred base address as that of MYDLL.DLL. The first instance loads MYDLL.DLL at its preferred base address by default. In the second instance, MYDLL.DLL cannot be loaded at its preferred base address because the address range is already occupied by JUNK.DLL. After MYDLL.DLL is loaded at a different base address, there is no reason for the program to keep JUNK.DLL loaded, and so it frees the JUNK.DLL instance. Next, the function prints the physical addresses and page attributes of the memory sections in MYDLL.DLL using the DisplayVirtualAndPhysicalAddresses() function. The information printed here can be compared with the output of the first instance of the program to get an idea of how the DLLs loaded at different base addresses share the memory sections. The NonFirstInstance also demonstrates the sharing of memory sections by MYDLL.DLL loaded at the same base address by two processes. It unloads MYDLL.DLL and loads it again. This time MYDLL.DLL is loaded at its preferred bas e address because now that JUNK.DLL is no more loaded, the virtual address space is not occupied by anything. Thus, MYDLL.DLL is loaded at the same base address in both the first and the second instance of the program. The physical addresses and the page attributes printed here demonstrate the memory sharing by MYDLL.DLL when loaded at the same base address in two processes. Next, the NonFirstInstance() function writes to some of the memory locations in MYDLL.DLL. As we explain soon, this action affects the memory sharing between the instances. As described earlier, the code sections are marked read-only by Windows NT. The function uses the VirtualProtect() API function to change the attributes of the NonRelocatableFunction() so that it can modify a few bytes at the start of this function. You can modify the data variables from MYDLL.DLL without any such hassle because the data variables have the read-write attribute. int DecideTheInstanceAndAct() { hfilemapping = CreateFileMapping( (HANDLE)0xFFFFFFFF, NULL, PAGE_READWRITE, 0, 0x1000, "MyFileMapping");

64 if (hfilemapping == NULL) { printf("unable to create file mapping\n"); FreeDll(); return -1; if (GetLastError() == ERROR_ALREADY_EXISTS) { NonFirstInstance(); else { FirstInstance(); The sample program does not accept any parameter to indicate whether it s the first instance. It uses a simple trick to decide it: It creates a named file mapping. The call to the CreateFileMapping() API function sets the last error to ERROR_ALREADY_EXISTS if a mapping with the same name already exists. This indicates that an instance that created the file mapping is already running. In other words, if the program can successfully create the named file mapping, it s the first instance of the program. Otherwise, another instance (that is, the first instance) of the program is already running and the current instance is the second instance. Depending on whether it s the first instance, the DecideTheInstanceAndAct() function calls the NonFirstInstance() function or the FirstInstance() function. A file mapping is automatically destroyed by the operating system when the reference count drops to zero. The sample program does not explicitly close the handle to the mapping. The handle is closed and the reference count for the memory mapping is decremented when the program exits. The mapping is freed up when the last instance of the program exits. main() { int rc; /* Creates callgate to get PTE entries from ring 3 application */ if ((rc = CreateRing0CallGate())!= SUCCESS) { printf("unable to create callgate, rc=%x\n",rc); return -1; DecideTheInstanceAndDoTheThings(); /* Releases the callgate */ FreeRing0CallGate(); The main() function starts by a call to the CreateRing0CallGate() function that is located in the GETPHYS.C file. The sample program uses the callgate mechanism to access the page tables. As described earlier, the page tables reside in the kernel memory and are not accessible to the user-mode code. The CreateRing0CallGate() function sets up a function that reads in the page tables to be executed in kernel mode. The DisplayVirtualAndPhysicalAddresses() function

65 later uses this function to get hold of the physical address and the page attributes for a given virtual address. After creating the callgate, the main function passes control to the DecideTheInstanceAndDoTheThings() function. The callgate is freed up by the program before exiting. Listing 4-9: GETPHYS.C #include <windows.h> #include <stdio.h> #include "..\cgate\dll\gate.h" static short CallGateSelector; The GETPHYS.C file implements the function to access the page table using the callgate mechanism. The GATE.H file is included because it contains the prototypes for functions that deal with the callgate manipulation. The segment selector of the callgate used by the program is stored in the global variable, CallGateSelector. /* C function called from assembly langauage stub */ BOOL _stdcall CFuncGetPhysicalAddressAndPageAttributes( unsigned int VirtualAddress, unsigned int *PhysicalAddress, unsigned int *PageAttributes) { unsigned int *PageTableEntry; *PhysicalAddress = 0; *PageAttributes = 0; PageTableEntry = (unsigned int *)0xC U +(VirtualAddress > 0x0CU); if ((*PageTableEntry)&0x01) { *PhysicalAddress =((*PageTableEntry)&0xFFFFF000U) +(VirtualAddress&0x00000FFFU); *PageAttributes = (*PageTableEntry)&0x00000FFFU; return TRUE; else { return FALSE; The CfuncGetPhysicalAddressAndPageAttributes() function executes in kernel mode using the callgate mechanism. The function depends on the fact that page tables for a process are always mapped at the virtual address 0xC It s an array of 1024 page tables where each page table is an array of 1024 page table entries. You can access the memory area as if it were a single contiguous array of page table entries. The first entry in this big array

66 corresponds to a virtual address in the range 0?4096, the second entry corresponds to virtual address range 4096?8192, and so on. The function calculates the index in the big PTE array by dividing the given virtual address by 4096 that is, by shifting the virtual address by 12 bits. Adding the index in the base address of the PTE array gives us the required PTE. Each PTE is 4 bytes (32 bits) long. Out of these 32 bits, the upper 20 bits in the PTE denote the address of the physical page, and the lower 12 bits denote the page attributes. The physical address and the page attributes are valid only if the LSB is set. The function checks the LSB and if the bit is set, it separates out the physical page address and the page attributes by masking off appropriate bits from the PTE. The function adds the offset within the page to the physical page address to get the physical address for the given virtual address. BOOL GetPhysicalAddressAndPageAttributes( void *VirtualAddress, unsigned int *PhysicalAddress, unsigned int *PageAttributes) { BOOL rc; static short farcall[3]; if (!CallGateSelector) { return FALSE; farcall[2] = CallGateSelector; _asm { mov eax, PageAttributes mov ecx, PhysicalAddress mov edx, VirtualAddress call fword ptr [farcall] mov rc, eax return rc; The GetPhysicalAddressAndPageAttributes() function runs in user mode and invokes the CfuncGetPhysicalAddressAndPageAttributes() function in kernel mode using the callgate mechanism. It uses the callgate initialized by the call to the CreateRing0CallGate() function. The parameters to the kernel-mode function are passed through the processor registers. An intermediate Assembly language function, namely, GetPhysicalAddressAndPageAttributes(), converts the register parameters to stack parameters. int CreateRing0CallGate() { DWORD rc; rc = CreateCallGate( _GetPhysicalAddressAndPageAttributes,

67 0, &CallGateSelector); return rc; The CreateRing0CallGate() function is a utility function that uses the CreateCallGate() function provided by GATE.DLL to create a callgate to execute the GetPhysicalAddressAndPageAttributes() function in kernel mode. It stores the segment selector of the created callgate in the CallGateSelector global variable, which is used later by the GetPhysicalAddressAndPageAttributes() function while invoking the kernel-mode function. int FreeRing0CallGate() { DWORD rc; rc = FreeCallGate(CallGateSelector); if (rc == SUCCESS) { CallGateSelector = 0; return rc; The FreeRing0CallGate() function is another utility function that destroys the callgate created by the CreateCallGate() function. It uses the FreeCallGate() interface function provided by GATE.DLL. Listing 4-10: RING0.ASM.386.model small.code public GetPhysicalAddressAndPageAttributes extrn _CFuncGetPhysicalAddressAndPageAttributes@12:near include..\include\undocnt.inc GetPhysicalAddressAndPageAttributes proc Ring0Prolog push eax push ecx push edx call _CFuncGetPhysicalAddressAndPageAttributes@12 Ring0Epilog retf GetPhysicalAddressAndPageAttributes endp END

68 The GetPhysicalAddressAndPageAttributes() function gets control through the callgate. The function executes the Ring0Prolog macro just after entering the function to enable paging in kernel mode. It converts the register parameters to stack parameters because CfuncGetPhysicalAddressAndPageAttributes() is a C function that expects the parameters on stack. Listing 4-11 presents the output from the previous sample program. Note the differences between the physical addresses and page attributes printed by the first instance and the second instance. See if you can explain the output and match your findings with our description that comes after this output. Here are two instances of the showphys program. Listing 4-11: showphys program Loading DLL MYDLL.DLL MYDLL.DLL loaded at base address = Virtual address to Physical address mapping Variable/function VirtualPhysical Page AddressAddressAttributes NonRelocatableFunction d8b000 P R U RelocatableFunction d8a000 P R U SharedVariable 2000c000 e44000 P RW U NonSharedVariable 2000b000 6b7000 P R U Now Run another copy of showphys... This is another instance of the showphys program: Loading DLL MYDLL.DLL at diffrent base address than that of the first instance MYDLL.DLL loaded at base address = 7e0000 Virtual address to Physical address mapping Variable/function VirtualPhysical Page AddressAddressAttributes NonRelocatableFunction 7e1000 d8b000 P R U RelocatableFunction 7e2000 1d6c000P R U SharedVariable 7ec000 e44000 P RW U

69 NonSharedVariable 7eb000 6b7000 P R U Loading DLL MYDLL.DLL at same base address as that of the first instance MYDLL.DLL loaded at base address = Virtual address to Physical address mapping Variable/function VirtualPhysical Page AddressAddressAttributes NonRelocatableFunction d8b000 P R U RelocatableFunction d8a000 P R U SharedVariable 2000c000 e44000 P RW U NonSharedVariable 2000b000 6b7000 P R U Modifying the code bytes at the start of NonRelocatableFunction...Modifying the value of SharedVariable...Modifying the NonSharedVariable s value Virtual address to Physical address mapping Variable/function VirtualPhysical Page AddressAddressAttributes NonRelocatableFunction e000 P RW U D RelocatableFunction d8a000 P R U SharedVariable 2000c000 e44000 P RW U D NonSharedVariable 2000b000 1ceb000P RW U D Note the page attributes from the output of the first instance. The functions are marked read-only, as expected. The unshared variable is also marked read-only. This is because Windows NT tries to share the data space also. As described earlier, such pages are marked for copy-on-write, and as soon as the process modifies any location in the page, the process gets a private copy of the page to write to. The other page attributes show that the PTE is valid, the page is a user-mode page, and nobody has modified the page so far. Now, compare the output from the first instance with the output from the second instance when it loaded the MYDLL.DLL at a base address different from that in the first instance. As expected, the virtual addresses of all the memory sections are different than those for the first instance. The physical addresses are the same except for the physical address of the relocatable function. This demonstrates that the code pages are marked as copy-on-write, and

70 when the loader modifies the code pages while performing relocation, the process gets a private writable copy. Our nonrelocatable function does not need any relocation; hence, the corresponding pages are not modified. The second instance can share these pages with the first instance and hence has the same physical page address. To cancel out the effects of relocation, the second instance loads MYDLL.DLL at the same base address as that in the first instance. Yup! Now, the virtual address matches the ones from the first instance. Note that the physical address for the relocatable function also matches that in the output from the first instance. The loader need not relocate the function because the DLL is loaded at the preferred base address. This allows more memory sharing and provides optimal performance. It s reason enough to allocate proper, nonclashing preferred base addresses for your DLLs. This ideal share-all situation ceases to exist as soon as a process modifies some memory location. Other processes cannot be allowed to view these modifications. Hence, the modifying process gets its own copy of the page The second instance of the sample program demonstrates this by modifying the data variables and a byte at the start of the nonrelocatable function. The output shows that the physical address of the nonrelocatable doesn t match with the first instance. The nonrelocatable function is not modified by the loader, but it had the same effect on sharing when we modified the function. The shared variable remains a shared variable. Its physical address matches that in the first instance because all the processes accessing a shared variable are allowed to see the modifications made by other processes. But the nonshared variable has a different physical address now. The second instance cannot share the variable with the first instance and gets its own copy. The copy was created by the system page fault handler when we tried to write to a read-only page and the page was also marked for copy-on-write. Note that the page is now marked read-write. Hence, further writes go through without the operating system getting any page faults. Also, note that the modified pages are marked as dirty by the processor. SWITCHING CONTEXT As we saw earlier, Windows NT can switch the memory context to another process by setting the appropriate page table directory. The processor requires that the pointer to the current page table directory be maintained in the CR3 register. Therefore, when the Windows NT scheduler wants to perform a context switch to another process, it simply sets the CR3 register to the page table directory of the concerned process. Windows NT needs to change only the memory context for some API calls such as VirtualAllocEx(). The VirtualAllocEx() API call allocates memory in the memory space of a process other than the calling process. Other system calls that require memory context switch are ReadProcessMemory() and WriteProcessMemory(). The ReadProcessMemory() and WriteProcessMemory() system calls read and write, respectively, memory blocks from and to a process other than the calling process. These functions are used by debuggers to access the

71 memory of the process being debugged. The subsystem server processes also use these functions to access the client process s memory. The undocumented KeAttchProcess() function from the NTOSKRNL module switches the memory context to specified process. The undocumented KeDetachProcess() function switches it back. In addition to switching memory context, it also serves as a notion of current process. For example, if you attach to a particular process and create a mutex, it will be created in the context of that process. The prototypes for KeAttachProcess() and KeDetachProcess() are as follows: NTSTATUS KeAttachProcess(PEB *); NTSTATUS KeDetachProcess (); Another place where a call to the KeAttachProcess() function appears is the NtCreateProcess() system call. This system call is executed in the context of the parent process. As a part of this system call, Windows NT needs to map the system DLL (NTDLL.DLL) in the child process s address space. Windows NT achieves this by calling KeAttachProcess() to switch the memory context to the child process. After mapping the DLL, Windows NT switches back to the parent process s memory context by calling the KeDetachProcess() function. The following sample demonstrates how you can use the KeAttachProcess() and KeDetachProcess() functions. The sample prints the page directories for all the processes running in the system. The complete source code is not included. Only the relevant portion of the code is given. Because these functions can be called only from a device driver, we have written a device driver and provided an IOCTL that demonstrates the use of this function. We are giving the function that is called in response to DeviceIoControl from the application. Also, the output of the program is shown in kernel mode debugger s window (such as SoftICE). Getting the information back to the application is left as an exercise for the reader. void DisplayPageDirectory(void *Peb) { unsigned int *PageDirectory = (unsigned int *)0xC ; int i; int ctr=0; KeAttachProcess(Peb); for (i = 0; i < 1024; i++) { if (PageDirectory[i]&0x01) { if ((ctr%8) == 0) DbgPrint(" \n"); DbgPrint("%08x ", PageDirectory[i]&0xFFFFF000); ctr++; DbgPrint("\n\n"); KeDetachProcess();

72 The DisplayPageDirectory() function accepts the PEB for the process whose page directory is to be printed. The function first calls the KeAttachProcess() function with the given PEB as the parameter. This switches the page directory to the desired one. Still, the function can access the local variables because the kernel address space is shared by all the processes. Now the address space is switched, and the 0xC address points to the page directory to be printed. The function prints the 1024 entries from the page directory and then switches back to the original address space using the KeDetachProcess() function. void DisplayPageDirectoryForAllProcesses() { PLIST_ENTRY ProcessListHead, ProcessListPtr; ULONG BuildNumber; ULONG ListEntryOffset; ULONG NameOffset; BuildNumber=NtBuildNumber & 0x0000FFFF; if ((BuildNumber==0x421) (BuildNumber==0x565)) { // NT 3.51 or NT 4.0 ListEntryOffset=0x98; NameOffset=0x1DC; else if (BuildNumber==0x755) {// Windows 2000 beta2 ListEntryOffset=0xA0; NameOffset=0x1FC; else { DbgPrint("Unsupported NT Version\n"); return; ProcessListHead=ProcessListPtr=(PLIST_ENTRY)(((char *)PsInitialSystemProcess)+ListEntryOffset); while (ProcessListPtr->Flink!=ProcessListHead) { void *Peb; char ProcessName[16]; Peb=(void *)(((char *)ProcessListPtr)-ListEntryOffset); memset(processname, 0, sizeof(processname)); memcpy(processname, ((char *)Peb)+NameOffset, 16); DbgPrint("**%s ", ProcessName, Peb); DisplayPageDirectory(Peb); ProcessListPtr=ProcessListPtr->Flink; The DisplayPageDirectoryForAllProcesses() function calls the DisplayPageDirectory() function for each process in the system. All the processes running in a system are linked in a list. The

73 function gets hold of the list of the processes from the PEB of the initial system process. The PsInitialSystemProcess variable in NTOSKRNL holds the PEB for the initial system process. The process list node is located at an offset of 0x98 (0xA0 for Windows NT 5.0) inside the PEB. The process list is a circular linked list. Once you get hold of any node in the list, you can traverse the entire list. The DisplayPageDirectoryForAllProcesses() function completes a traversal through the processes list by following the Flink member, printing the page directory for the next PEB in the list every time until it reaches back to the PEB it started with. For every process, the function first prints the process name that is stored at a version-dependent offset within the PEB and then calls the DisplayPageDirectory() function to print the page directory. Here, we list partial output from the sample program. Please note a couple of things in the following output. First, every page directory has 50-odd valid entries while the page directory size is The remaining entries are invalid, meaning that the corresponding page tables are either not used or are swapped out. In other words, the main memory overhead of storing page tables is negligible because the page tables themselves can be swapped out. Also, note that the page directories have the same entries in the later portion of the page directory. This is because this part represents the kernel portion shared across all processes by using the same set of page tables for the kernel address range. Listing 4-12: Displaying page directories: output **System cf ce a b c d a b c d b c d **smss.exe e b c d e f b c d e f b c d e f **winlogon.exe be b c d e f

74 b c d e f b c d e f DIFFERENCES BETWEEN WINDOWS NT AND WINDOWS 95/98 Generally, the memory management features offered by Windows 95/98 are the same as those in Windows NT. Windows 95/98 also offers 32-bit flat separate address space for each process. Features such as shared memory are still available. However, there are some differences. These differences are due to the fact that Windows 95/98 is not as secure as Windows NT. Many times, Windows 95/98 trades off security for performance reasons. Windows 95/98 still has the concept of user-mode and kernel-mode code. The bottom 3GB is user-mode space, and the top 1GB is kernel-mode space. But the 3GB user-mode space can be further divided into shared space and private space for Windows 95/98. The 2GB to 3GB region is the shared address space for Windows 95/98 proc esses. For all processes, the page tables for this shared region point to the same set of physical pages. All the shared DLLs are loaded in the shared region. All the system DLLs for example, KERNEL32.DLL and USER32.DLL are shared DLLs. Also, a DLL s code/data segment can be declared shared while compiling the DLL, and the DLL will get loaded in the shared region. The shared memory blocks are also allocated space in the shared region. In Windows 95/98, once a process maps a shared section, the section is visible to all processes. Because this section is mapped in shared region, other processes need not map it separately. There are advantages as well as disadvantages of having such a shared region. Windows 95/98 need not map the system DLLs separately for each process; the corresponding entries of page table directory can be simply copied for each process. Also, the system DLLs loaded in shared region can maintain global data about all the processes and separate subsystem processes are not required. Also, most system calls turn out to be simple function calls to the system DLLs, and as a result are very fast. In Windows NT, most system calls either cause a context switch to kernel mode or a context switch to the subsystem process, both of which are costly operations. For developers, loading system DLLs in a shared region means that they can now put global hooks for functions in system DLLs. For all these advantages, Windows 95/98 pays with security features. In Windows 95/98, any process can access all the shared data even if it has not mapped it. It can also corrupt the system DLLs and affect all processes. SUMMARY In this chapter, we discussed the memory management of Windows NT from three different

75 perspectives. Memory management offers programmers a 32-bit flat address space for every process. A process cannot access another process s memory or tamper with it, but two processes can share memory if they need to. Windows NT builds its memory management on top of the memory management facilities provided by the microprocessor. The 386 (and above) family of Intel microprocessors provides support for segmentation plus paging. The address translation mechanism first calculates the virtual address from the segment descriptor and the specified offset within the segment. The virtual address is then converted to a physical address using the page tables. The operating system can restrict access to certain memory regions by using the security mechanisms that are provided both at the segment level and the page level. Windows NT memory management provides the programmer with flat address space, data sharing, and so forth by selectively using the memory management features of the microprocessor. The virtual memory manager takes care of the paging and allows 4GB of virtual address space for each process, even when the entire system has much less physical memory at its disposal. The virtual memory manager keeps track of all the physical pages in the system through the page frame database (PFD). The system also keeps track of the virtual address space for each process using the virtual address descriptor (VAD) tree. Windows NT uses the copy-on-write mechanism for various purposes, especially for sharing the DLL data pages. The memory manager has an important part in switching the processor context when a process is scheduled for execution. Windows 95/98 memory management is similar to Windows NT memory management with the differences being due to the fact that Windows 95/98 is not as security conscious as Windows NT.

76 Chapter 5 Reverse Engineering Techniques

77 Abstract This chapter teaches you how to reverse engineer Windows NT given the raw Assembly code and the useful symbolic information provided by Microsoft in the form of.dbg files. With this knowledge, you can explore on your own the undocumented Windows NT world. THIS CHAPTER DIFFERS greatly from other chapters in the book. It does not contain any undocumented Windows NT information. Instead, it provides some general tips regarding how to reverse engineer on your own to explore the undocumented Windows NT world. This chapter teaches you how to reverse engineer Windows NT given the raw Assembly code and the useful symbolic information provided by Microsoft in the form of.dbg files. You can access these.dbg files on the Windows NT distribution CD-ROM. This chapter does not provide a complete guide to reverse engineering for the simple reason that you cannot clearly define a way of approaching this problem. Reverse engineering is like panning for gold; you have to sift through tons of Assembly code to find a little information. But this chapter contains some useful tricks we have used to come up with undocumented Windows NT. Reverse engineering is an art, and it requires a lot of intuition, patience, and logical deduction. We divided this chapter into different sections with each section describing a step in reverse engineering. We conclude the chapter by illustrating reverse engineering of a sample undocumented function. The best tool for implementing reverse engineering is NuMega excellent SoftICE. This book would not have been possible without SoftICE. This chapter assumes that the reader has used debuggers. We recommend trying out SoftICE to get the most out of this chapter. Although the concepts explained here specifically apply to reverse engineering NTOSKRNL (NT Executive image) using SoftICE, these concepts can apply to reverse engineering any piece of operating system code. HOW TO PREPARE FOR REVERSE ENGINEERING First, install SoftICE on your machine with Boot time as the option. Now copy the.dbg files in the SUPPORT directory on the Windows NT CD-ROM. There are many.dbg files in this directory categorized according to the type of the file (for example,.dll, SYS, or.exe ). The.DBG files you will require depend upon the Windows NT component you want to explore. XREF: See the NuMega Web site at for up-to-date version information on SoftICE. You need the following.dbg files to explore the KERNEL component: KERNEL32.DBG NTDLL.DBG

78 NTOSKRNL.DBG You need the following.dbg files to explore the USER and GDI components: USER32.DBG GDI32.DBG CSRSS.DBG CSRSRV.DBG WIN32K.DBG Copy these.dbg files onto your hard drive, and then, using the symbol loader, convert.dbg files into.nms (the native symbol format of SoftICE). Then, add these files to SoftICE initialization settings using the SoftICE Initialize Settings/Symbols option in the symbol loader. This ensures that the symbols get loaded when SoftICE loads. Now, reboot the machine. SoftICE now contains the symbolic information rather than the hex addresses, making the Assembly code look more readable. The Windows 2000 symbolic information comes in.dbg and.pdb files instead of just.dbg files. One needs to have MSPDB60.DLL file from Visual C++ to covert these files into native symbol format of SoftICE (.NMS) HOW TO REVERSE ENGINEER Because most of the Windows NT components are written in C, you must understand how the C compiler generates the Assembly code that corresponds to a C function. You must also understand how a compiler generates the code to call a particular function, how the parameters are passed, how compiler implements local variables, and so on. Compilers follow different function calling conventions. We will not get into the details of each and every compiler calling convention. Instead, we will cover only the stdcall and fastcall calling conventions because most of the functions in Windows NT follow either of these calling conventions. The NTOSKRNL.EXE contains a lot of functions with the fastcall calling convention, the fastest of all the calling conventions. In stdcall calling conventions, the parameters are pushed by the caller from right to left, and the parameters pop off the stack by the called function. The advantage of using the stdcall calling convention is that it generates compact code because the code for popping the parameters off the stack resides in only one place (in the function itself). The disadvantage is that since a fixed number of parameters always pop off in the function, this calling convention cannot support a variable number of arguments. To have a variable number of arguments, you must follow the cdecl calling convention. The fastcall calling convention resembles stdcall, except its first two parameters are passed in registers instead of on a stack. This results in faster code because the register access proves much faster than memory access.

79 Let us take one sample C function following the stdcall calling convention and see the corresponding Assembly code generated by the compiler. In this example, we will also see how compiler-generated Assembly code accesses parameters passed to the function, and how local variables are implemented. The concepts explained here form the basis for reverse engineering discussed later in this chapter. Listing 5-1: C function int _stdcall sum(int x, int y, int z) { int sum; sum=x+y+z; return sum; main() { sum(10, 20, 30); Listing 5-2: Compiler-generated Assembly code for the C function in Listing 5-1 ;sum PUSH EBP MOV EBP,ESP SUB ESP,04 PUSH EBX PUSH ESI PUSH EDI MOV EAX,[EBP+10] ADD EAX,[EBP+0C] ADD EAX,[EBP+08] MOV [EBP-04],EAX MOV EAX,[EBP-04] POP EDI POP ESI POP EBX LEAVE RET 000C ;main PUSH 30 PUSH 20 PUSH 10 CALL _sum@12 If you take a look at the Assembly code, the compiler generates the code to set the EBP register to the start of the stack frame. (The stack frame for the function starts from EBP+8 since the compiler pushes the EBP register to maintain the stack frame set up by the caller

80 function.) Hence, the parameters passed to the function start at EBP+8. Therefore, the first parameter x is accessed as [EBP+8] by the generated Assembly code. The parameters y and z are accessed as [EBP+C] and [EBP+10]. For implementing local variables, compilers typically generate code, which decrements the ESP register by the total number of bytes required to hold all the local variables defined in the function. In the previous code, there is only one local variable sum; therefore, the compiler allocates space for 4 bytes (1 DWORD) on the stack by generating the instruction SUB ESP, 4. The EBP register accesses all such local variables as negative offsets. The variable sum is accessed as [EBP-4] in the code. The LEAVE instruction used in the end restores the contents of EBP register and cleans up the local variables. Let us demonstrate the preceding mechanism in tables. When the function sum is called, the stack frame looks like: 30 fl Last parameter 20 fl Second parameter 10 fl First parameter Return address (Address of the instruction following the call _sum@12 instruction) fl ESP After setting up the standard stack frame of PUSH EBP, MOV EBP, ESP and creating space for local variables, the stack looks like: 30 fl Last parameter(ebp+10) 20 fl Second parameter(ebp+c) 10 fl First parameter(ebp+8) Return address (Address of the instruction following the call _sum@12 instruction) Original contents of EBP register fl EBP

81 holding the stack frame base for function main Local (sum) variable ESP(EBP-4) Most of the functions in the NTOSKRNL access the parameters and local variables in the same way (by setting up the frame using EBP registers and accessing the local variables using the negative offsets from the EBP register). But a few functions do not set up this standard stack frame; instead, the parameters are accessed directly using the ESP register (such as ESP+8). In this case, reverse engineering becomes very difficult because the same parameter is accessed using different offsets from the ESP register at different places. The advantage is that it results in faster and more compact code. UNDERSTANDING CODE GENERATION PATTERNS Because compilers are themselves software programs, they follow a certain pattern when generating the Assembly code. LEA EDI, [EBP-24] MOV ECX, 6 REPSZ STOSD This piece of code initializes the memory of 6 DWORD size (0x18 bytes), which starts at location EBP-24. This also suggests that probably some structures of size 0x18 bytes is locally defined and initialized in the function. MOV EAX, [EBP+18] TEST EAX, JZ BitNotSet.... BitNotSet: This piece of code tests the fifteenth bit of the fifth parameter passed to the function, assuming standard stack frame is generated for the function and does the processing based on the bit test results. MOV EAX, [FS:124] This statement fills in the EAX register with a pointer to the current thread object. Note that the FS register points to a Processor Control Region (PCR) in kernel mode. MOV EAX, [FS:124]

82 MOV EAX, [EAX+40] This piece of code fills in the EAX register with a pointer to the current process object under Windows NT Under Windows NT 4.0 and Windows 2000, this instruction looks like MOV EAX, [EAX+44], since the offset of pointer to process object is changed from 3.51 to 4.0. HOW WINDOWS NT PROVIDES DEBUGGING INFORMATION Various kernel data variables control the output of debug messages. By turning on a few bits in these variables, you can get more debugging messages from the operating system apart from the messages given by default. Some of these bits are already turned on in checked builds of the operating system, although some of them are not. We strongly feel that Microsoft itself likely turns on bits of these variables whenever they get any bug information and they want to figure out the problem. But Microsoft probably turns these bits off when they get the release out. By doing this, Microsoft hides a wealth of information from operating system reverse engineering. We expose a part of this wealth here. There could be many other such flags. Pieces of hidden debug messages code inside NTOSKRNL appear like this: TEST [DebugVariable], 0x80 JZ HideFromReverseEngineering PUSH.. PUSH.. PUSH.. CALL DbgPrint HideFromReverseEngineering: Whenever you come across such a piece of code, just set the required bit from SoftICE, and you will see all those messages that are hidden. Here are some of the known variables in NTOSKRNL and the debug messages shown by the operating system when these variables or bits of these variables are turned on. Most of the variables appear only in the checked builds of the operating system. ExpEchoPoolCalls By setting this variable to 1, you can get the information about each memory allocation/deallocation performed using functions such as ExAllocatePoolWithTag and ExFreePool. The information shown includes the address where the memory was allocated, size of the region allocated, type of the pool used (paged/nonpaged), and type of memory (cache, aligned, and so on). The information displays as follows: "0xe EXALLOC: from Paged size 284 Callers:0, 0 0xe EXDEALLOC: from Paged Callers:0, 0" ObpShowAllocAndFree By setting this variable to 1, you can get information about each executive object when it is created/destroyed. The information includes the memory address where the object was

83 created and the type of the object (Key, Semaphore, and so on). The information appears like this: "OB: Alloc e (e ) Key OB: Free e (e ) - Type: Key" LpcpTraceMessages This variable proves very useful in reverse engineering the local procedure call mechanism (LPC) used by Windows NT for implementing various subsystems. By setting this variable to 1, you can get tons of information about how LPC functions. The information displays as follows: "LPC[ ]: Allocate Msg e1118b08 LPC[ ]: Explorer.exe Send Request (LPC_REQUEST) Msg e1118b08 (853) [ c21e63 77b9da6b] to Port e11a6dc0 (csrss.exe) LPC[ 1a.52 ]: csrss.exe Receive Msg e1118b08 (853) from Port e11a6dc0 (csrss.exe ) LPC[ 1a.52 ]: Free Msg e1118b08 LPC[ 1a.52 ]: Allocate Msg e1118b08 LPC[ 1a.52 ]: csrss.exe Sending Reply Msg e1118b08 (853.0, 0) [ c21e63 77b9da6b] to Thread ff (Explorer.exe) LPC[ 1a.52 ]: csrss.exe Waiting for message to Port e11a6dc0 (csrss.exe) LPC[ ]: Explorer.exe Got Reply Msg e1118b08 (853) [ b9da6b] for Thread ff (Explorer.exe) LPC[ ]: Free Msg e1118b08" MmDebug By setting different bits of this variable, you can see different messages generated by the memory management module. Following, we list the bits of this variable that the operating system can set and then generate the corresponding messages. Bit 2 MM:actual fault c01dfc38 va 77f0e9db ***DumpPTE at c01dfc38 contains 3d0450 protoaddr e10f40a0 subsect ffba2fc0 inserting element 51 77f0e001 MM:actual fault c0307b00 va c1ec0000 ***DumpPTE at c0307b00 contains 75c434 protoaddr e11d7068 subsect ffb6a3b0 ***DumpPTE at e11d7068 contains 1c1d44c2 protoaddr e subsect fdfc2bf8 MM:actual fault c030d500 va c ***DumpPTE at c030d500 contains 7f4434 protoaddr e11fd068 subsect ffb60bb0 removing wsle 313 c Bit 3 ***WSLE cursize 79 frstfree 11a Min 1e Max 91 quota 88 firstdyn 3 last ent 25a next slot 3

84 index 0 c index 1 c index 2 c index 3 c01ff index f43401 Bit 4 csrss.exe file: \MMFAULT: va: 8018cd7e size: 1000 process: SystemVa file: \ MMFAULT: va: 77d9bd10 size: 1000 process: progman.exe file: \MMFAULT: va: c1ec00 00 size: 1000 process: SystemVa file: (null) MMFAULT: va: c size: 1000 process: SystemVa file: (null) MMFAULT: va: c size: 1000 process: SystemVa file: (null) Bit 10 allocated 0x1 Ptes at c03f308c releasing 0x1 system PTEs at location c03f308c System Pte at c03f308c for 1 entries (c03f308c) System Pte at c03f3108 for 2 entries (c03f310c) System Pte at c03f31b0 for 1 entries (c03f31b0) System Pte at c03f31d0 for 1 entries (c03f31d0) Bit 28 crea sect access mask f001f maxsize 0 page prot 10. allocation attributes file handle a0 return crea sect handle a4 status 0 crea sect access mask f001f maxsize page prot 4 allocation attributes file handle 0 return crea sect handle 1f0 status 0 mapview process handle ffffffff section 1f0 base address 0 zero bits 0 view size 0 offset 0 commitsize protect 4 Inheritdisp 2 Allocation type 0 Bit 30 MM:**access fault - va 77ea1d17 process fdf787a0 thread fdf77020 MM:**access fault - va 77ea31ba process fdf787a0 thread fdf77020 ObDebugFlags Two bits of this variable (the fifth and sixth bits) control the operating system debug messages. These bits control the security descriptor-related messages

85 Bit 6 Reference Index = 20, New RefCount = 5 Referencing index #20, Refcount = 5 Dereferencing SecurityDescriptor e11cc778, index #20, refcount = 6 Reference Index = 252, New RefCount = 8 Referencing index #252, Refcount = 8 Bit 7 Deassigning security descriptor e11cea98, Index = 252 Deassigning security descriptor e11cc778, Index = 20 Deassigning security descriptor e11d0ed8, Index = 214 Deassigning security descriptor e11d89d8, Index = 250 NtGlobalFlag One bit of this variable enables the debug messages. Other bits control the validations performed by the operating system and general operation of the operating system. Take a look at the GFLAGS utility in the resource kit for the description of individual bits of NtGlobalFlag. The value of this variable is inherited by a variable in NTDLL.DLL during the process startup. NTDLL.DLL uses the second bit of this variable to show the loading of a process. During process startup, NTDLL gets the value of this flag and sets its internal variable ShowSnap to 1 if the second bit is set. Once this bit is set, you can watch the behavior of the PE executable/dll loader. Windows NT will show names of all the imported DLLs, plus it will show a real set of DLLs required to start an application. It will also show you the address of initialization functions of each of these DLLs as well as a lot of other information. Look at the following messages displayed by the operating system by just turning on one bit of the NtGlobal flag variable. Here, we started pstat.exe and terminated it immediately: LDR: PID: 0x47 started - 'pstat' LDR: NEW PROCESS Image Path: C:\MSTOOLS\bin\PSTAT.EXE (PSTAT.EXE) Current Directory: C:\MSTOOLS\bin Search Path: C:\MSTOOLS\bin;.;C:\WINNT40\System32;C:\WINNT40\system;C:\WINNT40;C:\WINNT40 \system32;c:\winnt40;c:\winnt35;c:\winnt35\system32;c:\msdev\bin;c:\dos LDR: PSTAT.EXE bound to USER32.dll NTICE: Load32 START=77E10000 SIZE=62000 KPEB=FF925DE0 MOD=user32 LDR: ntdll.dll used by USER32.dll LDR: Snapping imports for USER32.dll from ntdll.dll LDR: KERNEL32.dll used by USER32.dll NTICE: Load32 START=77ED0000 SIZE=5E000 KPEB=FF925DE0 MOD=kernel32 LDR: ntdll.dll used by KERNEL32.dll LDR: Snapping imports for KERNEL32.dll from ntdll.dll

86 LDR: Snapping imports for USER32.dll from KERNEL32.dll LDR: LdrLoadDll, loading NTDLL.dll from LDR: LdrGetProcedureAddress by NAME - RtlReAllocateHeap LDR: LdrLoadDll, loading NTDLL.dll from LDR: LdrGetProcedureAddress by NAME - RtlSizeHeap LDR: LdrLoadDll, loading NTDLL.dll from LDR: LdrGetProcedureAddress by NAME - RtlUnwind LDR: LdrLoadDll, loading NTDLL.dll from LDR: LdrGetProcedureAddress by NAME - RtlAllocateHeap LDR: LdrLoadDll, loading NTDLL.dll from LDR: LdrGetProcedureAddress by NAME - RtlFreeHeap LDR: Refcount USER32.dll (1) LDR: Refcount KERNEL32.dll (1) LDR: Refcount GDI32.dll (1) LDR: Refcount KERNEL32.dll (2) LDR: Real INIT LIST C:\WINNT40\system32\KERNEL32.dll init routine 77ed47a0 C:\WINNT40\system32\RPCRT4.dll init routine 77dc060d C:\WINNT40\system32\ADVAPI32.dll init routine 77d38650 C:\WINNT40\system32\USER32.dll init routine 77e23890 LDR: KERNEL32.dll loaded. - Calling init routine at 77ed47a0 LDR: RPCRT4.dll loaded. - Calling init routine at 77dc060d LDR: ADVAPI32.dll loaded. - Calling init routine at 77d38650 LDR: USER32.dll loaded. - Calling init routine at 77e23890 LDR: PID: 0x47 finished - 'pstat' NTICE: Exit32 PID=47 MOD=PSTAT SepDumpSD By setting this variable to 1, the operating system dumps the security descriptor in the security handling elated code. SECURITY DESCRIPTOR Revision = 1 Dacl present Self relative Owner S Group SYSTEM S Sacl@ 0 Dacl@ e11f71fc Revision: 02 Size: 0044 AceCount: 0002

87 AceHeader: Access Allowed Access Mask: 001f03ff AceSize = 20 Ace Flags = Sid = SYSTEM S AceHeader: Access Allowed Access Mask: AceSize = 24 Ace Flags = Sid = S TokenGlobalFlag By setting this variable to 1, the operating system dumps the security token-related messages. SE (Token): Acquiring Token READ Lock for access to token 0xe11826f0 SE (Token): Releasing Token Lock for access to token 0xe11826f0 SE (Token): Acquiring Token READ Lock for access to token 0xe11826f0 SE (Token): Releasing Token Lock for access to token 0xe11826f0 CmLogLevel and CmLogSelect These variables control the debugging messages given by the registry handling code. Different log levels serve as debug levels. By setting the individual bits in CmLogSelect, you can control the volume of messages generated by the operating system. The maximum value of CmLogLevel is 7. By default, the individual bits in CmLogSelect are set to produce the most verbose output. NtOpenKey DesiredAccess= RootHandle= Name='\Registry\Machine\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\notepad.exe' CmpParseKey: CompleteName = '\Registry\Machine\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\notepad.exe' RemainingName = '\Machine\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\notepad.exe' CmpFindSubKeyByName: Hive=e10025c8 Parent= SearchName=fdd0bd08 CmpFindSubKeyInLeaf: Hive=e10025c8 Index=e10091f4 SearchName=fdd0bd08 CmpFindSubKeyByName:

88 Hive=e10025c8 Parent= SearchName=fdd0bd08 CmpFindSubKeyInLeaf: Hive=e10025c8 Index=e100943c SearchName=fdd0bd08 CmpFindSubKeyByName: Hive=e10c8988 Parent= SearchName=fdd0bd08 HOW TO DECIPHER THE PARAMETERS PASSED TO AN UNDOCUMENTED FUNCTION This section describes how you can find out the parameters to be passed to an undocumented function. The first step in deciphering parameters is to set a breakpoint on the function using SoftICE. If you know the application that uses this undocumented function (from the import dump), start the application. For example, Dr. Watson (DRWTSN32.EXE) uses an undocumented NTDLL function NtOpenThread(). XREF: You can find a complete list of functions (documented as well as undocumented) imported by an application using the DUMPBIN utility. For example, DUMPBIN PROGMAN.EXE /IMPORTS will display all the functions imported by the program manager. To start DRWTSN32, begin an application that faults (GPF) or write one that does the fault explicitly. If you do not know an application that uses this undocumented function, try to find an equivalent Win32 API call. If you find such a call, write an application that will call this function. Assuming you want to decipher the parameters passed to a NtAllocateVirtualMemory system service, you may write an application that calls VirtualAlloc(). Once the breakpoint for the function that you want to decipher is triggered, you can look at the details of the function implementation. You can use some general tricks to decipher the parameters. We discuss a few of them in the sections that follow. Examining the Error Handling Code Many times a function checks for the value of a particular parameter, and if it is not appropriate, returns an error code. By examining the error code, you can get information about the error in NTSTATUS.H file from DDK. Then, we can find out the type of parameter used. Consider the following piece of code in an undocumented NtQueryMutant system service: CMP DWORD PTR [EBP+C], 0 JZ 8019D397 MOV DWORD PTR [EBP-34], C (STATUS_INVALID_INFO_CLASS) D397: CMP DWORD [EBP+14], 8 JZ 8019D3B3 MOV DWORD PTR [EBP-34], C (STATUS_INFO_LENGTH_MISMATCH)

89 From this Assembly code, you can easily see that [EBP+C], the second parameter, contains the InfoClass, and [EBP+14], the fourth parameter, contains the size of the buffer that holds the mutant information. Use in the Function Sometimes, a particular parameter of an undocumented function is passed as a parameter to some documented function. In this case, by looking at the documented function, you can easily find out the parameter passed to the undocumented function. Consider the following piece of code in the NtQueryMutant() function: PUSH 00 LEA EAX,[EBP-20] PUSH EAX PUSH DWORD PTR [EBP-19] MOV EAX,[_ExMutantObjectType] PUSH EAX PUSH 01 PUSH DWORD PTR [EBP+08] CALL _ObReferenceObjectByHandle MOV [EBP-24],EAX TEST EAX,EAX JL 8019D435 PUSH DWORD PTR [EBP-20] CALL _KeReadStateMutant Looking at this code, you can clearly see that the first parameter to the NtQueryMutant() function is the Mutex object handle because the same parameter is passed a first parameter to documented ObReferenceObjectByHandle() function, and first parameter to ObReferenceObjectByHandle() function is the object handle. Hence, using the knowledge that the name of the function is NtQueryMutant and the first parameter is passed as is to ObReferenceObjectByHandle as a object handle, we can conclude that the first parameter might be a handle to a mutex object. Checking the Validation Code Sometimes, a piece of code checks for the value of a parameter and displays a message if it has a particular value. By looking at the message provided by the operating system, you can find out the parameter. Especially in checked builds, asserts are used extensively. By looking at the messages in these asserts, you can find out the parameters. For example, a function that expects PEB as a parameter contains a piece of code that checks if the type field of the object is a Process object. TYPICAL ASSEMBLY LANGUAGE PATTERNS AND THEIR

90 MEANINGS This piece of code gets the Current process object pointer (PEB) in the EAX register: MOV EAX, FS:[124] MOV EAX, [EAX+40] While executing in kernel mode, FS:[124] always points to the currently executing thread (TEB) and [TEB+40] always points to the current process. Under Windows NT 4.0 and Windows 2000, [TEB+44] points to the current process. MOV EAX, ESI AND EAX, 0xFFFFF3FF SHR EAX, 0A SUB EAX, MOV EAX, ESI AND EAX, FFCFFFFF SHR EAX, 14 SUB ECX, 3FD00000 The preceding two pieces of code route to the page table entry and the page directory entry, respectively, for the virtual address present in the ESI register. The functioning registers might change; however, the pattern remains the same. You may have seen this code in many memory management-related functions. At first it looks odd; however, it is highly optimized using the 2 s complement method. As an exercise, try to determine how this works. Hint: Page tables are mapped starting at the virtual address 0xC , and Page directory is mapped starting at the virtual address 0xC PUSH 00 LEA EAX,[EBP-20] PUSH EAX PUSH ECX PUSH DWORD PTR [_PsProcessType] PUSH 08 PUSH DWORD PTR [EBP+08] CALL _ObReferenceObjectByHandle MOV [EBP-24],EAX TEST EAX,EAX JL... MOV EAX,FS:[ ] MOV ECX,[EBP-20] CMP [EAX+40],ECX JZ... PUSH ECX

91 CALL _KeAttachProcess Here, the code attempts to play with other processes. It wants to perform some work on behalf of another process. This piece of code gets the handle to the Process object as a parameter. Using this handle, the code reaches to the actual object and then compares the address of the Process object with the address of the current Process object stored at [TEB+40] in Windows NT 3.51 and [TEB+44] in Windows NT 4.0 and Windows If the Process object dealt with is not the current Process object, then the code attaches to the desired Process object using KeAttachProces(). The code following this will execute in the context of the attached process. You can see a similar kind of code in the system services that have the ability to play in other processes. The system service NtAllocateVirtualMemory enables allocation of the memory for a process other than the current one. You will find this kind of code in the NtAllocateVirtualMemory() function. Other places where you can find this code are NtFreeVirtualMemory() and NtLockVirtualMemory(). THE PRACTICAL APPLICATION OF REVERSE ENGINEERING Now, let s observe the practical application of the reverse engineering techniques discussed in this chapter. We will show clearly how you can arrive at pseudocode given the raw assembler listing. XREF: You can study the example we chose in Chapter 10, Adding New Software Interrupts. In Chapter 10, we discuss the callgate implementation on Windows NT (for running ring 0 code from ring 3 application). When we decided to design the callgate mechanism, we were in search of some mechanism to allocate the selectors the basic requirement for creating callgates. We knew that the Win32 application did not have a Local Descriptor Table (LDT). Therefore, we wanted to allocate selectors from a Global Descriptor Table (GDT). First, we looked at the symbols of NTOSKRNL by using SoftICE s command SYM *Selector*. We received some entries matching the regular expression *Selector*. One symbol we found interesting was KeI386AllocateGdtSelector. We deduced from the name that this function must allocate GDT Selectors. Next, we took the export dump of NTOSKRNL to see whether the function is exported. You can make use of undocumented functions only if the function is exported. If the function is not exported then you have to deal with hard-coded addresses. This makes the program bound to the specific version of Windows NT (for example, NT 3.51/4.0/2000, free builds/checked builds/service packs). Luckily, we found that the function was exported. Our next step was to put breakpoint on this function. Unfortunately, we found that this breakpoint is never triggered on our configuration, so we decided to reverse engineer the function ourselves. We extracted the Assembly output of the function using the SoftICE history buffer. Here is the raw Assembly code for the function: _KeI386AllocateGdtSelectors

92 0008:80125D00 PUSH EBP 0008:80125D01 MOV EBP,ESP 0008:80125D03 PUSH ESI 0008:80125D04 MOV SI,[EBP+0C] 0008:80125D08 PUSH EDI 0008:80125D09 CMP [_KiNumberFreeSelectors$S10229],SI 0008:80125D10 JB 80125D5E 0008:80125D12 MOV ECX,_KiAbiosGdtLock 0008:80125D17 CALL [ imp_@kfacquirespinlock] 0008:80125D1D SUB [_KiNumberFreeSelectors$S10229],SI 0008:80125D24 MOV EDX,[_KiFreeGdtListHead$S10230] 0008:80125D2A TEST SI,SI 0008:80125D2D JZ 80125D :80125D2F MOV ECX,[EBP+08] 0008:80125D32 MOV EDI,EDX 0008:80125D34 SUB DI,[_KiAbiosGdt] 0008:80125D3B MOV [ECX],DI 0008:80125D3E ADD ECX, :80125D41 DEC SI 0008:80125D43 MOV EDX,[EDX] 0008:80125D45 JNZ 80125D :80125D47 MOV ECX,_KiAbiosGdtLock 0008:80125D4C MOV [_KiFreeGdtListHead$S10230],EDX 0008:80125D52 MOV EDX,EAX 0008:80125D54 CALL [ imp_@kfreleasespinlock] 0008:80125D5A XOR EAX,EAX 0008:80125D5C JMP 80125D :80125D5E MOV EAX,C ; STATUS_ABIOS_SELECTOR_NOT_AVAILABLE 0008:80125D63 POP EDI 0008:80125D64 POP ESI 0008:80125D65 POP EBP 0008:80125D66 RET 0008 Looking at the last instruction, RET 8, the function clearly followed the _stdcall calling convention with two parameters to the function. We next had to decipher what those parameters were. Because the compiler generated the standard stack frame (PUSH EBP, MOV EBP, ESP), clearly EBP+8 referred to the first parameter, and EBP+C referred to the second parameter. The following instruction sequence suggests that the second parameter represents the number of selectors to be allocated: 0008:80125D03 PUSH ESI

93 0008:80125D04 MOV SI,[EBP+0C] 0008:80125D08 PUSH EDI 0008:80125D09 CMP [_KiNumberFreeSelectors$S10229],SI 0008:80125D10 JB 80125D5E :80125D5E MOV EAX,C ; STATUS_ABIOS_SELECTOR_NOT_AVAILABLE This code moves the second parameter in the SI register and compares the SI register with the kernel variable KiNumberFreeSelectors$S If the value in the SI register is less than KiNumberFreeSelectors$S10229, then the code jumps to a label and from there fills in the EAX register with an error code of STATUS_ABIOS_SELECTOR_NOT_AVAILABLE. Clearly, the second parameter to the function was Number of Selectors to allocate. Next, we looked at the code, assuming an x number of available selectors. We assumed that the JB condition evaluated to false. The next two instructions acquired the GDT lock. Locks are extensively used at various places to protect multiple threads from accessing some shared kernel data structure. Most of the time, you can ignore these pieces of code, because they have nothing to do with the actual logic of the function. 0008:80125D12 MOV ECX,_KiAbiosGdtLock 0008:80125D17 CALL [ imp_@kfacquirespinlock] The next instruction decrements the value of the kernel variable _KiNumberFreeSelectors$S10229 according to the number of selectors to be allocated. 0008:80125D1D SUB [_KiNumberFreeSelectors$S10229],SI Then, the function loads the EDX register with the value of the kernel variable _KiFreeGdtListHead$S Looking at the instruction, you can see the selectors are put in a free list. 0008:80125D24 MOV EDX,[_KiFreeGdtListHead$S10230] Next, the function checks to see if the number of selectors to be allocated is zero. In that case, the function jumps to a label where some rollback is done, and the EAX register is zeroed out indicating success so the function returns. 0008:80125D2A TEST SI,SI 0008:80125D2D JZ 80125D47...

94 :80125D47 MOV ECX,_KiAbiosGdtLock 0008:80125D4C MOV [_KiFreeGdtListHead$S10230],EDX 0008:80125D52 MOV EDX,EAX 0008:80125D54 CALL [ imp_@kfreleasespinlock] 0008:80125D5A XOR EAX,EAX 0008:80125D5C JMP 80125D :80125D5E MOV EAX,C ; STATUS_ABIOS_SELECTOR_NOT_AVAILABLE 0008:80125D63 POP EDI 0008:80125D64 POP ESI 0008:80125D65 POP EBP 0008:80125D66 RET 0008 Now, let s see what happens when the number of allocated selectors is nonzero: 0008:80125D2F MOV ECX,[EBP+08] 0008:80125D32 MOV EDI,EDX 0008:80125D34 SUB DI,[_KiAbiosGdt] 0008:80125D3B MOV [ECX],DI 0008:80125D3E ADD ECX, :80125D41 DEC SI 0008:80125D43 MOV EDX,[EDX] 0008:80125D45 JNZ 80125D32 The code fills the ECX register with the first parameter. Then, it loads the EDI register with the value of the EDX register (_KiFreeGdtListHead$S10230). Next, it subtracts the value of the kernel variable KiAbiosGdt. The value of the kernel variable KiAbiosGdt matched with the base address of the Global Descriptor Table. Hence, the preceding piece of code extracts the selector value in the DI register. Next, the code copies the selector value in the location pointed by the ECX register. The code then adds 2 to the ECX register. From this, we deduced that the first parameter points to a buffer that contains the selector values allocated with each entry consisting of 2 bytes. Therefore, the first parameter must be an array of short integers. The code reaches to the next free selector using the instruction: MOV EDX,[EDX] From this, we can see that the free selectors are maintained in a linked list, and the descriptors are used for keeping track of the next free selector in the list. The SI register decrements each time in the loop. Initially, the SI register contains the number of selectors to be allocated. In the end, the SI register reaches 0. At this point, the buffer pointed by second parameter contains the list of selectors allocated. Now, we ll write the pseudocode for the function:

95 NTSTATUS _stdcall KeI386AllocateGdtSelectors( unsigned short *SelectorArray, unsigned short nselectors) { register int i=0; register int *DescritorEntry; if (KiNumberFreeSelectors$S10229<nSelectors) { return STATUS_ABIOS_SELECTOR_NOT_AVAILABLE; KfAcquireSpinLock(_KiAbiosGdtLock); _KiNumberFreeSelectors$S10229-=nSelectors; if (nselectors==0) { goto CommonExit; DescriptorEntry=_KiFreeGdtListHead$S10230; while (nselectors!=0) { SelectorArray[i]=DescriptorEntry-KiAbiosGdt; i++; nselectors--; DescriptorEntry=*DescriptorEntry CommonExit: KfReleaseSpinLock(_KiAbiosGdtLock); return 0; SUMMARY In this chapter, we described how to use symbolic information supplied with Windows NT using SoftICE. We also discussed some general techniques used for reverse engineering, such as how to understand the compiler code generation patterns. Next, we showed how Windows NT can assist in reverse engineering by enabling some debugging flags in the kernel. We also discussed various ways of deciphering the parameters for undocumented functions. Next, we reviewed some typical Assembly language patterns found throughout the Windows NT kernel code. The chapter concluded with an example showing the deciphering of an undocumented function called KeI386AllocateGdtSelectors from NTOSKRNL EXE.

96 Chapter 6 Hooking Windows NT System Services

97 Abstract This chapter explores system services under DOS, Windows 3.x, Windows 95/98, and Windows NT. The authors discuss the need for hooking these system services. THIS CHAPTER DISCUSSES hooking Windows NT system services. Before we begin, let s first review what we mean by a system service. A system service refers to a set of functions (primitive or elaborate) provided by the operating system. Application programming interfaces (APIs) enable developers to call several system services, directly or indirectly. The operating system provides APIs in the form of a dynamic link library (DLL) or a static compiler library. These APIs are often based on system services provided by the operating system. Some of the API calls are directly based on a corresponding system service, and some depend on making multiple system service calls. Also, some of the API calls may not make any calls to system services. In short, you do not need a one-to-one mapping between API functions and system services. Figure 6-1 demonstrates this in context of Windows NT. Figure 6-1: Mappings between API functions and system services SYSTEM SERVICES: THE LONG VIEW System services and the APIs calling these system services have come a long way from DOS to Windows NT. System Services under DOS Under DOS, system services comprise part of the MS-DOS kernel (including MSDOS.SYS and IO.SYS). These system services are available to users in the form of Interrupt Service Routines (ISRs). ISRs can be invoked by calling the appropriate interrupt handlers using the

98 INT instruction. API functions, provided by compiler libraries, call the interrupt handler for system services (the INT 21h interrupt). For example, to open a file, MS-DOS provides a system service for which you have to specify the function number 0x3D in the AH register, attribute mask in the CL register, filename in the DS:DX register, as well as issue the INT 21h instruction. Compilers typically provide wrappers around this and provide a nice API function for this purpose. System Services under Windows 3.x and Windows 95/98 Under Windows 3.x or Windows 95/98, the core system services take the form of VXDs and DLLs and some real-mode DOS code. The APIs are provided in the form of dynamic link libraries. These dynamic link libraries call the system services to implement the APIs. For example, to open a file, applications call an API function from KERNEL32.DLL such as OpenFile() or CreateFile(). These APIs, in turn, call a system service. System Services under Windows NT Under Windows NT, the NT executive (part of NTOSKRNL.EXE) provides core system services. These services are rather generic and primitive. Various APIs such as Win32, OS/2, and POSIX are provided in the form of DLLs. These APIs, in turn, call services provided by the NT executive. The name of the API function to call differs for users calling from different subsystems even though the same system service is invoked. For example, to open a file from the Win32 API, applications call CreateFile() and to open a file from the POSIX API, applications call the open() function. Both of these applications ultimately call the NtCreateFile() system service from the NT executive. Note: Under Windows NT 3.51, the system services are provided by a kernel-mode component called NTOSKRNL.EXE. Most of the KERNEL32.DLL calls such as those related to memory management and kernel objects management are handled by these system services. The USER32 and GDI32 calls are handled by a separate subsystem process called CSRSS. Starting with Windows NT 4.0, Microsoft moved most of the functionality of CSRSS into a kernel-mode driver called WIN32K.SYS. The functionality moved into WIN32K.SYS is made available to the applications in the form of system services. These system services are not truly part of native system services since they are specific to the user interface and not used by all subsystems. This chapter and the next chapter focus only on the system services provided by NTOSKRNL.EXE. NEED FOR HOOKING SYSTEM SERVICES Hooking represents a very common mechanism of intercepting a particular section of executing code. Hooking provides a useful way of modifying the behavior of the operating system. Hooking can help the developer in several ways. Often developers are concerned more with how to hook a system service or an API call rather than why to hook. Nevertheless, we examine the various possible situations in which the need to hook a system service arises. How hooking can help the developer is explained in the following sections.

99 Trapping Events at Occurrence Developers trap events such as the creation of a file (CreateFile()), creation of a mutex (CreateMutex()), or Registry accesses (RegCreateKey()) for specific purposes. Hooking a particular event -related API or system service call, synchronously, can help trap those events. Applications doing system monitoring will find these kinds of hooking invaluable. These hooks could act as interrupts triggered by the occurrence of these events. A developer could write a routine to handle the occurrence of these events and take appropriate action. Modifying System Behavior to Suit User Needs Diverting the normal flow of control by introducing the hooks can modify operating system behavior. This enables the developer to change data structures and context at the time of hooking enough to induce new behavior. For example, you can protect the opening of a sensitive file by hooking the NtCreateFile() system service. Although NTFS provides user-level security for files, this security is not available on FAT partitions. You should ensure that hooking does not have any undesirable side effects on the operating system. Protecting modifications to Registry keys is something easily doable when you hook the Registry system services. This has several applications, since little protection is provided for Registry settings created by applications. Studying the Behavior of the System In order to get a better idea of the internal workings of the operating system, studying the behavior of the system is something most debuggers or system hackers will relate to. Understanding of undocumented operating system functionality requires a lot of hacking, which goes hand in hand with hooking. Debugging Complex programs could make use of system-service hooking to debug the stickiest problems. For example, a few days back, we had a problem with the installation of a piece of software. We had difficulty creating folders and shortcuts for this application. Using a systemwide hook, we quickly figured that the installation program was looking for a Registry value that indicated where to install the folders (which happened to be the Start menu). We hooked the NtQueryValueKey() call, then obtained the value the installation program was looking for. We created that value and solved our problem. Getting Performance Data for Specific Tasks and Generating Statistics These tasks can prove very useful to those writing benchmarks and applications to critically measure system performance under specific conditions. Even measuring the frequency of certain system services becomes very easy with this type of hooking. Measuring file system performance by hooking the file system-related system services exemplify this procedure. Life without hooking is unthinkable for most Windows developers in today s Microsoft-dominated world of operating systems. Windows NT system services lie at the center of the NT universe, and having the ability to hook these can prove extremely handy.

100 TYPES OF HOOKS The following sections explore two types of hooking. Kernel-Level Hooking You can achieve kernel-level hooking by writing a VXD or device driver. In this method, essential functions provided by the kernel are hooked. The advantage of this type of hooking is that you get one central place from which you can monitor the events occurring as a result of a user-mode call or a kernel-mode call. The disadvantage of this method is that you need to decipher the parameters of the call passed from kernel mode, since many times these services are undocumented. Also, the data passed to the kernel-mode call might differ from the data passed in a user-mode call. Also, a user-level API call might be implemented using multiple calls to the kernel. In this case, hooking becomes far more difficult. In general, this type of hooking is more difficult to achieve, but it can produce more rewarding results. User-Level Hooking You can perform this type of hooking with some help from a VXD or device driver. In this method, the functions provided by the user-mode DLLs are hooked. The advantage of this method is that these functions are usually well documented. Therefore, you know the parameters to expect. This makes it easy to write the hook function. This type of hooking limits your field of vision to user mode only and does not extend to kernel mode. IMPLEMENTATIONS OF HOOKS The following sections detail the implementation of hooks under various Microsoft platforms. DOS In the DOS world, system services are implemented as an interrupt handler routine (INT 21h). The compiler library routines typically call this interrupt handler to provide an API function to the programmer. It is trivial to hook this handler using the GetVect (INT 21h, AX=25h) and SetVect (Int 21h, AX=35h) services. Hence, hooking system services are fairly straightforward. DOS does not contain separate user and kernel modes. Windows 3.x In the Windows 3.x world, system services are implemented in DLLs. The compiler library routines represent stubs that jump to the DLL code (this is called dynamic linking of DLLs). Also, because the address space is common to all applications, hooking amounts to getting the address of that particular system service and changing a few bytes at that address. Changing of these bytes sometimes requires the simple aliasing of selectors. XREF: Refer to the MSDN article in Microsoft Systems Journal (Vol. 9, No. 1) entitled, Hook and Monitor Any 16-bit Windows(tm) Function With Our ProcHook DLL, by James Finnegan.

101 Windows 95 and 98 In the Windows 95/98 world, system services are implemented in a DLL as in Windows 3.1. However, under Windows 95/98, all 32-bit applications run in separate address spaces. Because of this, you cannot easily hook any unshared DLL. It is fairly easy to hook a shared DLL such as KERNEL32.DLL. You simply modify a few code bytes at the start of the system service you want to hook and write your hook function in a DLL that is loaded in shared memory. Modifying the code bytes may involve writing a VXD, because KERNEL32.DLL is loaded in the upper 2GB of the address space and protected by the operating system. Windows NT In the Windows NT world, system services are implemented in the kernel component of NT (NTOSKRNL.EXE). The APIs supported by various subsystems (Win32, OS/2, and POSIX) are implemented by using these system services. There is no documented way of hooking these system services from kernel mode. There are several documented ways for hooking user-level API calls. XREF: Refer to the MSDN articles in Microsoft Systems Journal entitled, Learn System-Level Win32(r) Coding Techniques by Writing and API Spy Program, by Matt Pietrek (Vol.9, No.12), and Load Your 32-bit DLL into Another Process s Address Space Using INJLIB, by Jeffrey Richter (Vol.9, No.5). Refer to CyberSensor on We will present one way of achieving hooking of NT system services in kernel mode in this chapter. We also provide the code for this on the CD-ROM accompanying this book. WINDOWS NT SYSTEM SERVICES Windows NT has been designed with several design goals in mind. Support for multiple (popular) APIs, extensibility, isolation of various APIs from each other, and security are some of the most important ones. The present design incorporates several protected subsystems (for example, the Win32 subsystem, the POSIX subsystem, and others) that reside in the user space isolated from each other. The NT executive runs in the kernel mode and provides native support to all the subsystems. All subsystems use the NT system services provided by the NT executive to implement most of their core functionality. Windows programmers, when they link with the KERNEL32, USER32, and GDI32 DLLs, are completely unaware of the existence of the NT system services supporting the various Win32 calls they make. Similarly, POSIX clients using the POSIX API end up using more or less the same set of NT system services to get what they want from the kernel. Thus, NT system services represent the fundamental interface for any user-mode application or subsystem to the kernel.

102 For example, when a Win32 application calls CreateProcess() or when a POSIX application calls the fork() call, both ultimately call the NtCreateProcess() system service from the NT executive. NT system services represent routines, which run entirely in the kernel mode. For those familiar with the Unix world, NT system services can be considered the equivalent of system calls in Unix. Figure 6-2 A caller program invoking an NT system service. Figure 6-2: A caller program invoking an NT system service Currently, Windows NT system services are not completely documented. The only place where you can find some documentation regarding the NT system services is on Windows NT DDK CD-ROMs from Microsoft. The DDK discusses about 25 different system services and covers the parameters passed to them in some detail. You ll see from Appendix A that this is only the tip of the iceberg. In Windows NT 3.51, 0xC4 different system services exist, in Windows NT 4.0, 0xD3 different system services exist, and in Windows 2000 Beta-2, 0xF4 different system services exist. We deciphered the parameters of 90% of the system services. Prototypes for all these system services can be found in UNDOCNT.H on the CD-ROM included with this book. We also provide detailed documentation of some of the system services in Appendix A.

103 In the following section, you will learn how to hook these system services. HOOKING NT SYSTEM SERVICES Let s first look at how NT System Services are implemented in the Windows NT operating system. We also will discuss the exact mechanics of hooking an NT system service. In addition, we ll explore the kernel data structures involved and provide sample code to aid hooking of system services. On the CD: Check out hookdrv.c on the accompanying CD-ROM. Implementation of a System Service in Windows NT The user mode interface to the system services of NTOSKRNL is provided in the form of wrapper functions. These wrapper functions are present in a DLL called NTDLL.DLL. These wrappers use the INT 2E instruction to switch to the kernel mode and execute the requested system service. The Win32 API functions (mainly in KERNEL32.DLL and ADVAPI32.DLL) use these wrappers for calling a system service. The Win32 API functions performs validations on the parameters passed to the API functions, and translates everything to Unicode. After this, the Win32 API function calls an appropriate wrapper function in NTDLL corresponding to the required service. Each system service in NTOSKRNL is identified by the Service ID. The wrapper function in NTDLL fills in the service id of the requested system service in the EAX register, fills in the pointer to stack frame of the parameters in EDX register, and issues the INT 2E instruction. This instruction changes the processor to the kernel mode, and the processor starts executing the handler specified for the INT 2E in the Interrupt Descriptor Table (IDT). The Windows NT executive sets up this handler. The INT 2E handler copies the parameters from user-mode stack to kernel-mode stack. The base of the stack frame is identified by the contents of the EDX register. The INT 2E handler provided by NT Executive is internally called as KiSystemService(). During the initialization of NTOSKRNL, it creates a function table, hereafter referred to as the System Service Dispatch Table (SSDT), for different services provided by NTOSKRNL (see Figure 6-3). Each entry in the table contains the address of the function to be executed for a given service ID. The INT 2Eh handler looks up this table based on the service ID passed in EAX register and calls the corresponding system service. The code for each function resides in the kernel. Similarly, another table called the ParamTable (hereafter referred to as System Service Parameter Table [SSPT]) provides the handler with the number of parameter bytes to expect from a particular service.

104 Figure 6-3: System Service Dispatch Table and Parameter Table Hooking NT System Services The easiest way to put a hook into the system services is to locate the System Service Dispatch Table used by the operating system and change the function pointers to point to some other function inserted by the developer. You can do this only from a kernel-mode device driver because this table is protected by the operating system at the page table level. The page attribute for these pages is set so that only kernel-mode components can read from and write to this table. User-level applications cannot read or write these memory locations. LOCATING THE SYSTEM SERVICE DISPATCH TABLE IN THE NTOSKRNL There is one undocumented entry in the export list of NTOSKRNL called KeServiceDescriptorTable(). This entry is the key to accessing the System Service Dispatch Table. The structure of this entry looks like this: typedef struct ServiceDescriptorTable { PVOID ServiceTableBase; PVOID ServiceCounterTable(0); unsigned int NumberOfServices; PVOID ParamTableBase; where ServiceTableBase Base address of the System Service Dispatch Table. NumberOfServices Number of services described by ServiceTableBase. ServiceCounterTable This field is used only in checked builds of the operating system and

105 contains the counter of how many times each service in SSDT is called. This counter is updated by INT 2Eh handler (KiSystemService). ParamTableBase Base address of the table containing the number of parameter bytes for each of the system services. ServiceTableBase and ParamTableBase contain NumberOfServices entries. Each entry represents a pointer to a function implementing the corresponding system service. The following program provides an example of hooking system services, under Windows NT. The system service NtCreateFile() hooks and the name of the file created prints when the hook gets invoked. We encourage you to insert code for hooking any other system service of choice. Note the proper places for inserting new hooks in the following code. Here are the steps to try out the sample (assuming that the sample binaries are copied in C:\SAMPLES directory): 1. Run instdrv hooksys c:\samples \hooksys.sys. This will install the hooksys.sys driver. The driver will hook the NtCreateFile system service. 2. Try to access the files on your hard disk. For each accessed file, the hooksys.sys will trap the call and display the name of the file accessed in the debugger window. These messages can be seen in SoftICE or using the debug message-capturing tool. #include "ntddk.h" #include "stdarg.h" #include "stdio.h" #include "hooksys.h" #define DRIVER_SOURCE #include "..\..\include\wintype.h" #include "..\..\include\undocnt.h" typedef NTSTATUS (*NTCREATEFILE)( PHANDLE FileHandle, ACCESS_MASK DesiredAccess, POBJECT_ATTRIBUTES ObjectAttributes, PIO_STATUS_BLOCK IoStatusBlock, PLARGE_INTEGER AllocationSize OPTIONAL, ULONG FileAttributes, ULONG ShareAccess, ULONG CreateDisposition, ULONG CreateOptions, PVOID EaBuffer OPTIONAL, ULONG EaLength );

106 #define SYSTEMSERVICE(_function) KeServiceDescriptorTable.ServiceTableBase[ *(PULONG)((PUCHAR)_function+1)] NTCREATEFILE OldNtCreateFile; NTSTATUS NewNtCreateFile( PHANDLE FileHandle, ACCESS_MASK DesiredAccess, POBJECT_ATTRIBUTES ObjectAttributes, PIO_STATUS_BLOCK IoStatusBlock, PLARGE_INTEGER AllocationSize OPTIONAL, ULONG FileAttributes, ULONG ShareAccess, ULONG CreateDisposition, ULONG CreateOptions, PVOID EaBuffer OPTIONAL, ULONG EaLength) { int rc; char ParentDirectory[1024]; PUNICODE_STRING Parent=NULL; ParentDirectory[0]='\0'; if (ObjectAttributes->RootDirectory!=0) { PVOID Object; Parent=(PUNICODE_STRING)ParentDirectory; rc=obreferenceobjectbyhandle(objectattributes->rootdirectory, 0, 0, KernelMode, &Object, NULL); if (rc==status_success) { extern NTSTATUS ObQueryNameString(void *, void *, int size, int *); int BytesReturned; rc=obquerynamestring(object, ParentDirectory, sizeof(parentdirectory), &BytesReturned); ObDereferenceObject(Object);

107 if (rc!=status_success) RtlInitUnicodeString(Parent, L"Unknown\\"); else { RtlInitUnicodeString(Parent, L"Unknown\\"); DbgPrint("NtCreateFile : Filename = %S%S%S\n", Parent?Parent->Buffer:L"",Parent?L"\\":L"", ObjectAttributes->ObjectName->Buffer); rc=((ntcreatefile)(oldntcreatefile)) ( FileHandle, DesiredAccess, ObjectAttributes, IoStatusBlock, AllocationSize, FileAttributes, ShareAccess, CreateDisposition, CreateOptions, EaBuffer, EaLength); DbgPrint("NtCreateFile : rc = %x\n", rc); return rc; NTSTATUS HookServices() { OldNtCreateFile=(NTCREATEFILE)(SYSTEMSERVICE(ZwCreateFile)); _asm cli (NTCREATEFILE)(SYSTEMSERVICE(ZwCreateFile))=NewNtCreateFile; _asm sti return STATUS_SUCCESS; void UnHookServices() { _asm cli (NTCREATEFILE)(SYSTEMSERVICE(ZwCreateFile))=OldNtCreateFile; _asm sti return; NTSTATUS DriverEntry(

108 IN PDRIVER_OBJECT DriverObject, IN PUNICODE_STRING RegistryPath ) { MYDRIVERENTRY(DRIVER_DEVICE_NAME, FILE_DEVICE_HOOKSYS, HookServices()); return ntstatus; NTSTATUS DriverDispatch( IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp ) { Irp->IoStatus.Status = STATUS_SUCCESS; IoCompleteRequest (Irp, IO_NO_INCREMENT); return Irp->IoStatus.Status; VOID DriverUnload( IN PDRIVER_OBJECT DriverObject ) { WCHAR devicelinkbuffer[] =L"\\DosDevices\\"DRIVER_DEVICE_NAME; UNICODE_STRING devicelinkunicodestring; UnHookServices(); RtlInitUnicodeString (&devicelinkunicodestring, devicelinkbuffer); IoDeleteSymbolicLink (&devicelinkunicodestring); IoDeleteDevice (DriverObject->DeviceObject); SUMMARY In this chapter, we explored system services under DOS, Windows 3.x, Windows 95/98, and Windows NT. We discussed the need for hooking these system services. We discussed kernel- and user-lever hooks. We discussed the data structures used during the system call and the mechanism used for hooking Windows NT system services. The chapter concluded with an example that hooked the NtCreateFile() system service.

109 Chapter 7 Adding New System Services to the Windows NT Kernel

110 Abstract This chapter explores in detail the system service implementation of Windows NT. The authors explain the mechanism for adding new system services to the Windows NT kernel and provide an example that adds three new system services. CUSTOMIZING THE KERNEL for specific purposes has been very popular among developers long before Windows NT. Ancient Unix gurus and developers alike practiced the art. In Unix, for example, kernel developers can modify the kernel in several ways, such as adding new device drivers, kernel extensions, system calls, and kernel processes. In Windows NT, DDK provide means to add new device drivers. However, one of most effective ways of modifying the kernel adding new system services to it is not documented. This method proves more efficient than adding device drivers for several reasons discussed later in this chapter. Here, we focus on the detailed implementation of a system service inside the Windows NT kernel and explain, with examples, how new system services can add to the Windows NT. In Inside Windows NT, Helen Custer mentions the design of system services and the possibility of adding new system services to the kernel: Using a system service dispatch table provides an opportunity to make native NT system services extensible. The kernel can support new system services simply by expanding the table without requiring changes to the system or to applications. After a code is written for a new system service, a system administrator could simply run a utility program that dynamically creates a new dispatch table. The new table will contain another entry that points to a new system service. The capability to add new system services exists in Windows NT but it is not documented. Very little changed between NT 3.51 and later versions of Windows NT in this area. The only thing being changed is that some of the data structures involved in implementation of a system service are located at the different offsets in the later versions of the operating system. We feel that our method of adding new system services may hold, possibly with very minor modifications, in future releases of Windows NT. At the end of this chapter, we try to shed some light on the possible thought that went into the design of this portion of the operating system. DETAILED IMPLEMENTATION OF A SYSTEM SERVICE IN WINDOWS NT In Chapter 6, we discussed how a system service is invoked by the NTDLL.DLL at the request of the application. The SSDT (System Service Dispatch Table) and SSPT (System Service Parameter Table) help the kernel in accessing the right system service ID. The implementation

111 of the SSDT and SSPT occurs similarly in all versions of Windows NT to date. We present the two implementations separately for clarity, one for Windows NT 3.51 and one for the later versions of the operating system such as Windows NT 4.0 and Windows Below is the table containing the service ID mappings for all versions of Windows NT to date. TABLE 7-1 SERVICE ID MAPPINGS TABLE 7-1 SERVICE ID MAPPINGS KERNEL32 and ADVAPI32 USER32 and GDI32 Calls Windows NT 3.51 Windows NT 4.0 (up to Service Pack 5) Windows NT 2000 (beta-2) Mapped to 0x0 through 0xC3 service IDs inside NTOSKRNL Mapped to 0x0 through 0xD2 service IDs inside NTOSKRNL 0x0 through 0xF3 service IDs inside NTOSKRNL Processed by the Win32 subsystem a user mode process. No system services are provided in the kernel for handling these directly. These calls use the Win32 subsystem using kernel LPC system services. Mapped to 0x1000 through 0x120A service IDs in the inside WIN32K.SYS. The kernel mode driver WIN32K.SYS takes over the functionality of the Win32 subsystem and supports these services. Mapped to 0x1000 through 0x1285 service IDs in the inside WIN32K.SYS. The kernel mode driver WIN32K.SYS takes over the functionality of the Win32 subsystem and supports these services. In Windows NT 3.51, only the KERNEL32 and ADVAPI32 functions of the operating system route through NTDLL.DLL to NTOSKRNL. The USER32 and GDI32 functions of the operating system implement as a part of the Win32 subsystem process (CSRSS). The USER32.DLL and GDI32.DLL provide wrappers, which calls the CSRSS process using the local procedure call (LPC) facility.

112 The functionality of USER32.DLL and GDI32.DLL is implemented differently in Windows NT 4.0 and Windows The functionality of the USER32 and GDI32 components is moved into the kernel mode driver WIN32K.SYS. The workhorse routines of NT 3.51 s Win32 subsystem have transferred their load on the system services added by the addition of the WIN32K.SYS component. This explains why we see more system services versions later to Windows NT This new set of system services corresponds to the USER32 and GDI32 components of the operating system. Figure 7-1 System service tables Figure 7-1 System service tables Windows NT System Service Implementation Here, we discuss the implementation of a system service under Windows NT. An INT 2Eh instruction implements the system services. The INT 2Eh handler is internally named as KiSystemService and hereafter we refer to it as the handler. Before entering the handler, the EAX register is loaded with the service ID and the EDX register with a pointer to the stack frame required for implementation of a particular service. The handler gets to the current TEB (Thread Environment Block) by looking at the Processor Control Region (PCR). The current TEB is stored at an offset of 0x124 in the Processor Control Region. The handler gets the address of the System Service Descriptor Table from the TEB. You can locate the address of the Service Descriptor Table at 0x124 offset in the TEB. Chapter 6 explains the format of the Service Descriptor Table. The handler refers to the first entry in the Service Descriptor Table for service IDs less than 0x1000 and refers to the second entry of the table for service IDs greater than or equal to 0x1000. The handler checks the validity of service IDs. If a service ID is valid, the handler extracts the addresses of the SSDT and SSPT. The handler copies the number of bytes (equal to the total number of bytes of the parameter list) described by the SSPT for the service from user-mode stack to kernel-mode stack and then calls the function pointed to by the SSDT for that service.

113 Initially, when any thread is started, the TEB contains a pointer to the Service Descriptor Table identified internally as KeServiceDescriptorTable. KeServiceDescriptorTable contains four entries. Only the first entry in this table is used, which describes the service ids for some of the KERNEL32 and ADVAPI32 calls. Another Service Descriptor Table, internally named KeServiceDescriptorTableShadow, identically matches KeServiceDescriptorTable under NT However, under later versions of the operating system, the second entry in the table is not NULL. The second entry points to another SSDT and SSPT. This SSDT and SSPT comprise part of the WIN32K.SYS driver. The WIN32K.SYS driver creates this entry during its initialization (in its DriverEntry routine) by calling the function called KeAddSystemServiceTable. (We provide more information on this later in this chapter.) This second entry describes the services exported by WIN32K.SYS for USER32 and GDI32 modules. You should note that in all versions of Windows NT, KeServiceDescriptorTable contain only one entry and that all started threads point their TEBs to KeServiceDescriptorTable. This continues so long as the threads call services belonging to first entry in KeServiceDescriptorTable. When the threads call services above these limits (unlikely in 3.51, but very likely in later versions of Windows NT, because USER and GDI service IDs start with 0x1000), the KiSystemService jumps to a label _KiEndUnexpectedRange under NT 3.51 and _KiErrorMode under NT 4.0 and KiBBTEndUnexpectedRange in Windows Let s see what role the code at each label plays. _KiEndUnexpectedRange (NT 3.51) The following example shows the role of the code at the _KiEndUnexpectedRange label: if (serviceid < 0x1000) { /* It means if service id > 0xC3 and * service id < 0x1000 */ return STATUS_INVALID_SYSTEM_SERVICE; if (PsConvertToGuiThread()!= STATUS_SUCCESS) { return STATUS_INVALID_SYSTEM_SERVICE; PsConvertToGuiThread() { if (PspW32ProcessCallout) { /* In case of NT 3.51 this is code is never * invoked, since PspW32ProcessCallout is * always = 0 */ /* This is only invoked for the later versions of the operating system * Please refer to the next section for details */

114 else { return STATUS_ACCESS_DENIED; _KiErrormode (in Windows NT 4.0 and KiBBTEndUnexpectedRange in Windows 2000) The code resembles _KiEndUnexpectedRange, except that the PspW32ProcessCallout variable is always nonzero. Hence, the code in PsConvertToGuiThread proceeds further. It performs several tasks; we now describe the one of immediate interest. PsConvertToGuiThread allocates a block of memory and copies KeServiceDescriptorTableShadow to the allocated block. Note that under NT 4.0 and Windows 2000, KeServiceDescriptorTableShadow contains two entries one for KERNEL32 calls and one for USER32 and GDI32 calls. After copying this, the code updates the TEB of the current thread to point to this copy of KeServiceDescriptorTableShadow and then returns. This happens only the first time a USER32 or GDI32 service is invoked. After this, all system services, including KERNEL32 module, route through this new table, since the first entry in this table already points to the SSDT and SSPT for the KERNEL32 functions. KeServiceDescriptorTableShadow is not exported by the NTOSKRNL and therefore is a nonaccessible table. Under Windows NT 3.51, both KeServiceDescriptorTable and the Shadow Table point to the same SSDT and SSPT and contain only one entry. Now, ask yourself this logical question: Why do we have the Shadow Table at all when apparently it does not provide much help in NT 3.51? We attempt to answer this question later in the chapter. Note: Note that once a process makes a USER32/GDI32 call, it permanently stops using the original KeServiceDescriptorTable and switches entirely to a copy of KeServiceDescriptorTableShadow. ADDING NEW SYSTEM SERVICES Adding new system services involve the following steps: 1. Allocate a block of memory large enough to hold existing SSDT and SSPT and the extensions to each of the table. 2. Copy the existing SSDT and SSPT into this block of memory. 3. Append the new entries to the new copies of the two tables as shown in Figure Update KeServiceDescriptorTable and KeServiceDescriptorTableShadow to point to the newly allocated SSDT and SSPT. In NT 3.51, because the Shadow Table is never used, you could get away without having to

115 update it. In NT 4.0 and Windows 2000, however, the Shadow Table takes a leading role once a GDI32 or a USER32 call has been made. Therefore, it is important that you update both KeServiceDescriptorTable and KeServiceDescriptorTableShadow. If you fail to update KeServiceDescriptorTableShadow in NT 4.0 or Windows 2000, the newly added services will fail to work once a GDI32 or USER32 call is made. We recommend that you update both the tables in all versions of Windows NT so that you can use the same piece of code with all the versions of the operating systems. Figure 7-2 Adding new system services One implementation issue in updating the KeServiceDescriptorTableShadow is that NTOSKRNL does not export this table. However, NTOSKRNL does export KeServiceDescriptorTable. So, how can you get the address of KeServiceDescriptorTableShadow? The method we used for this is as follows. There is a function in NTOSKRNL called KeAddSystemServiceTable. This function is used by WIN32K.SYS driver for adding the USER32 and GDI 32 related functions. This function does refer to KeServiceDescriptorTableShadow. The first entry in both KeServiceDescriptorTable and KeServiceDescriptorTableShadow is the same. We iterate through each DWORD in the KeAddSystemServiceTable code, and for all valid addresses found in this function, we compare the 16 bytes (size of one entry in descriptor table) at this address with the first entry in KeServiceDescriptorTable. If we find the match, we consider that as the address of the KeServiceDescriptorTableShadow. This method seems to work in all Windows NT versions.

116 EXAMPLE OF ADDING A NEW SYSTEM SERVICE This example consists of three modules. One device driver contains the code for new system services and the mechanism of adding new system services to a Windows NT kernel. One DLL represents an interface to new system services (just as NTDLL.DLL provides interface for services called by KERNEL32.DLL). And one application links to this wrapper DLL and calls the newly added services. The newly added services print a debug message saying, kernel service... Called and print the parameters passed to the services. Each service returns values 0, 1, and 2. The function AddServices() isolates the code for the mechanism of adding new system services. Assuming first that the sample binaries are copied in C:\SAMPLES directory, here are the steps to try out the sample: 1. Run instdrv extndsys c:\samples \extndsys.sys. This will install the extndsys.sys driver. The driver will add three new system services to Windows NT Kernel. 2. Run MYAPP.EXE. This will call wrapper functions in MYNTDLL.DLL to call newly added system services in EXTNDSYS.SYS. #include "ntddk.h" #include "stdarg.h" #include "stdio.h" #include "extnddrv.h" #define DRIVER_SOURCE #include "..\..\include\wintype.h" #include "..\..\include\undocnt.h" /* Prototypes for the services to be added */ NTSTATUS SampleService0(void); NTSTATUS SampleService1(int param1); NTSTATUS SampleService2(int param1, int param2); /* TODO TODO TODO TODO Add more to this list to add more services */ /* Table describing the new services */ unsigned int ServiceTableBase[]={(unsigned int)sampleservice0, (unsigned int)sampleservice1, (unsigned int)sampleservice2, /* TODO TODO TODO TODO......

117 Add more to this list to add more services */ ; /* Table describing the parameter bytes required for the new services */ unsigned char ParamTableBase[]={0, 4, 8, /* TODO TODO TODO TODO Add more parameter bytes to this list to add more services */ ; unsigned int *NewServiceTableBase; /* Pointer to new SSDT */ unsigned char *NewParamTableBase; /* Pointer to new SSPT */ unsigned int NewNumberOfServices; /* New number of services */ unsigned int StartingServiceId; NTSTATUS SampleService0(void) { trace(("kernel service with 0 parameters called\n")); return STATUS_SUCCESS; NTSTATUS SampleService1(int param1) { trace(("kernel service with 1 parameters called\n")); trace(("param1=%x\n", param1)); return STATUS_SUCCESS+1; NTSTATUS SampleService2(int param1, int param2) { trace(("kernel service with 2 parameters called\n")); trace(("param1=%x param2=%x\n", param1, param2)); return STATUS_SUCCESS+2; /* TODO TODO TODO TODO Add implementations of other services here */ unsigned int GetAddrssofShadowTable() { int i;

118 unsigned char *p; unsigned int dwordatbyte; p=(unsigned char *)KeAddSystemServiceTable; for (i=0; i<4096; i++, p++) { try { dwordatbyte=*(unsigned int *)p; except (EXCEPTION_EXECUTE_HANDLER) { return 0; if (MmIsAddressValid((PVOID)dwordatbyte)) { if (memcmp((pvoid)dwordatbyte, &KeServiceDescriptorTable, 16)==0) { if ((PVOID)dwordatbyte==&KeServiceDescriptorTable) { continue; dwordatbyte); return dwordatbyte; return 0; NTSTATUS AddServices() { PServiceDescriptorTableEntry_t KeServiceDescriptorTableShadow; unsigned int NumberOfServices; NumberOfServices=sizeof(ServiceTableBase)/sizeof(ServiceTableBase[0]); trace(("keservicedescriptortable=%x\n", &KeServiceDescriptorTable)); KeServiceDescriptorTableShadow=(PServiceDescriptorTableEntry_t) GetAddrssofShadowTable(); if (KeServiceDescriptorTableShadow==NULL) { return STATUS_UNSUCCESSFUL; trace(("keservicedescriptortableshadow=%x\n", KeServiceDescriptorTableShadow)); NewNumberOfServices=KeServiceDescriptorTable.NumberOfServices +NumberOfServices;

119 StartingServiceId=KeServiceDescriptorTable.NumberOfServices; /* Allocate sufficient memory to hold the existing services as well as the services you want to add */ NewServiceTableBase=(unsigned int *) ExAllocatePool (PagedPool, NewNumberOfServices*sizeof(unsigned int)); if (NewServiceTableBase==NULL) { return STATUS_INSUFFICIENT_RESOURCES; NewParamTableBase=(unsigned char *) ExAllocatePool(PagedPool, NewNumberOfServices); if (NewParamTableBase==NULL) { ExFreePool(NewServiceTableBase); return STATUS_INSUFFICIENT_RESOURCES; /* Backup the exising SSDT and SSPT */ memcpy(newservicetablebase, KeServiceDescriptorTable.ServiceTableBase, KeServiceDescriptorTable.NumberOfServices*sizeof(unsigned int)); memcpy(newparamtablebase, KeServiceDescriptorTable.ParamTableBase, KeServiceDescriptorTable.NumberOfServices); /* Append to it new SSDT and SSPT */ memcpy(newservicetablebase+keservicedescriptortable.numberofservices, ServiceTableBase, sizeof(servicetablebase)); memcpy(newparamtablebase+keservicedescriptortable.numberofservices, ParamTableBase, sizeof(paramtablebase)); /* Modify the KeServiceDescriptorTableEntry to point to new SSDT and SSPT */ KeServiceDescriptorTable.ServiceTableBase=NewServiceTableBase; KeServiceDescriptorTable.ParamTableBase=NewParamTableBase; KeServiceDescriptorTable.NumberOfServices=NewNumberOfServices; /* Also update the KeServiceDescriptorTableShadow to point to new SSDT and SSPT */ KeServiceDescriptorTableShadow->ServiceTableBase=NewServiceTableBase; KeServiceDescriptorTableShadow->ParamTableBase=NewParamTableBase; KeServiceDescriptorTableShadow->NumberOfServices=NewNumberOfServices; /* Return Success */ DbgPrint("Returning success\n"); return STATUS_SUCCESS; NTSTATUS DriverDispatch( IN PDEVICE_OBJECT DeviceObject,

120 IN PIRP Irp ); VOID DriverUnload( IN PDRIVER_OBJECT DriverObject ); NTSTATUS DriverEntry( IN PDRIVER_OBJECT DriverObject, IN PUNICODE_STRING RegistryPath ) { MYDRIVERENTRY(L"extnddrv", FILE_DEVICE_EXTNDDRV, AddServices()); return ntstatus; NTSTATUS DriverDispatch( IN PDEVICE_OBJECT DeviceObject, IN PIRP Irp ) { PIO_STACK_LOCATION irpstack; PVOID iobuffer; ULONG inputbufferlength; ULONG outputbufferlength; NTSTATUS ntstatus; Irp->IoStatus.Status Irp->IoStatus.Information = 0; = STATUS_SUCCESS; irpstack = IoGetCurrentIrpStackLocation (Irp); switch (irpstack->majorfunction) { case IRP_MJ_DEVICE_CONTROL: trace(("extnddrv.sys: IRP_MJ_CLOSE\n")); switch (irpstack->parameters.deviceiocontrol.iocontrolcode) { case IOCTL_EXTNDDRV_GET_STARTING_SERVICEID: trace(("extnddrv.sys:ioctl_extnddrv_get_starting_serviceid\n")); outputbufferlength = irpstack->parameters.deviceiocontrol.outputbufferlength; if (outputbufferlength<sizeof(startingserviceid)) {

121 Irp->IoStatus.Status = STATUS_INSUFFICIENT_RESOURCES; else { iobuffer = (PULONG)Irp->AssociatedIrp.SystemBuffer; memcpy(iobuffer, &StartingServiceId, sizeof(startingserviceid)); Irp->IoStatus.Information = sizeof(startingserviceid); break; break; ntstatus = Irp->IoStatus.Status; IoCompleteRequest (Irp, IO_NO_INCREMENT); return ntstatus; VOID DriverUnload( IN PDRIVER_OBJECT DriverObject ) { WCHAR devicelinkbuffer[] = L"\\DosDevices\\EXTNDDRV"; UNICODE_STRING devicelinkunicodestring; RtlInitUnicodeString (&devicelinkunicodestring, devicelinkbuffer ); IoDeleteSymbolicLink (&devicelinkunicodestring); IoDeleteDevice (DriverObject->DeviceObject); trace(("extnddrv.sys: unloading\n")); /* MYNTDLL.C * This DLL is a wrapper around the new services * added by the device driver. This DLL is like * NTDLL.DLL which is a wrapper around KERNEL32.DLL */ #include <windows.h> #include <stdio.h> #include <winioctl.h> #include "..\sys\extnddrv.h"

122 typedef int NTSTATUS; int ServiceStart; declspec(dllexport) NTSTATUS SampleService0(void) { _asm { mov eax, ServiceStart int 2eh declspec(dllexport) NTSTATUS SampleService1(int param) { void **stackframe=&param; _asm { mov eax, ServiceStart add eax, 1 mov edx, stackframe int 2eh declspec(dllexport) NTSTATUS SampleService2(int param1, int param2) { char **stackframe=&param1; _asm { mov eax, ServiceStart add eax, 2 mov edx, stackframe int 2eh declspec(dllexport) NTSTATUS SampleService3(int param1, int param2, int param3) { char **stackframe=&param1; _asm { mov eax, ServiceStart add eax, 3

123 mov edx, stackframe int 2eh declspec(dllexport) NTSTATUS SampleService4(int param1, int param2,int param3, int param4) { char **stackframe=&param1; _asm { mov eax, ServiceStart add eax, 4 mov edx, stackframe int 2eh declspec(dllexport) NTSTATUS SampleService5(int param1, int param2,int param3, int param4,int param5) { char **stackframe=&param1; _asm { mov eax, ServiceStart add eax, 5 mov edx, stackframe int 2eh declspec(dllexport) NTSTATUS SampleService6(int param1, int param2,int param3, int param4,int param5, int param6) { char **stackframe=&param1; _asm { mov eax, ServiceStart add eax, 6 mov edx, stackframe int 2eh

124 BOOL SetStartingServiceId() { HANDLE hdevice; BOOL ret; hdevice = CreateFile ( "\\\\.\\extnddrv", GENERIC_READ GENERIC_WRITE, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL ); if (hdevice == ((HANDLE)-1)) { MessageBox(0, "Unable to open handle to driver", "Error", MB_OK); ret = FALSE; else { DWORD BytesReturned; ret=deviceiocontrol( hdevice, IOCTL_EXTNDDRV_GET_STARTING_SERVICEID, NULL, NULL, &ServiceStart, sizeof(servicestart), &BytesReturned, NULL); if (ret) { if (BytesReturned!=sizeof(ServiceStart)) { MessageBox(0, "DeviceIoControl failed", "Error", MB_OK); ret=false; else { ret = TRUE; else { MessageBox(0, "DeviceIoControl failed","error", MB_OK); CloseHandle (hdevice);

125 return ret; BOOL WINAPI DllMain(HANDLE hmodule, DWORD Reason, LPVOID lpreserved) { switch (Reason) { case DLL_PROCESS_ATTACH: // // We re being loaded - save our handle // return SetStartingServiceId(); default: return TRUE; /* This is a sample console application that calls the newly added services. The services are called through a wrapper DLL. The application simply prints the return values from the newly added system services. */ #include <windows.h> #include <stdio.h> #include "..\dll\myntdll.h" main() { printf("sampleservice0 returned = %x\n", SampleService0()); printf("sampleservice1 returned = %x\n", SampleService1(0x10)); printf("sampleservice2 returned = %x\n", SampleService2(0x10, 0x20)); return 0; Device Drivers as a Means of Extending the Kernel versus Adding New System Services Writing pseudo device drivers and providing the DeviceIoControl methods to the applications can also extend the kernel. However, in this case, each application that wants to use the DeviceIoControl has to open a handle to the device, issue the DeviceIoControl, and close the device. Extending the kernel by means of system services has its distinct advantages; first and foremost is that applications need not be aware of the device driver. Applications will just link to

126 a DLL that provides an interface for the system services (just like NTDLL.DLL provides an interface for KERNEL32.DLL). Further, DeviceIoContol proves much slower, especially if the DeviceIoControl requires a large amount of data transfers between the application and the device driver. By using this technique of adding system services, you might write a set of system services and provide a user-level interface DLL that everybody can use. This implementation looks cleaner and more standardized than the DeviceIoControl method. KeAddSystemServiceTable The WIN32K.SYS driver calls this function during its DriverEntry under Windows NT 4.0 and Windows This function looks somehow odd. The function expects five parameters: an index in the Service Descriptor Table where this new entry is to be added, SSDT, SSPT, the number of services, and one parameter for use only in checked build versions. This last parameter points to a DWORD Table that holds the value of the number of times each service gets called. NT 3.51 Design versus NT 4.0 and Windows 2000 Design: Microsoft s Options You might find it interesting to discover that the code manipulating the KeServiceDescriptorTableShadow resides in all versions of Windows NT the only difference is that the code for allocating and copying the Shadow Table is not triggered under NT 3.51 based on the value of a PspW2ProcessCallout variable. This information might convince you that the relocation of USER32 and GDI32 component into the NT 4.0 and Windows 2000 kernel (as contrasted with the NT 3.51 kernel) is not only performance based as Microsoft claims now but something well thought out as an option when NT 3.51 was designed. This leads us to believe that Microsoft had two solutions implemented for USER32 and GDI32 modules the LPC-based solution of using the Win32 subsystem and the INT 2Eh-based system service solution. Microsoft attempted the first solution under NT 3.51 and now settles for the second solution in later versions of Windows NT. The partial code for both solutions exists in NT 3.51, but there is no trace of the LPC solution for the Win32 subsystem under versions later than NT So, we can also conclude that the future releases of NT, unless drastically different, will continue to use the INT 2Eh-based solution for WIN32K.SYS system services. SUMMARY In this chapter, we discussed in detail the system service implementation of Windows NT. We explored some code fragments from a system service interrupt handler, using KiSystemService() as an example. Next, we detailed the mechanism for adding new system services to the Windows NT kernel. We also used an example that adds three new system services to the Windows NT kernel. We compared extending the kernel with device drivers with extending the kernel by adding system services.

127 Chapter 8 Local Procedure Call

128 Abstract A local procedure call (LPC) is the communication mechanism used by Windows NT subsystems. This chapter introduces subsystems and then provides a detailed discussion on the undocumented LPC mechanism. MICROSOFT DESIGNED THE local procedure call (LPC) facility to enable efficient communication with what Windows NT calls the subsystems. Although you do not need to know about subsystems before understanding the LPC mechanism, it is certainly interesting and advisable. In this chapter, we discuss the subsystems and then shed some light on the undocumented LPC mechanism. THE ORIGIN OF THE SUBSYSTEMS Although Microsoft never stated what NT stood for, one popular theory suggests that it refers to New Technology. That s not to say everything that goes inside Windows NT is new. Windows NT has borrowed several concepts from earlier operating systems. For example, the NTFS (New Technology File System) borrows a lot from the HPFS (High Performance File System) of IBM s OS/2. The Win32 API itself is an extension of the Windows 16-bit API. The Windows NT 3.51 user interface comes from Windows 3.1 and Windows NT 4.0 inherits its interface from Windows 95. Windows 2000 (Beta 3) maintains more or less the same user interface as Windows NT 4.0. In this section, we discuss the overall architecture of Windows NT, which Microsoft borrowed from the MACH operating system, originally developed at Carnegie Mellon University. DOS and Unix variants dominated the operating systems world in the 1980s. DOS has a monolithic architecture, composed of a single lump of code. Unix follows the layered architecture, where the operating system divides into layers such that each layer uses only the interface provided by the lower layers. The MACH operating system follows a new client-server approach. The initial versions of MACH were based on BSD Unix 4.3. The MACH team focused on two major goals. First, they wanted to have a more structured code than BSD 4.3. Second, they wanted to support different variants of the Unix API. They achieved both these goals by pushing the execution of kernel code to user-mode processes, which acted as servers. The MACH kernel appears very small, providing only the basic system services common to all Unix APIs. Therefore, we call it a micro-kernel. The server processes run in user mode and provide a sophisticated API interface. The normal application processes are clients of these server processes. When a client process invokes an API function, the emulation library, which links with the client code, transparently passes on the call to the server process. You can accomplish this using a facility similar to RPC (remote procedure call). The server process, after carrying out any necessary processing, returns the results to the client. To support a new API in the MACH environment, you need to write a server process and

129 emulation library, which support the new API. Not all server processes provide a different API. Some provide generic functionality such as memory management or TTY management. The Windows NT design team sought goals similar to that of MACH s developers. They wanted to support Win32, OS/2, and POSIX APIs, while keeping room for future APIs. Client-server architecture proved a natural choice. The servers are called as the protected subsystems in Windows NT. Subsystems are user-mode processes running in a local system security context. We call them protected subsystems because they are separate processes operating in separate address spaces and hence are protected from client access/modification. There are two types of subsystems: Integral subsystems Environment subsystems Integral Subsystems An integral subsystem performs some essential operating system task. For Windows NT, this group includes the Local Security Authority (lsass.exe), the Security Accounts Manager, the Session Manager (smss.exe), and the network server. The Local Security Authority (LSA) subsystem manages security access tokens for users. The Security Accounts Manager (SAM) subsystem maintains a database of information on user accounts, including passwords, any account groups a given user belongs to, the access rights each user is allowed, and any special privileges a given user has. The Session Manager subsystem starts and keeps track of NT logon sessions and serves as an intermediary among protected subsystems. Environment Subsystems An environment subsystem is a server that appears to perform operating system functions for its native applications by calling system services. An environment subsystem runs in user mode and its interface to end-users emulates another operating system, such as OS/2 or POSIX on top of Windows NT. Even the Win32 API implements through a subsystem process under Windows NT Note: Not all the API functions in the client-side DLLs need to pass the call to the subsystem process. For example, most of the KERNEL32.DLL calls can directly map onto the system services provided by the kernel. Such API functions invoke the system services via NTDLL.DLL. Most of the USER32.DLL functions and GDI32.DLL functions pass on the call to the subsystem process. (In Windows NT 4.0, Microsoft moved the Win32 subsystem inside the kernel for performance reasons.) The system call interface provided by the Windows NT kernel is called as the native API. The Win32 subsystem uses the native API for implementing the Win32 API. Generally, user programs make calls to an API provided by some subsystem, avoiding the use of a cumbersome, native API. We refer to the user programs as the clients of the subsystem that provides the API used by these programs.

130 The communication between the client processes and the subsystem happens through a mechanism called local procedure call (LPC), specially designed by Microsoft for that purpose. For unknown reasons, Microsoft prefers to keep the LPC interface undocumented. There is no reason why LPC cannot function as an Inter-Process Communication (IPC) mechanism. Microsoft provides a RPC kit for client-server communication across machines. Windows NT optimizes the RPCs by converting them to LPCs, in case the client and the server reside on the same machine. However, RPC has its own overheads. LPC proves most efficient in the raw form, and the subsystems also use it in that form only. Apart from that, RPC does not provide access to the fastest form of LPC the Quick LPC. For these reasons, we provide you with useful information on the LPC interface. LOCAL PROCEDURE CALL In Windows NT, client-subsystem communication happens in a fashion similar to that in the MACH operating system. Each subsystem contains a client-side DLL that links with the client executable. The DLL contains stub functions for the subsystem s API. Whenever a client process an application using the subsystem interface makes an API call, the corresponding stub function in the DLL passes on the call to the subsystem process. The subsystem process, after the necessary processing, returns the results to the client DLL. The stub function in the DLL waits for the subsystem to return the results and, in turn, passes the results to the caller. The client process simply resembles calling a normal procedure in its own code. In the case of RPC, the client actually calls a procedure sitting in some remote server over the network hence the name remote procedure call. In Windows NT, the server runs on the same machine; hence the mechanism is called as a local procedure call. There are three types of LPC. The first type sends small messages up to 304 bytes. The second type sends larger messages. The third type of LPC is called as Quick LPC and used by the Win32 subsystem in Windows NT The first two types of LPC use port objects for communication. Ports resemble the sockets or named pipes in Unix. A port is a bidirectional communication channel between two processes. However, unlike sockets, the data passed through ports is not streamed. The ports preserve the message boundaries. Simply put, you can send and receive messages using ports. The subsystems create ports with well-known names. The client processes that need to invoke services from the subsystems open the corresponding port using the well-known name. After opening the port, the client can communicate, with the server, over the port. Short Message Communication The client-subsystem communication via a port happens as follows. The server/subsystem creates a port using the NtCreatePort() function. The name of the port is well published and known to the clients (or, rather, to the client-side DLL). The NtCreatePort() function returns a port handle used by the subsystem to wait and accept requests using the NtListenPort()

131 function. Any process can send connection requests on this port and get a port handle for communication. The subsystem receives the request messages, processes them, and sends back the replies over the port to the client. The client sends a connection request to a waiting subsystem using the NtConnectPort() function. When the subsystem receives the connect request, it comes out of the NtListenPort() function and accepts the connection using the NtAcceptConnectPort() function. The NtAcceptConnectPort returns a new port handle specific to the client requesting the connection. The server can break the communication link with the particular client by closing this handle. The subsystem completes the connection protocol using the NtCompleteConnectPort() function. Now, the client also returns from the NtConnectPort() function and gets a handle to the communication port. This handle is private to the client process. The child processes do not inherit the port handles so the children need to open the subsystem port again. After completing this connection protocol, the client and the subsystem can start communicating over this port. The client sends a request to the subsystem using the NtRequestPort() function. When the NtRequestPort() function sends datagram messages to the subsystem, the client does not receive any acknowledgment for the sent messages. In case the client expects a reply to its request, the client can use the NtRequestWaitReplyPort() function, which sends the request to the subsystem and waits for a reply from the subsystem. The subsystem receives request messages using the NtReplyWaitReceive() function and sends reply messages using the NtReplyPort() function. The subsystem can optimize by replying to the previous request and waiting for the next request using a single call to the NtReplyWaitReceivePort() function. Figure 8-1 displays this entire process of communication.

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1 Windows 7 Overview Windows 7 Overview By Al Lake History Design Principles System Components Environmental Subsystems File system Networking Programmer Interface Lake 2 Objectives To explore the principles