Evolution of the Windows Kernel Architecture

mangledcobwebSoftware and s/w Development

Dec 14, 2013 (3 years and 6 months ago)

73 views

EVOLUTION OF THE WINDOWS
KERNEL ARCHITECTURE

Dave Probert, Ph.D.
-

Windows Kernel Architect

Microsoft Windows Division




08.10.2009

Buenos Aires

Copyright Microsoft Corporation

About Me


Ph.D. in Computer Engineering (Operating Systems w/o
Kernels)


Kernel Architect at Microsoft for over 13 years


Managed platform
-
independent kernel development in Win2K/XP


Working on multi
-
core & heterogeneous parallel computing support


Architect for UMS in Windows 7 / Windows Server 2008 R2


Co
-
instigator of the Windows Academic Program


Providing kernel source and curriculum materials to universities


http://microsoft.com/WindowsAcademic

or
compsci@microsoft.com



Wrote the Windows material for leading OS textbooks


Tanenbaum
,
Silberschatz
, Stallings


Consulted on others, including a successful OS textbook in China

UNIX
vs

NT Design Environments

Environment which influenced

fundamental design decisions


UNIX

[1969]

Windows (NT)

[1989]

16
-
bit program address space

Kbytes of physical memory

Swapping system with memory mapping

Kbytes of disk, fixed disks

Uniprocessor

State
-
machine based I/O devices

Standalone interactive systems

Small number of friendly users

32
-
bit program address space

Mbytes of physical memory

Virtual memory

Mbytes of disk, removable disks

Multiprocessor (4
-
way)

Micro
-
controller based I/O devices

Client/Server distributed computing

Large, diverse user populations


Copyright Microsoft Corporation

Effect on OS Design

NT
vs

UNIX

Although both Windows and Linux have adapted to changes in the
environment, the original design environments (i.e. in 1989 and 1969) heavily
influenced the design choices:

Unit of concurrency:

Process creation:

I/O:

Namespace root:

Security:

Threads
vs

processes

CreateProcess
()
vs

fork()

Async

vs

sync

Virtual
vs

Filesystem

ACLs
vs

uid
/
gid

Addr

space,
uniproc

Addr

space, swapping

Swapping, I/O devices

Removable storage

User populations

Copyright Microsoft Corporation

Today’s Environment
[2009]

64
-
bit addresses

GBytes

of physical memory

TBytes

of rotational disk

New Storage hierarchies (SSDs)

Hypervisors, virtual processors

Multi
-
core/Many
-
core

Heterogeneous CPU architectures, Fixed function hardware

High
-
speed internet/intranet, Web Services

Media
-
rich applications

Single user, but vulnerable to hackers worldwide

Convergence:
Smartphone /
Netbook

/ Laptop / Desktop / TV / Web / Cloud

Copyright Microsoft Corporation

Windows Architecture

hardware interfaces (buses, I/O devices, interrupts,

interval timers, DMA, memory cache control, etc., etc.)

System Service Dispatcher

Task Manager

Explorer

SvcHost.Exe

WinMgt.Exe

SpoolSv.Exe

Service

Control Mgr.

LSASS

Object

Mgr.

Windows

USER,

GDI



File


System


Cache

I/O Mgr

Environment

Subsystems

User

Application

Subsystem DLLs

System Processes

Services

Applications

System

Threads

User

Mode

Kernel

Mode

NTDLL.DLL

Device &

File Sys.

Drivers

WinLogon

Session Manager

Services.Exe


POSIX

Windows DLLs

Plug and

Play Mgr.

Power

Mgr.

Security

Reference

Monitor

Virtual

Memory

Processes

&

Threads

Local

Procedure

Call

Graphics

Drivers

Kernel

Hardware Abstraction Layer (HAL)

(kernel mode callable interfaces)

Configura
-

tion Mgr

(registry)

OS/2

Windows

Copyright Microsoft Corporation

Kernel
-
mode Architecture of Windows


Copyright Microsoft Corporation

NT API stubs (wrap sysenter)
--

system library (ntdll.dll)

user
mode

kernel
mode

NTOS executive layer

Trap/Exception/Interrupt Dispatch

CPU mgmt: scheduling, synchr, ISRs/DPCs/APCs

Drivers

Devices, Filters,
Volumes,
Networking,
Graphics

Hardware Abstraction Layer (HAL): BIOS/chipset details

firmware/
hardware

CPU, MMU, APIC, BIOS/ACPI, memory, devices

NTOS
kernel
layer

Caching Mgr

Security

Procs/Threads

Virtual Memory

IPC

glue

I/O

Object Mgr

Registry

Copyright Microsoft Corporation

Kernel/Executive layers


Kernel layer


ntos
/
ke



~ 5% of NTOS source)


Abstracts the CPU


Threads, Asynchronous Procedure Calls (APCs)


Interrupt Service Routines (ISRs)


Deferred Procedure Calls (DPCs


aka Software Interrupts)


Providers low
-
level synchronization


Executive layer


OS Services running in a multithreaded environment


Full virtual memory, heap, handles


Extensions to NTOS: drivers, file systems, network, …

Copyright Microsoft Corporation

NT (Native) API examples

NtCreateProcess

(&
ProcHandle
, Access,
SectionHandle
,
DebugPort
,
ExceptionPort
, …)

NtCreateThread

(&
ThreadHandle
,
ProcHandle
, Access,
ThreadContext
,
bCreateSuspended
, …)

NtAllocateVirtualMemory

(
ProcHandle
,
Addr
, Size,
Type, Protection, …)

NtMapViewOfSection

(
SectionHandle
,
ProcHandle
,
Addr
, Size, Protection, …)

NtReadVirtualMemory

(
ProcHandle
,
Addr
, Size, …)

NtDuplicateObject

(
srcProcHandle
,
srcObjHandle
,
dstProcHandle
,
dstHandle
, Access, Attributes,
Options)

Copyright Microsoft Corporation

Windows Vista Kernel Changes


Kernel changes mostly minor improvements


Algorithms, scalability, code maintainability


CPU timing: Uses Time Stamp Counter (TSC)


Interrupts not charged to threads


Timing and quanta are more accurate


Communication


ALPC: Advanced Lightweight Procedure Calls


Kernel
-
mode RPC


New TCP/IP stack (integrated IPv4 and IPv6)



I/O


Remove a context switch from I/O Completion Ports


I/O cancellation improvements


Memory management


Address space randomization (DLLs, stacks)


Kernel address space dynamically configured


Security:
BitLocker
, DRM, UAC, Integrity Levels

Copyright Microsoft Corporation

Windows 7 Kernel Changes


Miscellaneous kernel changes


MinWin


Change how Windows is built


Lots of DLL refactoring


API Sets (virtual DLLs)


Working
-
set management


Runaway processes quickly start reusing own pages


Break up kernel working
-
set into multiple working
-
sets


System cache, paged pool,
pageable

system code


Security


Better UAC, new account types, less
BitLocker

blockers


Energy efficiency


Trigger
-
started background services


Core Parking


Timer
-
coalescing, tick skipping


Major scalability improvements for large server apps


Broke apart last two major kernel locks, >64p


Kernel support for
ConcRT


User
-
Mode Scheduling (UMS)

Copyright Microsoft Corporation

MinWin


MinWin

is first step at creating architectural
partitions


Can be built, booted and tested separately from the rest of the
system


Higher layers can evolve independently


An engineering process improvement, not a microkernel NT!


MinWin

was defined as set of components required to
boot and access network


Kernel, file system driver, TCP/IP stack, device drivers, services


No servicing, WMI, graphics, audio or shell, etc, etc, etc


MinWin

footprint:


150 binaries, 25MB on disk, 40MB in
-
memory

MinWin

Layering

Shell,

Graphics,

Multimedia,

Layered Services,

Applets,

Etc.

Kernel,

HAL,

TCP/IP,

File Systems,

Drivers,

Core System Services

MinWin

Timer Coalescing


Secret of energy efficiency: Go idle and Stay idle


Staying idle requires minimizing timer interrupts


Before, periodic timers had independent cycles even when period
was the same


New timer APIs permit timer coalescing


Application or driver specifies tolerable delay


Timer system shifts timer firing

MarkRuss

Broke apart the Dispatcher Lock


Scheduler Dispatcher lock hottest on server workloads


Lock protects all thread state changes (wait,
unwait
)


Very lock at >64x


Dispatcher lock broken up in Windows 7 / Server 2008 R2


Each object protected by its own lock


Many operations are lock
-
free

Copyright Microsoft Corporation

Removed PFN Lock


Windows tracks the state of pages in physical memory


In use: in working sets:


Not assigned: on paging lists:
freemodified
, standby, …


Before, all page state changes protected by global PFN
(Physical Frame Number) lock


As of Windows 7 the PFN lock is gone


Pages are now locked individually


Improves scalability for large memory applications

Copyright Microsoft Corporation

The Silicon Power Wall

The situation:


Power
2



Clock frequency


Voltage


Power
2


Clock frequency and Voltage offset each other


Clock frequency inversely proportional to logic path length

Bad News:


Power is about as low as it can go


Logic paths between clocked elements are pretty short

Good News:


Moore’s Law continues (# transistors doubles ~22 months)


All that parallel computational theory is going into practice

Transistors going into more cores, not faster cores!

Software subject to Amdahl’s Law, not Moore’s Law

(or Gustafson’s Law




if my wife can find large enough datasets she cares about)

17

Approaches to HW parallelism

Homogeneous

More big superscalar cores


Extend with private (or shared) SIMD engines (SSE on steroids)


(
Maybe) not very energy efficient

A few more big, cores and lots of smaller, slower, cooler cores


Use SIMD for performance


Shutoff idle small cores for energy efficiency (but leakage?)

Lots of little fully programmable cores, all the same


Nobody has ever gotten this to work


more on this later

Heterogeneous

Programmable Accelerators (e.g. GPUs)


Attach loosely
-
coupled, specialized (non
-
x86), energy
-
efficient cores

Fixed
-
function Accelerators


Very energy
-
efficient, device
-
like computational units for very
-
specific tasks

18

User Mode Scheduling (UMS)


Improve support for efficient cooperative multithreaded
scheduling of small tasks (over
-
decomposition)



Want to schedule tasks in user
-
mode



Use NT threads to simulate CPUs, multiplex tasks onto these
threads


When a task calls into the kernel and blocks, the CPU may get
scheduled to a different app



If a single NT thread per CPU, when it blocks it blocks.



Could have extra threads, but then kernel and user
-
mode are
competing to schedule the CPU


Tasks run arbitrary Win32 code (but only x64/IA64)



Assumes running on an NT thread (TEB, kernel thread)


Used by
ConcRT

(Visual Studio 2010’s Concurrency Run
-
Time)

Copyright Microsoft Corporation

Windows 7 User
-
Mode Scheduling


UMS breaks NT thread into two parts:


UT: user
-
mode portion (TEB,
ustack
, registers
)


KT: kernel
-
mode portion (ETHREAD,
kstack
, registers)


Three key properties:


User
-
mode scheduler switches UTs w/o ring crossing


KT switch is lazy: at kernel entry (e.g.
syscall
,
pagefault
)


CPU returned to user
-
mode scheduler when KT blocks


KT “returns” to user
-
mode by queuing completion


User
-
mode scheduler schedules corresponding UT


(similar to scheduler activations, etc)

Copyright Microsoft Corporation

Normal NT Threading

kernel

user

KT
0

KT
1

KT
2

UT
2

UT
1

UT
0

Kernel
-
mode

Scheduler

NTOS executive

trap code

NT Thread is Kernel Thread (KT) and User Thread (UT)

UT/KT form a single logical thread representing NT thread in user or
kernel

KT:
ETHREAD, KSTACK, link to EPROCESS

UT:

TEB, USTACK

x86 core

Copyright Microsoft Corporation

User
-
Mode Scheduling (UMS)

kernel

user

Thread Parking

KT
0

KT
1

KT
2

UT Completion list

Primary

Thread

UT
0

UT
1

UT
0

User
-
mode

Scheduler

trap code

NTOS executive

KT
0

blocks

Only primary thread runs in user
-
mode

Trap code switches to parked KT

KT blocks


灲業pr礠r整ur湳nt漠u獥r
-
浯摥

䭔⁵湢n潣歳k☠灡r歳k


煵敵攠啔U捯浰l整楯i

Copyright Microsoft Corporation

UMS


Based on NT threads



Each NT thread has user & kernel parts (UT & KT)



When a thread becomes UMS, KT never returns to UT



(Well, sort of)



Instead, the
primary

thread calls the
USched


USched



Switches between UTs, all in user
-
mode



When a UT enters kernel and blocks, the primary thread will hand
CPU back to the
USched

declaring UT blocked



When UT unblocks, kernel queues notification



USched

consumes notifications, marks UT
runnable


Primary Thread



Self
-
identified by entering kernel with wrong TEB



So UTs can migrate between threads



Affinities of primaries and KTs are orthogonal issues

Copyright Microsoft Corporation

UMS Thread Roles


Primary threads:

represent CPUs, normal app threads enter the
USched

world and become primaries, primaries also can be created
by
UScheds

to allow parallel execution



Primaries represent concurrent execution



UMS threads (UT/KTs):

allow blocking in the kernel without losing
the CPU



UMS thread represent concurrent blocking in kernel

Copyright Microsoft Corporation

Thread Scheduling
vs

UMS

Core 2

Thread

3

Non
-
running threads

Core 1

Thread

4

Thread

5

Thread

1

Thread

2

Thread

6

Core 2

Core 1

User

Thread

2

Kernel

Thread

2

User

Thread

1

Kernel

Thread

1

User

Thread

3

Kernel

Thread

3

User

Thread

4

Kernel

Thread

4

User

Thread

5

Kernel

Thread

5

User

Thread

6

Kernel

Thread

6

MarkRuss

Win32
compat

considerations

Why not Win32 fibers?


TEB issues



Contains TLS and Win32
-
specific fields (
incl

LastError
)



Fibers run on multiple threads, so TEB state doesn’t track


Kernel thread issues



Visibility to TEB



I/O is queued to thread



Mutexes

record thread owner



Impersonation



Cross
-
thread operations expect to find threads and IDs



Win32 code has thread and affinity awareness

Copyright Microsoft Corporation

Futures: Master/Slave UMS?

remote kernel

Remote x86

Thread Parking

KT
0

KT
1

KT
2

UT
2

UT
1

Remote

Scheduler

trap code

NTOS executive

Kernel
-
mode

Scheduler

Syscall

Completion Queue

Syscall

Request Queue

UT
0

x86 core

UTs (can) run on accelerators or x86s

KTs run on x86s,
syscalls

remoted
/batched

Pagefaults

are just like
syscalls

Accelerator never “loses the CPU” (implicit primary)

Copyright Microsoft Corporation

Operating Systems Futures


Many
-
core challenge


New driving force in software innovation:

Amdahl’s Law overtakes Moore’s Law as high
-
order bit


Heterogeneous cores?


OS Scalability


Loosely

coupled OS:
mem

+
cpu

+ services?


Energy efficiency


Shrink
-
wrap and Freeze
-
dry applications?


Hypervisor/Kernel/Runtime relationships


Move kernel scheduling (
cpu
/memory) into run
-
times?


Move kernel resource management into Hypervisor?

Copyright Microsoft Corporation

Windows Academic Program


Windows Kernel Internals


Windows kernel in source (Windows Research Kernel


WRK)


Windows kernel in PowerPoint (Curriculum Resource Kit


CRK)


Based on Windows Server 2008 Service Pack 1


Latest kernel at time of release


First kernel release with AMD64 support


Joint program between Windows Product Group and MS
Academic Groups


Program directed by Arkady Retik (Need a DVD? Have questions?)

Information available at


http://microsoft.com/WindowsAcademic

OR


compsci@microsoft.com



Microsoft Academic Contacts in Buenos Aires

Miguel Saez (
masaez@microsoft.com
) or

Ezequiel Glinsky (
eglinsky@microsoft.com
)

Copyright Microsoft Corporation

30

muchas

gracias