Introduction to KVA-Shadow
As the calamity called the Meltdown brought by Google Project Zero in early 2018, all OS vendors worked on mitigations against it. Microsoft developed a mechanism called the KVA-Shadowing. To understand what’s going on in KVA-Shadowing, we should observe the following:
- There is a new section called
KVASCODE
inntoskrnl.exe
- An arbitrary process has two sets of page tables. One is the traditional page table that maps both user-mode memory space and kernel-mode memory space. The other one, albeit it also maps user-mode memory space, only a few kernel-mode memory space is mapped. For example, the
KVASCODE
section is mapped in this page table whereas the.text
section is not.
With this knowledge, we may observe that if we change the value of LSTAR
MSR to the address of our own proxy handler, a syscall
instruction would of course trigger a #PF
exception. This means the KiPageFaultShadow
procedure would be invoked. This becomes interesting in that it seems most procedures in KVASCODE
section do not necessarily switch page table. Let’s observe the KiPageFaultShadow
function:
nt!KiPageFaultShadow: fffff801`23211840 f644241001 test byte ptr [rsp+10h],1 fffff801`23211845 746a je nt!KiPageFaultShadow+0x71 (fffff801`232118b1) nt!KiPageFaultShadow+0x7: fffff801`23211847 0f01f8 swapgs fffff801`2321184a 0faee8 lfence fffff801`2321184d 650fba24251890000001 bt dword ptr gs:[9018h],1 fffff801`23211857 720c jb nt!KiPageFaultShadow+0x25 (fffff801`23211865) nt!KiPageFaultShadow+0x19: fffff801`23211859 65488b242500900000 mov rsp,qword ptr gs:[9000h] fffff801`23211862 0f22dc mov tmm,rsp ; Why would WinDbg disassemble cr3 register as tmm? nt!KiPageFaultShadow+0x25: fffff801`23211865 65488b242508900000 mov rsp,qword ptr gs:[9008h] fffff801`2321186e 654889342510000000 mov qword ptr gs:[10h],rsi fffff801`23211877 65488b342538000000 mov rsi,qword ptr gs:[38h] fffff801`23211880 4881c600420000 add rsi,4200h fffff801`23211887 ff76f8 push qword ptr [rsi-8] fffff801`2321188a ff76f0 push qword ptr [rsi-10h] fffff801`2321188d ff76e8 push qword ptr [rsi-18h] fffff801`23211890 ff76e0 push qword ptr [rsi-20h] fffff801`23211893 ff76d8 push qword ptr [rsi-28h] fffff801`23211896 ff76d0 push qword ptr [rsi-30h] fffff801`23211899 65488b342510000000 mov rsi,qword ptr gs:[10h] fffff801`232118a2 65488324251000000000 and qword ptr gs:[10h],0 fffff801`232118ac e94f379fff jmp nt!KiPageFault (fffff801`22c05000) nt!KiPageFaultShadow+0x71: fffff801`232118b1 0faee8 lfence fffff801`232118b4 e947379fff jmp nt!KiPageFault (fffff801`22c05000)
The first instruction is worth mentioning in that it is actually testing whether the byte located in [rsp+10h]
has bit 1 set. Because #PF
exceptions have error codes, [rsp+10h]
is actually pointing the cs
selector, where the lowest 2 bits are actually referring to the CPL from which the #PF exception occurs. This means, if this page-fault occurs in kernel mode, the je
instruction will be jumping to KiPageFaultShadow+0x71
, which continues to jump to KiPageFault
procedure. In terms of LSTAR
hooking, the page fault occurs in kernel mode, yet the page table is remained unswitched by virtue of immediate exception as the handler is being invoked. Therefore, such jump would trigger page fault again, resulting in #DF
failure. Interestingly, the KiDoubleFaultAbortShadow
is not such a simpleton. It would check if the cr3
matches the one that maps the entire kernel space. Therefore, it would not trigger a triple-fault failure.
Similarly, if you set a breakpoint in the beginning of KiSystemCall64Shadow
function, it would also induce a #DF
failure. This is because a breakpoint could cause either #DB
or #BP
exception. Such exception, nevertheless, is evaluated as kernel-mode exception. Thus the RPL
field of cs
selector is zero. The KiDebugTrapOrFaultShadow
function thinks it is unnecessary to switch page-table, so when it jumps to KiDebugTrapOrFault
function, it triggers #PF
exception. Once again, because the exception is taken in kernel-mode, recurrence of #PF
is inevitable. Such recurrence would result in #DF
failure.
How to Hook System Call MSR with Hypervisor and Be Compatible with KVA-Shadow
With the power of virtualization, we may set up the VMCS to intercept page-fault exceptions. Upon interception of page fault, read the Exit Qualification
field in VMCS. This is the linear address where the page fault occurs. Compare it with our proxy function. If they match, switch the guest CR3 to the proper page table. The “proper page table” I am referring to is actually located in +0x28
offset of KPROCESS
structure. In addition, don’t forget to invalidate TLB via invvpid
instruction if VPID is enabled for guest, in that the guest should be running with a different set of address mapping. If the TLB of the guest is left not invalidated, all effort paid in #PF
-interception would be in vain.
In terms of hardware-accelerated virtualization, the less VM-Exits the guest triggers, the better performance the guest could have. With Intel VT-x, we may filter unwanted page faults being intercepted by setting the Page-Fault Error-Code Mask
and Page-Fault Error-Code Match
fields in VMCS. When a page-fault occurs, the processor would get a masked error code by doing a logical-and with the error code of the page-fault and the mask set in VMCS, then compare the masked error code with the match set in VMCS. If they are equal and #PF
-interception is set, a VM-Exit occurs, If they are not equal and the #PF
-interception is reset, a VM-Exit also occurs. In this regard, we should observe the exact trait of the page-fault we intend to intercept: it comes from kernel-mode, and results by an instruction fetch. In this regard, we set the both I/D
and U/S
in the Page-Fault Error-Code Mask
field in VMCS, but we only set the I/D
bit in Page-Fault Error-Code Match
field in VMCS. In this way, interceptions of common page-faults like memory swapping, writing to read-only pages, etc., could be circumvented and could thereby effectively increase performance.
Is this a throwback of Meltdown mitigation? No, the user-mode programs are still running with the page table that maps limited kernel-mode memory space, so this is not a throwback of Meltdown mitigation.
What about AMD-V? As a matter of fact, Windows disables KVA-Shadow mechanism by default if the machine has AMD CPU, so perhaps we don’t have to worry. However, AMD-V lacks the filtering feature of page-fault interceptions, so all #PF
exceptions will be intercepted. Optimization technique used on Intel VT-x is infeasible on AMD-V.
Similar Approach of KVA-Shadow-Compatible System Call Hook
Aidan Khoury introduced a method to purposefully disable syscall
and sysret
instruction in the EFER
MSR:
https://revers.engineering/syscall-hooking-via-extended-feature-enable-register-efer/
To do this, you will have to intercept #UD
exceptions and emulate these instructions on your own. Such emulation is not arcane, but it is of course mundane. Do what is described in Intel’s manual. In addition, you will also have to intercept reads on the EFER
MSR. Elsewise, the PatchGuard could detect this modification and trigger a BSOD.
You may do something special and favorable with your hypervisor: mitigate the CVE-2012-0217 vulnerability in your handler, albeit your system should have vulnerability this mitigated in that this vulnerability is disclosed in nearly ten years ago. Nonetheless, mitigation could actually be easier than you may even imagine: remove your canonicality check on the returning address because Intel VT-x does not require a canonical address to be loaded to the rip
register. Hence, upon VM-Entry, the processor detects such canonicality violation, so it would trigger #GP
with CPL=3
and the vulnerability is thereby mitigated.
Beware of Supervisor-Mode Access Prevention
Is game over? The answer is yes and no. Half of the answer is yes because our MSR-Hook should be compatible with KVA-Shadow mechanism, but please duly note there is something more. If you observe the syscall
handler in later version of Windows 10 (at least on version 1903), you should see there is an interesting instruction: the stac
instruction.
Let’s inspect backward. We may observe the control flow could go toward the stac
instruction if byte KeSmapEnabled
is not zero. This brings up an interesting feature in x86 called Supervisor-Mode Access Prevention
, often abbreviated as SMAP. This feature prevents kernel codes from accidentally accessing rogue user-mode memory without prior knowledge. For instance, OS Vendors may mitigate CVE-2018-8897 vulnerability by utilizing the SMAP feature so that user-mode memory will not be accidentally treated as kernel-mode memory. The general rule of SMAP is that if CR4.SMAP
bit is set, accessing user-mode memory in kernel-mode while RFLAGS.AC
is cleared would result in #PF
exception. This, if not properly considered, would prevent the system call handler from accessing parameters saved in user-mode memory. Solution is easy: just execute the stac
instruction before you access user-mode memory in your proxy system call handler.
Novel PatchGuard Trick
This novel PatchGuard trick was already introduced by Aidan Khoury:
https://revers.engineering/patchguard-detection-of-hypervisor-based-instrospection-p2/
Method of countering this trick was also introduced in that article: we may unhook the system call if the Guest is writing to LSTAR
with something else and rehook the system call once the Guest is writing the original system call handler to LSTAR
. Make sure that during the rdmsr
interception, the unhook is manifested so that PatchGuard will not trigger a BSoD.
Summary
This blog introduced three things relevant to MSR-Hooking in latest versions of Windows.
- How to make your MSR-Hook compatible with KVA-Shadow mechanism.
- How to make your MSR-Hook compatible with SMAP processor feature.
- How to make your MSR-Hook compatible with novel PatchGuard trick.
By the way, feel free to visit Project NoirVisor on GitHub: https://github.com/Zero-Tang/NoirVisor