Posted on 26 Jan 2023 by Paranoid Ninja
NOTE: This is a PART I blog on Stack Tracing evasion. PART II can be found here.
Been a while since I actually wrote any blog on Dark Vortex (not counting the Brute Ratel ones, just raw research), thus I decided to add the post here. This blog provides a high level overview on stack tracing, how EDR/AVs use it for detections, the usage of ETWTI telemetry and what can be done to evade it. Last year, I posted a blog on Brute Ratel which was the first Command & Control to provide built-in proxying of DLL loads to avoid detections, which was later on adopted by other C2s like nighthawk with a different set of APIs (RtlQueueWorkItem
) to avoid detections. Thus, before we discuss evasion, lets first understand why stack tracing is important for EDRs.
The simplest way to describe a ‘Stack’ in computer science, is a temporary memory space where local variables and function arguments are stored with non-executable permissions. This stack can contain several information about a thread and the function in which it is being executed. Whenever your process executes a new thread, a new stack is created. Stack grows from bottom to top and works in linear fashion, which means it follows the Last In, First Out principal. The ‘RSP’ (x64) or ‘ESP’ (x86) stores the current stack pointer of the thread. Each new default stack size for a thread in windows is of 1 Megabyte unless explicitly changed by the developer during the creation of the thread. This means, if the developer does not calculate and increase the stack size while coding, the stack might end up hitting the stack boundary (alternative known as stack canary) and raise an exception. Usually, it is the task of the _chkstk
routine within msvcrt.dll to probe the stack, and raise an exception if more stack is required. Thus if you write a position independent shellcode which requires a large stack (as everything in PIC is stored on stack), your shellcode will crash raising an exception since your PIC will not be linked to the _chkstk
routine within msvcrt.dll. When your thread starts, your thread might contain execution of several functions and usage of various different types of variables. Unlike heap, which needs to be allocated and freed manually, we dont have to manually calculate the stack. When the compiler (mingw gcc or clang) compiles the C/C++ code, it auto calculates the stack required and adds the required instruction in the code. Thus when your thread is run, it will first allocate the ‘x’ size on stack from the reserved stack of 1 MB. Take the below example for this instance:
void samplefunction() {
char test[8192];
}
In the above function, we are simply creating a variable of 8192 bytes, but this will not be stored within the PE as it will unnecessarily end up eating space on disk. Thus such variables are optimized by compilers and converted to instructions such as:
sub rsp, 0x2000
The above assembly code subtracts 0x2000 bytes (8192 decimal) from stack which will be utilized by the function during runtime. In short, if your code needs to clean up some stack space, it will add bytes to stack, whereas if it requires some stack space, it will subtract from the stack. Each function’s stack within the thread will be converted to a block which is called as stack frame. Stack frames provide a clear and concise view of which function was last called, from which area in memory, how much stack is being used by that frame, what are the variables stored in the frame and where the current function needs to return to. Everytime your function calls another function, your current function’s address is pushed to stack, so that when the next function calls ‘ret’ or return, it returns to the current function’s address to continue execution. Once your current function returns to the previous function, the stack frame of the current function gets destroyed, not completely though, it can still be accessed, but mostly ends up being overwritten by the next function which gets called. To explain it like I would to a 5 year old, it would go like this:
void func3() {
char test[2048];
// do something
return;
}
void func2() {
char test[4096];
func3();
}
void func1() {
char test[8192];
func2();
}
The above code gets converted to assembly like this:
func3:
sub rsp, 0x800
; do something
add rsp, 0x800
ret
func2:
sub rsp, 0x1000
call func3
add rsp, 0x1000
ret
func1:
sub rsp, 0x2000
call func2
add rsp, 0x2000
ret
Well, a 5 year old wont understand it, but when do you find a 5 year old writing a malware right? XD! Thus, each stack frame will contain the number of bytes to allocate for variables, return address which pushed to stack by the previous function and information about current function’s local variables (in a nut shell).
The technique for detection is extremely smart here. Some EDRs use userland hooks, whereas some use ETW to capture the stack telemetry. For example, say you want to execute your shellcode without module stomping. So, you allocate some memory via VirtualAlloc or the relative NTAPI NtAllocateVirtualMemory, then copy your shellcode and execute it. Now your shellcode might have its own dependencies and it might call LoadLibraryA
or LdrLoadDll
to load a dll from disk into memory. If your EDR uses userland hooks, they might have already hooked LoadLibrary
and LdrLoadDll
, in which case they can check the return address pushed to stack by your RX shellcode region. This is specific to some EDRs like Sentinel One, Crowdstrike etc. which will instantly kill your payload. Other EDRs like Microsoft Defender ATP (MDATP), Elastic, FortiEDR will use ETW or kernel callbacks to check where the LoadLibrary
call originated from. The stack trace will provide a complete stack frame of return address and all the functions from where the call to LoadLibrary
started. In short, if you execute a DLL Sideload which executes your shellcode which called LoadLibrary
, it would look like this:
|-----------Top Of The Stack-----------|
| |
| |
|--------------------------------------|
|------Stack Frame of LoadLibrary------|
| Return address of RX on disk |
| |
|----------Stack Frame of RX-----------| <- Detection (An unbacked RX region should never call LoadLibraryA)
| Return address of PE on disk |
| |
|-----------Stack Frame of PE----------|
| Return address of RtlUserThreadStart |
| |
|---------Bottom Of The Stack----------|
This means any EDR which hooks LoadLibrary
in usermode or via kernel callbacks/ETW, can check the last return address region or where the call came from. In the v1.1 release of BRc4, I started using the RtlRegisterWait
API which can request a worker thread in thread pool to execute LoadLibraryA
in a seperate thread to load the library. Once the library is loaded, we can extract its base address by simply walking the PEB (Process Environment Block). Nighthawk later adopted this technique to RtlQueueWorkItem
API which is the main NTAPI behind QueueUserWorkItem
which can also queue a request to a worker thread to load a library with a clean stack. However this was researched by Proofpoint sometime last year in their blog, and lately Joe Desimone from Elastic also posted a tweet about the RtlRegisterWait
API being used by BRc4. This meant sooner or later, detections would come around it and there were need of more such APIs which can be used for further evasion. Thus I decided to spend some time reversing some undocumented APIs from ntdll and found atleast 27 different callbacks
which, with a little tweaking and hacking can be exploited to load our DLL with a clean stack.
Callback functions are pointers to a function which can be passed on to other functions to be executed inside them. Microsoft provides an insane amount of callbacks for software developers to execute code via other functions. A lot of these functions can be found in this github repository which have been exploited quite widely since the past two years. However there is a major issue with all those callbacks. When you execute a callback, you dont want the callback to be in the same thread as of your caller thread. Which means, you dont want stack trace to follow a trail like: LoadLibrary returns to -> Callback Function returns to -> RX region
. In order to have a clean stack, we need to make sure our LoadLibrary executes in a seperate thread independent of our RX region, and if we use callbacks, we need the callbacks to be able to pass proper parameters to LoadLibraryA
. Most callbacks in Windows, either dont have parameters, or dont forward the parameters ‘as is’ to our target function ‘LoadLibrary’. Take an example of the below code:
#include <windows.h>
#include <stdio.h>
int main() {
CHAR *libName = "wininet.dll";
PTP_WORK WorkReturn = NULL;
TpAllocWork(&WorkReturn, LoadLibraryA, libName, NULL); // pass `LoadLibraryA` as a callback to TpAllocWork
TpPostWork(WorkReturn); // request Allocated Worker Thread Execution
TpReleaseWork(WorkReturn); // worker thread cleanup
WaitForSingleObject((HANDLE)-1, 1000);
printf("hWininet: %p\n", GetModuleHandleA(libName)); //check if library is loaded
return 0;
}
If you compile and run the above code, it will crash. The reason being the definition of TpAllocWork is:
NTSTATUS NTAPI TpAllocWork(
PTP_WORK* ptpWrk,
PTP_WORK_CALLBACK pfnwkCallback,
PVOID OptionalArg,
PTP_CALLBACK_ENVIRON CallbackEnvironment
);
This means our callback function LoadLibraryA
should be of type PTP_WORK_CALLBACK. This type expands to:
VOID CALLBACK WorkCallback(
PTP_CALLBACK_INSTANCE Instance,
PVOID Context,
PTP_WORK Work
);
As can be seen in the above figure, our PVOID OptionalArg
from TpAllocWork
API gets forwarded as secondary argument to our Callback (PVOID Context
). So if our hypothesis is correct, the argument libName (wininet.dll)
that we passed to TpAllocWork
will end up as a second argument to our LoadLibraryA
. But LoadLibraryA
DOES NOT have a second argument. Checking this in debugger leads to the following image:
So this indeed created a clean stack like: LoadLibraryA returns to -> TpPostWork returns to -> RtlUserThreadStart
, but our argument for LoadLibrary gets sent as the second argument, whereas the first argument is a pointer to a TP_CALLBACK_INSTANCE
structure sent by the TpPostWork
API. After a bit more reversing, I found that this structure is dynamically generated by the TppWorkPost
(NOT TpPostWork
), which as expected is an internal function of ntdll.dll and nothing much can be done without having the debug symbols for this API.
However, all hope is not yet lost. One of the dirty tricks we can try is to replace a Callback function from LoadLibrary
to a custom function in TpAllocWork
which then calls LoadLibraryA
via our callback. Something like this:
#include <windows.h>
#include <stdio.h>
VOID CALLBACK WorkCallback(
_Inout_ PTP_CALLBACK_INSTANCE Instance,
_Inout_opt_ PVOID Context,
_Inout_ PTP_WORK Work
) {
LoadLibraryA(Context);
}
int main() {
CHAR *libName = "wininet.dll";
PTP_WORK WorkReturn = NULL;
TpAllocWork(&WorkReturn, WorkerCallback, libName, NULL); // pass `LoadLibraryA` as a callback to TpAllocWork
TpPostWork(WorkReturn); // request Allocated Worker Thread Execution
TpReleaseWork(WorkReturn); // worker thread cleanup
WaitForSingleObject((HANDLE)-1, 1000);
printf("hWininet: %p\n", GetModuleHandleA(libName)); //check if library is loaded
return 0;
}
However this means, the callback will be in our RX region and the stack would become: LoadLibraryA returns to -> Callback in RX Region returns to -> RtlUserThreadStart -> TpPostWork
which is not good as we ended up doing the same thing we were trying to avoid. The reason for this is stack frame. Because when we call LoadLibraryA
from our Callback in RX Region
, we end up pushing the return address of the Callback in RX Region
on stack which ends up becoming a part of the stack frame. However, what if we manipulate the stack to NOT PUSH THE RETURN ADDRESS? Sure, we will have to write a few lines in assembly, but this should solve our issue entirely and we can have a direct call from TpPostWork
to LoadLibrary
without having the intricacies in between.
#include <windows.h>
#include <stdio.h>
typedef NTSTATUS (NTAPI* TPALLOCWORK)(PTP_WORK* ptpWrk, PTP_WORK_CALLBACK pfnwkCallback, PVOID OptionalArg, PTP_CALLBACK_ENVIRON CallbackEnvironment);
typedef VOID (NTAPI* TPPOSTWORK)(PTP_WORK);
typedef VOID (NTAPI* TPRELEASEWORK)(PTP_WORK);
FARPROC pLoadLibraryA;
UINT_PTR getLoadLibraryA() {
return (UINT_PTR)pLoadLibraryA;
}
extern VOID CALLBACK WorkCallback(PTP_CALLBACK_INSTANCE Instance, PVOID Context, PTP_WORK Work);
int main() {
pLoadLibraryA = GetProcAddress(GetModuleHandleA("kernel32"), "LoadLibraryA");
FARPROC pTpAllocWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpAllocWork");
FARPROC pTpPostWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpPostWork");
FARPROC pTpReleaseWork = GetProcAddress(GetModuleHandleA("ntdll"), "TpReleaseWork");
CHAR *libName = "wininet.dll";
PTP_WORK WorkReturn = NULL;
((TPALLOCWORK)pTpAllocWork)(&WorkReturn, (PTP_WORK_CALLBACK)WorkCallback, libName, NULL);
((TPPOSTWORK)pTpPostWork)(WorkReturn);
((TPRELEASEWORK)pTpReleaseWork)(WorkReturn);
WaitForSingleObject((HANDLE)-1, 0x1000);
printf("hWininet: %p\n", GetModuleHandleA(libName));
return 0;
}
section .text
extern getLoadLibraryA
global WorkCallback
WorkCallback:
mov rcx, rdx
xor rdx, rdx
call getLoadLibraryA
jmp rax
Now if you compile both of them together, our TpPostWork
calls WorkCallback
, but WorkCallback
does not call LoadLibraryA
, it instead jumps to its pointer. WorkCallback
simply moves the library name in the RDX
register to RCX
, erases RDX
, gets the address of LoadLibraryA
from an adhoc function and then jumps to LoadLibraryA
which ends up rearranging the whole stack frame without adding our return address. This ends up making the stack frame look like this:
The stack is clear as crystal with no signs of anything malevolent. After finding this technique, I started hunting similar other APIs which can be manipulated, and found that with just a little bit of similar tweaks, you can actually implement proxy DLL loads with 27 other Callbacks
residing in kernel32, kernelbase and ntdll. I will leave it out as an exercise for the readers of this blog to figure that out. For the users of Brute Ratel, you will find these updates in the next release v1.5. That would be all for this blog and the full code can be found in my github repository.
Tagged with: red-team blogs brute-ratel