Let me tell you about a problem I ran into a couple of years ago, and the solution I ended up with. If you’ve ever heard of ArmInline, then this is the story behind its Nanomites tool.
If you’re not already aware, Armadillo is a commercial anti-cracking software scheme for Windows: you buy a license, throw your exe (or DLL) at it, and you end up with a new, protected, file. This new program does just what the old one did, but it’s far harder to reverse-engineer. As the attacker, our goal is to remove the protection so that we can have our wicked way with the program inside.
Among other things, Armadillo employs a system known as Debug Blocker. Briefly put, this causes the program to create two instances whenever it is run - we call them the ‘parent’ and ‘child’ processes. The parent acts as a user-mode debugger, nannying the child (which does all the real work) to make sure that no bad guys can get too close. This system was fairly easy to defeat - all you needed to do was detach the parent process’s debugger at an appropriate moment and attach your own.
So to prevent this happening, the developers of Armadillo invented what they call Nanomites. When the protector is installed on the program, user-marked parts of the code section are scanned for jump instructions (JZ, JNZ, JBE and so on), and a database is created containing the address, type and offset of each. These jump instructions are patched over with ‘INT 3’s (user-mode breakpoint interrupt) and the database is put in the hands of the debugger. The idea is that the child process will raise a debug-break exception whenever one of these instructions fires, whence the parent steps in, grabs the thread context, looks up the appropriate jump in the database and sets the child process on its merry way.
This works very well. If the Nanomite-enabled code regions are chosen carefully then performance is virtually unaffected, and any attempts to sever the child-parent bond results in an immediate and unrecoverable crash. Even worse for the would-be cracker, the information needed to recover the code to a working state is locked up in this database, which is encrypted several times over and accessed only by heavily obfuscated, anti-debug-ridden routines. Reverse-engineering this would be a royal pain.
Getting the table
Many successful efforts had been made to reverse this encryption process and produce a working Nanomite table, but with each offence from the crackers came a counter-offence from the developers and pretty soon there were several variants of the Nanomite system floating around. It was time for a unified approach. Being lazy as I am, I insisted on making the computer do as much of the work as possible. So the plan was this:
Write a program to debug the parent process. That is, debug the debugger. With this level of control, it would be reasonably easy to fool the parent into processing Nanomites at our will. Three function hooks need to be created in the parent process:
WaitForDebugEvent - This is the primary source of information for any debugger. With a hook in here, we could forge any conceivable exception and let the parent attempt to handle it.
GetThreadContext - When alerted of an INT 3 exception, the parent calls this to find out where the Nanomite was struck. Another hook and we can feign a Nanomite hit at an arbitrary address.
SetThreadContext - After ploughing through that obfuscated code, the parent will have decided where execution should continue from, and enforces its will by setting the thread context. This last inside-element will help us determine the details of any given Nanomite.
From here the algorithm writes itself. We find all instances of the byte 0xCC (INT 3) in the code section, spoof an INT 3 exception at each of these points and watch how the parent responds. By setting the EFlags register to take different values for the same Nanomite address, we can determine under which circumstances the jump occurs and hence exactly which conditional jump is being emulated. A few switch-statements later and we have a complete Nanomite table, without having to step through a single instruction of Armadillo’s code.
The Real Problem
After all that work, it we can just assemble all the jumps from the database into place and dump the process. That’ll be sure to remove all the Nanomites, right? Well, yes, but it turns out that something far nastier happens in the process. See, when Armadillo creates the table in the first place, it doesn’t just store the addresses of the jumps but also creates some false entries at addresses that happen to legitimately contain a 0xCC byte. This means that a completely unrelated ‘CALL DWORD PTR:[0043CC7A]’, for instance, will produce a false entry in the table. This entry will never be needed, as the 0xCC is in the middle of an instruction and can’t trigger an exception under normal circumstances, but those clever developers have put us in a real dilly of a pickle.
There is simply no sure-fire way to weed out the ‘false Nanomites’ from the real ones. Without defeating the object of our endeavour and writing a purpose-built debugger to do exactly what we didn’t want the parent process doing, how can we fix this?
It took a little bit of brainstorming, but this is where vectored exception-handling comes to the rescue. This little-used feature of the Win32 API allows for installation of a process-wide exception-handler that doesn’t depend on stack-frames. They are of limited use in the real world, but just perfect for our needs for the sole reason that the VEH chain is triggered before the SEH chain.
Suppose that we’ve managed to dump and patch the program (and fixed the imports, encrypted pages, code-splicing) so that it runs without the parent. Suppose further that the original program didn’t use any VEH. Then everything works great until a Nanomite triggers: a debug-break fires, promptly falls through all the structured exception-handlers and the process crashes and burns. But if we had a VEH installed, we’d be given a chance to deal with it.
So by adding a new section to the exe containing the Nanomite table along with some code, we can save the day:
Redirect the entry-point to our code, which installs the VEH and jumps straight to the original entry-point.
Have the VEH handle only INT 3 exceptions, searching the database and patching in the appropriate jump instruction when necessary.
That nearly takes care of everything. The only remaining problem is for programs that use VEHs of their own. It’s unlikely that anybody would implement their own exception handler to deal with breakpoints, but conceivable for a catch-all scenario to ruin our best-laid plans. So the last piece of the puzzle is to hook RtlAddVectoredExceptionHandler, telling it to remove our handler before installing the client’s, then replace it afterwards. In this way, the Nanomite-handler is guaranteed to be the first exception-handler on the scene (be it structured or vectored), and existing functionality is unaffected.