Run-time determination of VC++ 2005 virtual member function addresses

November 29th, 2007 Greg

I was recently somewhat surprised to find that there is really no C++ way to resolve a virtual function to its address at run-time. Admittedly, there is no good reason why anybody would morally need to do this, but when you’ve already lowered yourself to patching another process’s own code without consent, it seems like a very small crime.

Pioneers of such hackery have already established concrete methods for calling virtual functions from inline assembly, but these methods don’t quite stretch to getting the address in pointer form. So, if for no reason other than to convince you that it’s a lot of hassle, I present a miserable bit-chop hack to do just this.

Read More…

Drawing on another Direct3D program’s viewport

November 27th, 2007 Greg

Update: See the post for the new version.

The theme of the moment is DLL hooking, and so I thought I’d present an applied example. I already explained how Fraps works, and since I’ve recently been roped into writing a similar tool for a stranger, I thought I’d share the wealth. There isn’t much new material here, but people like examples with source code, so you can download the DLL source (C++) from the project page.

Bioshock Hook Screenshot

If you don’t know how to inject this DLL into a foreign process, then you’ll need to read my previous post or wait for the injection framework I’m working on. But once it’s injected call the Initialise method, via CreateRemoteThread or otherwise, to install the hooks. It works with any program that uses IDirect3DDevice9::Present (or IDirect3DSwapChain::Present) to render, which is probably all of the DirectX 9 games. Similarly, invoke Release to remove the hooks. The source is fairly self-explanatory, with a few exceptions.

  • It’s not safe for 64-bit consumption, though this should be obvious.
  • While there’s no reason it can’t be made to work with Unicode, I’ve written everything in ASCII, for simplicity.
  • By default, the DLL will increase its own reference count to prevent it being unloaded prior to termination of the host process. This is because there is a small risk of the DLL being unloaded by one thread, while a hooked function in another returns to the now dead memory. I figured that it’s best to waste a little bit of everybody’s memory than to crash unnecessarily.
  • The d3d9.dll function addresses (and prologues) are hard-coded, or at least their offsets are. While this may look very unprofessional and rather risky, I can assure you that it’s quite safe. The alternative would be to hack up some virtual-function tables and that’s a whole other story for a whole other post.
  • You may notice that the compiled DLL is dependent upon D3DX. This isn’t necessary for the hook itself, but I used ID3DXFont in my example for demonstrative purposes. The only reason I mention this is that there is no way to guarantee the existence of any D3DX DLLs on a DirectX 9 machine, and distributing them yourself is in violation of the DirectX Runtime EULA. So if you happen to need to distribute this code, you’ll either need to carry the huge runtime installer around, or avoid using D3DX altogether.

Update:

  • The soft-hooks used here will cause problems with PunkBuster if applied to any of its monitored functions. If you need to do this then you’ll have to be a bit cleverer.
  • The source assumes that the graphics device will never become invalid. If you suspect that this isn’t the case (which will be true for any full-screen game at a minimum) then you’ll need to add the appropriate sanity checks (see IDirect3DDevice9::TestCooperativeLevel) before attempting to render anything, lest you want to crash and burn.

RCE essentials: PEiD

November 24th, 2007 Greg

When I mention my reverse-engineering feats or failures to technically-minded friends, I tend to get one of a few responses. Not uncommon is ‘I wouldn’t know where to start.’ Well, I know it’s just a figure of speech, but I always start in the same place: PEiD.

PEiD

Many programs are built with third-party post-applied protection schemes, or are compressed with a packer to reduce the file size. The basic workings are the same - you run what you think is the program, but unknowingly execute the unpacker’s code, which decompresses or decrypts the original exe in memory and executes that once it’s done. The fact that most people are completely unaware of this process goes to show that these protectors and packers do at least half of their job well. While some protection schemes are better than others, any such packer will have the effect of turning a trivial hack, crack or patching job into a relative pain in the neck.

Rather distinctly, the odd occasion comes up where you’d like to know which compiler and/or linker was used to produce a binary, as the different options have their own quirks and particulars. Differentiating your Borland C++ Builder 5 from your Microsoft Visual C++ 6 can save you a little time and effort, if you need to fiddle with the ins and outs of stack-frame prologues or function indirection tables, for example.

Any tool that modifies a PE (exe or DLL) has to conform to strict standards, so as to keep the program functional, but will also have the effect of leaving behind a mark. These tell-tale marks are aptly known as PE fingerprints, and PEiD is designed to sniff out these fingerprints and give you the lowdown. So if I decide that I want to tweak the interface of my PostScript viewer, or to investigate how my anti-spyware tool enumerates processes, I only need to drag-drop the respective exe files into PEiD and I immediately know that GhostScript 4.7’s gsview32.exe was built in Microsoft Visual C++ 7.0 and that AdAware SE Personal 6.20 is compressed using ASPack 2.12. This tells me that the former will be very easy to analyse, whereas the latter will put up something of a fight, and that I’d perhaps be better off spending my time on Google.

So PEiD is something of an unsung hero, in that it only ever runs for five seconds at a time, perhaps once a week (at least on my computer), but yet when used properly it can have a profound effect on the development of any RCE project. And it is for this reason that I hereby sing its heroism for all to hear.

Case study: Fraps

November 22nd, 2007 Greg

One of the topics that I often find myself bluffing through on GameDev is Direct3D hooking. In particular, how to display an overlay of your own on the window of another Direct3D program, often a commercial game. It’s pretty clear that the simplest method would involve somehow hooking the call to IDirect3DDevice8/9/10::Present, but the details are a little sketchy, particularly when you throw anti-hack systems into the mix. To be quite honest, I wasn’t sure I’d have been able to write a scalable hook that wouldn’t cause any incompatibilities - at least not without doing some very immoral things. So when I found out that Fraps has been doing exactly this for years, and that it somehow manages to avoid angering PunkBuster and other such systems, I decided to investigate.

What is Fraps?

Simply put, it’s a profiling and video-capture tool for PC games. As well as providing the ability to capture a video stream from any Direct3D 8/9/10, DirectDraw or OpenGL program, it can display real-time performance statistics (frame-rate and such) by means of an overlay on the game’s window (or full-screen display).

Remarkably, Fraps seems to handle the hacky side of all this automatically, with no fuss. It will dig its claws into any compatible game, whether it was started before or after Fraps. Yet if you investigate the state of DirectX and OpenGL’s DLLs on disk (in system32), they remain untouched at all times.

So how does it work?

Well it could be a whole lot worse, but I wasn’t thrilled to find out that every process running on my system had an instance of Fraps.dll loaded. It’s not so much the 106kB footprint that bothers me, but the performance and stability concerns. Anyway, my suspicions that this DLL was ‘infecting’ all processes via a system-wide hook were soon confirmed when OllyDbg caught Fraps.exe making a call to SetWindowsHookEx.

SetWindowsHookEx(WH_CBT, &FrapsProcCBT, hModuleFrapsDLL, 0);

So in one fell swoop, Fraps guarantees that Windows will load a copy of Fraps.dll into every process currently running, as well as those created in future. Moreover, the function FrapsProcCBT will be called by that process each time it attempts to create, destroy, show, hide, move or resize a window. Now this may seem like the perfect solution to a difficult problem, but it’s rather wasteful considering that most processes run window-driven interfaces, and only very few involve DirectX or OpenGL.

How should it work?

Now I haven’t tested this, but a much cleaner way to achieve the same goal would be for Fraps.exe to periodically poll EnumProcesses and EnumProcessModules, so as to determine the processes that actually need hooking. Installing the hooks into these specific processes would require no more work but would save the OS some effort and limit the worst-case-scenario disaster-zone to DirectX and OpenGL applications, which is a considerably smaller domain than almost everything. In Fraps’s defence, the code makes extensive use of IsBadReadPtr and suchlike, and I’ve never heard of it causing any trouble, but nevertheless, the best way to prevent your DLL from crashing someone else’s program is to make sure it never gets loaded.

What does the hook do?

All events but window activation and focus-acquisition (HCBT_ACTIVATE, HCBT_SETFOCUS) fall through the hook chain (CallNextHookEx). But in either of these cases, Fraps.dll goes on to look for a supported graphical interface.

Rather achronologically, the first thing it does at this point is a bunch of string-processing to capitalise and isolate the executable image’s file name. Presumably, the writers would have used GetProcessImageFileName, but bringing psapi.dll along to the party for this reason alone would be borderline-criminal.

Next, GetModuleHandleA is called on opengl32.dll, d3d8.dll, d3d9.dll, dxgi.dll and ddraw.dll. If all return NULL, then there is no work to do and the function returns. But if any of these modules are found, Fraps.dll gets straight to installing its function hooks. The hooks are simply JMP operations assembled ad-hoc at the beginning of IDirect3DDevice9::Release and Present (and presumably the equivalent functions belonging to the other APIs). Now, I was rather surprised to find that PunkBuster has no problem with such crude, unsubtle behaviour, but it’s possible that there is some agreement between the two developers.

That’s almost everything, but one problem remains. Nothing will be drawn to the screen unless d3d9!Present is called, and installation of the patch renders the original functions useless. It is for this reason IAT hooking is preferable to the patch method being used here, but from what I gather PunkBuster periodically ‘fixes’ the IAT of its client process, so that’s no-go. Fraps gets around this little inconvenience in the messy, but reliable way that you’d expect: each time the proxy Present function fires, it removes the patch, calls the original function, and restores the patch.

Here’s some untested C++ concept-code I threw together for the IDirect3DDevice9::Present case. The unbraced snippet installs a patch at address_d3d9_Present (use GetProcAddress) to redirect it to PresentHook. I’ve omitted the patch-removal code, along with a whole load of sanity-checking that really shouldn’t be left out in such a risky situation. Don’t use my laziness as an excuse.

// Calculate offset
DWORD from_int = reinterpret_cast <DWORD> (address_d3d9_Present);
DWORD to_int = reinterpret_cast <DWORD> (&PresentHook);
 
// This version of the JMP instruction takes an address relative
// to the current address, and it is 5 bytes long
// So the relative offset is 'to - from - 5'
// Don't worry about the unsigned DWORD underflowing
DWORD offset = to_int - from_int - 5;
 
// Assemble the patch at the beginning of Present
const unsigned char jmp = 0xE9; // The opcode for a 32-bit rel JMP
 
unsigned char* ip = reinterpret_cast <unsigned char*> (address_d3d9_Present);
*ip = jmp;
*(reinterpret_cast <DWORD*> (ip + 1)) = offset;
 
HRESULT __cdecl PresentHook(const RECT* pSourceRect, const RECT* pDestRect, HWND hDestWindowOverride, const RGNDATA* pDirtyRegion) {
    IDirect3dDevice9* device;
    __asm MOV device, ECX;
 
    // Do anything that needs to be done before Present gets called
 
    // Remove the Present hook
    HRESULT return_value = Present(pSourceRect, pDestRect, hDestWindowOverride, pDirtyRegion);
    // Reinstall the Present hook
 
    // Do anything that needs to be done after Present gets called
    return return_value;
}

How I cracked the iTunes 7 DRM, Pt III

November 20th, 2007 Greg

After last time’s failure, things started to become personal. I started exploring all kinds of new avenues and employing many techniques that aren’t so commonly used. In parallel, I drew up a map of the inner-workings of iTunes 7.0.2.16 and began coding up a framework from which to launch a full-scale attack once I knew how.

The program map was an uninteresting flow-chart full of hex strings (mostly addresses of key points) and crossings-out, but the framework was quite a sight to behold. I had no idea that it was such heavy overkill at the time, but so as to keep all bases covered I ended up with two distinct but inseparably linked programs, codenamed DRMBugger and DLLBugger.

DLLBugger is a tool-kit DLL, designed for injection into iTunes so I could execute arbitrary code from within its address-space and hook functions to my heart’s content.

DRMBugger is a purpose-built debugger. When you fire it up, it locates and attaches to any running instance of iTunes as a user-mode debugger, keeping track of all internal goings-on. It wouldn’t take much effort to convert this into a full-blown ring3 debugger, as it has support for module and memory-page enumeration, hardware breakpoints and run-tracing, along with a rather unreliable disassembler.

With the combined power of these two, iTunes was truly at the mercy of my twisted will. If only I knew what I needed to make it do.

My First Stream

I can’t emphasise enough just how much code is executed in the seemingly idle state of iTunes’s AAC playback, but for days on end I would see OllyDbg’s familiar disassembly window whenever I closed my eyes. After tracing backwards, forwards, using stochastic pausing to find the bottlenecks (like a primitive manual code-profiler), comparing run-traces and hit-traces and analysing dead-listings of programs known to use QuickTime’s DRM code; I finally struck gold.

At 0×0062914D, iTunes.exe (7.0.2.16 for Windows XP) calls a function at 0×005C1B20, passing a pointer to a structure containing the address of an encrypted chunk of AAC, along with a pointer to the pseudo-1024-bit decryption key. It was pretty clear that the AAC data was an entire chunk as defined in the stbl atom directory. It was also evident that the key is retrieved from ‘Documents and Settings\All Users\Application Data\Apple Computer\iTunes\SC Info\SC Info.sidb’ using some of the udta atom’s values as an index. I could have investigated this further, but decided that the details are unimportant if DLLBugger can just use this function black-box-style.

So I got straight to writing an inline patch. The idea was simple: Each time this ‘decrypt’ function is called, dump the decrypted buffer to a previously VirtualAlloced array. And sure enough, after waiting for my protected track to play beginning-to-end, I ended up with an encryption-free copy of the AAC stream. I was confident that this had worked, as the data chunks looked the part, according to my limited knowledge of AAC format, but it wasn’t so straightforward to verify. Decrypting the stream is the hard part, sure, but this stream is useless without the remainder of the MP4 file to house it. I spent a little while in a hex-editor piecing things back together and wasn’t surprised when iTunes refused to play the new file. As far as iTunes was concerned, the MP4 file (even with its new m4a extension) still had all the descriptors of an encrypted file, and as you’d expect, decrypting the stream twice didn’t produce anything meaningful.

Fortunately, this frustrating period didn’t last too long. I had to remove all evidence of DRM-related atoms before things got up-and-running, but that means only overwriting a contiguous block of the file with zeros. As it happened, I had already written a fairly complete atom parsing engine into DRMBugger, so rather than let this go to waste, I took things a step further and removed & reshuffled some of these atoms so that the resulting file was indistinguishable from a true ‘m4a’. Reassuringly, the file played just fine in WMP, WinAmp and VLC, as well as iTunes, so things were really starting to look up. The only problem is that it had taken me two and a half hours to remove the DRM from this file (a little trivia: it was ‘The Rejection’ by Dangerous Muse). The next thing on the agenda, after a night of celebration, was to automate the process.

RST decomposition of a general skew-free 3D transformation

November 18th, 2007 Greg

First of all, I refer you to D3DMatrixDecompose. If you want to break a standard 3D transformation matrix into its rotational, translational and scaling parts, without caring how it’s done, then look no further. If your needs are a little more specific and you’re sure you aren’t reinventing this wheel, then read on.

There is nothing clever about this decomposition, but it’s a question that comes up more often than I’d expect, so here’s the lowdown. I assume that your matrix is in DirectX-standard row-vector representation (so vout = vinM) and that the skew components are zero (M14 = M24 = M34 = 0). You should visualise the matrix like this:

RST Matrix

First, we extract the translation:

vtranslation = (M41, M42, M43)

Then the scaling factors:

sx = √(M112 + M212 + M312)
sy = √(M122 + M222 + M322)
sz = √(M132 + M232 + M332)

The rotation matrix is then the upper-left 3×3 minor, after scaling back to unity.

Mrotation = M;

// Remove translation
Mrotation41 = 0;
Mrotation42 = 0;
Mrotation43 = 0;

// Normalise
Mrotation11 /= sx;
Mrotation21 /= sx;
Mrotation31 /= sx;

Mrotation12 /= sy;
Mrotation22 /= sy;
Mrotation32 /= sy;

Mrotation13 /= sz;
Mrotation23 /= sz;
Mrotation33 /= sz;

DLL injection via CreateRemoteThread

November 15th, 2007 Greg

This isn’t exactly news, but I thought I’d briefly run through the now standard method of injecting a DLL of your choice into an arbitrary process under 32-bit Windows. It will serve as a foundation for the upcoming post on function hooking via DLL injection.

So you have analysed a target program, know how it works and what you’d like to make it do, but there isn’t enough room for an inline patch. Or maybe you need to hook an existing function in order to spy on or modify data being passed back and forth. Perhaps add some functionality or a status window? If any of this sounds familiar then you should consider DLL-based code-injection. It is the cleanest method of injection (if there ever were such a thing) and it offers a lot of scalability and maintainability that a quick-and-dirty inline/diversion patch lacks.

The idea is simple, but very clever. We abuse CreateRemoteThread to call LoadLibrary on a string that was injected by means of WriteProcessMemory. The method hinges on the fact that LoadLibrary takes a single parameter. In eight steps:

  1. Locate the target process in terms of its process-id. Depending on the situation, any of EnumProcesses, CreateProcess, FindWindow and GetWindowThreadProcessId may come in handy.
  2. Acquire a handle to it by calling OpenProcess. This isn’t necessary in certain cases, such as when the target is spawned using CreateProcess, which returns a handle automatically.
  3. Allocate some space in the target process for a string containing the path of the DLL, using VirtualAllocEx. Remember to make enough room if using Unicode strings.
  4. Fill the new buffer out with the path of the DLL by means of a call to WriteProcessMemory.
  5. Call GetProcAddress to get a pointer to LoadLibrary. While there are no guarantees in the Win32 process specification that this function will be in the same place for every process, a little reverse-engineering and common-sense makes it a very safe bet. If the next XP Hotfix were to change the base of kernel32, user32 or ntdll - none of which are relocatable - then a considerable amount of both Microsoft and non-Microsoft code will break and all hell would break loose.
  6. Call CreateRemoteThread on the target process, passing the address of LoadLibrary as the LPTHREAD_START_ROUTINE and the address of the newly initialised string buffer as the parameter. Technically, CreateRemoteThread wasn’t designed to deal with this, but in terms of implementation everything is kosher.
  7. When LoadLibrary returns, the thread terminates with the return-value as its exit code. So if we call GetExitCodeThread, the result will be the base of the DLL in the target process, or NULL if the call failed.
  8. VirtualFreeEx will take care of the now useless string buffer and the process handle should be freed if no more magic needs doing.

This seems like quite a lot of work for such a common operation, so it’s a good idea to code it up in a reusable way. Here’s the implementation as pasted from the DLL-hooking interface I’m working on:

HMODULE DLLInjection::InjectDLL(DWORD process_id) {
    // Open Process
    process_handle = OpenProcess(PROCESS_ALL_ACCESS, false, process_id);
    if (process_handle == 0) return NULL;
 
    // Allocate space for string to contain the DLL Path
    SIZE_T path_length = dll_path.size() + 1;
    void* remote_buffer = VirtualAllocEx(process_handle, NULL, path_length * sizeof(char), MEM_COMMIT, PAGE_READWRITE);
 
    bool success = false;
    if (remote_buffer != NULL) {
        SIZE_T bytes_written = 0;
        WriteProcessMemory(process_handle, remote_buffer, dll_path.c_str(), path_length, &bytes_written);
        if (bytes_written == path_length) {
            DWORD thread_id = 0;
            HMODULE kernel32 = GetModuleHandleA("Kernel32");
            LPTHREAD_START_ROUTINE remote_lla = reinterpret_cast <LPTHREAD_START_ROUTINE> (GetProcAddress(kernel32, "LoadLibraryA"));
            if (remote_lla != NULL) {
                HANDLE thread = CreateRemoteThread(process_handle, NULL, 0, remote_lla, reinterpret_cast <void*> (remote_buffer), 0, &thread_id);
                if (thread != NULL && thread_id != 0) {
                    WaitForSingleObject(thread, 5000);
                    DWORD exit_code = 0;
                    GetExitCodeThread(thread, &exit_code);
                    if (exit_code != 0) {
                        success = true;
                        remote_base = reinterpret_cast <void*> (exit_code);
                    }
                }
            }
        }
        VirtualFreeEx(process_handle, remote_buffer, dll_path.size() + 1, MEM_RELEASE);
    }
 
    if (success) return reinterpret_cast <HMODULE> (remote_base);
    return NULL; // Fail
}

LDR tone-mapping and how to do it properly

November 13th, 2007 Greg

I’m a huge fan of post-processing in games. It seems that no matter what I’m writing, I can’t resist the temptation to install an over-the-top bloom effect and some tone-mapping. And that’s me being conservative. The great thing about tone-mapping is that you can throw it on the end of just about any rendering pipeline and instantly glitz up the visuals, giving it that ‘digitally remastered’ feel.

Tone-Mapping

So what is tone-mapping? Well, it’s a post-processing effect that remaps the render’s colour-dynamics to change the overall appearance of the game. Some of the more common tone-mapping operations are contrast & brightness, saturation and HDR exposure. HDR tone-mapping is an art unto itself and it can only be used for HDR pipelines (with floating-point render-targets and HDR textures), so I’ll restrict the conversation to the more universal LDR tone-mapping.

The Conjugate Transform

If you plan on doing anything interesting in a tone-mapping pass, then it’s rather necessary, for the sakes of performance, readability and maintainability, to convert to a more suitable colour-space than RGB. The first such space that springs to mind is HSL, and indeed tone-mapping in HSL is like a gentle walk in the park, but it’s wise to look a little further afield to YCC. But why YCC? Sure, it does offer a luma component for brightness & contrast mapping, but the saturation is tied up in the two chroma components. Granted, this is a bad thing, but it’s not nearly as bad as the cost of a full RGB->HSL->RGB conversion.

The Problem With HSL

I spent a fair while trying to optimise this HSL-detour code in HLSL, hoping that I could make it viable for small shaders, but came out rather disappointed. Despite the availability of vector SIMD instructions, the piecewise-linear nature of the transformation demands a worryingly large number of conditional branches, and unless you have the luxury of Shader Model 4’s true branching, this amounts to a horror story of register-juggling and lerp operations. I didn’t try too hard, but believe it’s impossible to complete the transformation-and-back through HSL in under 100 ps_3_0 operations, which immediately rules out the possibility of assembling on a Shader Model 2 target platform.

YCC To The Rescue

Contrast this with the simplicity of the truly linear RGB->YCC->RGB transformation. If there’s one thing that the GPU does best, it’s vector-matrix multiplication, and that’s exactly what this boils down to:

float4x4 RGBToYCC = 
{ 0.299,  0.587,  0.114,  0.000,
  0.701, -0.587, -0.114,  0.000,
 -0.299, -0.587,  0.886,  0.000,
  0.000,  0.000,  0.000,  1.000};
 
float4x4 YCCToRGB = 
{ 1.000,  1.000,  0.000,  0.000,
  1.000, -0.509, -0.194,  0.000,
  1.000,  0.000,  1.000,  0.000,
  0.000,  0.000,  0.000,  1.000};
 
float4 PS_LDRToneMap(float4 tex_coord : TEXCOORD) : COLOR
{
    float4 RGBA = tex2D(linear_sampler, tex_coord);
    float4 YCCA = mul(RGBToYCC, RGBA);
 
    // Work goes here
 
    RGBA = mul(YCCToRGB, YCCA);
    return saturate(RGBA);
}

This assembles to a handsome 9 instructions, leaving plenty of room even with the arcane ps_1_4’s instruction limit.

The Prize

My current project makes use of this code to ramp the contrast and saturation up and down, according to the scene. The code is simple, and the results rather dramatic.

// Contrast
YCCA.x -= contrast_midpoint;
YCCA.x *= contrast_gain;
YCCA.x += contrast_midpoint;
 
// Chroma
YCCA.y -= chroma_red_midpoint;
YCCA.y *= chroma_red_gain;
YCCA.y += chroma_red_midpoint;
 
YCCA.z -= chroma_blue_midpoint;
YCCA.z *= chroma_blue_gain;
YCCA.z += chroma_blue_midpoint;

LDR Tone-Mapping

How I cracked the iTunes 7 DRM, Pt II

November 11th, 2007 Greg

So I had the motivation; it was time for action. The first step in undertaking such a large project is to research: Research like a maniac until Google dries up. It took me three days (on top of work) until I was happy that there was no more pre-invented wheel to take advantage of. The fruits of my labour boiled down to two busy flowcharts, each occupying a full page, which I kept pinned at head-level on the notice-board next to me for the following six weeks. Unfortunately, they seem to have been lost in the process of moving back from university, which isn’t surprising considering how much paper is ‘archived’ after finals.

Just to make sure everyone’s on the same page, here’s a quick overview of some of my findings:

  • The iTunes Store’s music is distributed exclusively under their ‘m4p’ file extension, as opposed to ‘m4a’ which is the default for music encoded by iTunes. Presumably, ‘a’ stands for ‘audio’ and ‘p’ for ‘protected audio’.
  • There is nothing special about an m4a file other than the extension. It is just a common or garden MP4 file with a single AAC audio stream.
  • MP4 files consist of a nested structure of ‘atoms’, which are variable-sized general-purpose binary data structures whose contents are identified by a four-character string. There was very little available information about the atomic structure of m4p files, so I had to throw together a C# program to display the tree of identifiers (96kB exe download).
  • Deep inside this tree-structure lies the ‘mdat’ atom containing the AAC audio stream, which occupies the majority of the file.
  • The only consistent differences between an ‘m4a’ and an ‘m4p’ are:
    • The AAC stream is encrypted for DRMed files, using a proprietary variant of the Rijndael cipher.
    • An ‘iods’ (Initial Object Descriptor) atom is present only for protected files (at moov/mvhd/iods). This turns out to be insignificant for our purposes, but from what I gather it is generated by the software that installs the protection scheme, to store some information about the original file.
    • The seemingly innocent ‘udta’ (User Data) atom contains a rogue null dword to trick most atom viewers into thinking it has no contents.
    • The DRMed ‘moov/udta/meta/ilst’ atom actually contains a whole new set of undocumented atoms that tell iTunes all about your iTunes Store account, encryption keys and so on.
New udta atoms

The first hurdle was convincing iTunes to run under a debugger. Many released programs don’t like the idea of being debugged, ’cause it almost always means bad news. Fortunately, iTunes didn’t put up much of a fight and after NOPing a couple of calls to IsDebuggerPresent, iTunes and OllyDbg became like peas and carrots.

In the absence of any meaningful variable or function names (other than imported DLL functions), my first thought was to trace backwards from the opening of the media file towards the decryption routine. From here, I could analyse the algorithm and attempt to rewrite it to decrypt the files on disc. So I started out with a breakpoint on CreateFileW, to catch the point where iTunes loads in the ‘m4p’ file, waited for kernel32.dll to do its thing, then set a conditional breakpoint on ReadFileExW (the extended version is used because the file is read asynchronously). This landed me immediately in QuickTime’s audio-playback engine. I spent a while flicking back-and-forth between QuickTime and iTunes code before I found a very loop-heavy function full of SSE instructions. It could only be the decryption routine or, more likely, the AAC decoder. After an hour or so of manual stepping and comparing run traces of m4p and m4a playback, I determined that it was indeed the AAC decoder, and began to get a good feel for the intricate details of the playback engine.

This is the part where I omit a day’s worth of results, because the minute I realised that what I’d been analysing for what felt like forever was actually just a tiny fraction of the system I’d set out to understand, I fled like a little girl. This monster was clearly a lot more complex than I had initially anticipated. So I returned to the drawing board, not feeling defeated, but optimistically empowered. This DRM wasn’t going to fall very easily. I simply didn’t have the man-power or insight for the usual divide-and-conquer approach (nobody knows how DVD Jon does it - he’s a glitch in the matrix). If I was ever to succeed then I’d need to work smarter, not harder. So that’s exactly what I did, but not before taking a day off.

Compatible X-file HLSL-based vertex-blending with D3DX

November 9th, 2007 Greg

On the whole, D3DX does a great job of making our lives easier, us Direct3D 9 programmers. But one topic that has generated a lot of confusion yet very little documentation is the correct usage of the BLENDINDICES shader semantic.

If you’re having trouble getting your CPU and GPU to communicate blend-indices correctly, the first thing to do is make sure that your source data is valid. If you’re using a custom mesh-loading routine then you’re on your own, but certainly under D3DXLoadMeshHierarchyFromX you can save yourself a lot of potential grief by prototyping with a reliable X-file. Support for X-file exporting is very shaky for most 3D modelling suites, but the SDK’s tiny.x sample is as close as you’ll get to a standard model. Once Tiny is happily running around in her fighter-pilot’s uniform you can start thinking about trying out your own models.

Nobody seems to be sure just what went wrong when the late ARB drafted up the usage semantics of blending parameters, but many current video cards lack hardware support for the UBYTE data type needed to store the indices. For this reason, developers have found themselves hammering unsigned-byte-quadruplets into the shape of D3DCOLORs, lying about the data content and praying that the vertex pipeline will have magically transformed the output into something that can be jimmied back into a tuple of UBYTEs. As far as I can see, this is the mother of all DirectX hacks, but it also seems to be the universal standard. That’s the only reason I’ll sleep at night after posting this.

After successfully calling ID3DXSkinInfo::ConvertToIndexBlendedMesh, the blend-index data is all present and correct. The challenge is to pipe it to the shader without misinterpretation. For that, I found it necessary to set all index data in the vertex declaration to be passed as D3DCOLOR data (DWORDS in disguise). By the way, if you are still using FVFs for something as finicky as GPU skinning then you don’t deserve any help: bite the bullet and write out those overly verbose, but oh-so-flexible vertex declarations.

// Fix UBYTE4 Support
D3DVERTEXELEMENT9 decl[MAX_FVF_DECL_SIZE];
skinned_mesh->GetDeclaration(decl);
{
    int i = 0;
    while (decl[i].Method != 0xFF)
    {
        if (decl[i].Usage == D3DDECLUSAGE_BLENDINDICES)
        {
            decl[i].Type = D3DDECLTYPE_D3DCOLOR;
            break;
        }
        ++i;
    }
}
skinned_mesh->UpdateSemantics(decl);

This takes care of the program code. All that remains is to interpret the data accordingly in the shader. For this, Microsoft generously provided us with the D3DCOLORtoUBYTE4 macro. HLSL (as of SM3) doesn’t have true integer support, let alone unsigned byte support, so we use their emulated int4 vector.

VS_OUTPUT VS_SkeletalBlend
(float3 pos           : POSITION0,
 float1 blend_weight  : BLENDWEIGHT0,
 float4 blend_indices : BLENDINDICES0,
 float3 normal        : NORMAL0,
 float2 tex_coord     : TEXCOORD0)
{
    int4 indices = D3DCOLORtoUBYTE4(blend_indices);
 
    float weight0 = blend_weight;
    float weight1 = 1 - weight0;
    int ind0 = indices.x;
    int ind1 = indices.y;
 
    // ...
}

Notice that the semantic here is BLENDINDICES, even though the vertex declaration says D3DCOLOR. Note further that the type becomes, somewhat wastefully, a float4 as all colour values are passed as such. The sample here is for a two-weights-per-vertex animation, but it’s easy enough to scale this up to four. If you need more than four weights then you need to have a word/fight with your artist, but the idea is the same, only you pass another parameter just like the first. Now I’m not certain that this is the best configuration - it certainly looks like a horrid mess - but it’s the only combination I know to work, after hours and hours of painful Googling. Suggestions for improvement are welcome.