Hey!
The basic principle behind emulator is to act as a host for a unkowing guest application. From the application point of view it should not realize, that it's not running on the actual system. There are few basic considerations:
Runtime interpretation - which means instructions for the CPU would be dynamically decoded and executed. Probably using a big "switch" statement, fancy jump table, etc. While straightforward to implement this method has huge overhead and besides adding runtime decoding cost to every instruction it blocks A LOT of optimizations possibilities - for example I would have to treat all immediate values as dynamic data and I wouldn't be able to do any optimizations between instructions as I would be getting only one instruction at a time.
JIT compiler - (Just-In-Time) seems initially like a solution to the previous problems - whenever you enter a new block of code and you don't have an optimal version of it prepared than compile it at the spot and run the optimal version. In general this works well but it two disadvantages:
- there's a cost every time a new code block is visited (that can cause hitches)
- any kind of global (cross code-blocks) optimization is not easily achievable
The later seemed important due to the shittiness of the ABI (Aplication-Binary interface) used on Xbox360. (and on PowerPC in general), especially the amount of non-volatile registers that have to be stored/restored in a function calls + general performance of argument passing (when emulated). Being able to optimize around that may be a make or break of the whole system. So, I decided to go with...
Recompilation - which meens to decompile all existing code and recompile it into a new form. This is done in one big step for the whole executable, there's no JIT, no runtime interpretation. The major disadvantage of this solution is that it's the most complex one, and requires, in general a LOT of code to be created before you can even see the first Hello World. Well, I like writing code :)
The basic principle behind emulator is to act as a host for a unkowing guest application. From the application point of view it should not realize, that it's not running on the actual system. There are few basic considerations:
Speed
Clock speed on Xbox360 was 3.2Ghz, that's more than most of the current PCs. Any kind of emulator should be able to keep up with that. Good thing is that the Xbox's CPU is using in-order RISC architecture, which in practice means lots of stalls so that was giving me some hope. As always with CPU emulation there are few choices:Runtime interpretation - which means instructions for the CPU would be dynamically decoded and executed. Probably using a big "switch" statement, fancy jump table, etc. While straightforward to implement this method has huge overhead and besides adding runtime decoding cost to every instruction it blocks A LOT of optimizations possibilities - for example I would have to treat all immediate values as dynamic data and I wouldn't be able to do any optimizations between instructions as I would be getting only one instruction at a time.
JIT compiler - (Just-In-Time) seems initially like a solution to the previous problems - whenever you enter a new block of code and you don't have an optimal version of it prepared than compile it at the spot and run the optimal version. In general this works well but it two disadvantages:
- there's a cost every time a new code block is visited (that can cause hitches)
- any kind of global (cross code-blocks) optimization is not easily achievable
The later seemed important due to the shittiness of the ABI (Aplication-Binary interface) used on Xbox360. (and on PowerPC in general), especially the amount of non-volatile registers that have to be stored/restored in a function calls + general performance of argument passing (when emulated). Being able to optimize around that may be a make or break of the whole system. So, I decided to go with...
Recompilation - which meens to decompile all existing code and recompile it into a new form. This is done in one big step for the whole executable, there's no JIT, no runtime interpretation. The major disadvantage of this solution is that it's the most complex one, and requires, in general a LOT of code to be created before you can even see the first Hello World. Well, I like writing code :)
Memory
Two things here that were boggling me:
Endianess - PowerPC used in Xbox360 is configured to work in Big Endian mode, which is different from normal Intel architecture which is Little Endian. Which means that the ordering of bytes in the memory will be different.
The question was, could I emulate BigEndian (BE) system on a LittleEndian (LE) CPU using recompilation (so with generated code) without wasting to much performance. I would like to use native operation (addition, multiplication, etc) on the host system in every case instead of reimplementing BE equivalent.
On x86 Intel there's a bswap instruction (aliased in MSVC under intrinsics _byteswap_ulong, _byteswap_uint64, etc: https://msdn.microsoft.com/en-us/library/a3140177.aspx ). It has only one cycle of latency and it hides itself really well with all the things happening around any way - I couldn't see any measurable difference with or without it.
Secondly size fo all accesses to memory on PowerPC is always explicit (byte, half word, word, etc) so you know every time exactly how many bytes to swap.
So, it would seem that in principle it's possible by keeping the memory layout consistent with Xbox (so it's BE) but all the data on the host side (like values in registers) are LE. Every time there's a load/store or any other instruction accessing memory we will have to swap the byte order.
Consistency - The question here is more low-level, basically is PowerPC has any special memory consistency model that would NOT be easily achievable on my host PC?
So far, the answer seems to be "no" - mostly due to the in-order nature of the CPU, explicit memory fences and explicit and very simple atomic operations on the PowerPC. Also, memory consistency model on typical x64 machine is not that much relaxed by default.
There's another issue of GPU/CPU memory coherency but I will write about this later.
GPU
In general the GPU used in Xbox360 was an ATI Radeon from R500 family. There is source code for a linux driver for that GPU avaiable on the internet and A LOT can be taken from that. I don't think that the whole project would be doable without that.
Other minor things consists of emulating a little bit different memory organization and of course emulating the whole GPU functionality using DX11. In the future memory management may get simpler with DX12 but honestly I'm a little bit affraid about performance with DX12 in this particular case.
I will not get into details here as there are hairy and I would say that the GPU is the most complicated part of the whole thing so far. I will elaborate more in other post.
OS
Any application (including games from Xbox360) will call lots of OS functions. Again, the question is will I be able to emulate them all ? Do I need to emulate them all ? In general - No, a lot of the functions can be faked and they work just fine.
Luckily, almost all of the OS functions that are important are cleanly imported (similary to function import in DLLs) and creating a fake/substituted implementations for them is very easy in practice. Also, most of the functions are named exactly (or very close) to the "normal" Windows ones so at least I didn't feel like I'm in a totally alien environment. Rest can be found again via Google (and in DDK if necessary).
To sum up
Surprisingly there's no major roadblock that to my knowledge would prevent this from working (in principle). The open ended question was (and still is) will the whole thing run fast enough to be practical :D
Thanks for reading,
Dex