Projects‎ > ‎

LLBoy

Overview

LLBoy is a Nintendo® GameBoy™ emulator based upon Imran Nazar's jsGB. Its distinguishing feature is incremental translation of the GameBoy cart to native instructions with LLVM. Source Code is currently available on GitHub.

Goals

LLBoy doesn't aim to be a complete GameBoy emulator. Its core purpose is as a vehicle for becoming familiar with LLVM, C/C++, CMake, and QT. The core functional goal is to play a mostly native version of a single-banked game (eg, Tetris) with user input, video, and no sound. Subsequent goals would include memory banking, audio, aggressive optimization, save states, and a friendly debugger.

Piracy Concerns

As with any emulator, concerns will be raised about the possibility of piracy. To alleviate these concerns, the following policies and design goals are in effect:
  • Games, ROMs, or information pointing to them will not be distributed by the project
  • Software will be designed with a requirement for an original version of the game cartridge
  • It is up to the end user to handle the format shifting of their original cartridge (fair use)
  • Hardware emulation shall be based upon publicly available information on the assumption that it has been legitimately published

Implementation Roadmap

  1. Core Ops - Build enough opcodes to execute the BIOS (complete)
  2. Naïve Block Generator - Build the Game Loop's switch statement out of straight runs with no splitting, block chaining, or advanced optimization. (in progress)
  3. Remaining Ops - Finish building any ops that have not been implemented.
  4. Block Splitter - Split blocks that are being jumped in to, reducing duplicate code and allowing block chaining.
  5. UI - A limited User Interface
  6. Jump to Block - Set blocks as jump targets after splitting, if possible.
  7. Benchmark - Create a performance benchmark that can be run across emulators
  8. Optimization - Play with other optimization passes

Underlying Technologies

The primary selection criteria for the involved technologies and libraries were personal interest and ease of use. These also seemed to coincide with the "best of breed" and "most popular" choices in the software ecosystem, though there are many other alternative choices that would be more appropriate given different constraints for memory, performance, or licensing.

jsGB

Porting jsGB is the simplest start as it is implemented in JavaScript, with which I am reasonably familiar. The system is also accompanied by a large amount of high-level documentation, which saves a great deal of time on trying to understand low-level hardware documentation. The JavaScript basis also rules out a large number of additional complexity to do with platform dependencies, multimedia, and build processes. 

LLVM

LLVM is a particularly interesting tool, in that it can be used to generate and optimize native bytecode with a reasonably friendly C++ API. It was selected on the basis of its relative popularity and ability to store its intermediate bitcode representation (generated by a simple call to clang -c --emit-lvm) for future use with other dynamically-generated functions. LLVM has also become somewhat "trendy" in compiler circles thanks to its inclusion on mainstream Apple hardware. The llvm-qemu project has also done some experimental work on  dynamic binary translation in the context of the QEmu x86 emulator, indicating that this is a somewhat viable approach despite their observed slowdown and memory/compilation overhead.

QT

The QT Framework is a very solid cross-platform library that supports all of the major desktop environments (plus Linux) and mobile. It also offers a rich library and abstractions for native system functions, file management, threading, and user interface, possibly rivalling the Java class library. Only a small subset of the available features will see any use, as most of the complexity lies in the emulation backend. QT will be used to handle basic user input and to provide a drawing surface to the emulator.

C/C++ and CMake

The choice to use QT and LLVM necessitates usage of C/C++ as there aren't any Java bindings, and OCaml is not on my (current) list of interests. It is also nice to work with something other than Java for a change -- a variety in development languages breeds adaptability in the face of change, and allows for hedging one's bets given the uncertainty of the Java platform at the hands of Oracle. It would still be interesting to build a Java port with BCEL or ASM, though.

To support C/C++ with unusual intermediary build steps, CMake proves to be an interesting cross-platform build and testing system. It has the advantage of being something other than the Autotools toolchain. It also integrates reasonably well with QT thanks to its heavy involvement in KDE. Profiling the application will prove interesting, possibly requiring some time with DTrace.

Technical Overview

Simplifying Assumptions

To reduce initial implementation complexity and duration, several aspects of the system are being ignored as they are not core to the main goals:
  • Immediate-mode instructions and their data are on the same memory segment (eg, a bank switch will not change an instruction's arguments)
  • Multi-byte instructions will not lie on two halves of a memory segment, and will not be altered by a bank switch
  • Multi-byte instructions are atomic, and the system will never jump into the middle of one (eg, 0xCB00 RLC B cannot be used as a 0x00 NOP)
  • Bytecode is not self-modifying
  • Sound hardware is not being emulated

Core Design

The system is fundamentally designed in the same manner as a traditional emulator, calling state-altering micro-ops from a fetch-decode-execute loop. LLVM will be used to assemble sequences of these micro-ops (and their arguments where possible) into a large switch statement that can be inlined, optimized, and JITed for native execution. The switch statement should make use of fall through to allow for execution of continuous runs and jumps within runs. If the program counter lands at an unknown location, the switch can fall through to simply calling the function and adding an element to the cache, or splitting an existing block where possible.

The instruction cache is a simple array of metadata that matches the Cart ROM layout, providing information about where jump targets land and where sequences must be broken. This metadata can be utilized to generate an LLVM representation of the game as needed.

Mockup of unoptimized generated code

Expressing the generated code as a (surprisingly legal) C fragment, several key properties can be observed. Case 0x0 begins an uninterrupted sequence of instructions, continuing on to case 0x4 and ultimately ending the sequence. A future execution around case 0x10 would fall through to the default case, signalling that no native implementation is available, allowing some other function to handle updating the game cache and calling the unoptimized micro_op directly.

// foo.rom.o
// boolean translation_cache(state_t* state)
switch(state->cpu.registers.pc)
{
    case 0x0: block_0x0:
        addr_A(1);
        addr_B(2);
    case 0x4: block_0x4:
        nop();
        addr_C(fetch_memory(pc++)); // addrn_C(0x50)
        addr_C(1); // immediate-mode add
        break;
    // ... (unvisited memory. May be data, or still to be traversed) ...
    case 0x27: block_0x27
        subr_C(5);
        jmp(0x04); // updates CPU state, does not perform actual jump
        goto block_0x04;
    default:
        return false;
}
return true;

A primitive loop or branch instruction can also be observed here as part of case 0x27, which uses goto as a means of jumping to a known block. If implemented with care, LLVM branching instructions can be emitted by the code generator for JIT and subsequent native execution. Further gains can also be realized by making MMU access logic available to the optimizer, as some constants can be known, allowing for removal of untraversed branches, or possible replacement of MMU address calculations with a constant address or value.

This will be output as LLVM Bitcode and stored alongside the ROM image. The LLVM Cache will be loaded and executed on next run or on-demand.

Game Loop

The game loop should attempt to run cached operations, falling back to instruction generation and one-off execution.
void game_loop()
{
    // Grab the cache from the module
    boolean (*translation_cache)(state_t* state) = (boolean (*)(state_t*)) module->getOrCreateFunction("translation_cache");
    while(running)
    {
        if(!translation_cache->execute(state))
        {
            interpreted_dispatch(state);
        }
    }
}

Micro Ops (under heavy revision)

It would be ideal to support straight interpretation with pre-translation and dynamic translation to work around the compilation overhead experienced by llvm-qemu. The core cases for implementation of micro ops vary based upon the addressing mode used by the instruction. The items requiring the most attention are Immediate Data instructions, as the operation can be read and the arguments may be emitted directly into the native instruction stream.

The possibility of direct emission of immediate-mode instructions with constant data may allow the LLVM Optimizer to partially or fully replace the micro-op during arithmetic optimization and instruction combining. This can turn multiple function calls, all of which with their own program counter increments and memory accesses, into a few native add and set instructions.
  • Read instruction and increment PC.
  • Emit PC increment.
  • Read argument(s), incrementing PC if necessary
  • Emit function call.

// Push the "fetch" simulation down into function.
void op(state_t* state)
{
    state->pc++; // Twice for double-byte instructions
    mutate state;
}

// Pull args out of ROM and emit: call @op(state, arg1)
Function* emit_op(Module* m, state_t* state)
{
    Function* op = m->getFunction("op_0xf123");
    IRBuilder* builder = new IRBuilder<>(BasicBlock::create(m->getContext(), "entry", op);)

    // Increment the Program Counter
    Value* STATE = m->getGlobalVariable("state");
    Value* PC = builder->CreateGEP(....?); // How to address the Program Counter?
    builder->CreateAdd(PC, one);

    // Manipulate the state
    
    IntegerType* BYTE = IntegerType::get(builder->getContext(), 8);
    Value* arg1 = ConstantInt::get(BYTE, mmu_read_byte(state, state->cpu.pc));
    builder->CreateCall3(op, STATE, arg1);
    builder->CreateRetVoid();
    delete builder;

    return op;
}
Core cases: Immediate Mode operation (NOP, LDHLnn), Indirect Mode operation (LDHLIA)

Initially, treat all ops as equal, allowing them to go through the MMU to get their arguments. Immediate-mode instructions can be specialized later to hard-code their arguments.

Comments