Thursday, 8 January 2009

Blocked Pipes

I couldn't resist it. Despite the knowledge that I really should be working on the IDE, I had to see what effect there would be on the core execution speed if I implemented a rudimentary pipelining mechanism into the instruction decode routine.

So over the last couple of nights, I've added a small 'BurstData' routine to the MemoryMap class, which populates a pre-defined shared array with the next 'n' bytes of data from the primary memory array and makes it available to the CPUCore class. I then added a FIFO queue to CPUCore (using the System.Collections Queue class) and wrote the logic to call BurstData when the queue was less than half-full, enqueuing data from that shared array. Tweaking the fetch code to pull its data from the queue instead of memory was a matter of seconds to change, and then I went through all the 6502 instruction implementations adjusting them to use queued data instead of memory where appropriate.

Once that was done, I set the queue capacity to six bytes and the refill threshold to three bytes, and fired it up. And the results are stunning.

Stunningly bad, that is - where I was getting 40MHz on the older box, I now get 10MHz.

I experimented with some other configurations of the queue; after 6/3 proved disasterous, I played with 9/6, 12/3, 12/6, and even some silly values like 30/15. Performance varied, but in general either stayed in the region of 10MHz or dropped - the worst was just 4.1MHz. Clearly, this is Not Working.

So why is this? Well, the basic problem is that whenever the Program Counter changes due to a Branch, Jump, Break or Return instruction, the queue has to be flushed and re-populated from the new PC location. And a characteristic of 6502 code is that routines and loops tend to be short, so the queue gets flushed a lot. Since we read 'n' bytes ahead to populate the queue, we're now doing a lot MORE memory access than ever before, and in the majority of cases it's a complete waste of effort. Look at this example from the VIC-20 ROM, which executes during the boot process and is checking to see if a short sequence of bytes is present in memory which would indicate that a plug-in auto-start cartridge has been connected:
; CHKA0CBM: FD3F  [2]  A2 05     LDX #$05       (- Check For A-ROM)
; FD41 [4] BD 4C FD LDA $FD4C,X
; FD44 [4] DD 03 A0 CMP $A003,X
; FD47 [2] D0 03 BNE $FD4C
; FD49 [2] CA DEX
; FD4A [2] D0 F5 BNE $FD41
; FD4C [6] 60 RTS
This is a perfect example of the nature of the problem; let's assume we have a queue set to 6 bytes long, with a refill threshold set to 3. When we jump to this routine at $FD3F, the queue is flushed and we read 6 bytes of memory from the PC location into it. Instruction execution commences, reading the first byte (LDX) and then the second as the operand. We then read the third byte (LDA), and two more as operands. On the next fetch, the queue level has dropped to one byte, so a refill occurs and puts five further bytes into the queue. We then execute the next instruction (CMP), pulling its' two bytes of operand data off the queue; the queue level is now three bytes, which is not lower than the refill threshold, so we execute the next instruction (BNE) and pull the operand byte off the queue too.

Now at this point (in our example) the branch is taken - the CMP instruction didn't find a match and we're skipping out of the loop. So the PC changes to $FD4C, and the queue is flushed and populated with six bytes. We execute the first instruction, which is RTS, and that also changes the PC to set it back to the address this routine was called from; so the queue is flushed again, and we read another six bytes. In total, to execute this loop in the pre-pipeline model, we would read 12 bytes of memory (11 in the routine, plus the first instruction after the calling JSR when we RTS); with the pipeline enabled, we read 6+5+6+6 (flush and fill the queue, refill the queue, flush and fill, flush and fill) or 23 bytes. Discounting how many bytes we make use of after the RTS, we've definitely read 6 bytes more than we needed during execution of the routine.

That doesn't sound so bad, but the problem is endemic - every time we hit a Branch or Jump, even if the target address is only a few bytes away, we flush and refill the queue. And longer queues just exacerbate the problem; if we were to simply double the queue length and the refill threshold (thinking that having more data to hand with fewer refills might improve the situation) we'd read 12+8+12+12 bytes, or 44 altogether - which doesn't help at all!

So I need to have a bit of a re-think on this. I have some ideas:

1. Ensure the queue manipulation logic is as fast as it can be - there might be some optimisation I can do to the mechanics of the code.

2. Tweak the PC-changing instructions to look and see if the target address is ahead by just a few bytes, and don't flush the queue if so.

3. Abandon the FIFO queue and use a hand-rolled routine to manage a pipeline in which data isn't popped-off the top when consumed, but instead we have a pointer that we can move up and down - good for short loops where the branch is a negative one.

4. Put queue population on a separate thread, so that it handles flushes and refills concurrently to the core instruction fetch/decode logic.

5. Replicate 20-odd-years-worth of industrial CPU research and design a mechanism that does pipeline caching and branch prediction with a high degree of accuracy.

I'm not giving-up on this yet, because I think I can crack it. The first attempt has been hugely disappointing, but also very informative, and I'm pretty confident that I can improve upon my first design in enough ways that it evolves from a massive drag on performance into something that improves it.

I'll keep you posted.

6 comments:

The Cowboy Online said...

I'm wondering if you're ever going to realise the benefits of pipelining when implemented in managed code. I find it fascinating that you're doing this, and I loved the previous post on this subject, but to my - possibly naive - understanding it seems like you're on a bit of a 'hiding to nothing' trying to replicate hardware pipelining in software, unless you could put the pipelining in to another thread and execute in parallel?

Jonners said...

It's entirely possible that the Managed Code CLR is already doing the lions' share of pipelining and branch-prediction at the IL level; but I reckon I can help it along a bit if it has to do fewer context switches during execution.

So although my pipelining code might not help much in the long run, it might get me a few MHz more out of the core - and it'll pay off when I move to the V2 core, which works very differently in the way it emulates the 6502 and will be slower. Any speed advantage I've gained from pipelining will be an asset at that point.

If I can get it working at all, of course. ;)

The Cowboy Online said...

Well good luck, either way it's fascinating to read about what you're trying to achieve and I can't wait until something comes out we can have a play with. :-)

t0ne said...

Sounds like a lot of fun. You'll end up with a JIT compiler before you know it ;)

Jonners said...

That's spooky. As I was thinking about the V2 core during a quiet smoke last night, I considered the idea of 'pre-compiling' the entire 64K memory map contents into an intermediate byte-code that would run through the core 'parser' in double-quick time... ;)

One step at a time though - let's get pipelining working (or not) on the V1 core, and get that IDE up and running. :)

thomas said...

Very interesting... I will definitely look into the one-word-prefetch that the 68K actually has implemented anyway (Thanks t0ne!), but will for now stay away from a more complex pipelining scheme.

There's still a bunch of opcodes I have to implement first anyway...

Thanks for sharing your results!