Monday, 5 January 2009

New Years' Resolution

Well, here we are in 2009 - in London it's cold, dark and unpleasant outside, and of course Christmas is now just a fading memory as we return to the daily grind. Travel expenses have risen, and things seem pretty gloomy all round; the come-down after the yuletide high is a bitch. But it's not all bad, as the need to get my brain back into work-mode has also stimulated me into returning to the project after two or three weeks away from it during the festive period. My initial target was to get a stable core emulation by Christmas, which I achieved; my New Year Resolution is to have the whole core complete, together with the IDE, by Easter.

Of course, being something of a geek (surely not!) I did spend a few idle minutes over the holidays thinking about Sharp6502. Naturally my mind wandered from the task in hand, and instead of thinking about GUI coding for the IDE, my thoughts turned back to the CPU Core and entertained some ideas on how to speed it up.

It's running at about 75MHz at the moment, which is plenty fast enough for my requirements; I don't need to make it any faster at this juncture, and in fact the ideas I've had for it are staying on the 'Future Enhancements' list for a good while because I want to focus on the IDE until it does at least as much as the command-line interface. But even though it's not a priority, a geeks' mind loves to tinker and tweak. :)

The biggest bottleneck in the way this version of the core runs is memory access - every instruction needs between one and three accesses to get the OpCode and any requisite operands, plus another access or two if the addressing mode is indirect, and there may then be further reads and/or writes if the instruction is one which does any memory access itself. The core only accesses memory when the instruction demands it, so we read one byte initially (for the OpCode) and then do further accesses only if the instruction decode requires one or two operands, any indirection access, and then whatever memory access is actually performed by the instruction.

All well and good - but memory access is expensive, relatively, because it requires the core instruction decode logic to make requests of the MemoryMap class and is thus a branch out of the sequence of IL instructions that are being executed by the CLR as it emulates each 6502 mnemonic. So my mental meanderings began to wonder if there was any way to reduce the number of times that happened - or at least to reduce the number of times we had to break the flow of IL instructions to fetch data, even if the number of accesses remained the same.

Of course, Intel (and AMD, and probably others) were thinking about this many years ago, which is why we have pipelined CPUs in our PCs today; to put it in simple terms, when a pipelined CPU reads an instruction byte it also reads-ahead a few bytes (perhaps quite a few, in current processor families) and fills a pipeline with that data. As the instruction executes, its operands will, if it's lucky, be available in the pipeline instead of having to do another memory access. Equally, successive instructions may also be in the pipeline, so you could end up executing several instructions very quickly from the pipeline instead of doing memory accesses for each. Filling the pipeline is a fast operation, because it's a burst of sequential data from memory, and can happen concurrently with the first instruction being decoded; even if it takes a bit longer than a straight-forward memory access for the next byte or two, you still recoup that time as subsequent instructions and data come from the pipeline instead of memory. Coupled with on-chip cache and efficient pipelining and branch-prediction algorithms, this has a huge impact on execution speed. The only time the model falls apart is when the pipeline is populated with the wrong data (perhaps because the branch-predictor got it wrong) and needs to be re-populated; the trick is to try to minimise the number of times this happens, and a great deal of time and effort has been invested in doing just that.

Funnily enough, even the humble 6502 had a primitive version of pipelining; the next byte would often be read whilst instruction decode was still occuring on the previous byte, so there was a shorter wait-state when that next byte was needed. In CPUCore version 2 (which I've alluded to in previous posts, and which will emulate the core at a lower level) this simple read-ahead will be faithfully replicated as part of the fetch logic. I started to wonder if a pipeline mechanism of sorts might exhibit some interesting performance gains even in the current version, and worked-out a way to do it.

What we'll have is a six-byte pipeline implemented as a FIFO queue; when the first OpCode fetch happens, we'll actually return six consecutive bytes from memory instead of just one. The instruction decode will then use the first pipeline entry, and any successive operand(s) will also be fed from there. As each byte is consumed, the pipeline queue will pop it off the top; when there's only three bytes left in the queue, we'll initiate another memory access to fill it to capacity again. Any instruction that causes a non-incremental change to happen to the PC (i.e. Bxx, JSR, JMP, and BRK) will of course incur a full pipeline refresh; but all other instructions will get a performance boost as they read their operands from the pipeline instead of memory. Also, subsequent instructions will more often than not be available in the pipeline, so we'll score a bonus on the instruction fetch as well.

In a best-case situation, the pipeline will be filled with six one-byte instructions; we'd do a memory access to get the pipeline filled, and then the next two instructions are cost-free. With the queue now down to 50% full we initiate another memory access to get three bytes; so we're essentially doing one MemoryMap access for every three instructions executed. In the real world, of course, we'll see instructions varying from one to three bytes and that theoretical 3:1 ratio will drop; but we'll still execute at better than the current 1:1 level. And all this is without doing anything at all sophisticated about branch prediction...

Why six bytes for the pipeline length? Purely as a starting-point, and knowing that an instruction can be one, two or three bytes long, I took the worst-case length and doubled it so that the pipeline could potentially hold two complete three-byte instructions. Once the logic is proved to be viable, I'll experiment with other multiples of three to see whether there's a 'sweet spot' at which the pipeline length performs better. I'll also play with the pipeline-fill trigger, so we can modify the circumstances under which a full or partial pipeline refresh will occur - the first cut will top-up the pipeline when it drops to 50% full, and completely repopulate it when PC does something other than increment by one. But it already feels like there might be ways of tuning or improving that.

It'll be fun playing with it to see what happens. ;)

3 comments:

thomas said...

That's a very interesting idea - it didn't even occur to me yet to look into existing CPU performance gain methods for ideas that can be used for an emulator.

Creating a pipeline for the opcode fetch would make sense in my case too - I'm currently writing a 68000 emulation and almost all 68K opcodes are followed by 16bit of data. Currently I only fetch the data if a instruction needs it, but it will probably speed up things by just looking ahead and filling up a pipeline. I'll have to look into that.

Have you already implemented enough of the pipeline to see a performance difference?

t0ne said...

The 68000 does prefetch the next word. I've been pondering whether to add this into Miggy's CPU emulation, but I haven't investigated exactly what happens in the prefetching as yet.

Interesting idea for the emulation code, I'm looking forward to seeing how it works out.

Jonners said...

I've succumbed to temptation and started putting pipelining in place - see the next post for details. ;)