Thursday, 29 January 2009

Dirty Pretty Thing

Real life has a nasty habit of unravelling the best-laid of plans, and the last couple of weeks have seen me with very little time to spend on the project - but I have done a little as the opportunity has arisen, so as promised here's a screenshot of what the IDE looks like right now. It's not as far along as I would have liked, but it's getting there - and there's another panel (the memory-dump view) which is nearing completion but not shown, so progress is slow but steady:



As you can see, we have a disassembly panel (which does what it says on the tin) and an address select/convert tool which lets me convert addresses from Hex to Decimal (and back) and also makes use of the annotation labels (if present) to find and select addresses by name. There's a little more I want to do to this tool, but it works so far.

In the meantime, when I've not been actively coding, I've been mulling-over the second version of the core, which emulates the 6502 at the T-state level. I've got a reasonable idea of how to make it work, so once I've got the IDE where I want it, I'll be in a position where I can instantiate both the current V1 core and the new V2 core side-by-side, and watch the behaviour of each simultaneously as they step through instructions. Which should be cool. :)

Friday, 16 January 2009

He Who Fights And Runs Away...

OK, I concede temporary defeat. Pipelining is proving too large a drain on the progress of the project - it's working, and performance is better than it was after the first attempt, but it's still nowhere near as good as the non-pipelined version. So, for now, this goes to the back-burner and I'm returning to the IDE. I have a pretty good idea for how to do some branch prediction and 'loop unrolling' which will make the pipeline quicker, but that's another week or two's effort; I really don't want to get bogged-down in this and stall the project further, so a tactical retreat is in order.

Accordingly, I've reverted the codebase to the pre-pipeline version and rolled-in the few bugfixes I'd made elsewhere in the code as I implemented and tested the pipeline constructs. So we're now back to where we were a fortnight ago, and the pipelined version is tucked away waiting for the day I return to it. In the meantime, on with the IDE!

In a few days I'll post some screenshots so you can see how it's shaping up.

Tuesday, 13 January 2009

Unblocking The Pipes

Getting pipelining working is now approaching the point at which it's becoming a major distraction from pushing the project forward, and I'm thinking I'm going to try one last thing before giving it up as a bad job and reverting to the non-pipeline version of the code. I've fixed a couple of other little bugs I've found whilst I've been fiddling with the instruction fetch logic, so lifting those fixes out and dropping them into the non-pipeline version will be a matter of seconds.

I spent a good while making use of JetBrains' "dotTrace" to identify hotspots in the pipeline code - as expected, there was some redundant object creation going on which was easy to nullify, and I changed a byte-by-byte copy from one array to another with the Array.Copy method; but apart from that there wasn't much to do in the way of performance tuning - there isn't much code to tune, after all, as it's pretty simple. And the big performance hit remained the over-read of memory triggered by the frequent pipeline flushes.

So it was time to do something a little more radical, to try to minimise the number of times the pipeline got flushed. A flush occurs when an instruction causes the PC to change; the PC increments by the instruction length after each instruction anyway, but by 'change' I mean that it's set to another value instead of the natural increment. This occurs after a BRK, BxS, BxC, JMP, JSR, RTS or RTI instruction when the PC is reloaded with the address to continue execution from. We also do pipeline 'top-ups' when we've consumed more than 50% of the pipeline contents, but the frequency of that happening is dictated by pipeline length, and is a negligible cost compared to the flush.

The trick to mitigating flushes is to try to arrange matters so that the new PC value is one that is already in the pipeline. For a lot of 6502 code this can be achieved without having a huge pipeline; branches, for example, can only be -128/+127 locations from the PC. And JSR instructions tend to be to relatively close-by code, often within the same page (256 bytes) which means the corresponding RTS could easily be within the reach of the pipeline too. JMP instructions tend to be to farther-off locations, and RTI by its very nature could be returning from a long way off. BRK occurs infrequently as an instruction, and when it does occur it'll likely be doing a very long jump indeed, and anyway will route through the NMI vector so there's no pipeline optimisation that'll help with that.

So the first thing to do is make the pipeline big enough to cope with the common scenarios - branch instructions and JSR/RTS combinations. We don't want a huge pipeline, because although we're trying to reduce flushes, whenever we DO flush we'll be refilling that pipe - and the bigger it is, the longer it'll take to fill. Equally, a FIFO queue pipeline isn't a smart idea if we want to accomodate 'negative' changes to the PC and jump backwards; as we consume instructions from the FIFO queue, they are lost, so negative jumps would demand a flush.

I've written a pipeline that is 512 bytes long, arranged as two complete pages of 256 bytes either side of the PC; when a flush occurs, we fill the pipeline with 256 bytes from behind the new PC value, and 256 bytes from ahead of it. This gives us a pretty high chance that any jump or branch forwards or backwards will be into the pipeline, and because it contains data from behind the PC we'll have a moving window as we do top-ups that will also keep the chances high that a negative jump will also still be in the pipeline too. For this to work properly, we also have to stop popping instructions out of the pipeline as we execute them; we need to be able to jump back to an instruction we've already processed for negative jumps to have a chance of being satisfied by what's in there. So instead of a FIFO queue, the new pipeline is a simple array with a pointer maintained by the emulator code; when we flush the queue, the pointer is set to 256 and we execute from there. As we proceed through the pipeline, we leave the contents intact but modify the pointer; branches and jumps that can be satisfied from the pipeline are manifested by simply changing the pointer to the appropriate pipeline slot.

Although this sounds like a second Program Counter (and in a sense it is) the key thing to remember is that the pipelined data is right there in the CPUCore class, and doesn't require a context switch to get data from MemoryMap every time we fetch and decode an instruction. There are still inline MemoryMap references, such as when an indirect load, store or jump occurs; in those circumstances, the operand data in the pipeline is a pointer to a memory location which in turn contains the real target address, and we have to read that address out of memory before we can operate on it. But the majority of instructions and addressing modes can be satisfied directly from the pipeline. At least, that's the theory.

I've made the changes to the pipeline structure and logic in CPUCore, and right now I'm just finishing the change to MemoryMaps' BurstData method, so that the pipeline is filled with 256 bytes from either side of the current PC when we ask for data. The minor complication here is that the data should 'wrap' when the PC is less than 256 or greater than [Memory.Size - 256], so once I add that little bit of functionality we'll be good to go. Hopefully, the ability to jump and branch within the pipeline without necessarily triggering a flush will make all the difference.

If that doesn't deliver the desired results, the next thing to do will be to make pipeline population an asynchronous task on another thread - but I've already decided that I'm not going to persue that at this time, and it can wait until I write the V2 CPUCore after Easter because that will be a multi-threaded entity anyway. Abandoning the pipeline scheme (if I have to) won't be a total waste of effort, as I've learned a lot both about pipeline mechanics and 6502 code profiling - and that is good foundation knowledge for the V2 core.

Anyway, we'll see.

Thursday, 8 January 2009

Blocked Pipes

I couldn't resist it. Despite the knowledge that I really should be working on the IDE, I had to see what effect there would be on the core execution speed if I implemented a rudimentary pipelining mechanism into the instruction decode routine.

So over the last couple of nights, I've added a small 'BurstData' routine to the MemoryMap class, which populates a pre-defined shared array with the next 'n' bytes of data from the primary memory array and makes it available to the CPUCore class. I then added a FIFO queue to CPUCore (using the System.Collections Queue class) and wrote the logic to call BurstData when the queue was less than half-full, enqueuing data from that shared array. Tweaking the fetch code to pull its data from the queue instead of memory was a matter of seconds to change, and then I went through all the 6502 instruction implementations adjusting them to use queued data instead of memory where appropriate.

Once that was done, I set the queue capacity to six bytes and the refill threshold to three bytes, and fired it up. And the results are stunning.

Stunningly bad, that is - where I was getting 40MHz on the older box, I now get 10MHz.

I experimented with some other configurations of the queue; after 6/3 proved disasterous, I played with 9/6, 12/3, 12/6, and even some silly values like 30/15. Performance varied, but in general either stayed in the region of 10MHz or dropped - the worst was just 4.1MHz. Clearly, this is Not Working.

So why is this? Well, the basic problem is that whenever the Program Counter changes due to a Branch, Jump, Break or Return instruction, the queue has to be flushed and re-populated from the new PC location. And a characteristic of 6502 code is that routines and loops tend to be short, so the queue gets flushed a lot. Since we read 'n' bytes ahead to populate the queue, we're now doing a lot MORE memory access than ever before, and in the majority of cases it's a complete waste of effort. Look at this example from the VIC-20 ROM, which executes during the boot process and is checking to see if a short sequence of bytes is present in memory which would indicate that a plug-in auto-start cartridge has been connected:
; CHKA0CBM: FD3F  [2]  A2 05     LDX #$05       (- Check For A-ROM)
; FD41 [4] BD 4C FD LDA $FD4C,X
; FD44 [4] DD 03 A0 CMP $A003,X
; FD47 [2] D0 03 BNE $FD4C
; FD49 [2] CA DEX
; FD4A [2] D0 F5 BNE $FD41
; FD4C [6] 60 RTS
This is a perfect example of the nature of the problem; let's assume we have a queue set to 6 bytes long, with a refill threshold set to 3. When we jump to this routine at $FD3F, the queue is flushed and we read 6 bytes of memory from the PC location into it. Instruction execution commences, reading the first byte (LDX) and then the second as the operand. We then read the third byte (LDA), and two more as operands. On the next fetch, the queue level has dropped to one byte, so a refill occurs and puts five further bytes into the queue. We then execute the next instruction (CMP), pulling its' two bytes of operand data off the queue; the queue level is now three bytes, which is not lower than the refill threshold, so we execute the next instruction (BNE) and pull the operand byte off the queue too.

Now at this point (in our example) the branch is taken - the CMP instruction didn't find a match and we're skipping out of the loop. So the PC changes to $FD4C, and the queue is flushed and populated with six bytes. We execute the first instruction, which is RTS, and that also changes the PC to set it back to the address this routine was called from; so the queue is flushed again, and we read another six bytes. In total, to execute this loop in the pre-pipeline model, we would read 12 bytes of memory (11 in the routine, plus the first instruction after the calling JSR when we RTS); with the pipeline enabled, we read 6+5+6+6 (flush and fill the queue, refill the queue, flush and fill, flush and fill) or 23 bytes. Discounting how many bytes we make use of after the RTS, we've definitely read 6 bytes more than we needed during execution of the routine.

That doesn't sound so bad, but the problem is endemic - every time we hit a Branch or Jump, even if the target address is only a few bytes away, we flush and refill the queue. And longer queues just exacerbate the problem; if we were to simply double the queue length and the refill threshold (thinking that having more data to hand with fewer refills might improve the situation) we'd read 12+8+12+12 bytes, or 44 altogether - which doesn't help at all!

So I need to have a bit of a re-think on this. I have some ideas:

1. Ensure the queue manipulation logic is as fast as it can be - there might be some optimisation I can do to the mechanics of the code.

2. Tweak the PC-changing instructions to look and see if the target address is ahead by just a few bytes, and don't flush the queue if so.

3. Abandon the FIFO queue and use a hand-rolled routine to manage a pipeline in which data isn't popped-off the top when consumed, but instead we have a pointer that we can move up and down - good for short loops where the branch is a negative one.

4. Put queue population on a separate thread, so that it handles flushes and refills concurrently to the core instruction fetch/decode logic.

5. Replicate 20-odd-years-worth of industrial CPU research and design a mechanism that does pipeline caching and branch prediction with a high degree of accuracy.

I'm not giving-up on this yet, because I think I can crack it. The first attempt has been hugely disappointing, but also very informative, and I'm pretty confident that I can improve upon my first design in enough ways that it evolves from a massive drag on performance into something that improves it.

I'll keep you posted.

Monday, 5 January 2009

New Years' Resolution

Well, here we are in 2009 - in London it's cold, dark and unpleasant outside, and of course Christmas is now just a fading memory as we return to the daily grind. Travel expenses have risen, and things seem pretty gloomy all round; the come-down after the yuletide high is a bitch. But it's not all bad, as the need to get my brain back into work-mode has also stimulated me into returning to the project after two or three weeks away from it during the festive period. My initial target was to get a stable core emulation by Christmas, which I achieved; my New Year Resolution is to have the whole core complete, together with the IDE, by Easter.

Of course, being something of a geek (surely not!) I did spend a few idle minutes over the holidays thinking about Sharp6502. Naturally my mind wandered from the task in hand, and instead of thinking about GUI coding for the IDE, my thoughts turned back to the CPU Core and entertained some ideas on how to speed it up.

It's running at about 75MHz at the moment, which is plenty fast enough for my requirements; I don't need to make it any faster at this juncture, and in fact the ideas I've had for it are staying on the 'Future Enhancements' list for a good while because I want to focus on the IDE until it does at least as much as the command-line interface. But even though it's not a priority, a geeks' mind loves to tinker and tweak. :)

The biggest bottleneck in the way this version of the core runs is memory access - every instruction needs between one and three accesses to get the OpCode and any requisite operands, plus another access or two if the addressing mode is indirect, and there may then be further reads and/or writes if the instruction is one which does any memory access itself. The core only accesses memory when the instruction demands it, so we read one byte initially (for the OpCode) and then do further accesses only if the instruction decode requires one or two operands, any indirection access, and then whatever memory access is actually performed by the instruction.

All well and good - but memory access is expensive, relatively, because it requires the core instruction decode logic to make requests of the MemoryMap class and is thus a branch out of the sequence of IL instructions that are being executed by the CLR as it emulates each 6502 mnemonic. So my mental meanderings began to wonder if there was any way to reduce the number of times that happened - or at least to reduce the number of times we had to break the flow of IL instructions to fetch data, even if the number of accesses remained the same.

Of course, Intel (and AMD, and probably others) were thinking about this many years ago, which is why we have pipelined CPUs in our PCs today; to put it in simple terms, when a pipelined CPU reads an instruction byte it also reads-ahead a few bytes (perhaps quite a few, in current processor families) and fills a pipeline with that data. As the instruction executes, its operands will, if it's lucky, be available in the pipeline instead of having to do another memory access. Equally, successive instructions may also be in the pipeline, so you could end up executing several instructions very quickly from the pipeline instead of doing memory accesses for each. Filling the pipeline is a fast operation, because it's a burst of sequential data from memory, and can happen concurrently with the first instruction being decoded; even if it takes a bit longer than a straight-forward memory access for the next byte or two, you still recoup that time as subsequent instructions and data come from the pipeline instead of memory. Coupled with on-chip cache and efficient pipelining and branch-prediction algorithms, this has a huge impact on execution speed. The only time the model falls apart is when the pipeline is populated with the wrong data (perhaps because the branch-predictor got it wrong) and needs to be re-populated; the trick is to try to minimise the number of times this happens, and a great deal of time and effort has been invested in doing just that.

Funnily enough, even the humble 6502 had a primitive version of pipelining; the next byte would often be read whilst instruction decode was still occuring on the previous byte, so there was a shorter wait-state when that next byte was needed. In CPUCore version 2 (which I've alluded to in previous posts, and which will emulate the core at a lower level) this simple read-ahead will be faithfully replicated as part of the fetch logic. I started to wonder if a pipeline mechanism of sorts might exhibit some interesting performance gains even in the current version, and worked-out a way to do it.

What we'll have is a six-byte pipeline implemented as a FIFO queue; when the first OpCode fetch happens, we'll actually return six consecutive bytes from memory instead of just one. The instruction decode will then use the first pipeline entry, and any successive operand(s) will also be fed from there. As each byte is consumed, the pipeline queue will pop it off the top; when there's only three bytes left in the queue, we'll initiate another memory access to fill it to capacity again. Any instruction that causes a non-incremental change to happen to the PC (i.e. Bxx, JSR, JMP, and BRK) will of course incur a full pipeline refresh; but all other instructions will get a performance boost as they read their operands from the pipeline instead of memory. Also, subsequent instructions will more often than not be available in the pipeline, so we'll score a bonus on the instruction fetch as well.

In a best-case situation, the pipeline will be filled with six one-byte instructions; we'd do a memory access to get the pipeline filled, and then the next two instructions are cost-free. With the queue now down to 50% full we initiate another memory access to get three bytes; so we're essentially doing one MemoryMap access for every three instructions executed. In the real world, of course, we'll see instructions varying from one to three bytes and that theoretical 3:1 ratio will drop; but we'll still execute at better than the current 1:1 level. And all this is without doing anything at all sophisticated about branch prediction...

Why six bytes for the pipeline length? Purely as a starting-point, and knowing that an instruction can be one, two or three bytes long, I took the worst-case length and doubled it so that the pipeline could potentially hold two complete three-byte instructions. Once the logic is proved to be viable, I'll experiment with other multiples of three to see whether there's a 'sweet spot' at which the pipeline length performs better. I'll also play with the pipeline-fill trigger, so we can modify the circumstances under which a full or partial pipeline refresh will occur - the first cut will top-up the pipeline when it drops to 50% full, and completely repopulate it when PC does something other than increment by one. But it already feels like there might be ways of tuning or improving that.

It'll be fun playing with it to see what happens. ;)