Getting pipelining working is now approaching the point at which it's becoming a major distraction from pushing the project forward, and I'm thinking I'm going to try one last thing before giving it up as a bad job and reverting to the non-pipeline version of the code. I've fixed a couple of other little bugs I've found whilst I've been fiddling with the instruction fetch logic, so lifting those fixes out and dropping them into the non-pipeline version will be a matter of seconds.
I spent a good while making use of JetBrains' "dotTrace" to identify hotspots in the pipeline code - as expected, there was some redundant object creation going on which was easy to nullify, and I changed a byte-by-byte copy from one array to another with the Array.Copy method; but apart from that there wasn't much to do in the way of performance tuning - there isn't much code to tune, after all, as it's pretty simple. And the big performance hit remained the over-read of memory triggered by the frequent pipeline flushes.
So it was time to do something a little more radical, to try to minimise the number of times the pipeline got flushed. A flush occurs when an instruction causes the PC to change; the PC increments by the instruction length after each instruction anyway, but by 'change' I mean that it's set to another value instead of the natural increment. This occurs after a BRK, BxS, BxC, JMP, JSR, RTS or RTI instruction when the PC is reloaded with the address to continue execution from. We also do pipeline 'top-ups' when we've consumed more than 50% of the pipeline contents, but the frequency of that happening is dictated by pipeline length, and is a negligible cost compared to the flush.
The trick to mitigating flushes is to try to arrange matters so that the new PC value is one that is already in the pipeline. For a lot of 6502 code this can be achieved without having a huge pipeline; branches, for example, can only be -128/+127 locations from the PC. And JSR instructions tend to be to relatively close-by code, often within the same page (256 bytes) which means the corresponding RTS could easily be within the reach of the pipeline too. JMP instructions tend to be to farther-off locations, and RTI by its very nature could be returning from a long way off. BRK occurs infrequently as an instruction, and when it does occur it'll likely be doing a very long jump indeed, and anyway will route through the NMI vector so there's no pipeline optimisation that'll help with that.
So the first thing to do is make the pipeline big enough to cope with the common scenarios - branch instructions and JSR/RTS combinations. We don't want a huge pipeline, because although we're trying to reduce flushes, whenever we DO flush we'll be refilling that pipe - and the bigger it is, the longer it'll take to fill. Equally, a FIFO queue pipeline isn't a smart idea if we want to accomodate 'negative' changes to the PC and jump backwards; as we consume instructions from the FIFO queue, they are lost, so negative jumps would demand a flush.
I've written a pipeline that is 512 bytes long, arranged as two complete pages of 256 bytes either side of the PC; when a flush occurs, we fill the pipeline with 256 bytes from behind the new PC value, and 256 bytes from ahead of it. This gives us a pretty high chance that any jump or branch forwards or backwards will be into the pipeline, and because it contains data from behind the PC we'll have a moving window as we do top-ups that will also keep the chances high that a negative jump will also still be in the pipeline too. For this to work properly, we also have to stop popping instructions out of the pipeline as we execute them; we need to be able to jump back to an instruction we've already processed for negative jumps to have a chance of being satisfied by what's in there. So instead of a FIFO queue, the new pipeline is a simple array with a pointer maintained by the emulator code; when we flush the queue, the pointer is set to 256 and we execute from there. As we proceed through the pipeline, we leave the contents intact but modify the pointer; branches and jumps that can be satisfied from the pipeline are manifested by simply changing the pointer to the appropriate pipeline slot.
I spent a good while making use of JetBrains' "dotTrace" to identify hotspots in the pipeline code - as expected, there was some redundant object creation going on which was easy to nullify, and I changed a byte-by-byte copy from one array to another with the Array.Copy method; but apart from that there wasn't much to do in the way of performance tuning - there isn't much code to tune, after all, as it's pretty simple. And the big performance hit remained the over-read of memory triggered by the frequent pipeline flushes.
So it was time to do something a little more radical, to try to minimise the number of times the pipeline got flushed. A flush occurs when an instruction causes the PC to change; the PC increments by the instruction length after each instruction anyway, but by 'change' I mean that it's set to another value instead of the natural increment. This occurs after a BRK, BxS, BxC, JMP, JSR, RTS or RTI instruction when the PC is reloaded with the address to continue execution from. We also do pipeline 'top-ups' when we've consumed more than 50% of the pipeline contents, but the frequency of that happening is dictated by pipeline length, and is a negligible cost compared to the flush.
The trick to mitigating flushes is to try to arrange matters so that the new PC value is one that is already in the pipeline. For a lot of 6502 code this can be achieved without having a huge pipeline; branches, for example, can only be -128/+127 locations from the PC. And JSR instructions tend to be to relatively close-by code, often within the same page (256 bytes) which means the corresponding RTS could easily be within the reach of the pipeline too. JMP instructions tend to be to farther-off locations, and RTI by its very nature could be returning from a long way off. BRK occurs infrequently as an instruction, and when it does occur it'll likely be doing a very long jump indeed, and anyway will route through the NMI vector so there's no pipeline optimisation that'll help with that.
So the first thing to do is make the pipeline big enough to cope with the common scenarios - branch instructions and JSR/RTS combinations. We don't want a huge pipeline, because although we're trying to reduce flushes, whenever we DO flush we'll be refilling that pipe - and the bigger it is, the longer it'll take to fill. Equally, a FIFO queue pipeline isn't a smart idea if we want to accomodate 'negative' changes to the PC and jump backwards; as we consume instructions from the FIFO queue, they are lost, so negative jumps would demand a flush.
I've written a pipeline that is 512 bytes long, arranged as two complete pages of 256 bytes either side of the PC; when a flush occurs, we fill the pipeline with 256 bytes from behind the new PC value, and 256 bytes from ahead of it. This gives us a pretty high chance that any jump or branch forwards or backwards will be into the pipeline, and because it contains data from behind the PC we'll have a moving window as we do top-ups that will also keep the chances high that a negative jump will also still be in the pipeline too. For this to work properly, we also have to stop popping instructions out of the pipeline as we execute them; we need to be able to jump back to an instruction we've already processed for negative jumps to have a chance of being satisfied by what's in there. So instead of a FIFO queue, the new pipeline is a simple array with a pointer maintained by the emulator code; when we flush the queue, the pointer is set to 256 and we execute from there. As we proceed through the pipeline, we leave the contents intact but modify the pointer; branches and jumps that can be satisfied from the pipeline are manifested by simply changing the pointer to the appropriate pipeline slot.
Although this sounds like a second Program Counter (and in a sense it is) the key thing to remember is that the pipelined data is right there in the CPUCore class, and doesn't require a context switch to get data from MemoryMap every time we fetch and decode an instruction. There are still inline MemoryMap references, such as when an indirect load, store or jump occurs; in those circumstances, the operand data in the pipeline is a pointer to a memory location which in turn contains the real target address, and we have to read that address out of memory before we can operate on it. But the majority of instructions and addressing modes can be satisfied directly from the pipeline. At least, that's the theory.
I've made the changes to the pipeline structure and logic in CPUCore, and right now I'm just finishing the change to MemoryMaps' BurstData method, so that the pipeline is filled with 256 bytes from either side of the current PC when we ask for data. The minor complication here is that the data should 'wrap' when the PC is less than 256 or greater than [Memory.Size - 256], so once I add that little bit of functionality we'll be good to go. Hopefully, the ability to jump and branch within the pipeline without necessarily triggering a flush will make all the difference.
If that doesn't deliver the desired results, the next thing to do will be to make pipeline population an asynchronous task on another thread - but I've already decided that I'm not going to persue that at this time, and it can wait until I write the V2 CPUCore after Easter because that will be a multi-threaded entity anyway. Abandoning the pipeline scheme (if I have to) won't be a total waste of effort, as I've learned a lot both about pipeline mechanics and 6502 code profiling - and that is good foundation knowledge for the V2 core.
Anyway, we'll see.

0 comments:
Post a Comment