Tuesday, 4 November 2008

Spin Cycle

Having analysed the deepest and darkest corners of the emulator code with a profiling tool, it's looking increasingly like the speed degradation I've seen recently is more due to the way I count instruction metrics rather than the actual implementation of the code. The biggest hotspot is, unsurprisingly, the ExecuteInstruction() method, and within that the bits that take the most time are fairly evenly scattered around in the form of calls to Properties in the various Register classes. I'll have a look at optimising that at a later date, because what the analysis showed as a bigger issue was a degree of imprecision in the way core speed was being measured.

How was metrics counting working?

As each instruction executed, the basic cycle-count it specified was added to a running total of cycles the core had executed so far. This did not account for the instances where an instruction would need additional cycles, such as when a branch instruction is true, or at page boundaries for many instructions - those additions would be incrementally added by the instruction implementation logic itself, as it decided at run-time whether further cycles were needed.

Every 100ms a timer fired which took the current total cycle count and stashed it in the next slot of a 10-element array (looping around to the beginning when full). It also then zeroed the total, so that it would begin to accrue again over the next 100ms. At the end of execution, these 10 most-recent stashed cycle totals were added together to give the total number of cycles executed over the previous second of elapsed time. This total was then divided by one million to give a megahertz value.

One problem with this was latency - during the interval that the timer was stashing the total cycles for the last 100ms period, but before it reset the total, the core would continue to execute instructions, but of course any further additions to the total would be lost. There is also the natural imprecision of timer events at these frequencies - the timer is not guaranteed to fire on an exact 100ms boundary, and indeed observations showed it firing at varying intervals depending on overall operating system load. So there would be a small but measurable skew on the count added to the array, because it would be a variable amount over 100ms since the last one.

Equally, the process worked over only the last second of elapsed execution time; so if the core had just completed a large amount of work involving instructions with small cycle times, the average would go up. In the reverse scenario, where the core had executed a bunch of instructions with large cycle times, the average would go down. So the reported clock speed could be highly variable depending on what the core had just been doing.

How does metrics counting work now?

Instead of counting the number of cycles executed, the core now simply increments a counter for each instruction every time that instruction is executed. These counters continue to accrue for the duration of the core execution period, and at the end of execution a simple calculation multiplies the number of each instruction execution by its' baseline cycle requirement. These results are summed to give a total number of cycles executed overall, and then divided by the number of seconds for which the core ran, giving a cycles-per-second result. Dividing this by one million then provides the megahertz clock speed.

There are a couple of points to note about this approach. Firstly, we're not taking any account of additional cycle requirements for those instructions which vary according to circumstance (like branch or page boundaries). I'm not sure if I like that, but at the moment the main objective is to get a representative indicator of core speed, even if it's a little imprecise. Secondly, those instruction counters will eventually overflow in long-run situations, so I'm going to need some way of managing that.

On the flip side, we've eliminated the slightly flakey timer-based measurement technique. Doing continuous counting also means we tend to smooth out those peculiarities where the core is busy doing lots of small- or large-cycle-count instruction executions, so when we come to do the speed calculation we get the average cycles-per-second over the entire run interval, and not a (potentially) biased view based on just the last second. This means our view of core speed is more representative of overall performance, and isn't skewed by whatever the core was just doing.

What Next?

I'm still not entirely satisfied with the way this works - under the covers, the core knows exactly how many cycles an instruction took to execute as it ran, so the fact that we don't see this knowledge exposed in the metrics (because we just multiply instruction counts by base cycles) irritates me. In any case, it feels like what I should be doing is figuring-out how long a real 6502 takes to execute one cycle, and using that time period as a coefficient to figure out what percentage over 100% the emulator is executing cycles at.

The other issue is that the 6502 does a limited form of pipelining - as each instruction is decoded and executed, activities relating to the next instruction can sometimes be happening in parallel. To properly emulate that, I need to take instruction decoding and execution up to the next level of complexity, where we're emulating T-states rather than instruction units.

However, for now the clockspeed results I'm seeing are stable, and that means I can finish implementing and testing all the instructions I don't yet have in the core. I'll also do the undocumented ones as well, and have a toggle on the core instancing code to switch support for those on and off. In a few days I should have a fully-implemented instruction set, including known bugs (like the 'JMP Indirect' Page Boundary glitch), and then I can publish the code at that point as a working entity, and move on to the next stage...

2 comments:

The Cowboy Online said...

This is continuining to be a fascinating blog to follow.

Was there any case where 6502 code would work out the clock speed of a CPU? I'm wondering if there's some code out there which does this, that you could then use to compare results with the inbuilt clock speed measurer?

Jonners said...

I'm not sure the 6502 can figure-out what clock speed it is running at without reference to some external timebase. You can certainly write comparison code which has a known number of cycles, run it for 'n' loops on a real 6502 and the same in an emulator, and see what the elapsed wall-clock times are.

Or, if you tied a 6502 to a 6522 and set the timer to a predetermined trigger value, it would be possible to use that as a synchroniser to measure how many instructions executed in a fixed time period.

I'm not too worried about my core clock rate at the moment, as it's still significantly up on a 'stock' 6502. It may become an issue later, particularly when it comes to SLOWING DOWN the core to run at 1MHz so it can be accurately used in larger emulation projects.