Some optimizations #1

jerch · 2015-08-28T20:39:30Z

Here are some quick optimizations to speedup the code in CPython. Most bottlenecks are cascades of function calls, esp. the setter/getter of the register objects (commit 9b73477, fac8dbc and partly 038475b). Also direct memory access in the cpu code shows some benefit (234d571).

Benchmark results: ~2 Mio. cycles/s in CPython 2.7 and ~17 Mio. cycles/s in PyPy.

jerch · 2015-08-29T12:04:46Z

Some more speed tests:

Using ctypes c_ubyte and c_ushort types for registers to avoid the fixed width conversions is slightly worse than pure python, I guess the ctype conversion layer is heavier than the handcrafted version.
Dotname reduction: Only tried with the CC-register by moving all its logic into the cpu class. Around 15% speedup (~2.3 Mio cycles/s). Main difference in code is self.cc.Z vs. self.Z. The speed gain is somewhat impressive regarding the fact that it is only one lookup shorter. This might be a promising refactoring ground, not sure how mixins with the MRO will do here.
Implementing standard registers with fast builtin types: Replacing the register objects with a list like [value, name, width] shows around 20% speedup (~2.4 Mio. cycles/s). Downside is the ugly code with all those magic index numbers and more instructions in the cpu code for the .set, .increment and .decrement replacements.

In summary the biggest problems for your emulator in CPython are function calls followed by dotname lookups. Speedwise the best would be one big cpu loop with all the states in the local namespace. Since Python has no low level jumps (goto, switch) this is hardly doable and any attempt will only lead to really ugly code. The best we got for code jumps are function mappings with the cost of all the function contexts around.
Btw your elif cascade in .get_ea_indexed is O(n). With a mapping it is O(1), but you would need a function as jump target which inserts more constant overhead than you can gain. As a workaround you could stick with the if cascade and alter it to a binary search.

jedie · 2015-08-31T07:10:55Z

Thanks for you contribution here!

Please add you to AUTHORS ;)

Dotname reduction: Only tried with the CC-register by moving all its logic into the cpu class. Around 15% speedup (~2.3 Mio cycles/s). Main difference in code is self.cc.Z vs. self.Z. The speed gain is somewhat impressive regarding the fact that it is only one lookup shorter. This might be a promising refactoring ground, not sure how mixins with the MRO will do here.

That sound a nice idea!

Implementing standard registers with fast builtin types: Replacing the register objects with a list like [value, name, width] shows around 20% speedup (~2.4 Mio. cycles/s). Downside is the ugly code with all those magic index numbers and more instructions in the cpu code for the .set, .increment and .decrement replacements.

Yes, ugly code for more speed is possible, but not my destination. If i want speed, than i would not use Python :P

jerch · 2015-08-31T12:56:07Z

Haha yeah Python should not be the first choice for number crunching. Nevertheless I started a new branch as playground just to see how far CPython can be pushed :D Benchmark is at 2.8 Mio cycles/s (23 for PyPy) and the code already quite unpythonic and degraded. Welcome to the big monster loop ;)

jedie · 2015-09-01T07:07:52Z

Welcome to the big monster loop ;)

Yes, that will be the fasted way. ...and all local variables ;) ... Maybe a better idea is to generate the code. Look at:
https://github.com/6809/MC6809/blob/master/MC6809/components/MC6809data/MC6809_op_data.py

I used this data to generate the CPU skeleton. And i use it to generate: https://github.com/6809/MC6809/blob/master/MC6809/components/cpu_utils/instruction_call.py

The Instruction_generator.py is here: https://github.com/6809/MC6809/blob/master/MC6809/components/cpu_utils/Instruction_generator.py

Maybe it's possible to copy the real OP-code from https://github.com/6809/MC6809/blob/master/MC6809/components/cpu6809.py into the MC6809_op_data.py
Then there is every information there to generate a CPU class.

But this is not my intention ;)

Btw. around 880.000 CPU cycles/s is real-time. See: https://github.com/jedie/DragonPy/blob/master/dragonpy/Dragon32/gui_config.py#L77-L84

btw. Please add you to AUTHOR, so i can merge.

Idea from: #1

Idea from: 6809#1

jerch added 4 commits August 28, 2015 22:05

optimized registers w'o api change

9b73477

api change: get rid of .get() on registers

fac8dbc

direct memory access in read_pc_byte and read_pc_word

234d571

some optimized instructions; deleted most register.set() calls

038475b

jedie added a commit that referenced this pull request Sep 3, 2015

remove .get() calls

3748d61

Idea from: #1

ysei pushed a commit to ysei/MC6809-1 that referenced this pull request May 22, 2016

remove .get() calls

9e8ff99

Idea from: 6809#1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimizations #1

Some optimizations #1

jerch commented Aug 28, 2015

jerch commented Aug 29, 2015

jedie commented Aug 31, 2015

jerch commented Aug 31, 2015

jedie commented Sep 1, 2015

Some optimizations #1

Are you sure you want to change the base?

Some optimizations #1

Conversation

jerch commented Aug 28, 2015

jerch commented Aug 29, 2015

jedie commented Aug 31, 2015

jerch commented Aug 31, 2015

jedie commented Sep 1, 2015