Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some optimizations #1

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Some optimizations #1

wants to merge 4 commits into from

Conversation

jerch
Copy link

@jerch jerch commented Aug 28, 2015

Here are some quick optimizations to speedup the code in CPython. Most bottlenecks are cascades of function calls, esp. the setter/getter of the register objects (commit 9b73477, fac8dbc and partly 038475b). Also direct memory access in the cpu code shows some benefit (234d571).

Benchmark results: ~2 Mio. cycles/s in CPython 2.7 and ~17 Mio. cycles/s in PyPy.

@jerch
Copy link
Author

jerch commented Aug 29, 2015

Some more speed tests:

  • Using ctypes c_ubyte and c_ushort types for registers to avoid the fixed width conversions is slightly worse than pure python, I guess the ctype conversion layer is heavier than the handcrafted version.
  • Dotname reduction: Only tried with the CC-register by moving all its logic into the cpu class. Around 15% speedup (~2.3 Mio cycles/s). Main difference in code is self.cc.Z vs. self.Z. The speed gain is somewhat impressive regarding the fact that it is only one lookup shorter. This might be a promising refactoring ground, not sure how mixins with the MRO will do here.
  • Implementing standard registers with fast builtin types: Replacing the register objects with a list like [value, name, width] shows around 20% speedup (~2.4 Mio. cycles/s). Downside is the ugly code with all those magic index numbers and more instructions in the cpu code for the .set, .increment and .decrement replacements.

In summary the biggest problems for your emulator in CPython are function calls followed by dotname lookups. Speedwise the best would be one big cpu loop with all the states in the local namespace. Since Python has no low level jumps (goto, switch) this is hardly doable and any attempt will only lead to really ugly code. The best we got for code jumps are function mappings with the cost of all the function contexts around.
Btw your elif cascade in .get_ea_indexed is O(n). With a mapping it is O(1), but you would need a function as jump target which inserts more constant overhead than you can gain. As a workaround you could stick with the if cascade and alter it to a binary search.

@jedie
Copy link
Member

jedie commented Aug 31, 2015

Thanks for you contribution here!

Please add you to AUTHORS ;)

Dotname reduction: Only tried with the CC-register by moving all its logic into the cpu class. Around 15% speedup (~2.3 Mio cycles/s). Main difference in code is self.cc.Z vs. self.Z. The speed gain is somewhat impressive regarding the fact that it is only one lookup shorter. This might be a promising refactoring ground, not sure how mixins with the MRO will do here.

That sound a nice idea!

Implementing standard registers with fast builtin types: Replacing the register objects with a list like [value, name, width] shows around 20% speedup (~2.4 Mio. cycles/s). Downside is the ugly code with all those magic index numbers and more instructions in the cpu code for the .set, .increment and .decrement replacements.

Yes, ugly code for more speed is possible, but not my destination. If i want speed, than i would not use Python :P

@jerch
Copy link
Author

jerch commented Aug 31, 2015

Haha yeah Python should not be the first choice for number crunching. Nevertheless I started a new branch as playground just to see how far CPython can be pushed :D Benchmark is at 2.8 Mio cycles/s (23 for PyPy) and the code already quite unpythonic and degraded. Welcome to the big monster loop ;)

@jedie
Copy link
Member

jedie commented Sep 1, 2015

Welcome to the big monster loop ;)

Yes, that will be the fasted way. ...and all local variables ;) ... Maybe a better idea is to generate the code. Look at:
https://github.com/6809/MC6809/blob/master/MC6809/components/MC6809data/MC6809_op_data.py

I used this data to generate the CPU skeleton. And i use it to generate: https://github.com/6809/MC6809/blob/master/MC6809/components/cpu_utils/instruction_call.py

The Instruction_generator.py is here: https://github.com/6809/MC6809/blob/master/MC6809/components/cpu_utils/Instruction_generator.py

Maybe it's possible to copy the real OP-code from https://github.com/6809/MC6809/blob/master/MC6809/components/cpu6809.py into the MC6809_op_data.py
Then there is every information there to generate a CPU class.

But this is not my intention ;)

Btw. around 880.000 CPU cycles/s is real-time. See: https://github.com/jedie/DragonPy/blob/master/dragonpy/Dragon32/gui_config.py#L77-L84

btw. Please add you to AUTHOR, so i can merge.

jedie added a commit that referenced this pull request Sep 3, 2015
ysei pushed a commit to ysei/MC6809-1 that referenced this pull request May 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants