-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise my Planar2Chunky Algorithm #18
Comments
If I understand it correctly, your planar2chunky algorithm performs bit-slicing. Starting from the values that have been read via bitplane DMA, you compute the color register indices by taking a bit from each value. Or, theoretically speaking: you are transposing a bit matrix. Transposing bit matrices seems to be a simple task, but it is not. The problem here is that a large amount of bit shuffling is needed to get the bits at the right position. To my knowledge there are two ways to speed things up:
The book covers all kinds of low-level algorithms similar to the one which is needed here. If I remember correctly, the algorithm in this book outbeats the naive approach by a factor of 2.
With some trickery, bit matrices can be sliced quickly by using the SSE instruction set. It’s a little tricky, because there is no slicing instruction that exactly fits our needs. However, this can be fixed by applying some SSE byte shuffling as a pre- and post-processing step. I’ve demonstrated such an algorithm in my Embedded Software class last semester and I also did some timing measurements to show the benefit. The result was that the SSE approach is about 4 times faster than the naive approach. The algorithm itself is already part of vAmiga (in sse_utils.cpp), but it needs to be tweaked a bit to fit exactly what is needed for the Amiga. For Omega, variant 2) is not an option I guess, because it contradicts the bare metal philosophy. Hence, you should definitely have a look into “Hacker’s delight”. |
BTW, thanks a lot for pointing me to the Compiler Explorer web site!! I didn't know that such a site exists. It's a brilliant tool for quickly figuring out the impact of code modifications (especially those modifications that are meant for speed up). |
😃 👍
As I said, I'm not a professional programmer, so I have to use such sites when doing baremetal coding, so I can understand what the compiler is doing. This site is the reason why you will notice my sometimes strange use of variables, and that most of my switch()/case code has been replaced with function tables, I managed to half the code size and increase performance this way. I was able to test both approaches and see what the various compilers I was using were outputting. |
I really need to revisit this: Now I look at it, it again, it seems to be a relatively simple 2 by 8 -> 1 by 16 matrix transform... -edit- no since it's to do with bit positions... it's a complex tansform. |
The Excel sheet looks OK to me. It computes the transposed bit matrix as it is supposed to do. |
Well, I used that sheet to compare the output of my code with my inputs! But I never got it to work right . |
I've uploaded a zip file containing my bit matrix transposition case study: It runs three algorithms consecutively:
It compares performance and produces the following output on my machine:
The SSE variant more than 4 times faster than the standard approach. Please feel free to use it. The numbers are the numbers of the test case. They are outputted to see that the algorithms do what they are supposed to do function wise. |
Hey Dirk, this is pretty exciting stuff! I'm going to have to find the time to really sit down and see what's going on. Many thanks for sharing this. |
My P2C code was written sometime in the evening of Christmas day, 2017... Sitting on the floor of my girl friend's parent's living room... After having eaten and drunk far too much 😄
It needs to be reviewed.
I've been using this to try and workout an optimisation...
https://godbolt.org/z/MJgMv2
The text was updated successfully, but these errors were encountered: