Optimise my Planar2Chunky Algorithm #18

h5n1xp · 2019-03-03T10:39:59Z

My P2C code was written sometime in the evening of Christmas day, 2017... Sitting on the floor of my girl friend's parent's living room... After having eaten and drunk far too much 😄

It needs to be reviewed.

I've been using this to try and workout an optimisation...
https://godbolt.org/z/MJgMv2

dirkwhoffmann · 2019-03-03T20:44:58Z

If I understand it correctly, your planar2chunky algorithm performs bit-slicing. Starting from the values that have been read via bitplane DMA, you compute the color register indices by taking a bit from each value. Or, theoretically speaking: you are transposing a bit matrix.

Transposing bit matrices seems to be a simple task, but it is not. The problem here is that a large amount of bit shuffling is needed to get the bits at the right position.

To my knowledge there are two ways to speed things up:

Use a conventional algorithm that is faster than the naive approach. Such an algorithm is described in this brilliant book:

The book covers all kinds of low-level algorithms similar to the one which is needed here.

If I remember correctly, the algorithm in this book outbeats the naive approach by a factor of 2.

Vectorise the task using SSE instructions.

With some trickery, bit matrices can be sliced quickly by using the SSE instruction set. It’s a little tricky, because there is no slicing instruction that exactly fits our needs. However, this can be fixed by applying some SSE byte shuffling as a pre- and post-processing step. I’ve demonstrated such an algorithm in my Embedded Software class last semester and I also did some timing measurements to show the benefit. The result was that the SSE approach is about 4 times faster than the naive approach. The algorithm itself is already part of vAmiga (in sse_utils.cpp), but it needs to be tweaked a bit to fit exactly what is needed for the Amiga.

For Omega, variant 2) is not an option I guess, because it contradicts the bare metal philosophy. Hence, you should definitely have a look into “Hacker’s delight”.

dirkwhoffmann · 2019-03-03T21:05:20Z

BTW, thanks a lot for pointing me to the Compiler Explorer web site!!

I didn't know that such a site exists. It's a brilliant tool for quickly figuring out the impact of code modifications (especially those modifications that are meant for speed up).

h5n1xp · 2019-03-04T08:46:58Z

If I understand it correctly, your planar2chunky algorithm performs bit-slicing. Starting from the values that have been read via bitplane DMA, you compute the color register indices by taking a bit from each value. Or, theoretically speaking: you are transposing a bit matrix.

😅 Yes, this is how I first envisioned it... I will try and dig out my original spreadsheets, where I tried to work out the matrix transformation involved.

Transposing bit matrices seems to be a simple task, but it is not. The problem here is that a large amount of bit shuffling is needed to get the bits at the right position.

Yes matrix maths in binary proved to be beyond my capabilities. So I just work out a shifting masking approach. Originally I used 64bit variables, as I wanted to keep the whole loop in the registers... But it became really messy and looking though the compiler output showed I wasn't achieving the result I wanted. I put this down to my inexperience, the idea is a good one on 64bit CPUs.

I'm an analyst by trade, not a professional programmer. I wish I had decided to become a programmer 20 years ago 😞

To my knowledge there are two ways to speed things up:

Use a conventional algorithm that is faster than the naive approach. Such an algorithm is described in this brilliant book:

The book covers all kinds of low-level algorithms similar to the one which is needed here.

I have sourced a copy! Many thanks for the heads up. I now just need to find time to read it!

If I remember correctly, the algorithm in this book outbeats the naive approach by a factor of 2.

Vectorise the task using SSE instructions.

With some trickery, bit matrices can be sliced quickly by using the SSE instruction set. It’s a little tricky, because there is no slicing instruction that exactly fits our needs. However, this can be fixed by applying some SSE byte shuffling as a pre- and post-processing step. I’ve demonstrated such an algorithm in my Embedded Software class last semester and I also did some timing measurements to show the benefit. The result was that the SSE approach is about 4 times faster than the naive approach. The algorithm itself is already part of vAmiga (in sse_utils.cpp), but it needs to be tweaked a bit to fit exactly what is needed for the Amiga.

For Omega, variant 2) is not an option I guess, because it contradicts the bare metal philosophy. Hence, you should definitely have a look into “Hacker’s delight”.

This is actually perfectly valid in the context of Omega, as long as any platform specific optimisations exist only in functions found the Host.c file.

I probably didn't make it very clear in my architecture document, but the Host.c file can include anything needed for the emulator to interact with the host.

Only host interactions can be found in that file, the internal operation of the emulator cannot depend upon anything found in the Host,c file. The emulator can therefore we thought of as a blackbox (as single class as it were) with the only way to get anything in or out is via functions found in the Host.c file. There are a few violations mostly around the floppy at this time, but I consider this acceptable while I'm getting parts working, but by V1.0 I hope to have this locked down. 😃 (*see below)

Since my two current targets the x86-64 and the ARM64 both have vector instructions in their ISA (coupled with the fact I have a naive fallback function), I say no reason not to try and create a better display code!

*Ok, my dream of "encapsulating the Emulator" gets a bit tricky when we start to think about Sprites. I have worked out an optimisation where sprites are generated as OGL textures and composited with the biplanes by the gfx hardware(also the playfields would be constructed from textures as well)... But this would need to be an architectural decision, that would make the emulator dependant upon having a 3D hardware, and violate my rules... So we are stuck with crap software sprite emulation 😭

h5n1xp · 2019-03-04T08:53:07Z

BTW, thanks a lot for pointing me to the Compiler Explorer web site!!

😃 👍

I didn't know that such a site exists. It's a brilliant tool for quickly figuring out the impact of code modifications (especially those modifications that are meant for speed up).

As I said, I'm not a professional programmer, so I have to use such sites when doing baremetal coding, so I can understand what the compiler is doing. This site is the reason why you will notice my sometimes strange use of variables, and that most of my switch()/case code has been replaced with function tables, I managed to half the code size and increase performance this way. I was able to test both approaches and see what the various compilers I was using were outputting.

h5n1xp · 2019-03-04T10:06:26Z

I really need to revisit this:

p2cMatrix.xlsx

Now I look at it, it again, it seems to be a relatively simple 2 by 8 -> 1 by 16 matrix transform...

-edit- no since it's to do with bit positions... it's a complex tansform.

dirkwhoffmann · 2019-03-04T17:07:34Z

The Excel sheet looks OK to me. It computes the transposed bit matrix as it is supposed to do.

h5n1xp · 2019-03-04T18:00:28Z

Well, I used that sheet to compare the output of my code with my inputs! But I never got it to work right .

dirkwhoffmann · 2019-03-05T10:54:19Z

I've uploaded a zip file containing my bit matrix transposition case study:

SSE.zip

It runs three algorithms consecutively:

A naive bit slicing algorithm
The optimised version from "Hacker's Delight"
My SSE code

It compares performance and produces the following output on my machine:

SSE Experiments. Dirk Hoffmann 2019
Starting timer
Elapsed time: 0.390108 msec
31 7 11 3 13 5 9 17 30 6 10 2 12 4 8 16 
Starting timer
Elapsed time: 0.22547 msec
31 7 11 3 13 5 9 17 30 6 10 2 12 4 8 16 
Starting timer 
Elapsed time: 0.0837 msec
31 7 11 3 13 5 9 17 30 6 10 2 12 4 8 16 
Program ended with exit code: 0

The SSE variant more than 4 times faster than the standard approach. Please feel free to use it.

The numbers are the numbers of the test case. They are outputted to see that the algorithms do what they are supposed to do function wise.

h5n1xp · 2019-03-05T20:55:18Z

Hey Dirk, this is pretty exciting stuff! I'm going to have to find the time to really sit down and see what's going on.

Many thanks for sharing this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise my Planar2Chunky Algorithm #18

Optimise my Planar2Chunky Algorithm #18

h5n1xp commented Mar 3, 2019

dirkwhoffmann commented Mar 3, 2019

dirkwhoffmann commented Mar 3, 2019

h5n1xp commented Mar 4, 2019 •

edited

Loading

h5n1xp commented Mar 4, 2019

h5n1xp commented Mar 4, 2019 •

edited

Loading

dirkwhoffmann commented Mar 4, 2019

h5n1xp commented Mar 4, 2019

dirkwhoffmann commented Mar 5, 2019

h5n1xp commented Mar 5, 2019

Optimise my Planar2Chunky Algorithm #18

Optimise my Planar2Chunky Algorithm #18

Comments

h5n1xp commented Mar 3, 2019

dirkwhoffmann commented Mar 3, 2019

dirkwhoffmann commented Mar 3, 2019

h5n1xp commented Mar 4, 2019 • edited Loading

h5n1xp commented Mar 4, 2019

h5n1xp commented Mar 4, 2019 • edited Loading

dirkwhoffmann commented Mar 4, 2019

h5n1xp commented Mar 4, 2019

dirkwhoffmann commented Mar 5, 2019

h5n1xp commented Mar 5, 2019

h5n1xp commented Mar 4, 2019 •

edited

Loading

h5n1xp commented Mar 4, 2019 •

edited

Loading