CP013: Create a motivational example (P1795) #129

AerialMantis · 2020-05-24T01:10:45Z

In our last discussion, we decided that we should create a motivational example of how a developer could use the topology discovery design proposed in P1795 to optimise an algorithm such as matrix multiply based on different system architectures.

mhoemmen · 2020-05-26T17:55:24Z

Do we plan on being able to look up cache sizes at different levels of the memory hierarchy? We could use Strassen for a simple example, and have it use the lowest-level cache size to decide when to stop recursing.

AerialMantis · 2020-05-27T20:08:47Z

I would like to have a property which reflects the various caches levels and their sizes. I'm not entirely sure how best to represent those in a generic way yet, perhaps through a hierarchy of managed memory resources which provide constructive/destructive interference.

Yeah, I like that idea, it would be a good example of using the topology information. So we would recursively divide the matrices into blocks until they fit into the lowest level cache and then compute one at a time, per group of threads sharing the cache.

It would be interesting to then further generalize this so that larger matrices could be subdivided across NUMA regions as well.

AerialMantis · 2020-06-08T14:45:43Z

I am working on a pseudo generic algorithm for this incorporating the various architecture agnostic information that we will need to be able to query, and this is a summary of what I have so far:

Establish where the matrices should be stored initially:
- Is there a global memory region representing a single unified address space that all high-level execution resources can share?
- Are there distinct memory regions for various execution resources, how can memory be migrated between them (copy, map, network communication)?
- What is the latency in moving between memory regions?
- Are the kind of memory regions and their latencies uniform, and can we hide latency through double buffering?
Establish the matrix subdivision size:
- What is the lowest latency memory resources available throughout the execution resources?
- What kind of memory regions are these, are they OS/kernel level managed, such as a cache or are they user-managed such as scratch-pad memory?
- How much memory is available to these memory resources?
- Is this uniform or do we need to look for the lowest common denominator?
Establish how much work to allocate to the various execution resources:
- What is the total available concurrency of the execution resources which share these memory resources?
- What is the peak performance of the execution resources which share these memory resources, say in FLOP/s?
- Is this uniform or does this need to be more dynamically assigned?

AerialMantis · 2020-06-12T18:05:01Z

From last heterogeneous C++ call:

We should focus first on generalizing for the CPU.
Another part of this algorithm should be to compare the computational complexity, in this case, matrix-matrix multiply and the performance of each execution resource against the cost of moving data to each execution resource, this will give us the necessary trade-off of whether or not to offload to accelerators such a GPU.
Another factor to consider in the future is the overhead invoking a callable on various executions resources, such as launching kernels on accelerators.

AerialMantis added the REM20 ISO C++ meeting (remote) 2020 label May 24, 2020

AerialMantis self-assigned this May 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CP013: Create a motivational example (P1795) #129

CP013: Create a motivational example (P1795) #129

AerialMantis commented May 24, 2020

mhoemmen commented May 26, 2020

AerialMantis commented May 27, 2020

AerialMantis commented Jun 8, 2020

AerialMantis commented Jun 12, 2020 •

edited

Loading

CP013: Create a motivational example (P1795) #129

CP013: Create a motivational example (P1795) #129

Comments

AerialMantis commented May 24, 2020

mhoemmen commented May 26, 2020

AerialMantis commented May 27, 2020

AerialMantis commented Jun 8, 2020

AerialMantis commented Jun 12, 2020 • edited Loading

AerialMantis commented Jun 12, 2020 •

edited

Loading