[L0 v2] implement native handle API for event #2354

igchor · 2024-11-19T21:44:17Z

No description provided.

github-actions · 2024-11-22T01:03:57Z

Compute Benchmarks level_zero_v2 run (with params: --compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11964457434

github-actions · 2024-11-22T01:34:20Z

Compute Benchmarks level_zero_v2 run (--compare baseline-v2):
https://github.com/oneapi-src/unified-runtime/actions/runs/11964457434
Job status: success. Test status: success.

Summary

No diffs to calculate performance change

(result is better)

Performance change in benchmark groups

Relative perf in group api (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
api_overhead_benchmark_sycl SubmitKernel out of order	21.371000 μs	23.101 μs	21.594 μs
api_overhead_benchmark_sycl SubmitKernel in order	24.461 μs	22.982000 μs	24.376 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.097 μs	2.223 μs	1.836000 μs
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.895 μs	1.646000 μs	1.846 μs
api_overhead_benchmark_ur SubmitKernel out of order	15.282 μs	14.164000 μs	14.588 μs
api_overhead_benchmark_ur SubmitKernel in order	14.703 μs	16.205 μs	11.955000 μs

Relative perf in group memory (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	198.549 μs	225.948 μs	197.150000 μs
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	91.086 μs	113.348 μs	85.506000 μs
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.885 μs	5.839000 μs	5.881 μs

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	804.435 μs	802.425000 μs	803.695 μs

Relative perf in group multithread (8): cannot calculate

Benchmark	This PR	baseline	baseline-v2
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	3582.565 μs	6700.861 μs	3567.437000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	8012.706000 μs	17388.983 μs	8045.880 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	25003.635 μs	24916.792000 μs	25356.152 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	1082.170 μs	1050.397000 μs	1075.116 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	4587.532 μs	7557.060 μs	4547.306000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	6426.939 μs	8417.786 μs	6415.747000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	24944.560000 μs	25260.363 μs	25211.590 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1077.801 μs	1039.281000 μs	1071.033 μs

Relative perf in group Velocity-Bench (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Velocity-Bench Hashtable	381.965 M keys/sec	379.536 M keys/sec	382.270813 M keys/sec
Velocity-Bench Bitcracker	35.235 s	35.219200 s	35.264 s
Velocity-Bench CudaSift	202.207 ms	204.502 ms	201.122000 ms
Velocity-Bench Easywave	250.000 ms	237.000 ms	232.000000 ms
Velocity-Bench QuickSilver	121.300 MMS/CTT	118.880 MMS/CTT	121.420000 MMS/CTT
Velocity-Bench Sobel Filter	516.153 ms	535.518 ms	514.658000 ms

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Runtime_IndependentDAGTaskThroughput_SingleTask	176.380000 ms	260.413 ms	177.099 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	182.626000 ms	270.075 ms	184.523 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	181.516000 ms	276.832 ms	183.959 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	186.463000 ms	275.064 ms	188.117 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	1334.975 ms	1686.954 ms	1200.051000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1363.164 ms	1708.538 ms	1220.445000 ms
Runtime_DAGTaskThroughput_SingleTask	1305.914 ms	1684.488 ms	1174.054000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	1377.810 ms	1735.179 ms	1241.248000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline	baseline-v2
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	4.530000 ms	4.575 ms	4.547 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	3.709000 ms	4.628 ms	3.716 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.373 ms	4.373 ms	4.206000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.116000 ms	618.221 ms	618.154 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.411 ms	617.428 ms	617.409000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.340000 ms	4.515 ms	4.455 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.395 ms	617.447 ms	617.380000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.331000 ms	4.570 ms	4.457 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	4.553 ms	4.639 ms	4.515000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.067000 ms	618.220 ms	618.092 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.473 ms	4.397000 ms	4.439 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	3.754000 ms	4.632 ms	3.773 ms
MicroBench_LocalMem_fp32_4096	29.854 ms	29.908 ms	29.832000 ms
MicroBench_LocalMem_int32_4096	29.936 ms	29.865 ms	29.849000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Pattern_Reduction_Hierarchical_int32	17.002 ms	16.675000 ms	16.895 ms
Pattern_Reduction_NDRange_int32	16.952 ms	16.906 ms	16.867000 ms
Pattern_SegmentedReduction_NDRange_int64	2.347 ms	2.342000 ms	2.346 ms
Pattern_SegmentedReduction_NDRange_int32	2.167000 ms	2.170 ms	2.169 ms
Pattern_SegmentedReduction_Hierarchical_int64	11.777 ms	11.772000 ms	11.778 ms
Pattern_SegmentedReduction_Hierarchical_int32	11.605 ms	11.592000 ms	11.598 ms
Pattern_SegmentedReduction_Hierarchical_fp32	11.597 ms	11.589000 ms	11.595 ms
Pattern_SegmentedReduction_NDRange_fp32	2.164 ms	2.166 ms	2.163000 ms
Pattern_SegmentedReduction_NDRange_int16	2.254 ms	2.270 ms	2.252000 ms
Pattern_SegmentedReduction_Hierarchical_int16	11.793000 ms	11.803 ms	11.799 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
ScalarProduct_NDRange_fp32	3.809 ms	3.752000 ms	3.812 ms
ScalarProduct_Hierarchical_int32	10.321000 ms	10.331 ms	10.326 ms
ScalarProduct_Hierarchical_fp32	9.959 ms	9.943000 ms	9.973 ms
ScalarProduct_NDRange_int64	5.489 ms	5.435000 ms	5.496 ms
ScalarProduct_Hierarchical_int64	11.326 ms	11.314000 ms	11.361 ms
ScalarProduct_NDRange_int32	3.823 ms	3.753000 ms	3.809 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline	baseline-v2
USM_Allocation_latency_fp32_host	37.388000 ms	37.410 ms	37.643 ms
USM_Allocation_latency_fp32_shared	0.061 ms	0.053000 ms	0.061 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.007000 ms	1.029 ms	1.039 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.313000 ms	1.643 ms	1.321 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.166000 ms	1.190 ms	1.196 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.561000 ms	1.804 ms	1.572 ms
USM_Allocation_latency_fp32_device	-	0.066000 ms	-

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
VectorAddition_int32	1.498 ms	1.454000 ms	1.495 ms
VectorAddition_int64	3.095 ms	3.063000 ms	3.103 ms
VectorAddition_fp32	1.494 ms	1.462000 ms	1.494 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline	baseline-v2
Polybench_2mm	1.210000 ms	1.212 ms	1.213 ms
Polybench_3mm	1.804 ms	1.737000 ms	1.814 ms
Polybench_Atax	6.704000 ms	6.884 ms	6.818 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
Kmeans_fp32	16.057 ms	16.052000 ms	16.062 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
LinearRegressionCoeff_fp32	702.790 ms	859.455 ms	688.918000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	baseline-v2	Relative perf	Change	-
MolecularDynamics	0.027 ms	0.026000 ms	0.026 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline	baseline-v2
llama.cpp Prompt Processing Batched 256	934.185 token/s	900.321 token/s	951.883120 token/s
llama.cpp Text Generation Batched 256	65.293023 token/s	62.640 token/s	65.291 token/s
llama.cpp Text Generation Batched 512	65.289 token/s	62.698 token/s	65.498832 token/s
llama.cpp Prompt Processing Batched 512	486.495 token/s	459.584 token/s	487.952260 token/s
llama.cpp Prompt Processing Batched 128	818.424187 token/s	781.978 token/s	709.163 token/s
llama.cpp Text Generation Batched 128	65.120 token/s	62.634 token/s	65.337827 token/s

Output:

---------> BitCracker: BitLocker password cracking tool <---------

==================================
Retrieving Info

Reading hash file "/home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt"

              Attack

================================================
Type of attack: User Password
Psw per thread: 1
max_num_pswd_per_read: 60000
Dictionary: /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt
MAC Comparison (-m): Yes

Iter: 1, num passwords read: 60000
Kernel execution:
Effective passwords: 60000
Passwords Range:
npknpByH7N2m3OnLNH1X9DJxLrzIFWk
.....
dL_7uuf3QCz-c6K3xDu0

================================================
Bitcracker attack completed
Total passwords evaluated: 60000
Password not found!

time to subtract from total: 0.00406402 s
bitcracker - total time for whole calculation: 35.2354 s

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Output:

UNKN:

UNKN: ==================================================
UNKN: User input parameters:
UNKN: Trace: ../../inputData
UNKN: ==================================================
UNKN:

Image size = (1920,1080)
Initializing data...
Number of original features: 3683 3933
Number of matching features: 1105 1262 30.0027% 1 2