-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI][Failure] shared_ptr_base.h:199:9: runtime error: member call on address which does not point to an object of type 'std::_Sp_counted_base<>' #3192
Comments
@BrianHarrisonAMD @atamazov I suspect It must be one of these:
Update: confirmed it is due to |
@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue. |
Unfortunately, in my short experiment reverting #2524 does not resolve this issue. We do see lots of warning messages like:
|
IIRC sometimes we need to know the amount of free GPU memory and use hipMemGetInfo to query this info. But in some cases, this HIP function does fail, and I have no idea why. The workaround (which issues a warning and simply returns some fixed value) was introduced in #2333, 6477e68 I suspect that the reason of HIP runtime failure is a combination of severely outdated base driver + new rocm in docker + some target asics. I think that we need some assistance from HIP runtime team. |
@junliume ...but I do not think this is related to this specific issue with UB. |
Not sure if this was already known, but I tracked it down to the test_find_db testsuite, and it appears to be from calling the following in solver_finders.cpp: std::transform(
finders.begin(), finders.end(), std::inserter(solutions, solutions.end()), [&](auto&& f) {
return std::make_pair(f->GetAlgorithmName(problem),
f->Find(ctx, problem, invoke_ctx, parameters, options));
}); Seems to be calling Find on the finders causes this issue in the test. |
Update, for the find_db.cpp test, I changed it to only run the forward test, and narrowed it down to miopen::solver::conv::ConvMlirIgemmFwdXdlops causing the above issue for me. This change to mlo_dir_conv.cpp fixes the forwards test for me: static auto GetImplicitGemmSolvers()
{
return miopen::solver::SolverContainer<
miopen::solver::conv::ConvHipImplicitGemmForwardV4R5Xdlops,
miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops,
miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops_Padded_Gemm,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1Xdlops,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1Xdlops,
miopen::solver::conv::ConvHipImplicitGemmV4R1Fwd,
miopen::solver::conv::ConvHipImplicitGemmV4R4Fwd,
// miopen::solver::conv::ConvMlirIgemmFwdXdlops,
miopen::solver::conv::ConvMlirIgemmFwd,
miopen::solver::conv::ConvMlirIgemmBwdXdlops,
miopen::solver::conv::ConvMlirIgemmBwd,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd_1x1,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicBwd,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlops,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlops,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlopsNHWC,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlopsNHWC,
miopen::solver::conv::ConvCkIgemmFwdV6r1DlopsNchw,
#if MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
miopen::solver::conv::ConvHipImplicitGemmFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmGroupFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmGroupBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemm3DGroupFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemm3DGroupBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmF16F8F16FwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmF16F8F16BwdXdlops,
#endif // MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC>{};
} Going to dig a bit deeper to see what's the issue with that one solver. Edit: Looks like the issue happens for me if I call any of the miir API's, and goes away if I prevent those from happening. This line is enough for it to trigger the issue for me: miirCreateHandle(params.c_str()); Looks like it's due to the params the handle is created with, but not sure yet what caused this to be an issue now. |
Adding a branch to suppress the ubsan errors since it's coming from MLIR handle creation, and our options are limited since we are using an older version. PR up with suppression changes #3198 |
@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that |
@amberhassaan shared_ptr_base.h isn't the issue, but it's where the ubsan error comes from during teardown of the application, and it's the only way I could find to suppress the error. The issue can be narrowed down to just creating a MLIR handle with nothing else happening, (I made a reproducer for that), and appears to be due to something in MLIR cleaning up static memory during exit. |
Another byproduct of #3181
LastTest.log
The error message:
[How to reproduce]:
cmake command:
and then
The text was updated successfully, but these errors were encountered: