You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I first noticed this issue when I saw that diagnostics messages from arm64 machines sometimes arrive, but only infrequently (between 0% and 10% of the time), and eventually the following message comes from the Diagnostic Aggregator running on amd64 and it appears all messages are dropped.
[ERROR] [1564529964.093868659 /diag_agg] [/tmp/binarydeb/ros-kinetic-roscpp-1.12.14/src/libros/transport_publisher_link.cpp:TransportPublisherLink::onMessageLength:175]: a message of over a gigabyte was predicted in tcpros. that seems highly unlikely, so I'll assume protocol synchronization is lost.
At first, I though it was Endianness, but all machines are Little Endian. There are also C++ nodes which are able to communicate with each other properly on all machines. It also appears as though this only happens with the Diagnostic Updater, not any other topic.
After this, I started running a test: just running roscore and a single test node with the following code.
Everything works fine with amd64, but on arm64, the above issues happen. Additionally, the node just eventually crashes with std::bad_alloc. Here is the backtrace and relevant message that was published (as it looks like the error was with serialization)
#13 0x0000000000419084 in diagnostic_updater::Updater::publish (this=this@entry=0x7fffffecc0, status_vec=std::vector of length 1, capacity 1 = {...})
at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:547
547 publisher_.publish(msg);
(gdb) list
542 node_name_.substr(1) + std::string(": ") + iter->name;
543 }
544 diagnostic_msgs::DiagnosticArray msg;
545 msg.status = status_vec;
546 msg.header.stamp = ros::Time::now(); // Add timestamp for ROS 0.10
547 publisher_.publish(msg);
548 }
549
550 /**
551 * Publishes on /diagnostics and reads the diagnostic_period parameter.
(gdb) p msg
$7 = {header = {seq = 0, stamp = {<ros::TimeBase<ros::Time, ros::Duration>> = {sec = 1564612119, nsec = 323701618}, <No data fields>}, frame_id = ""}, status = std::vector of length 1, capacity 1 = {{level = 0 '\000', name = "updater_node_1564612117153179612: Frequency Status", message = "Desired frequency met", hardware_id = "none",
values = std::vector of length 7, capacity 7 = {{key = "Events in window", value = "22"}, {key = "Events since startup", value = "22"}, {key = "Duration of window (s)", value = "2.100290"}, {key = "Actual frequency (Hz)", value = "10.474742"}, {key = "Target frequency (Hz)", value = "10.000000"}, {
key = "Minimum acceptable frequency (Hz)", value = "9.000000"}, {key = "Maximum acceptable frequency (Hz)", value = "11.000000"}}}}}
(gdb) bt
#0 0x0000007fb7a04528 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x0000007fb7a059e0 in __GI_abort () at abort.c:89
#2 0x0000007fb7bde254 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#3 0x0000007fb7bdbdc4 in ?? () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#4 0x0000007fb7bdbe10 in std::terminate() () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#5 0x0000007fb7bdc0d4 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#6 0x0000007fb7bdc6d8 in operator new(unsigned long) () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#7 0x000000000040f840 in ros::serialization::serializeMessage<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > > (message=...) at /opt/ros/kinetic/include/ros/serialization.h:795
#8 0x000000000040d60c in boost::_bi::list1<boost::reference_wrapper<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const> >::operator()<ros::SerializedMessage, ros::SerializedMessage (*)(diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const&), boost::_bi::list0> (f=<optimized out>, a=<synthetic pointer>...,
this=<optimized out>) at /usr/include/boost/function/function_template.hpp:129
#9 boost::_bi::bind_t<ros::SerializedMessage, ros::SerializedMessage (*)(diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const&), boost::_bi::list1<boost::reference_wrapper<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const> > >::operator() (this=<optimized out>) at /usr/include/boost/bind/bind.hpp:893
#10 boost::detail::function::function_obj_invoker0<boost::_bi::bind_t<ros::SerializedMessage, ros::SerializedMessage (*)(diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const&), boost::_bi::list1<boost::reference_wrapper<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > const> > >, ros::SerializedMessage>::invoke (
function_obj_ptr=...) at /usr/include/boost/function/function_template.hpp:138
#11 0x0000007fb7efc8b0 in ros::TopicManager::publish(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, boost::function<ros::SerializedMessage ()> const&, ros::SerializedMessage&) () from /opt/ros/kinetic/lib/libroscpp.so
#12 0x000000000040fe14 in ros::Publisher::publish<diagnostic_msgs::DiagnosticArray_<std::allocator<void> > > (this=this@entry=0x7fffffee88, message=...) at /usr/include/c++/8/new:169
#13 0x0000000000419084 in diagnostic_updater::Updater::publish (this=this@entry=0x7fffffecc0, status_vec=std::vector of length 1, capacity 1 = {...}) at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:547
#14 0x000000000040c5ac in diagnostic_updater::Updater::force_update (this=0x7fffffecc0) at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:440
#15 diagnostic_updater::Updater::update (this=0x7fffffecc0) at /opt/ros/kinetic/include/diagnostic_updater/diagnostic_updater.h:390
#16 main (argc=<optimized out>, argv=<optimized out>) at /ws/src/test/src/updater_node.cpp:20
It crashes after ~30 seconds but seems to do so more quickly if multiple of the nodes are running.
As specified in the title, this only happens with GCC optimization level 3 (compiling with -O3 or CMAKE_BUILD_TYPE=Release). I tried with -O2 and it appears to work fine.
I doubt this can be easily fixed and I'm not 100% sure if the issue is within this repo or ros_comm, but any ideas would be greatly appreciated. We really would like to use -O3 throughout our code for improved performance.
The text was updated successfully, but these errors were encountered:
PR ros/roscpp_core#136 should fix this. I'm looking for someone who could verify. Just please notice that Focal now has GCC 9.4 by default where I could not reproduce the issue. So the test would need to be done with GCC 9.3 installed explicitly.
The issue is a complicated one, but here goes.
I first noticed this issue when I saw that diagnostics messages from arm64 machines sometimes arrive, but only infrequently (between 0% and 10% of the time), and eventually the following message comes from the Diagnostic Aggregator running on amd64 and it appears all messages are dropped.
At first, I though it was Endianness, but all machines are Little Endian. There are also C++ nodes which are able to communicate with each other properly on all machines. It also appears as though this only happens with the Diagnostic Updater, not any other topic.
After this, I started running a test: just running roscore and a single test node with the following code.
Everything works fine with amd64, but on arm64, the above issues happen. Additionally, the node just eventually crashes with
std::bad_alloc
. Here is the backtrace and relevant message that was published (as it looks like the error was with serialization)It crashes after ~30 seconds but seems to do so more quickly if multiple of the nodes are running.
As specified in the title, this only happens with GCC optimization level 3 (compiling with
-O3
orCMAKE_BUILD_TYPE=Release
). I tried with-O2
and it appears to work fine.I doubt this can be easily fixed and I'm not 100% sure if the issue is within this repo or ros_comm, but any ideas would be greatly appreciated. We really would like to use
-O3
throughout our code for improved performance.The text was updated successfully, but these errors were encountered: