Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functions Nodelet::ok() and Nodelet::requestStop() #116

Open
wants to merge 7 commits into
base: noetic-devel
Choose a base branch
from

Conversation

peci1
Copy link
Contributor

@peci1 peci1 commented Feb 4, 2022

Could help resolving ros/geometry2#381.

This function tells whether it is okay to use the nodelet. The states go as follows:

  1. false
  2. just before onInit() is called -> true
  3. when requestStop() is called or destructor started to be called -> false

This helps client code recognize that the nodelet was asked to be unloaded and the code should stop everything. It is basically a nodelet variant of ros::ok().

The ManagedNodelet class from loader.cpp calls requestStop() in its destructor to tell the nodelet it is being stopped. I verified the sequence of destructor calls when using the standard nodelet loader, and ManagedNodelet is destroyed immediately when unload() is called, while the NodeletPtr it contains has an additional copy created here until all running callbacks have finished. So the callbacks running while the nodelet was unloaded will see the change from ok() == true to ok() == false. The nodelet destructor will not be called until these callbacks finish, so it is safe to access the nodelet's methods and state in these callbacks.

The call to requestShutdown() in ~Nodelet() is maybe superfluous as it might be too late to set ok() = false when all parent destructors have already been run. The call is definitely harmless, but it might give the impression that it is okay to not call requestStop() and expect that it is called from the nodelet's destructor. It was mainly meant as a safeguard against forgetting to call requestStop(), but maybe it's a wrong safeguard. I can remove the call from destructor if more people agree on that.

One of the cases where an ok() function is painfully missing is the above linked issue. If the TF buffer knew when the nodelet is asked to be unloaded, it could stop the lookup while loop even when ROS time is paused. Currently, when the time is paused, the loop keeps running indefinitely without a way of knowing that the nodelet was asked to unload...

This PR doesn't break ABI and API is only added, not changed or removed.

This PR comes with no tests yet, but I will add them once there is a 👍 from the maintainers that this functionality is desired and could be merged.

@peci1 peci1 force-pushed the stateful branch 2 times, most recently from c3b4c55 to ab95aa5 Compare February 5, 2022 13:57
@peci1 peci1 changed the title Added function Nodelet::ok() Added functions Nodelet::ok() and Nodelet::requestStop() Feb 5, 2022
@peci1
Copy link
Contributor Author

peci1 commented Feb 5, 2022

This is how I tested the sequence of destructor calls (together with the basic functioning of this PR):

class MyNodelet : public nodelet::Nodelet
{
public:
  ~MyNodelet()
  {
    ROS_INFO("~MyNodelet start %d", gettid());
    ros::WallDuration(3).sleep();
    ROS_INFO("~MyNodelet end %d", gettid());
  }

  void onInit() override
  {
    ROS_INFO("onInit start");
    this->sub = this->getNodeHandle().subscribe("/a", 1, &MyNodelet::cb, this);
    ros::WallDuration(1).sleep();
    ROS_INFO("onInit end");
  }
  
  void cb(const std_msgs::Header& m)
  {
    ROS_INFO("cb start %d %s", gettid(), this->ok() ? "1" : "0");
    ros::WallDuration(10).sleep();
    ROS_INFO("cb end %d %s", gettid(), this->ok() ? "1" : "0");
  }
  
  ros::Subscriber sub;
};

int main(int argc, char **argv)
{
  ros::init(argc, argv, "test");
  ros::NodeHandle nh;
  {
    nodelet::Loader l(boost::function<boost::shared_ptr<nodelet::Nodelet> (const std::string&)>([](const std::string& lookup_name){
      return boost::make_shared<MyNodelet>();
    }));
    
    EXPECT_TRUE(l.load("my_nodelet", "MyNodelet", {}, {}));
    
    auto pub = nh.advertise<std_msgs::Header>("/a", 1);
    ros::WallDuration(1).sleep();
  
    ROS_INFO("%i", pub.getNumSubscribers());
    
    ROS_INFO("pub %d", gettid());
    pub.publish(std_msgs::Header());
    
    ros::WallDuration(2).sleep();
  
    ROS_INFO("before unload %d", gettid());
    l.unload("my_nodelet");
    ROS_INFO("after unload %d", gettid());
    
    ros::WallDuration(10).sleep();
    ROS_INFO("~l start %d %lu", gettid(), l.listLoadedNodelets().size());
  }
  ROS_INFO("~l end %d", gettid());
}

This is what is printed to console, showing that MyNodelet destructor waits for the callback to finish and is called from a worker thread:

[ INFO] [1644070254.067059733] [ros.cras_cpp_common]: onInit start
[ INFO] [1644070255.069695258] [ros.cras_cpp_common]: onInit end
[ INFO] [1644070256.071366624] [ros.cras_cpp_common]: 1
[ INFO] [1644070256.071402383] [ros.cras_cpp_common]: pub 32255
[ INFO] [1644070256.071821151] [ros.cras_cpp_common]: cb start 32262 1
[ INFO] [1644070258.071548549] [ros.cras_cpp_common]: before unload 32255
[ INFO] [1644070258.071664136] [ros.nodelet]: ~ManagedNodlet start
[ INFO] [1644070258.071724619] [ros.nodelet]: ~ManagedNodelet end
[ INFO] [1644070258.071794670] [ros.cras_cpp_common]: after unload 32255
[ INFO] [1644070266.071987316] [ros.cras_cpp_common]: cb end 32262 0
[ INFO] [1644070266.072064560] [ros.cras_cpp_common]: ~MyNodelet start 32262
[ INFO] [1644070268.071997780] [ros.cras_cpp_common]: ~l start 32255 0
[ INFO] [1644070269.072254651] [ros.cras_cpp_common]: ~MyNodelet end 32262
[ INFO] [1644070269.073794651] [ros.cras_cpp_common]: ~l end 32255

When I comment out the publish() call so that there is no outstanding callback while the nodelet is unloaded, the following sequence happens showing that MyNodelet destructor is called from the main thread in that case:

[ INFO] [1644070500.700232067] [ros.cras_cpp_common]: onInit start
[ INFO] [1644070501.703203878] [ros.cras_cpp_common]: onInit end
[ INFO] [1644070502.704723130] [ros.cras_cpp_common]: 1
[ INFO] [1644070504.704879283] [ros.cras_cpp_common]: before unload 5187
[ INFO] [1644070504.704982160] [ros.nodelet]: ~ManagedNodlet start
[ INFO] [1644070504.705037195] [ros.nodelet]: ~ManagedNodelet end
[ INFO] [1644070504.705053956] [ros.cras_cpp_common]: ~MyNodelet start 5187
[ INFO] [1644070507.705186707] [ros.cras_cpp_common]: ~MyNodelet end 5187
[ INFO] [1644070507.706226714] [ros.cras_cpp_common]: after unload 5187
[ INFO] [1644070517.706408732] [ros.cras_cpp_common]: ~l start 5187 0
[ INFO] [1644070517.707532828] [ros.cras_cpp_common]: ~l end 5187

@peci1
Copy link
Contributor Author

peci1 commented Feb 5, 2022

Does anybody know about some other publicized ways how to run nodelets than the standard Loader class? If so, these implementations would also need to add the requestStop() calls in order for ok() to correctly work.

@peci1
Copy link
Contributor Author

peci1 commented Feb 5, 2022

A workaround until this PR is merged:

#include <sstream>
#define private public
#include <nodelet/nodelet.h>
#include <nodelet/detail/callback_queue.h>
#include <nodelet/detail/callback_queue_manager.h>
#undef private

bool isCallbackQueueValid(ros::CallbackQueueInterface* queue)
{
  auto nodeletQueue = dynamic_cast<nodelet::detail::CallbackQueue*>(queue);
  // if not a nodelet callback queue, we don't know what to do, so we rather report the queue as valid
  if (nodeletQueue == nullptr)
    return true;
  const auto& queues = nodeletQueue->parent_->queues_;
  return queues.find(nodeletQueue) != queues.end();
}

class MyNodelet : public nodelet::Nodelet
{
  bool ok() const
  {
    return inited_ &&
      ::cras::impl::isCallbackQueueValid(this->getNodeHandle().getCallbackQueue()) &&
      ::cras::impl::isCallbackQueueValid(this->getMTNodeHandle().getCallbackQueue());
  }
}

This works because the callback queue of the nodelet gets removed from the CallbackQueueManager as soon as it is unloaded, even when callbacks are still running. However, this approach is probably pretty fragile, and it needs to access private and detail-namespaced parts of the nodelet code.

@gbiggs
Copy link
Contributor

gbiggs commented Feb 19, 2022

@peci1 Is it possible to create an automated test to verify the correct functionality?

@peci1
Copy link
Contributor Author

peci1 commented Apr 29, 2022

@gbiggs I'm sorry it took so long, but the tests are here! I hope they test the behavior thoroughly.

There is requirement that was not obvious to me at the first time - the destructor cannot finish as long as any callbacks querying ok() are running, otherwise they would be calling the function on an already destroyed object (if the destructor is called e.g. in the nodelet manager thread). If the nodelet is expected to have each callback running at most once, the synchronization should be pretty easily done using some booleans, maybe mutex-protected. If each callback can be running multiple times, the situation gets more difficult and the best sync structure I found is a reverse semaphore (destructor can wait until its value reaches zero). Unfortunately, this cannot be handled from the Nodelet class itself.

Alternatively, there could be a shared pointer to some independent bool object that would be set to false by requestStop() and destructor, and each callback would copy this shared pointer and query the pointer for the ok() value. This would (hopefully) largely circumvent the need for sync primitives, but it would require binary-incompatible changes (which can always be worked around by the global map of Nodelet instances).

@gbiggs
Copy link
Contributor

gbiggs commented May 1, 2022

Have you solved that problem?

@peci1
Copy link
Contributor Author

peci1 commented May 2, 2022

You can see one of the possible solutions in the added unit test - manually adding locking into downstream classes.

The alternative I described above would be better in that it wouldn't require explicit support from downstream code, but it would require ABI-difficult changes. But I'm still not sure this is actually needed. The thing that was not obvious (and I'll maybe add it to nodelet docs on wiki) is that any callback accesing this should make sure the instance hasn't already been destroyed from another thread (when the nodelet is unloaded). This is not only related to the ok() method, but it holds generally. Code that accesses instance member without knowing whether the instance is valid is broken - but it is so easy to write such code with nodelets...

I guess strictly checking the instance validity would resolve some of the mysterious crashes on Ctrl-C with (not only?) nodelets. The bad thing is that the best generally usable primitive for this kind of synchronization is a reverese semaphore which has no easily available implementations in any libraries rosdep knows about (I'll publish my implementation soon as a part of our utility library).

Based on the above-mentioned thoughts, I think the implementation of ok() provided in this PR should be treated independent from any possible support for knowing whether the nodelet has been destroyed (the semaphore could be added directly to nodelet::Nodelet and all subclasses would be instructed to lock it when executing async callbacks).

@peci1
Copy link
Contributor Author

peci1 commented Apr 10, 2023

I wrote the reverse semaphore implementation - here and here. It could be easily copy-pasted into this library.

In a class where async destructor call is expected, I add this as the last statements in destructor:

this->callbackSemaphore.disable();
this->callbackSemaphore.waitZero();

This tells the semaphore to stop giving out "leases" and wait until all "leases" are returned before destroying the instance.

In ALL callbacks accessing this->, I add this piece of code at the beginning:

SemaphoreGuard<ReverseSemaphore> guard(this->callbackSemaphore);
if (!guard.acquired())
  return false;

Everything after the guard is guaranteed to have a valid this-> pointer.

I still see two problems:

  1. The destructor code needs to be repeated in each leaf class. It is not enough to put it in a base class destructor (if the leaf class uses member variables defined in itself and not only members of the base).
  2. Having to put the callback guard to each callback accessing this-> is veeery error-prone. Nobody will do that in practice. And the global variable workaround I was talking about earlier would not work as it needs to access this-> to get a copy of the shared pointer with the sync structure. We would need something like Python decorators for all child class methods...

I'm still thinking if this does not have a nice solution that would not require downstream classes to do anything special. Maybe some wrapped NodeHandle returned by Nodelet::getNodeHandle() etc. could automatically manage the semaphore, taking a lease each time a callback is to be executed? I can't now even figure out an ABI-breaking solution to this problem that would work in 100% of cases...

@peci1
Copy link
Contributor Author

peci1 commented Apr 10, 2023

Oh man, please forget everything I wrote about this-> being accessed after the nodelet is destroyed. This won't happen if all async callbacks are running on the nodelet's callback queues (which is the case probably in 99% cases?).

The callback queue makes sure the nodelet is not destroyed while a callback is running:

uint32_t CallbackQueue::callOne()
{
// Don't try to call the callback after its nodelet has been destroyed!
ros::VoidConstPtr tracker;
if (has_tracked_object_)
{
tracker = tracked_object_.lock();
if (!tracker)
return ros::CallbackQueue::Disabled;
}
return queue_.callOne();
}

The tracked object in that case is weak pointer from NodeletPtr of the running nodelet, so the destructor can't be called during a callback.

In case the user would spin some custom threads, he should also take a weak pointer of this->shared_from_this() and lock it before accessing member variables and methods.

I've faced the problem with destroyed this-> in the test case written for this PR (the last two tests), but that is because I'm actually doing "custom spinning" (calling the callback manually). I'll try to rewrite the tests using callback queues.

@peci1
Copy link
Contributor Author

peci1 commented Apr 11, 2023

Now I'm satisfied with the state of this PR and pretty confident it is usable and useful. It is ready to be reviewed.

I added other integration tests that verify the standard functionality inside a nodelet manager and how it interacts with explicit unloading of the nodelet, killing the nodelet loader node, and killing the nodelet manager.

Along the way, I discovered a bug in bond package which could get the test nodes into infinite wait - ros/bond_core#93 . This is worked around by making sure at least two bond heartbeat messages have been sent by each bond end.

There was also a bug in LoaderROS class - during destruction, it called destructors of the bonds, which however, expect to be called in a mutex-protected environment. But the destructor cannot assure that. So I made the bond map cleanup explicit and under a mutex. I changed the names of two methods to be more explicit about what they are actually doing. As this class is not exported to public, it should not change public ABI or API.

Last, a change of order of member variables was needed in Loader::Impl class to make sure LoaderROS is destroyed before the list of loaded nodelets (because it might access this list in its destructor). I'm not sure if changing member order of a non-exported class changes public ABI or not. I'd guess not.

@peci1
Copy link
Contributor Author

peci1 commented Jun 15, 2024

A year has passed with this PR being ready and without a review. Could I please ask a few eyes on this?

Our downstream shim that basically reimplements this PR as a custom Nodelet child class has been used for a few years already and it seems to be working well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants