Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent file descriptor leak in util.just_run #237

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mariokostelac
Copy link

@mariokostelac mariokostelac commented Jun 28, 2022

While working with papermill I've found that repeated running of notebooks is eventually bumping into the limit of number of opened files.

Following he stacktrace led me to util.just_run function that creates asyncio loop in some cases, but doesn't close them, which leaves some eventpoll file descriptors opened forever.

This PR consists of two commits

  • The first commit creates a failing test, showing that the current implementation is leaving file descriptors opened after each run. On my machine (unix selectors), it leaves 3 FDs per just_run call. It also explicitly declares psutil as a dependency. It was already a transitive dependency, but I'm promoting to be explicit since I'm calling it directly.
  • The second commit fixes the issue by stopping and closing used loop, but just in cases where the method created the loop.

Happy to discuss the approach, missing context, and next steps to fix this bug.

@mariokostelac
Copy link
Author

mariokostelac commented Jun 29, 2022

I see some type checks are failing. I'll probably need help to get around it. I've tried using MagicMock to swap loop object, but some code asserts that loop objects is AbstractEventLoop, which makes it fail.

Copy link
Member

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mariokostelac, that looks good.

nbclient/tests/test_util.py Outdated Show resolved Hide resolved
Comment on lines 105 to 106
loop.stop = MagicMock(wraps=loop.stop)
loop.close = MagicMock(wraps=loop.close)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just ignore type checks on these.

Suggested change
loop.stop = MagicMock(wraps=loop.stop)
loop.close = MagicMock(wraps=loop.close)
loop.stop = MagicMock(wraps=loop.stop) # type: ignore
loop.close = MagicMock(wraps=loop.close) # type: ignore

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might have a better solution for these, without ignoring type checks.

I can't run tests locally, though.

(venv) dev-envs-ad\mario@a-2zurb4h4bbny ➜  nbclient git:(mario/fix_fd_leak) ls
binder  CHANGELOG.md  CONTRIBUTING.md  docs  LICENSE  MANIFEST.in  nbclient  nbclient.egg-info  pyproject.toml  pytest.ini  README.md  RELEASING.md  requirements-dev.txt  requirements.txt  setup.cfg  setup.py  venv
(venv) dev-envs-ad\mario@a-2zurb4h4bbny ➜  nbclient git:(mario/fix_fd_leak) tox -e py38
ERROR: tox config file (either pyproject.toml, tox.ini, setup.cfg) not found

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using tox, you should just run pytest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I was running pytest, but it was missing the type checking piece that failed. I think https://github.com/jupyter/nbclient/blob/main/CONTRIBUTING.md might need an update.

Re: type checking. Would it make sense to put it as pre-commit check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I was running pytest, but it was missing the type checking piece that failed. I think https://github.com/jupyter/nbclient/blob/main/CONTRIBUTING.md might need an update.

Indeed!

Re: type checking. Would it make sense to put it as pre-commit check?

Definitely, if you want to open a PR for that, it would be great.

@mariokostelac
Copy link
Author

@davidbrochart I think this is in better shape now.

@davidbrochart
Copy link
Member

Trigger CI.

@davidbrochart davidbrochart reopened this Jun 29, 2022
@mariokostelac
Copy link
Author

mariokostelac commented Jun 29, 2022

This is interesting. It seems that something in jupyter_client is now failing.
I've taken a look at the stacktrace, it might be that https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#14 or https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#18 is getting our closed loop, and trying to run on it.

It's also strange that the function I'm changing looks remarkably similar to the one that's in the stacktrace. Why is that? @davidbrochart I see that you've authored both of them (at least parts).

I'll do some debugging tomorrow or on Friday. If you have some context that might help, I'm all ears.

Also sorry for not running the whole testsuite and delaying this bug discovery till CI run.

@mariokostelac
Copy link
Author

@davidbrochart there's some discussion in jupyter/jupyter_client#772 (comment).

@mariokostelac
Copy link
Author

I've tried hiding our loop from jupyter_client entirely by running asyncio.set_event_loop(None) after https://github.com/jupyter/nbclient/pull/237/files#diff-ce3eb5ea15b9f32af7e1dc032980287831900dc5538d389fc9875d7133c52220R63.

Unfortunately,

(and test after) then fail because kc.shutdown eventually tries to send something to control channel, which expect some running loop.
I'm not 100% sure I can locate the bug. Is the bug in the test, running kc.shutdown outside of the loop?

@mariokostelac
Copy link
Author

I've spent more time investigating this and I think this solution might be dangerous. It's changing the "current loop", and then stopping it. jupyter_client expects that the loop returned by get_event_loop is not closed, and there might be other libraries that expect the same. Shipping it could break client's processes in unexpected ways, and it might be very difficult for them to figure out why it happened.

There are several approaches we could take:

  1. At the end of just_run, if there wasn't a running loop, set it to none with set_event_loop(None). Drawback of that approach is that, prior to just_run call, there might had been a non-running (but open!) loop attached to the running thread. In that case, just_run would create a new loop, overwrite the previous loop, and nothing would close the old loop. That'd leak FDs as well.
  2. We could try harder to get some loop to run our coro, effectively do the same thing as https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#L18. In cases of getting a non-running, non-closed loop, we'd reuse it (by running again). That solves the leaking problem, and doesn't leave the system in some possibly broken state (closed loop attached to the thread).
  3. We could import jupyter_client.utils.run_sync method and use it directly, instead of reimplementing the same functionality.

Happy to hear your thoughts on these approaches, or new approaches entirely.

@blink1073
Copy link
Contributor

I like the idea of using jupyter_client.utils.run_sync here, and fixing it properly upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants