Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod Runner is stuck. Not passing through the first epoch after start training. #250

Open
camposwalacy opened this issue Jul 1, 2024 · 0 comments

Comments

@camposwalacy
Copy link

camposwalacy commented Jul 1, 2024

Hello, folks!

I am using HorovodRunner within Databricks runtime LTS 14.2 ML with Tensorflow 14.0 through sparkdl. My data is in TFRecords format, and this issue started to happen after 25th June. I migrated my workload to Unity Catalog. I am debugging on my side if there is something that might have changed, but I couldn't find a way to fix this yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant