Model training job fails immediately #206
Replies: 6 comments 12 replies
-
Can you check the app engine logs? That would be at console.cloud.google.com/logs/viewer?project=<your_project>&resource=gae_app One thing to note is that it appears that it couldn't even create the training job, because the error occurred while the dialog box was still up. That's unusual and makes me suspect that something went wrong during the project setup. |
Beta Was this translation helpful? Give feedback.
-
Another thing to look at is if there any errors in the browser console. |
Beta Was this translation helpful? Give feedback.
-
Can you double-check that you performed all the items in the "Grant the ml.serviceAgent role to your TPU service account" section of the readme? |
Beta Was this translation helpful? Give feedback.
-
Not that it really matters given the error message Google is returning, but the default config for the repo is to use TPU instances and yet your log indicates it's trying to use a GPU to train. Did you change any of the software to override the default or did you set a use_tpu configuration item to false?
Are you associated with an FTC team and do you have a FIRST dashboard account? |
Beta Was this translation helpful? Give feedback.
-
It's the eval job that's failing to be created. It's using GPU for the eval job. (The eval job is created before the training job so that it is ready to run the evaluation on the first checkpoint produced by the training job, which may happen quickly if the training job is running on TPU.) |
Beta Was this translation helpful? Give feedback.
-
The screenshot of the IAM shows that the App Engine default service account has the Editor role. That's the same role that I have on my project. However, I think your Editor role might have different permissions than my Editor role. Does your Editor role have the roles ml.jobs.create and ml.jobs.get? I think this is the URL to find out... Also, can you just try it again? I wonder if it is a transitory issue with getting access to a GPU and the error message is not quite accurate. |
Beta Was this translation helpful? Give feedback.
-
I downloaded the repo, and completed the self hosting of the FMLTC app on personal google account. I couldn't get the model training job to get executed, it immediately fails. Would appreciate pointers to triage the same, as I cannot proceed further.
Beta Was this translation helpful? Give feedback.
All reactions