Troubleshooting FAILED?

I pulled down the repo and followed the instructions to install it on Google Cloud. Everything seems to work 4.0, but the model training got a FAILED status. How do I diagnose the reason it failed? Trying to assess if I have a configuration issue or if I have a bad dataset.

Just FYI, this forum isn’t to answer issues with self-hosted instances of fmltc. You should go to the repo and ask a question in the Discussions area instead.

However, in the vast majority of cases a failed job is because you ran out of memory attempting to train your model. This happens when your video frames are too large (like you create 4k video frames, which 32 video frames at 4k per frame is too large for the memory buffer of a GPU).

You can look at your jobs to determine if you can find the error by going to:
https://console.cloud.google.com/ai-platform/jobs
and be sure to select your cloud instance. You can see the list of jobs and check to see if you can find an error there.

-Danny

1 Like