TFLITE Evaluation step showed as "Failed"

shubhasajeeth · February 2, 2024, 5:03pm

For centerstage we have been training models and getting a model file generated. recently we had a run where the model was generated successfullym but when looking at the evaluation step, it say it failed. does anyone know what this means?

ddiaz · February 2, 2024, 7:22pm

We are experiencing significant resource issues within the Google Machine Learning server farm, which is affecting our ability to provision training resources (in case you didn’t know, it’s a “first-come first-served” model where there isn’t always guaranteed resources for everyone allocated to a given server pool - think about it like Airlines overbooking seats on a flight). Individual Training and Evaluation jobs happen separately for each model’s step, and the “Job State” for the last Training and Evaluation job is shown in the Details tab. The current “failures” we’re experiencing are all related to not being able to provision training and evaluation resources.

What does this mean for your model? It could mean that any number of steps out of thousands were not able to execute properly. Does this mean the model is doomed? Not necessarily. You should still look at your model metrics to see if the model still trained appropriately, and test the model for accuracy.

-Danny

shubhasajeeth · February 3, 2024, 12:12am

Thank you for the response. Also in the models page i see a tab called “More Training”. our team did not use this. Instead we created new models everytime. may be what we should have done is train a model. Then use the “more training” button to train the same model with more frames. Looks like this is learning for us this year. Is there a way to combine 2 models after the fact? Just asking.

ddiaz · February 3, 2024, 12:40am

That is correct. You can perform “more training” with an additional dataset so long as your additional dataset has exactly the same labels (no more, no less) than your original dataset that you originally trained the original model on.

If you train with “more training” with additional data, it’s unclear of the benefits of doing that versus starting from scratch and training with both datasets (it’s only unclear because I haven’t done an exhaustive study). If you train with “more training” just to get more training done, that seems to have a lot more clear benefits over starting a model from scratch.

No, I’m afraid not.

-Danny