My team and I are currently working on training an object recognition model for our FTC robot using the ftc-ml toolchain. When trying to train the model with our dataset, we always get “failed” state.
Our dataset includes videos with the appropriate labels for training the model. We have explored different options for base models, but the result stays the same. We’ve made seven attemps without success
If anyone has ever got this problem and could solve it, we would appreciate any kind of help
My team number is 4897. The attempts are distributed between the last week and a half. The last one was 25/09/2023 at 10:31. I see a new model called Administrative_test from 40 minutes ago that I didn’t make, so I suppose that was you.
Edit: The model Administrative_test is not there anymore
One of the big things I love about the ftc-ml platform is that I have administrative access to all of ftc-ml, so not only don’t I have to not guess what you may have done (or not done) I can go in and see for myself. I have a few odds and ends I’d like to cover with you - I hope it’s okay that I use this as a learning moment for the community. In SUMMARY, PLEASE READ THE FTC-ML DOCUMENTATION.
Your Resolution is not Compatible.
PROBLEM: I noticed that the video you uploaded has dimensions of 2160x3840. That’s a 4K image resolution that eventually has to be downscaled to 320x320 pixels within the model - I know for a fact that at that resolution your model training (when you get that far) is going to fail, especially since you’ve got the video rotated for portrait mode and not for landscape mode (your camera will most likely be working in “landscape” mode, and so the stretching artifacts will very likely cause the model to not recognize properly).
SOLUTION: To fix this, determine what resolution and orientation your camera on your robot will be operating at. If you’re not going to use the same camera to take the video as you’re using on your robot, at least ensure both cameras (taking the video and being used on the robot) are both set to the same resolution. It’s also recommended to use the lowest, square-ish resolution as possible. In my testing, I’ve gotten the most consistent results with 720p resolution (1280x720, in landscape orientation) since that’s the resolution used by the vast majority of supported webcams.
Your Object Orientation is Non-Practical
PROBLEM: Okay, so you want to train your model on a Pixel, and that’s fine. However, you’ve oriented the Pixel to be on its side (similar to this public image). When on the FTC field will you see a Pixel in this orientation? This orientation introduces shadows and contrasts on the object that will not be present when it’s laying on its back and the lighting is from above. It’s very important for you to train the model in the exact way that the object will be shown - that’s why we don’t take pictures of cars from underneath when all the cameras are up above.
SOLUTION: Train the model as your robot will see the object.
Image Continuity Problems
PROBLEM: Every single image that TensorFlow is presented with should be considered a real and valid image for TensorFlow to train with - as in you always expect TensorFlow to recognize that object image and orientation as the object. Here is a sampling of the images you’re providing:
There is no way TensorFlow is going to provide a good model with such variance - if this is all a human ever sees of the object, there’s no way they could recognize it and differentiate it from other objects. With these images, TensorFlow is guaranteed to not train properly.
SOLUTION: When you have images you don’t want TensorFlow to see, use the “Ignore Frame” checkbox on the image frame within the video to exclude the frame from the TensorFlow training data set.
Please learn from my mistakes
PROBLEM: I’ve already spent over a hundred hours training TensorFlow models to recognize a Pixel. That’s what the default model is trained on that is shipped with the FTC SDK (when you use any of the standard sample TensorFlow programs, the default model is loaded). I gave a pretty thorough writeup here: TensorFlow for CENTERSTAGE with my findings. Because of the properties/geometries of the Pixel, I found out that I could not train TensorFlow to recognize the object from specific angles. Please read that document incredibly thoroughly.
PROBLEM: When you have very very large images, like the 4K video that you provided, you run the risk of blowing the GPU memory used to train a model. As discussed in the documentation, training on ftc-ml uses 32 images in a training step - that means 32 images have to be loaded into the GPU memory at once to be trained. Each 4K image is about 8MB, so 32 images is around 256MB, which happens to be the exact storage capacity of the GPUs that we’re currently using. However, more than just images are stored for a training run, so you exceed the memory capacity of the GPU and the training fails because the training system under the hood runs out of memory. Unfortunately we cannot programmatically detect this and throw an appropriate error, I had to run the Administrative_Test model that you saw and look at the Google Cloud Project log files to determine what was happening.
SOLUTION: Use a smaller resolution.
I hope this information helps you in your next attempt at training a model.
Thank you ddiaz.
We’ll upload a video with a lower resolution. Our object orientation may not be practical on the field, but we just wanted to test how the tool worked to have the model ready as fast as possible when we make our team prop and decided to test it with a pixel. We’ll also try to only keep the good frames.
Also, how is it that the high resolution images use that much GPU memory if they are resized before training the model?
The scaling happens during the training process. When we build a machine learning dataset and prepare the training to use a pre-built model (like the 320x320 standard model size), none of the component systems knows (or cares) exactly what sizes the two datasets are. We also don’t know the exact process that the images are downscaled, that’s included in the “black magic” once we hand off your images to the Google Machine Learning APIs that eventually spit back out the model. All we’re told is that the input images are processed to match the resolution of the model during training.