Proper Process for Object Detection

This year is the second year our team has used the ml-toolchain. Last year we had great success with the application, but this year not so much and likely due to our lack of knowledge on the proper procedure for best results. We have created a number of images, data sets, and training models, but we are frequently unable to correctly identify the objects. In our process, we are using a webcam mounted on the robot in the position it will be in when viewing the field during competition and taking short video bursts of the object at multiple places on the field. We are using the maximum resolution on the camera a c170 webcam (768p at 4x3) with the lowest frame rate of 30 fps. When it is imported into the ml-toolchain, we have tried drawing bounding boxes around the entire image and also tried where we draw the bounding box around the individual repeating elements in the image. We do not label any other objects on the field, and we take a few frames of negative images where the object is not on the field and then do not label anything in those images. We have set our recognition level at 70 and have tried multiple magnification levels. In the ml-toolchain modeling, we typically set the number of steps to a number where we get about 125 epochs completed.

Below are some examples of our images and how they are bounded in the application. Any help you can provide for us in identifying what we are doing wrong; is it the images, the labeling we are doing, or something else; or any tips in general would be greatly appreciated.

In this example, we typically get good object detection of this element.

Hey Richard,

I took a look at your models in your team’s account (I hope that was okay since you asked for help) - your labeling is clean, your object poses are varied, and your objects are fairly distinct. Congratulations, you’ve really nailed what most people have a difficult time getting a hang of. What you’re missing is what makes THIS year so different than LAST year. I will give you props, because you’re doing almost everything I did when training the default model this year. I stumbled in the EXACT place you’re stumbling, and it took me a couple days to figure it out. So Bravo, don’t feel bad, but let me explain.

The crux of the problem is that we’re not detecting OBJECTS, we’re detecting PATTERNS ON OBJECTS with TensorFlow this year. Not a crazy difference, but how we approach the training MUST compensate for this. Last season we were detecting OBJECTS - like, for instance, if we were to train a model THIS YEAR to detect a RED CONE. The RED CONE is the object, and the rest of the “scene” is the background. If you remove the RED CONE, the background is what’s left in the image, and that’s your negative. With the negative, we can teach TensorFlow what is background and what is the object we’re trying to detect. Alternatively (or including) if the background changes enough times within the training data, TensorFlow will ultimately begin to ignore the background on its own WITHOUT employing a negative because TensorFlow eventually realizes that it cannot rely on the background as a way to identify the OBJECT (it says, “Oh, the thing I was keying off of isn’t there this time, so maybe I need to go back and find some other element to key off of for my object classification”).

Okay, now that we understand TensorFlow for a RED CONE, let’s switch our brains to thinking about TensorFlow with PATTERNS ON CONES. Okay, so what is the “PATTERN”? Well, that’s the image you’re trying to identify. That PATTERN is very different than the OBJECT we were trying to identify previously. TensorFlow doesn’t realize the PATTERN is an IMAGE that effectively has a TRANSPARENT BACKGROUND, and the white background color on the sleeve IS NOT actually supposed to be part of our image. That’s right, your BACKGROUND is actually the WHITE parts of the Signal Sleeve. If you were going to provide a real NEGATIVE for the PATTERN, you’d provide it with a scene where you have a Signal with a Signal Sleeve containing NO IMAGES on it - you would NOT remove the signal!

But that’s not the worse part of it yet. The worse part is that your background color is WHITE. I’m sorry, what you’re seeing here are effects that professional photographers and videographers must account for with WHITE BALANCE. See, the problem is that in different lighting conditions, WHITE takes on different colors - it may take on a blue shade, it may take on an orange tint, it may take on a light brown. This is the bane of photographers everywhere. And what’s worse, is TensorFlow gets confused by it too. If you notice, in almost all images where you use the green-ish “shields” on your video your white signal sleeve takes on a blue-ish tint. Almost everywhere you use the purple dragon, your white signal sleeve takes on a warmer color and appears with a orange-ish tint. TensorFlow looks at the bounding box and says, “Wow, in all these ‘objects within my bounding box’ where a ‘shield’ is present there’s this nice blue-ish background, so that must be part of the ‘object’. I always see this blue-ish background, so if the background isn’t blue-ish it must not be a ‘shield’”. And so on for other images. This bit the heck out of me, too.

What I had to do is reproduce several different lighting conditions, and get short videos of all my images in those lighting conditions. What that did was trained the model to not rely on the background color, and to find something else - namely the image I provided, or patterns of those images. You have to treat TensorFlow like a child sometimes. I documented this for teams in ftc-docs “TensorFlow for PowerPlay”, look especially closely at the section marked “Understanding Backgrounds for Signal Sleeves” for more info.

If you have more questions or need further clarification, let me know. My models looked and trained just like yours - okay maybe mine had slightly better training metrics and detections within the evaluation data, but it was absolutely garbage once it hit “real life.” However, once I trained TensorFlow to stop paying attention to the background my results got better instantly. Just FYI, providing a “blank” sticker/sleeve (without the actual images on it) didn’t hit me when I did my training, but it may possibly save you from having to do lots of different lighting scenarios.


Thanks for such a quick and detailed response. We’ll go back over the documents and take your tips you listed above, and as I tell the kids, “try again”. However, I have one other question for you. In the documentation it states “This means the image is downscaled from the high definition webcam image to one that is only 300x300 pixels”. So should we shoot our video at a lower quality like 360p or do we stick to using the highest quality when taking video, such as 1080p?

That’s a fantastic question. In a perfect world, you’d take the video at the lowest/squarish resolution possible - your camera is going to be able to downscale images better than any downscale algorithm can, so the images will be better. But the REAL answer is to train the model using video at THE EXACT SAME RESOLUTION as the robot will take during competition. If you train the model with 1080p frames, but use/take 640x480 images while using the model as an inference (during a match) the results may be anywhere between sub-optimal to downright terrible. Depending on the language and features of the camera it may or may not be easy to change the resolution during the match, so we standardized on 720p for the default model (1280x720) since most cameras operate at 720p.


Just wanted to give an update and once again thank you. We updated our images, took new videos in more lighting conditions, and were able to successfully use our new models to detect which image was being presented. Thanks again!!!

1 Like