Our Models Recognition is really Bad

Hello. I am training a TensorFlow model to detect our props (red & blue crowns). This is my 3rd model now and the recognition is still off, bounding boxes are off and sometimes it thinks the gray tile is the prop. This last time, I did some research and found out some things which I implemented: 1) I switched the background between red, blue, and gray tiles. 2) I implemented about ~70 total negatives of the background. One key difference is that the model only got 25 min of training at 1500 steps, because I did not want to waste training minutes. I have attached images of a large detection box on the blue crown and a detection box on the tile.

Here is the other image:

There are a few things that appear to be causing this issue. First of all, make a video using the same camera your robot will use. Second, Do not add boxes to your background, instead take equal about of footage (approximately) in backgrounds other than different colored tiles. Additionally, take footage close up looking at the identifying features, you want to have some of these close-ups where you are not holding the camera (so you do not get motion blue). Lastly, do not make giant boxes, only draw a box around the entire item (If you do more it will get confusing for TenserFlow). I would recommend using the same amount of steps. Good luck and happy training!

Thank you for responding. I sincerely appreciate it. We made the video using the camera that is mounted on our robot by manually moving the robot around with our hands. I’m also confused by:

  1. “Do not add boxes to your background” because we didn’t add any boxes to the background; and

  2. “Only draw a box around the entire item” because we only drew boxes immediately around the crown prop (e.g. we didn’t draw large boxes which contained any content outside of the crown prop).

If I understand correctly, then the changes we should implement are as follows (any suggestions not listed below weren’t listed because those suggestions have previously always been followed):

  1. Include footage of crown props with a variety of backgrounds (equal amount of footage per background); and

  2. Include close-up footage of crowns where focus is on the identifying features.

Is my understanding correct?

Sorry, my bad I must have not correctly understood this quote. It seems like you do understand, best wishes!

Greetings @TRATOON,

I looked at your team workspace in order to provide some comments regarding your training:

  1. I’m glad that you decimated your video (to cut down on frames), but your video resolution (1920x1080) is very high for the ultimate model resolution (300x300). You will likely be using 640x480 for the video resolution on the robot, unless you are modifying the default resolution (and if you are, WHY?, it just literally takes longer because of the mountains of data you’re now forcing the system to process). If you’re not modifying the default resolution, consider taking video at a smaller resolution - the effects of scaling will be less severe. Taking video at this resolution almost completely negates small objects, for example in Frame 62 and 373 of video “2023120303_Crown_Props_V3” the camera is just way to far away to be of any use to the model. The model will have difficulty training on those objects.

  2. Your video labeling (“2023120303_Crown_Props_V3”) seems rushed - you have several frames with unlabeled objects without setting the image to ignore. This will cause big problems with training, it will have a hard time settling on training parameters. If you have objects in the frame that are actually supposed to be recognized but you don’t label them, the model thinks it’s made a mistake and will cause the training to be poor. For some examples, Look at frames 286, 273, 129, 514, 576, and others.

  3. Motion blur is a huge problem in your videos. Consider taking video REALLY slowly, and then just decimate (change the fps) the video afterwards using a free tool (looks like you’re skilled at that already). I find that when I take video I try to get the cleanest, crispest images for my model, and let TensorFlow handle motion blur in camera later. You don’t have to “train” the model to recognize motion blur, if anything it hurts the model.

  4. If you look at your training, you can see that the training still has a ways to go for your model. When I see artifacts like what you’re showing, it usually tells me one of two things - (A) The model still has more training to go, or (B) what your model is trained for is conflicting with what the model is seeing. Your training images need more “mat puzzle pieces” if you’re trying to ignore them, because the “puzzle piece connections” on the mat is clearly conflicting with the top of your “crown.” You do a good job of showing edges of the red and blue mats (which I assume are being used to differentiate between the objects, to train TensorFlow “not all red and blue objects are the crown”) but you don’t do the same for the gray tiles. I can see your training data Loss graph (look at the TRAINING Loss graph, NOT the Evaluation Loss graph) looks like below. You want to keep training until the TRAINING loss ideally reaches below 0.2 - though to be honest I train everything with 3,000 steps - sure, it takes extra time, but the training metrics have proven time and time again that 3,000 steps is a “magic number” for FTC-ML training:

image

  1. Lighting differences will hurt you. The training video you have is lit incredibly well, but your sample you show in your original post is lit incredibly poorly. TensorFlow has no idea about color - nope, none, don’t even try. Color is just a red herring (haha). TensorFlow will use CONTRASTS to differentiate objects. What you’ve done in the training data is teach TensorFlow that lightly contrasted objects are to be ignored (the well-lit gray tiles) but the darker contrasted objects are, well, objects. Then in your samples you have a very poorly lit room. The gray tiles are a MUCH darker contrast, and so TensorFlow seems to be confusing those tiles with the object. Your training data needs to represent the full breadth of lighting conditions that your model will be used in.

  2. Given (5), and that in CENTERSTAGE there is no physical need to differentiate Blue or Red props, I recommend labeling all props the same. Allow the differing contrasts to let TensorFlow understand that the props can be a range of contrasts. You still need to have differing lighting conditions so that TensorFlow doesn’t use Contrasts as the ONLY differentiator, which it will do if it thinks it can get away with it (from the training data).

If you have more questions, let me know. Good luck!

-Danny

Hello @ddiaz and thank you for your response. From your response, here is what I understand:

TensorFlow can’t detect color and therefore, I should continue training both props but labeling them under the same label.

  1. We should train to 3000 steps

  2. We haven’t taken lighting into account and should next time. Would images in a variety of lighting conditions solve this?

  3. We should focus on just training the model with clear, crisp images and not worry about motion blur

Additionally, I had some questions:

  1. If the ultimate model resolution is 300x300, do we record in that or some other resolution.

  2. Our goal is to scan all 3 tape locations in one go and based off x-coordinate, determine location, is this practical because I know you were mentioning the distance being too much for the model.

  3. Is it better to train the model on only viewpoints that the robot will see it from or just record it from as many angles as possible, even if the robot won’t see it from this perspective.

  4. Regarding “puzzle pieces” in point 4, are you saying we should include negative frames of mat connection pieces of all colors in order to tell TensorFlow to ignore this?

  5. Should we record the crown not only on just mats but also things like sheets, paper, tables, etc.

  6. If a model is partially off frame eg (25%, 50%, 75%) should the frame be included or ignored. OR should only full images be trained upon?

Thanks,

-Kavi

Correct. TensorFlow does not use color as a primary basis for identification, therefore it won’t provide a consistent basis for identifying two objects that are otherwise the same. Therefore, you really shouldn’t rely on that being the differentiator and should instead just embrace it by labeling the objects using a single common label (unless for some reason you MUST identify RED vs BLUE, at which case I recommend different objects).

Your target should be to train to 100 epochs, as stated in the FTC-ML manual, but 3,000 steps has a very specific training curve that is incredibly optimized for the way we use Google Machine Learning to create TensorFlow inference models. I recommend a single 3,000 step training process, rather than training for 1,500 steps and then another 1,500 steps.

Yes, indeed! When I created the default model used for PowerPlay, training in multiple lighting conditions was instrumental in getting a good model that was mostly impervious to weird lighting conditions. You can see how I did that in this TensorFlow for PowerPlay article.

Yes.

I personally record in 1280x720 resolution, but I am extremely conscious about scaling - I like to keep the objects within AT LEAST 50% of the image frame at all times. This means when I create videos, I tend to not pull very far away from the object (depending on the lens of the camera).

Sure. Remember that training and final use are different - you don’t necessarily have to train at the same distance that the object will be in during the final use. Train the model with the object, give the model variations in pose (distance and orientation) to let it see how the object changes as the distance changes, but keep the object within 50% of the image. The model will learn the object. Just make sure your object is large, and that you keep the specific details general - small details will not be seen very well at a distance.

No, don’t confuse the model. If the model won’t see the object from a specific perspective, don’t train the model with that perspective - for instance, if the model will see a car from above, don’t train it with images of the underside of the car. That’s just confusing for the model. Train it with any/all perspectives that you want the model to recognize the object from, but keep it simple.

In all honesty, negative frames are not necessary unless your object is obscuring some important or confusing element of the background that you want to ensure the model doesn’t accidentally detect, or if the background is incredibly uniform and you don’t want the model to accidentally think the background is part of the object. When you label objects in an image, you must label ALL objects that will be detectable in the image. Not labeling objects in an image automatically tell TensorFlow that anything it detects in the image that is NOT labeled is a bad detection, and TensorFlow will add those objects to the ignore list. That’s why it’s SO VERY IMPORTANT that ALL images that you use be fully and properly labeled.

Varying the background is important, but that’s also what negative frames are for. See, when it comes to your objects TensorFlow only sees what’s inside your bounding box of your label. If there’s ALWAYS a gray background surrounding your object, TensorFlow cannot know if the gray is ACTUALLY part of the object or if the gray is the background. TensorFlow does keep a list of “interesting detections” and if they’re never labeled they go into a special bucket for “background”, but portions of the background that always seem to be seen within bounding boxes don’t normally go into that “background” bucket. Negative frames can help with that, but only to a certain extent - but varying the background can be of vital importance to teach the model what IS and IS NOT part of the actual object. That’s why I loved you using the red and blue tiles, it provides different backgrounds so TensorFlow will NOT use the gray tiles as a key for recognizing the object. However, the only real difference is color, and we’ve already talked about TensorFlow not really being able to “see color.” Some variation in background would go a loooooong way.

Back in Freight Frenzy we actually trained the model with game objects on top of the gray tiles, and we did not vary the background - TensorFlow keyed in on that, and the game pieces would detect incredibly poorly without being on a gray tile. Oooof!

I personally tend to use the 33% rule - if AT LEAST 33% of the object is in the frame, label what’s left of the object. I find models tend to train better when I do that, and the models often recognize the objects better because they’re being trained to key in on multiple different key patterns in the same object. Just make sure that over half your training data is using full objects. If most of your training data is partial objects, that can lead to lots of false-positives.

-Danny