Hi! I’m a rookie programmer. My camera can identify the model, but I don’t know how to add the identified result into my blocks.
Which sample are you using as the base of your blocks code, and which parts of the code are you having a problem understanding?
I’m perfectly willing to help you understand the sample code, but I need to know which elements you’re being confused by.
-Danny
I followed the instructions on https://ftc-docs.firstinspires.org/, making my model and uploading it. However, it is difficult for my camera to identify the item correctly (it always notices on other unrelated objects.) Could you please give me some advice about how to shoot the video to make the recognition precise? Also, I don’t know what to do next after my camera identifies the item.
I’m so grateful and surprising that someone can help me. Thank you very much!
Greetings, @alex_hume!
Okay, so it seems as if you’ve got two primary issues going on here. The first issue is that your model doesn’t seem to be trained properly and the second issue is understanding how the sample code works so that you can synthesize your own program that does what you want.
I’m going to focus this post on training your model using FTC-ML. I have access to your team’s FTC-ML workspace, so I can see all of your model training - since you posted asking for help, I took that as permission to take a peek and see if there was anything I could identify. I saw the following areas of interest:
Homogenous Poses
Your videos are well-shot, non-blurry, and clean. You’re also rotating around the object to capture multiple poses. You’re using a “standard” resolution and you’re keeping the object within ~50% of the frame. So far you’ve got a great dataset for training.
However, your object is uniform all the way around the object, so you don’t technically need to do a full 360 degrees around the object (essentially 75% of your video is just repeated frames because of the uniformity of the object). But that’s secondary to the real problem - your poses are too homogenous (meaning they’re all the same) and the poses do not appear to be accurate to how your robot will see the object. What you’ve done is shot video with likely a handheld camera (don’t get me wrong, that’s perfectly fine, I do it too) while likely standing near the object. That’s exactly what I do. But a few degrees of pose difference can mean a HUGE difference to a model. Your handheld camera appears to be about 24-30 inches off the floor looking down at your object, yet your robot camera appears to be about 12-16 inches off the floor.
Recommendations for Correction:
-
Take video of the object at different heights. You can use the current video as a “up high” sample, but then take video exactly at the robot camera height, and then take some video below the robot camera height. The more examples the model can get the better the model can adjust to changing pose angles.
-
Take video with the camera closer to the object than you currently are, and then take video with the camera slightly farther away. You can do this in one smooth shot or in multiple videos. This gives your model multiple “sizes” of the same object, and helps it train on details and lack of details in the object.
Inconsistent Labeling
As I looked through your training video, I saw inconsistent labeling all throughout your videos. Here’s examples of inconsistent labeling:
The first image is a good label, but the labeling gets progressively worse. Inconsistent labeling can wreck your model’s ability to distinguish the difference between the object and the background. Even though your model finished with relatively good training metrics, I think the high metrics are simply a product of the homogenous poses of your object (and incredible number of effective “duplicate” frames).
Recommendations for Correction:
- Carefully watch your automated tracking process more carefully, and when the labeling box shifts stop the tracking immediately, go back and correct the bad label boxes, and start training again starting with your last good label. Machine Learning is a Garbage-In Garbage-Out process, if you don’t give the model the best inputs you’re not going to get the best outputs.
Backgrounds and Lighting
What’s up with the tile floor in the training video? That tile looks incredibly brighter/lighter than the tile floor in your Driver Station snapshots. Is that just an artifact of the lighting and/or filters you’re using on your camera? If so, you need to vary the background so that the model doesn’t key in on it when training. See, the model cannot easily differentiate the Blue Cone from the light-colored background if the light-colored background is ALWAYS present. When we trained the DUCK model in Freight Frenzy, we made the mistake of training the model with the DUCK always on the gray tiles, and the model eventually trained to recognize the DUCK only when on a gray tile - it didn’t detect the DUCK very well when it wasn’t on a gray tile. You need to vary the background so that the model will eventually say, “Oh, I see now, the contents within the label box doesn’t always include the same background, so I should stop using the background (or color of the background, or contrast of the background, or texture of the background) as a primary/secondary/tertiary method to identify the object.”
Recommendations for Correction:
- Take videos of the object on multiple different backgrounds. Try adjusting the brightness/lighting so that the model doesn’t assume that the objects will always be lit the same or highlight textures on the object/background the same. Making the training data heterogenous makes for better training data!
I think you’re doing a great job with the videography so far, you just need to incorporate some of these additional tips to help train a model that is more robust and better able to recognize the object more accurately.
-Danny
I’ll focus this post on the Custom TensorFlow sample OpMode.
When creating an OpMode, I created the OpMode using the ConceptTensorFlowObjectDetectionCustomModel sample (just highlighting it here for others who might be following along).
The sample program has three functions:
-
runOpMode - This is the standard starting function. In this function we can see it manages all of the program coordination.
- Calling the initTfod function during the initialization part of the program
- Calling waitForStart to wait for the start button to be pressed
- Continuously calling telemetryTfod in a loop while the opModeIsActive returns true (will return false once you hit the stop button or the program exits).
- Optionally stopping or resuming the Vision Portal streaming using the DPAD buttons.
-
initTfod - This function is the standard VisionPortal TFOD initialization function. It sets up, configures, and launches TensorFlow for the program. This function must be called during initialization if you want the Three-Dot Camera Preview to show anything.
- Creates the myTfodProcessor necessary for a VisionPortal
- Configures the myTfodProcessor with the model parameters (can also optionally add global settings using blocks from Vision->TensorFlow->TfodProcessor toolbox)
- Builds a myVisionPortal instance using the myTfodProcessor so that TensorFlow will run within the VisionPortal instance.
-
telemetryTfod - Each time this function is called, it asks TensorFlow for the most recent set of object detections. This function also calculates the centroid of each found object (useful for location detection within the image frame).
- Calls TfodProcessor.getRecognitions which asks TensorFlow for a list of the most recent list of found objects (called “recognitions”). You can see how many objects were recognized by using the “length of” block to see how many objects are in the list.
- Uses a “for each item” block in order to iterate through the list of recognitions. Inside the loop each recognition object is represented using the variable “myTfodRecognition”. If you use the “length of” block and determine that there is only one detection, or if you only care about the first detection (always check to ensure there’s at least one object in the list!!!) you can use the “in list … get … #” block (in the Lists toolbox) and use the number 1 to get the first recognition object.
- Uses the TfodRecognition.Label block (found in Vision->TensorFlow->Recognition menu toolbox) to determine which label the object was detected as. This is useful if there are multiple labels in the model - if there’s only one label, like “cone” in your case, it may not be necessary to use this because the only possible label will be a “cone”.
- Uses the TfodRecognition.Left and TfodRecognition.Right blocks to calculate horizontal center of the object (it adds the values and divides by 2 to find the center).
- Uses the TfodRecognition.Top and TfodRecognition.Bottom blocks to calculate vertical center of the object (it also adds the values and divides by 2 to find the center).
So with this code it teaches that you can minimally write code like this to determine if TensorFlow has found any objects, and the ROUGH horizontal location of the first object’s detection box:
This example is crude in order to keep the example simple, but I think it shows the general concept. If you have any questions on this, feel free to ask.
-Danny