Crash with TensorFlow

Hi,

Our team trained a custom TensorFlow model and is having decent results with it, except that ~5% of the time when initializing the autonomous op mode, the app on the control hub crashes. We can tell because with an HDMI cable plugged in to the Control Hub, we see the Android launcher. Sometimes the app restarts automatically, and sometimes not, so we have to power cycle the Control Hub.

I removed all the blocks from the code except the init for TensorFlow and the telemetry to show detected objects during init. The op mode is called CrashRepro. It still crashes occasionally.

I’m attaching a log where a crash occurred at 00:24:00. I was pressing Init, then waiting a few seconds to see 1 object, then hitting stop, and repeating. So there are several short but successful inits before the crash.

The hardware is a Control Hub new from this year, running SDK 9.0.1, and a Logitech C270 webcam.

Does anything stand out in this block or in the logs?
https://drive.google.com/drive/folders/1Q2nzvAIuJH6FtaoQbowOiqXIjByEy9y8?usp=sharing

This is a pretty great log, thank you for that. I think you’re stopping and restarting the program in quick succession - if this is true, if you slow down and wait between 30-35 seconds in-between restarting the OpMode does it still crash? I wonder if the underlying drivers are trying to reset and perform cleanup, and maybe it’s happening too fast for the underlying drivers - after several executions that cleanup is taking longer and longer, until it finally trips a watchdog.

I notice in the logs it appears you’re starting and stopping the OpMode ~6 times. The log calls out that prior to the crash there’s possibly too much happening in the main thread (I presume non user code) - this is being reported by the “Choreographer” immediately after the 5th time you stopped the program, and then again twice during the 6th execution (the crash happens when you stopped the 6th execution). The program execution in all 6 OpMode cases executes the same, but in the last OpMode execution a watchdog catches the code “stuck” in stop() at the end of the crashing execution. This is apparently happening on the scriptFinishedLock synchronization object, which forces the OpMode to wait for cleanup to complete - this tells me that something behind the scenes is indeed struggling to keep up and is taking progressively longer than expected to do some work. It appears that during the 5th execution we get a warning that things are overloading, and in the 6th execution that overload finally trips the “stopped code” watchdog (that doesn’t necessarily mean the code is “stopped”, it just means it took too long to complete).

I’m just trying to characterize the behavior of the crash, not necessarily making any determination yet what’s actually causing it to crash. But it looks like either you’re starting and stopping the execution too fast, or something eventually gets “backed up” and eventually takes too long and triggers a watchdog which crashes the system so we can see what’s going on. It would be nice to know if you “take it easy” on the code if it’s easier or harder to reproduce.

-Danny

Thanks for the detailed response, Danny!

I tried again this morning pressing init, waiting for the camera to see an object, then pressing start, waiting for 15 seconds, pressing stop then waiting for 30 seconds before pressing init again.

In this log, it shows many successes, then at 08:43:50 something different happened when I pressed stop. The Status: Robot is stopped line blinked a couple of times. The driver hub stayed connected to the robot controller though. I downloaded the logs, looked at them for a couple minutes, then I pressed init at 08:51 and at that point the app crashed on the robot controller. I waited for it to restart, then I downloaded the logs above.

Could the “too much happening” be due to the live preview on the robot controller HDMI port? When the team is done developing the vision model, do you recommend turning that off? Disabling live preview doesn’t affect the ability to use the camera stream on the driver hub during init, right?

Since I haven’t seen any crashes during the program run, only when pressing init or stop, I think at this point I’ll give the guidance to the driver coach to init the auton program, check the camera stream and if it’s stable for 10-15 seconds, they are good to go.

We are having a similar issue but ours fails about 60 to 75% of the time. Sometimes it crashes and says “Robot stopped responding to commands while op mode was running” and other times it restarted the robot. We have a competition Saturday and we are not sure how to correct it.
Chris

Sorry to hear your team is having this issue as well.

It still occurs for us, mostly when starting and stopping the auton program quickly. One mitigation we were able to do is to only scan for the team prop during init, then when the auton starts, we stop interacting with the camera to avoid a crash while auton is running. If it crashes during init, our students signal the FTA to get time to restart the robot. They will redo the randomization of the team prop if that happens.

I don’t know if that’s an option for your team but sharing just in case.

Thanks for advice. Our programmer just told me that it was also crashing op mode wasn’t running too.

I’m hoping that Danny can look at our logs to see if anything is obvious too.

If you have a chance, plug in an HDMI monitor into the Control Hub to see what it displays when it crashes. We see a blue square instead of the camera preview just before it crashes.

It may help to disable the live view of the camera on the Control Hub (the one shown when you have HDMI plugged in to the Control Hub, not the one you see on the Driver Hub when you click Camera Stream).

We can try that tonight, but we actually have not been running the live stream on the driver station and still get the crashing.