Our team trained a custom TensorFlow model and is having decent results with it, except that ~5% of the time when initializing the autonomous op mode, the app on the control hub crashes. We can tell because with an HDMI cable plugged in to the Control Hub, we see the Android launcher. Sometimes the app restarts automatically, and sometimes not, so we have to power cycle the Control Hub.
I removed all the blocks from the code except the init for TensorFlow and the telemetry to show detected objects during init. The op mode is called CrashRepro. It still crashes occasionally.
I’m attaching a log where a crash occurred at 00:24:00. I was pressing Init, then waiting a few seconds to see 1 object, then hitting stop, and repeating. So there are several short but successful inits before the crash.
The hardware is a Control Hub new from this year, running SDK 9.0.1, and a Logitech C270 webcam.
Does anything stand out in this block or in the logs?
This is a pretty great log, thank you for that. I think you’re stopping and restarting the program in quick succession - if this is true, if you slow down and wait between 30-35 seconds in-between restarting the OpMode does it still crash? I wonder if the underlying drivers are trying to reset and perform cleanup, and maybe it’s happening too fast for the underlying drivers - after several executions that cleanup is taking longer and longer, until it finally trips a watchdog.
I notice in the logs it appears you’re starting and stopping the OpMode ~6 times. The log calls out that prior to the crash there’s possibly too much happening in the main thread (I presume non user code) - this is being reported by the “Choreographer” immediately after the 5th time you stopped the program, and then again twice during the 6th execution (the crash happens when you stopped the 6th execution). The program execution in all 6 OpMode cases executes the same, but in the last OpMode execution a watchdog catches the code “stuck” in
stop() at the end of the crashing execution. This is apparently happening on the
scriptFinishedLock synchronization object, which forces the OpMode to wait for cleanup to complete - this tells me that something behind the scenes is indeed struggling to keep up and is taking progressively longer than expected to do some work. It appears that during the 5th execution we get a warning that things are overloading, and in the 6th execution that overload finally trips the “stopped code” watchdog (that doesn’t necessarily mean the code is “stopped”, it just means it took too long to complete).
I’m just trying to characterize the behavior of the crash, not necessarily making any determination yet what’s actually causing it to crash. But it looks like either you’re starting and stopping the execution too fast, or something eventually gets “backed up” and eventually takes too long and triggers a watchdog which crashes the system so we can see what’s going on. It would be nice to know if you “take it easy” on the code if it’s easier or harder to reproduce.
Thanks for the detailed response, Danny!
I tried again this morning pressing init, waiting for the camera to see an object, then pressing start, waiting for 15 seconds, pressing stop then waiting for 30 seconds before pressing init again.
In this log, it shows many successes, then at 08:43:50 something different happened when I pressed stop. The Status: Robot is stopped line blinked a couple of times. The driver hub stayed connected to the robot controller though. I downloaded the logs, looked at them for a couple minutes, then I pressed init at 08:51 and at that point the app crashed on the robot controller. I waited for it to restart, then I downloaded the logs above.
Could the “too much happening” be due to the live preview on the robot controller HDMI port? When the team is done developing the vision model, do you recommend turning that off? Disabling live preview doesn’t affect the ability to use the camera stream on the driver hub during init, right?
Since I haven’t seen any crashes during the program run, only when pressing init or stop, I think at this point I’ll give the guidance to the driver coach to init the auton program, check the camera stream and if it’s stable for 10-15 seconds, they are good to go.