Problem with the Steps

Bret4886 · January 14, 2022, 5:28pm

Hello, I’m Bret from 4886 Robojunkies.

I have followed the steps in the user manual to the best of my ability: I created videos, I selected them to be trained into a dataset, and with the trained dataset, I moved on to creating my first model. For context, I labeled 2 objects, Duck (a yellow rubber ducky) and BlueSphere (our team shipping element).

The dataset I used contains 265 training frames and 66 evaluation frames. With this in mind, I followed the formula provided to get 100 epochs as the manual suggested (I realize this is a rough suggestion, but I wanted to understand my limits first before I went higher in epoch counts). following the formula, I ended up with 830 steps:

100 (recommended epochs) x 265 (training frames) / 32 (the recommended batch size) = about 830 ( working out the math gives 828.125 steps, so i just rounded up)

The problem I have encountered is that if it takes 60 minutes to train 3000 steps if my math was right, I could train my 830 steps in 17 minutes. I settled for 20 minutes so that if for some reason the training lagged “behind schedule” I’d have an extra 3 minutes or so to account for it.

Sadly, after 20 minutes, I discovered that none of my steps were trained. No checkpoints, nothing. I ran the same training practice, but accolated 25 minutes instead, but to no avail. (you’ll notice the timer is a little over these values because it took some time to stop the training.)

My question to anyone who knows what might be going on is, is this a bug in the User Interface? Or have I made a miscalculation somewhere?

Thank you so much,
Bret Bodily 4886 Robojunkies
(See attached:)

(I would love to send more to show you what I mean, but I’m limited to 1 image since I’m a new user. Hopefully this makes sense)

Bret4886 · January 14, 2022, 8:46pm

I created only one dataset just to test. I figured once I had seen that I could produce a model with it, I would’ve used more videos to get about 1000 frames per dataset then make a better model. (I know that seems silly, but in hindsight, if I had followed through I would’ve spent more of the limited training time and might’ve ran into the same problem I’m having now). I doubt there could be a problem with it though, since I wouldn’t have been able to make it in the first place if I had any labels misspelled or errors like that. I’m providing it anyways in case someone sees something I’m not.

Bret4886 · January 14, 2022, 8:51pm

Also, I forgot to mention, in the manual, the code for “STOPPED” says “The user cancelled the job after checkpoints were created, can train more.”, but I never cancelled the job. It just stopped after my training time had ran out. I can still select either of the two models, but it only gives me the option to delete them and not to do more training

lizlooney · January 14, 2022, 9:12pm

I’ll take a look and see if I can figure out what happened.

-Liz

ddiaz · January 14, 2022, 9:53pm

@lizlooney took a look at our internal logs, and for whatever reason your training job never started - we don’t see any errors, it’s just that the Google worker never started the training job. We’ve seen this once before a couple months ago, and our Google Cloud contacts asked us to let them know if we ever see it again so we’ll let them know it’s happened again. Other jobs have been running great the past couple weeks, and Liz looked for other jobs that may have had this problem too (based on our logs) but we don’t really see any others. I’m going to credit your training time back to your account, and ask you to try running your job again.

By the way, you’re right - I need to update the STOPPED message in the manual specifically for this corner case (it’s not supposed to happen! LOL). You cannot continue training because training continuance requires a checkpoint to be saved - since the training job under the hood never started, and so it never completed the required initial 100 steps to get a checkpoint, you can’t continue.

-Danny

Bret4886 · January 15, 2022, 12:59am

Thank you both so much! I’ll run it again, and I’ll let you know what happens! Just so you can see from my screen, this is exactly what I ran last time, and what I’ll run right now: