Message boards : Number crunching : Problems with Rosetta version 5.93
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
csbyseti Send message Joined: 24 Dec 05 Posts: 11 Credit: 7,038,618 RAC: 34 |
Ended by watchdog, and running beyond their runtime target are two rather different things. I think i'll mean the same as FalconFly. The 2h4o - WU's have got a Problem. I restartet one WU on the Quad, CPU-time jumps down to 1h:xx (last working Checkpoint?) and seems to be running. Finished with wrong CPU-Time of 11337 sec (3h:8) but with Heartbeat-error. https://boinc.bakerlab.org/rosetta/result.php?resultid=135428650 On the X2 the CPU-Time jumps down to 0h:0x after restart (from 6h:59), seems to run but dont work anymore until the watchdog will stop it. This WU would konsum 4x3h + 6h:59 = 19h of CPU-Time. If such a WU will be stopped and restarted because of the Scheduler and resets the CPU-Time it will be a never ending loop. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If such a WU will be stopped and restarted because of the Scheduler and resets the CPU-Time it will be a never ending loop. The watchdog will catch such a thing and abort it for you. In this case, it would notice that the task was restarted 5 times from the exact same point. In other words, "I've started this thing 5 times and never reached a checkpoint, so I'm going to abort it". The basic idea being that whatever it is about that task is not well suited to how you are using your computer, and so the watchdog ends it, reports it back and get another task, which will tend to have different behavior. Rosetta Moderator: Mod.Sense |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Here's a snapshot of what others might be describing about the 2h4o wus. This is on my wifes laptop which was set to 1 hour run time pref, but I changed it at somepoint last nite to 6 hours(note: I changed it before I knew about this one, her laptop is WAAAAY out in the dining room, which never sees meals on the table, so I'm seldom there). Either way, we're way past that. It's longest recorded decoy (out of 912 recorded) was a 1gida which lasted 16627 seconds (4.61 hrs). I suspended the other projects to see what happens with it. ![]() [edit] after 1 hour run time the cpu time has progressed one hour, and the "% complete has progressed from 98.558 to 98.664, but the "to comp" has remained unchanged. |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
I can't edit after 60 min. After 3 hours the cpu time seems right, % comp has progressed up to 98.846, and "to comp" has gone up one second to 00:09:54. Hmmm, at .1%/hour there's just 11 more hours to go making it 25 hours/decoy...gotta be a record. I'll not post again until it's nearly over. (yes...I know 98.848% seems like it'd be nearly over...LOL) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Astro, the estimates are based on the time to completion as compared to the target runtimg... except for the final 10min to completion. They make increasingly fine adjustments to show things are still moving forward, but the client really doesn't know how long that model is going to take. Once the model completes your time to completion will zip from whereever it was near 10min to zero. You can't take a .1% adjustment and extrapolate that into a prediction on final time to completion. The last 10-12min of the time to completion do not work that way. And the time prior to that is just based on the time spent, as compared to target runtime. So, until you've completed a model on that task, there really isn't a great method to arrive at a true predicted time to completion. For most tasks, which take less then an hour per model, this method works fairly well. These 6+hr per model tasks are basicaly the worst case for the time estimate calculations. Rosetta Moderator: Mod.Sense |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
These 6+hr per model tasks are basicaly the worst case for the time estimate calculations. I do think somethings wrong. For the last 3 hours, it's progressed .1%/hour. If that were true for the full length of the wu, and given I'm 15 hours into it, then I should only be at 1.5% complete. At some point the "% comp" had to have progressed faster, and then at some point went into slow motion mode. I'm aware of how the "to comp" works and have NO issue with that. Also, If I'd had a 1hr, 2 hr, 3 hr, and shortly a 4 hour run time preference, then this one would have been ended by the watchdog. Ofcourse, I'm assuming it'll finish at all. If the .1%/hour holds, then a 6 hour pref would have been ended by the watchdog (have to wait and see total run time before I can say that definitively). I guess, If these are really that long, then admin should change the % comp mechanism, and say something about having some "unusually LARGE" wus in the system ATM. Otherwise, you're going to get alot of questions and who knows how many users will "abort" just because they don't know it might be "normal". Heck, I feel that I'm doing them a favor even running it as my gut feeling (without admin acknowledgement that this is normal) is I'm going to get nada for a days work. |
![]() ![]() Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,334,441 RAC: 0 |
Another 24ho oddity - I had this one, work unit 123329393: 2h4o__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK-2h4o_-native__2668_12846 that ran 12.53 hours of CPU time! My outcome says "Success" and "Done" (and I got plenty of credit for it) - but when I look at the details, I see: <core_client_version>5.10.20</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 10800 # random seed: 1745555 # cpu_run_time_pref: 10800 # random seed: 1745555 # cpu_run_time_pref: 10800 ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 45123.3 seconds. Greater than 4X preferred time: 10800 seconds ********************************************************************** GZIP SILENT FILE: .xx2h4o.out But it shows a "Validate state" of VALID. I certainly am not complaining about the credit, but how can it be done and valid if Watchdog shut it down?? --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. ![]() |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
THat's kind of my point. It was ended by the watchdog at 4X his/her 3 hour run pref at 12 hour + a bit. The Task ID shows NO decoy info at all. Was any scientifically worthwhile work performed? Or is it just credit for time served?? This is going to be very typical of all participants except those with a "cpu run time pref" exceeding 8-12 hours(depending on processor, etc). and now that I think about my one wu and my 6 hour run time. It looks like I should bump that up to the next step above 6hrs or suffer the same fate as everyone else. It's at 99.005 percent after 16:38:00, so is holding to the .1%/hour. [edit] moved up to 8 hour pref |
![]() ![]() Send message Joined: 15 Jul 06 Posts: 76 Credit: 5,334,441 RAC: 0 |
What's the default on CPU runtime limits? I looked at my settings and target CPU runtime is "not selected", so I have whatever you get when you don't specify. Do we have a consensus on what it should be?? I'm a little confused. --hedera Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic. ![]() |
![]() ![]() Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
What's the default on CPU runtime limits? I looked at my settings and target CPU runtime is "not selected", so I have whatever you get when you don't specify. Do we have a consensus on what it should be?? I'm a little confused. i got mine set for 1 day however i think it defaults to 3 hours |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
yes, the "not selected" is the default of 3 hours. also, that's set on the "web" so your client must "call home" in order to see and apply the change. This happens when it gets/reports work. But, if you want it to change in the middle of a run, you must do a "project update". You can manually update the projects from the "projects" tab on the manager. Highlight the project name in the right hand box by clicking on it. Then click the "update" button to the left. |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Here's an updated pic taken 17 hours + later(from the first pic posted) of the same wu. I changed my runtime pref to 8 last nite, but that only gets me up to 32 hours before the watchdog kicks in. Perhaps I'll go to 12 hour pref so it'll be able to finish normally as long as it doesn't take more than "48 HOURS to do ONE decoy" on a Mobile AMD64 3700 w/1 G ram. Boy, If the others I have in cache take anywhere near this long....all my Boincsimap, Einstein, and the rest of my rosetta will be past the deadline. Also notice the rate of completion seems to be continually slowing (atleast I assume so) since it's only progressed .3% overnite while I slept, instead of the .1%/hour I was seeing. At 28 + hours, this decoy has already taken more than 6 times it's previous "longest decoy". ![]() [Edit] I went to 12 hours "run time pref", so hopefully it'll finish in the next 19 hours. |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
You really are on a mission to find out how long it will take. Go for it, Astro! |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Someone's got to be the guinea pig. "show me the little plastic wheel, and I'll take her for a spin". I'll take this posting opportunity for an update: CPU time 30:21:01, 99.453% complete, 00:09:56 remaining, using the benchmark claiming method, this wu is worth 429.75 credits so far. I wonder.... |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was? |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was? And i received 92 of 94 claimed for resultid 135481414. I hope Astro gets more than 20 credits for his job, but it probably won't be 400+. |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was? And i received 92 of 94 claimed for resultid 135481414. I hope Astro gets more than 20 credits for his job, but it probably won't be 400+. |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was? And i received 92 of 94 claimed for resultid 135481414. I hope Astro gets more than 20 credits for his job, but it probably won't be 400+. |
![]() ![]() Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
sorry for the triple-post. I had some problems with my connection. |
Astro![]() Send message Joined: 2 Oct 05 Posts: 987 Credit: 500,253 RAC: 0 |
sorry for the triple-post. I had some problems with my connection. up to 461 credits now. LOL Say, you do know that you can "edit" your posted messages as long as you do so within 60 min of the original post. You should see an "edit" box on each of your previous posts. You could (only if you wanna) delete everything and just put "deleted" or some other message into all but the intended one. At that point a nice moderator might come along and hide those extra posts. Anyway, just wanted you to know. Hope you enjoy the rest of the weekend 32:49:08 cpu time, 99.494% complete with 00:09:57 remaining. [edit] made a progress chart. Given the curve, I doubt it'll ever finish. ![]() |
Message boards :
Number crunching :
Problems with Rosetta version 5.93
©2025 University of Washington
https://www.bakerlab.org