Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 55 · Next

AuthorMessage
Rocco Moretti

Send message
Joined: 18 May 10
Posts: 66
Credit: 585,745
RAC: 0
Message 70834 - Posted: 31 Jul 2011, 23:53:24 UTC - in response to Message 70824.  

Oh yeah, once again someone said awhile back that they would be monitoring the boards for discussions like this one. Once again the system fails and no one see's or says anything about it.


It's not really an issue about the system failing - it simply that we don't (currently) have jobs that are ready to run right this minute. The running of the jobs on R@h is only one step in the process - it takes a while to figure out what sorts of jobs will give usable scientific results, to set up the jobs, test them to make sure they won't cause a huge failure rate, and then at the end of the runs to process the results to figure out what the next round should do. Usually we have enough things going on that the computational lull in one project will be covered by the compute phase of a different one. We just happen to have hit a point where none of the currently active projects is in an active compute phase. (And doesn't help that we're maximally distant from both the previous and next CASP - as you've probably noticed, activity seems to ramp up before [mad rush to finalize improvements], during, and after [post-analyis] CASP.)

We're aware that the queue is empty - a message has been sent out on the appropriate internal mailing list. While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work. - It's somewhat trivial to re-run old jobs, but is that worth doing if no one is going to look at the results?

I hesitate to say this, as I don't want it to sound like we're chasing you away(*), but I'd agree with the implicit recommendation stated above to crunch other projects while we have this momentary lull. You can increase your stats on other projects secure in the knowledge that no one will gain on you with Rosetta@home. With any luck, we'll have new jobs for you early next week. (e.g. "We apologize for the inconvenience - Regular service should resume shortly.")

*) We really do appreciate your efforts. Having access to the computational resources of R@h allows us to do things we couldn't do otherwise. Frankly speaking, I was surprised how quickly and easily R@h handled my recent jobs. I would have monopolized our local computational resources, but R@h crunched through it like it was nothing. - It's prompted me to think about possible process improvement experiments that I probably wouldn't have otherwise considered due to the computational cost. (Unfortunately, it's in the very preliminary stages and nowhere near the point where I could actually launch any jobs.)
ID: 70834 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 135
Credit: 23,893,657
RAC: 30
Message 70836 - Posted: 1 Aug 2011, 2:53:22 UTC

Please give us a reason for not letting us know sooner. This behavior is rude and insulting to your volunteers. You are not showing any consideration, much less appreciation.
ID: 70836 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Michael G.R.

Send message
Joined: 11 Nov 05
Posts: 264
Credit: 11,247,510
RAC: 0
Message 70838 - Posted: 1 Aug 2011, 4:59:33 UTC

Calm down everybody. We're trying to cure diseases and make science progress. A little patience. The project is doing excellent science, and while communication could be better sometimes, they're allowed to miss a few days here and there. We might not see it on our end, but I'm sure things are pretty hectic on the other side of the screen...
ID: 70838 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TPCBF

Send message
Joined: 29 Nov 10
Posts: 111
Credit: 5,930,751
RAC: 236
Message 70839 - Posted: 1 Aug 2011, 6:23:10 UTC - in response to Message 70838.  

Calm down everybody. We're trying to cure diseases and make science progress. A little patience. The project is doing excellent science, and while communication could be better sometimes, they're allowed to miss a few days here and there. We might not see it on our end, but I'm sure things are pretty hectic on the other side of the screen...
Sorry, but basic communication from side of the researchers/sysadmins is an absolute must...
At least for me (and I am sure quite a few other will think similar) it is not that there is an outage of WU's, regardless of knowingly (like this time) or by "accident" (like it was the case the last few times since the change of the year), but that there is an absolute lack of communication from their side. They need the collaboration of the people running the WU's but time and time again, they don't seem to bother to keep those people informed. As I already said, a simple message on the home page or a quick note here in the forum, up front or within reasonable time, is all that it takes...

Ralf
ID: 70839 · Rating: 0 · rate: Rate + / Rate - Report as offensive
HiFiTubeGuy
Avatar

Send message
Joined: 12 Jan 10
Posts: 22
Credit: 6,291,999
RAC: 0
Message 70840 - Posted: 1 Aug 2011, 6:49:31 UTC

Personally,
I don't believe the people at Rosetta owe me anything for the crunching I do (I'm not doing it as a favor to them), nor do I crunch to earn the most credits, or to compete.
I volunteer to crunch for the goal of curing / preventing disease, and I believe that the people at Rosetta volunteer their time and energy for the same goal. I think we're all on the same team here.
In times of temporary lack of WU's, or other problems with Rosetta, better communication would be nice, but it's no big deal to me if it takes a few days; it's no loss for me.
As long as there continues to be SOME communication; and so far there is, even if a little slow.

Just my personal feelings.

ID: 70840 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 70842 - Posted: 1 Aug 2011, 11:58:06 UTC - in response to Message 70834.  

While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work.

Quite right.

Sorry, but it absolutely doesn't look like there is any appreciation.

Get over yourself. Volunteering is 100% about giving and 0% about receiving, even appreciation. Needy much?

I switched to 24hour job runs for Rosetta a few days ago to eke out my remaining jobs & selected a back-up project set at low priority for just these eventualities. The only guarantee is there are no guarantees. All projects have downtime.
ID: 70842 · Rating: 0 · rate: Rate + / Rate - Report as offensive
muddocktor

Send message
Joined: 11 May 07
Posts: 17
Credit: 14,543,886
RAC: 0
Message 70843 - Posted: 1 Aug 2011, 13:56:16 UTC

That's all fine and dandy, Sid, but taking 5 minutes out of their day to post a message like Rocco Moretti posted before running out of work is not asking a whole lot and it would help keep people informed and they could make plans in advance to set up another project to switch to in case of time such as this. In my situation for example, I am at work on a drilling rig hundreds of miles away from my computers, which I presently have set to run 100% Rosetta. If I knew this was coming up I could have set them up to run another project too, in case something happened with Rosetta (like now). My machines still seem to be crunching work units for now, but when they run out the 5 systems at the house will not be doing any work, just burning my electricity for no return. And that could simply be minimized or eliminated by a little better communication from the Baker Labs. That is what gets people upset, not the point that they ran out of work.
ID: 70843 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5770
Credit: 6,139,760
RAC: 1
Message 70844 - Posted: 1 Aug 2011, 14:01:16 UTC - in response to Message 70842.  

While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work.

Quite right.

Sorry, but it absolutely doesn't look like there is any appreciation.

Get over yourself. Volunteering is 100% about giving and 0% about receiving, even appreciation. Needy much?

I switched to 24hour job runs for Rosetta a few days ago to eke out my remaining jobs & selected a back-up project set at low priority for just these eventualities. The only guarantee is there are no guarantees. All projects have downtime.


For example, Milkyway went down due to their main A/C unit failing and having to get a portable system installed to keep their servers cool. They were completely offline for a few days if not a week I think it was. I run a total of 4 projects including R@H to keep my system busy.
ID: 70844 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile GarageFarm.net
Avatar

Send message
Joined: 21 Apr 10
Posts: 19
Credit: 17,915,923
RAC: 0
Message 70845 - Posted: 1 Aug 2011, 17:12:03 UTC
Last modified: 1 Aug 2011, 17:16:42 UTC

HDD in one of my computers died last night and i had to replace whole system, no new numbers for this guy... :)

Just wondering, how will you redistribute lost jobs, like those from crashed disc on one of my machines?
ID: 70845 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5770
Credit: 6,139,760
RAC: 1
Message 70846 - Posted: 1 Aug 2011, 19:27:06 UTC - in response to Message 70844.  

While we want to provide you with work units, we don't want to waste your time with scientifically pointless make-work.

Quite right.

Sorry, but it absolutely doesn't look like there is any appreciation.

Get over yourself. Volunteering is 100% about giving and 0% about receiving, even appreciation. Needy much?

I switched to 24hour job runs for Rosetta a few days ago to eke out my remaining jobs & selected a back-up project set at low priority for just these eventualities. The only guarantee is there are no guarantees. All projects have downtime.


For example, Milkyway went down due to their main A/C unit failing and having to get a portable system installed to keep their servers cool. They were completely offline for a few days if not a week I think it was. I run a total of 4 projects including R@H to keep my system busy.



Well now the R@H infection is spreading.
Milkyway is offline.
Poem is out of work.
So just Einstein is sending out work.
1/4 projects?!?!? weird!!
ID: 70846 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Tex1954

Send message
Joined: 3 Apr 11
Posts: 9
Credit: 3,394,752
RAC: 0
Message 70847 - Posted: 1 Aug 2011, 19:51:11 UTC - in response to Message 70846.  

Well now the R@H infection is spreading.
Milkyway is offline.
Poem is out of work.
So just Einstein is sending out work.
1/4 projects?!?!? weird!!


Good Grief, all projects have downtime at some point. Sometimes a "master task" is finishing up, sends out some stray WU's to complete, then it's all processed before the next "master task" generates a gazzillion WU's for us.

I know I shouldn't be surprised at the impatience of folks since I see it everywhere in life, but can't help it. BOINC-SIMAP has been out of work forever and Orbit is off until they get more funding in who knows how long.

Sometimes projects just end! Look at Archived projects on Boincstats!!!

But, there never seems to be an end to complaining... and to those that do, I say "Join the other millions of us and do something else!"

Patience is required... these folks are researchers... can't rush good research!

And complaining about hardware and A/C failures... well hell, complain to the wall so the rest of us don't have to hear about it. I spent the night in a motel room when my apartment A/C went out the other day, big deal! Stuff happens! Get a life and be an adult!!

JMHO

Tex1954
ID: 70847 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 70849 - Posted: 1 Aug 2011, 22:13:02 UTC - in response to Message 70845.  

HDD in one of my computers died last night and i had to replace whole system, no new numbers for this guy... :)

Just wondering, how will you redistribute lost jobs, like those from crashed disc on one of my machines?


Once your WUs are past their deadline, the same WU is sent to another host.
ID: 70849 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5770
Credit: 6,139,760
RAC: 1
Message 70852 - Posted: 2 Aug 2011, 3:03:11 UTC - in response to Message 70847.  

Well now the R@H infection is spreading.
Milkyway is offline.
Poem is out of work.
So just Einstein is sending out work.
1/4 projects?!?!? weird!!


Good Grief, all projects have downtime at some point. Sometimes a "master task" is finishing up, sends out some stray WU's to complete, then it's all processed before the next "master task" generates a gazzillion WU's for us.

I know I shouldn't be surprised at the impatience of folks since I see it everywhere in life, but can't help it. BOINC-SIMAP has been out of work forever and Orbit is off until they get more funding in who knows how long.

Sometimes projects just end! Look at Archived projects on Boincstats!!!

But, there never seems to be an end to complaining... and to those that do, I say "Join the other millions of us and do something else!"

Patience is required... these folks are researchers... can't rush good research!

And complaining about hardware and A/C failures... well hell, complain to the wall so the rest of us don't have to hear about it. I spent the night in a motel room when my apartment A/C went out the other day, big deal! Stuff happens! Get a life and be an adult!!

JMHO

Tex1954



That was not a complaint, it was an observation, though it could be seen as a complaint. It was not intended that way. Just saying that out of 4 projects 3 of them went dark at the same time. That's just weird.
ID: 70852 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile GarageFarm.net
Avatar

Send message
Joined: 21 Apr 10
Posts: 19
Credit: 17,915,923
RAC: 0
Message 70853 - Posted: 2 Aug 2011, 3:37:21 UTC - in response to Message 70852.  


That was not a complaint, it was an observation, though it could be seen as a complaint. It was not intended that way. Just saying that out of 4 projects 3 of them went dark at the same time. That's just weird.


End of the world is coming, possibly.

...rapture it was, right? :D
ID: 70853 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2474
Credit: 46,506,558
RAC: 3,757
Message 70854 - Posted: 2 Aug 2011, 4:48:09 UTC - in response to Message 70843.  

That's all fine and dandy, Sid, but taking 5 minutes out of their day to post a message like Rocco Moretti posted before running out of work is not asking a whole lot and it would help keep people informed and they could make plans in advance to set up another project to switch to in case of time such as this. In my situation for example, I am at work on a drilling rig hundreds of miles away from my computers, which I presently have set to run 100% Rosetta. If I knew this was coming up I could have set them up to run another project too, in case something happened with Rosetta (like now). My machines still seem to be crunching work units for now, but when they run out the 5 systems at the house will not be doing any work, just burning my electricity for no return. And that could simply be minimized or eliminated by a little better communication from the Baker Labs. That is what gets people upset, not the point that they ran out of work.

I was going to appreciate your position given your circumstances, but then I had a look at how close you came to running out of work. Holy Moly! I think I just found out where all the work was getting sucked to. Is that a full 10 days you've got there on some pretty hefty machines? Maybe you slipped down to your last 7-8 days at worst, with a couple of jobs on one machine timing out?

I can't even believe you said anything at all. "All for one and I hope it's me..." would be a fine motto.

As it happens some new work has slipped through this morning and I think we're both up to a full compliment now. I'll be sticking to 24hr runs until the supply is more assured so I don't hog WUs.
ID: 70854 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5770
Credit: 6,139,760
RAC: 1
Message 70857 - Posted: 2 Aug 2011, 12:05:55 UTC - in response to Message 70853.  


That was not a complaint, it was an observation, though it could be seen as a complaint. It was not intended that way. Just saying that out of 4 projects 3 of them went dark at the same time. That's just weird.


End of the world is coming, possibly.

...rapture it was, right? :D



mutters something about an improbability drive a manic depressed robot while having dinner at restaurant at the end of the universe.
ID: 70857 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5770
Credit: 6,139,760
RAC: 1
Message 70858 - Posted: 2 Aug 2011, 12:13:07 UTC

As of 12:12 GMT there are Ready to send 50,304 tasks ready to send!!
ID: 70858 · Rating: 0 · rate: Rate + / Rate - Report as offensive
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 135
Credit: 23,893,657
RAC: 30
Message 70859 - Posted: 2 Aug 2011, 17:30:51 UTC

Ready to send 3
ID: 70859 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Brett Collins
Avatar

Send message
Joined: 13 Feb 11
Posts: 2
Credit: 147,888
RAC: 0
Message 70862 - Posted: 2 Aug 2011, 20:44:57 UTC

I do not understand what is causing these errors


4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0.44 0 ---
4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over
Client error Compute error 0 0 ---
4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0.48 0 ---
4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Downloading 0 0 ---
4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0 0 ---
4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Downloading 0 0 ---
4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 0.36 0 ---
4.4E+08 4.01E+08 2 Aug 2011 12:03:34 UTC 2 Aug 2011 17:34:27 UTC Over Client error Downloading 0 0 ---
4.4E+08 4.01E+08 2 Aug 2011 11:44:35 UTC 2 Aug 2011 17:34:27 UTC Over Client error Compute error 8.28 0.03 ---
4.39E+08 4.01E+08 2 Aug 2011 5:19:23 UTC 2 Aug 2011 7:07:14 UTC Over Client error Compute error 3.96 0.01

Do you have any suggestions?
ID: 70862 · Rating: 0 · rate: Rate + / Rate - Report as offensive
dango

Send message
Joined: 22 Dec 08
Posts: 3
Credit: 75,820
RAC: 0
Message 70863 - Posted: 2 Aug 2011, 20:48:34 UTC

join to WCG! there is still work ;)
ID: 70863 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org