Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 276 · 277 · 278 · 279 · 280 · 281 · 282 . . . 300 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,139,251
RAC: 16,277
Message 109193 - Posted: 26 Apr 2024, 12:27:26 UTC - in response to Message 109190.  
Last modified: 26 Apr 2024, 12:31:22 UTC

On my systems, yes but it's a little deceptive.
I have the 5800X at home which is my only PC now. I'm there half the week and do my daily stuff with Boinc in the gaps in the background. When I'm away it runs Boinc 100% which is what you're looking at.
When I'm away (eg right now) I have the i5-9600K at the place I stay. It's 100% Boinc when I'm at home or work, but when I'm using it at night it usually takes 12h30 to 13h to run a 12hr task. I'm fine with that.
And I've set up another PC at work with a different user name as part of my team - an old i7-4770 that does what it does and gets turned off each night.

Losing 20, 30mins per task is fine by me. Boinc is, by definition, an occasional background job making use of downtime, not hogging uptime.
Only losing seconds per task is mental. Losing 20/30/50mins per task shows they're working machines - just as it should be.

Other GPU tasks with other projects may be different, but I don't run them.
I understand the theoretical point. I just don't see any practical difference.
The practical difference is if a GPU application needs a full CPU core/thread for each running Task, and it has to share that core/thread with another Task being processed on the CPU, not only will the CPU processing times suffer, but the GPU output can tank massively.
I can't remember the actual numbers, but a GPU that can process a Task in 4 min with a full core/thread supporting it (if it needs it of course), may take 40min (or more) if it has to share that core/thread with another CPU heavy load.
Only doing 360 Tasks per day when over 3,600 is possible is a pretty poor choice to make.
That is the practical difference.

Only in the self-contained context of that individual task.
If another task is also completing work over those same cores/threads in the same time, you have to add their processing together, not view them separately as, say, two badly running tasks.
Thousands of tasks only sounds bad, because they're 4mins v 40mins each.
I'd want to know the system-wide totals over 24hrs, with contention and without, to know if there were any losses at all.

Are you trying to tell me it's 90% losses? Because I don't believe that at all.
Is it rather single-digit %? I'd guess that's much nearer the mark.
If someone worked it out and it was 1% losses or less, because every second of the day is doing something, I wouldn't automatically disbelieve it before going through the workings.

In short, I think you're making a mountain out of a molehill.
I wouldn't take an entire core out of Boinc processing to give CPU support to a GPU task, because that's 12.5% of an 8C/T or 6.25% of a 16C/T machine and I wouldn't guess the losses from contention would be as high as that.
I'd let them all run - overcommit the CPU in your terms - and let the PC fight it out, knowing it'll do as much as it possibly can without me making assumptions about what it can or can't do that I'll never know in advance.
ID: 109193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,523,781
RAC: 8,309
Message 109194 - Posted: 26 Apr 2024, 18:54:04 UTC - in response to Message 109193.  

The screensaver of "ROSETTAVS_SAVE_ALL_OUT" wus crashes everytime on my Win11 machines...
ID: 109194 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 258
Credit: 483,503
RAC: 133
Message 109195 - Posted: 26 Apr 2024, 18:56:20 UTC - in response to Message 109194.  

It resolves database path incorrectly.
ID: 109195 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109196 - Posted: 26 Apr 2024, 23:08:03 UTC - in response to Message 109191.  
Last modified: 26 Apr 2024, 23:46:53 UTC

I have to tell you, I'm absolutely amazed that you think Boinc scheduling being wrong by 50% one way or 2-300% the other way for the bulk of the time a task is processing - and 100% of the time it's sitting waiting in the cache - is no kind of problem,
And i am absolutely amazed & astounded you would think something that at no stage have i ever said or i suggested.

No where have is said it is not a problem.
What i have said is that it is not as big a problem as you make it out to be. What i have said is it is that it is not the root cause for the High Priority issues. It contributes to it, but it is not the cause.
How on earth do you turn "it is not as big a problem as you make it out to be" in to "is no kind of a problem?"
Seriously? How on earth can you think that???

It is a problem for Scheduling.
But as i keep on repeating because you don't appear to be listening, it's not the cause of the High Priority issue. It's a contributing factor, but not the cause. The cause is the huge discrepancy between CPU time and Run time.



but losing the odd few seconds or minutes during processing is a big issue. (Talking about my PCs here).
Again, seriously??? Did you actually read what i posted there? I'll repost it.


Your two systems
...
A bigger gap between CPU time and Run time, but still not large. Which indicates the systems are getting some non-BOINC use, but not a lot.
How the hell does "but still not large" become "a big issue." Seriously- how???




Please do not try putting works in my mouth or attribute to me things that i have not said in any way, shape or form.
Grant
Darwin NT
ID: 109196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109197 - Posted: 26 Apr 2024, 23:43:35 UTC - in response to Message 109192.  
Last modified: 26 Apr 2024, 23:49:26 UTC

First thing to say is I didn't appreciate Folding runs at a dfferent (normal compared to low I assume) priority to Rosetta or other projects. I assumed they were all low priority.
But to hear folding runs at a higher priority - nominally 1 to 1 CPU to wallclock time - makes me think that's massively better than I thought.
Yes, Denis is particularly bad, Asteroids isn't great - but their tasks are very short so bygones - but Sidock looks pretty good by my standards in that context. If I was getting 1-to-1 for Folding on top of that, I'd be pretty happy.
On the proviso they all meet their respective deadlines.

This is all self-evident. But you've missed out where the problem is.
All the things you've pointed out are things you've chosen to do.
And from the outset we all understand that Boinc runs in the gaps when we're not fully utilising our computers, not ever 100% of the time.
And if you chose to do one thing you're prioritising that over Boinc.
Personally I insist on that because if I ever got bogged down in writing or viewing or whatever I'd consider that a big problem.
So if I <only> got the "losses" in task processing time that you later point me to, the first thing I'd think is I'm wasting my time having a computer because I'm not doing anything with it but donating it to distributed computing.
Frankly, I'm not that rich nor generous.
I like distributed computing, but not that much. If I didn't already have a computer for my own needs, I wouldn't buy one to run Boinc (or non-Boinc) tasks.
I hear by give up- as you've said above, how efficient your system is (ie how many Tasks are done each day), is of no importance. All that matters is not missing the deadlines.
It's obvious you don't understand what i'm saying, no matter how many ways i try to present it. And what i do say, as other posts you have quoted show, you mis-interpret what is posted (i for the life me cannot understand how "it is not as big a problem as you make it out to be" could be interpreted to mean it "is no kind of a problem" or "but still not large" becomes "a big issue.").





But for anyone else that's been reding these posts-

BOINC makes use of unused computing time.
Running other heavy CPU usage programmes isn't an issue, and if you set BOINC to recognise that there are other heavy usage processes running it won't have an adverse impact on your BOINC processing.
If you limit the number of cores/threads available to BOINC, you will maximise your BOINC processing. You will get the maximum possible amount of work done each day that your system is capable of, you won't have issues with deadlines (unless of course you have inappropriate cache settings), or Panic Mode or any of those types of issues.

So whether you have 2 cores/threads or 256, if you're running other CPU intensive Tasks then set your "Use at most 100 % of the CPUs" to an appropriate value.
If you've got 2 cores/threads, set it to 50%, 256 cores/threads set it to 0.5% (or 1% if it won't accept 0.5), 7% if you have 16 cores/threads. It's not hard to work out.
Then your Tasks will run for as long as they needed to- ie Run time will match (or be damn close to) CPU time, not 1.5, 2 or 4 or more times longer than they need to.

If you don't do much CPU heavy stuff with your system, then there's no need to reserve some cores/threads.
If you're doing considerable non-BOINC work, and how efficient your system is at doing BOINC work is of no importance at all (ie how many Tasks you actually process each day), along with the occasional missed deadline, then don't bother with reserving any cores/threads for non-BOINC work.
Grant
Darwin NT
ID: 109197 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109198 - Posted: 26 Apr 2024, 23:47:29 UTC

And now i've done all that, i'll just wait for the next Rosetta server crash to occur.
Grant
Darwin NT
ID: 109198 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109199 - Posted: 27 Apr 2024, 1:04:27 UTC
Last modified: 27 Apr 2024, 1:04:55 UTC

Oh, and in case no one had noticed, we now have a batch of Beta work that is running for 8 hours, and takes roughly 1GB of RAM per Task, the RosettaVS_ Tasks.
So those with large multicore/thread systems & low amounts of system RAM may have some issues if they get a full load of them.
Grant
Darwin NT
ID: 109199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 109200 - Posted: 27 Apr 2024, 9:43:19 UTC - in response to Message 109183.  

Given panic-mode means Boinc realises tasks can't be completed within deadline, preventing Panic mode occurring is the entire solution.
Eliminating the reason for the panic mode is the entire solution, everything else is a workaround, which might fail as soon as something changes (new WU type, new project, whatever) or even before.

The root cause reason for panic mode is holding too large an offline cache.
The root cause for the panic mode is highy misconfugured client, too large cache is just a small part of it.


This isn't a problem, because Adrian (in this case) said both projects are important to him.
Than he should configure BOINC properly so it can coexist with Folding without any issues, currently it seems he doesn't really care if BOINC works properly or not.
.
ID: 109200 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109201 - Posted: 28 Apr 2024, 0:16:35 UTC - in response to Message 109199.  

Oh, and in case no one had noticed, we now have a batch of Beta work that is running for 8 hours, and takes roughly 1GB of RAM per Task, the RosettaVS_ Tasks.
So those with large multicore/thread systems & low amounts of system RAM may have some issues if they get a full load of them.
Getting a few of those Tasks using 2GB of RAM each.
Grant
Darwin NT
ID: 109201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 187
Credit: 6,367,770
RAC: 5,706
Message 109203 - Posted: 28 Apr 2024, 4:31:24 UTC - in response to Message 109199.  

in case no one had noticed, we now have a batch of Beta work that is running for 8 hours, and takes roughly 1GB of RAM per Task, the RosettaVS_ Tasks.


Mine look like this. (This is one of them.) Is it one of the ones to which you refer? RosettaVS_ Tasks If not, how are the ones to which you refer identified?

Application
Rosetta Beta 6.05 
Name
7a_hal_l_hal_7aa_391_d694_ce_0001_SAVE_ALL_OUT_2977935_67
State
Running
Received
Fri 26 Apr 2024 02:37:53 AM EDT
Report deadline
Mon 29 Apr 2024 02:37:53 AM EDT
Estimated computation size
80,000 GFLOPs
CPU time
05:15:37
CPU time since checkpoint
00:17:21
Elapsed time
05:19:11
Estimated time remaining
02:44:47
Fraction done
65.667%
Virtual memory size
468.18 MB
Working set size
364.18 MB
Directory
slots/11
Process ID
2777585
Progress rate
12.240% per hour
Executable
rosetta_beta_6.05_x86_64-pc-linux-gnu

ID: 109203 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109205 - Posted: 28 Apr 2024, 6:01:03 UTC - in response to Message 109203.  

Mine look like this. (This is one of them.) Is it one of the ones to which you refer? RosettaVS_ Tasks If not, how are the ones to which you refer identified?
Exactly the way i posted- they start with RosettaVS_
The one you posted starts with 7a_hal_l_hal_

Application
Rosetta Beta 6.05
Name
7a_hal_l_hal_7aa_391_d694_ce_0001_SAVE_ALL_OUT_2977935_67
State
Running
Received
Fri 26 Apr 2024 02:37:53 AM EDT
Report deadline
Mon 29 Apr 2024 02:37:53 AM EDT
Estimated computation size
80,000 GFLOPs
CPU time
05:15:37
CPU time since checkpoint
00:17:21
Elapsed time
05:19:11
Estimated time remaining
02:44:47
Fraction done
65.667%
Virtual memory size
468.18 MB
Working set size
364.18 MB
Directory
slots/11
Process ID
2777585
Progress rate
12.240% per hour
Executable
rosetta_beta_6.05_x86_64-pc-linux-gnu

Grant
Darwin NT
ID: 109205 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 258
Credit: 483,503
RAC: 133
Message 109206 - Posted: 28 Apr 2024, 9:46:30 UTC - in response to Message 109205.  

Mine look like this. (This is really one of them.)


Project Rosetta@home

Name RosettaVS_SAVE_ALL_OUT_NOJRAN_CHIP_8EHZ_fulldb_IGNORE_THE_REST_9GjqZI_5_4100_2977977_2_0

Application Rosetta Beta 6.05
Workunit name RosettaVS_SAVE_ALL_OUT_NOJRAN_CHIP_8EHZ_fulldb_IGNORE_THE_REST_9GjqZI_5_4100_2977977_2
State Running
Received 4/28/2024 8:42:31 AM
Report deadline 5/1/2024 8:42:30 AM
Estimated app speed 2.78 GFLOPs/sec
Estimated task size 80 000 GFLOPs
CPU time at last checkpoint 00:00:00
CPU time 02:43:13
Elapsed time 02:46:22
Estimated time remaining 06:19:37
Fraction done 20.911%
Virtual memory size 2 200.04 MB
Working set size 1 973.22 MB
Directory slots/4
Process ID 193635

Debug State: 2 - Scheduler: 2

ID: 109206 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 187
Credit: 6,367,770
RAC: 5,706
Message 109207 - Posted: 28 Apr 2024, 12:06:35 UTC - in response to Message 109205.  

OK. I now have three of the RosettaVS_ Tasks and they are as you say.
Since I have 128 GBytes of RAM, I do not expect problems.

Application
Rosetta Beta 6.05 
Name
RosettaVS_SAVE_ALL_OUT_NOJRAN_KCa2_homology_fulldb_IGNORE_THE_REST_vF8nFW_8_1999_2977959_2

Estimated computation size
80,000 GFLOPs

Virtual memory size
1.19 GB
Working set size
1.03 GB

Progress rate
10.440% per hour
Executable
rosetta_beta_6.05_x86_64-pc-linux-gnu




Mine look like this. (This is one of them.) Is it one of the ones to which you refer? RosettaVS_ Tasks If not, how are the ones to which you refer identified?

Exactly the way i posted- they start with RosettaVS_
The one you posted starts with 7a_hal_l_hal_

Application
Rosetta Beta 6.05
Name
7a_hal_l_hal_7aa_391_d694_ce_0001_SAVE_ALL_OUT_2977935_67

ID: 109207 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,523,781
RAC: 8,309
Message 109211 - Posted: 1 May 2024, 10:52:06 UTC - in response to Message 109207.  

The validation server is down...
ID: 109211 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mrchips

Send message
Joined: 11 Nov 09
Posts: 10
Credit: 14,603,528
RAC: 16,717
Message 109212 - Posted: 1 May 2024, 20:16:42 UTC

issues

State: All (3339) · In progress (163) · Validation pending (154) · Validation inconclusive (0) · Valid (2933)
ID: 109212 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109214 - Posted: 2 May 2024, 0:28:14 UTC - in response to Message 109211.  

The validation server is down...
Not again...
At least the rest are still up (for now).

Yep, boinc-process is down again.
It wouldn't be a big ask to run a Cron job on a system remote from the servers to check if they're there & running or not, and send an email and text to someone to let them know if they've go MIA...


Looking at the hardware list, it is getting on (and the OS is 8 years old!).
Even a single socket mid-range CPU of the lower end EPYC systems could replace all of the existing systems, with not only significantly more performance, but all while using way, way, way less power.
Price wise they're a bargain for what they can do, but they're still not exactly cheap in absolute terms.
Grant
Darwin NT
ID: 109214 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,523,781
RAC: 8,309
Message 109215 - Posted: 2 May 2024, 10:15:03 UTC - in response to Message 109214.  
Last modified: 2 May 2024, 10:16:12 UTC

Yep, boinc-process is down again.
It wouldn't be a big ask to run a Cron job on a system remote from the servers to check if they're there & running or not, and send an email and text to someone to let them know if they've go MIA...


Insert, during the boinc project server creation/configuration, a MANDATORY e-mail to use for emergency (daemon crash, problem with queues, etc)
But i think it needs to be done by the boinc developers...


Looking at the hardware list, it is getting on (and the OS is 8 years old!).

I also noticed that os and hw is old.
But another volunteer said to me that, maybe, the status server page is not updated and that, maybe, the hw and os is updated.
I don't think so.


P.S. Now, over 200k wus pending validation!!
ID: 109215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,523,781
RAC: 8,309
Message 109221 - Posted: 2 May 2024, 18:46:59 UTC - in response to Message 109215.  

P.S. Now, over 200k wus pending validation!!


Now 270k
And no news from admins
ID: 109221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1673
Credit: 17,589,473
RAC: 22,408
Message 109223 - Posted: 2 May 2024, 21:51:14 UTC

Server is still dead.
Grant
Darwin NT
ID: 109223 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 187
Credit: 6,367,770
RAC: 5,706
Message 109224 - Posted: 3 May 2024, 0:53:43 UTC - in response to Message 109223.  

Server is still dead.

It seem mostly up for me.

top - 20:51:09 up 2 days, 12:17,  2 users,  load average: 13.33, 13.65, 13.72
Tasks: 474 total,  14 running, 460 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  0.2 sy, 80.3 ni, 18.4 id,  0.0 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem : 128074.1 total,  33544.1 free,   6219.7 used,  88310.2 buff/cache
MiB Swap:  15992.0 total,  15992.0 free,      0.0 used. 120200.2 avail Mem 

    PID    PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 469545    2039 boinc     39  19 R   1.4g   1.2  98.8 15 287:51.62 ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.05_x86_64-pc-li+ 
 504299    2039 boinc     39  19 R 444456   0.3  98.8  5  26:25.33 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
 482867    2039 boinc     39  19 R 213072   0.2  98.6 13 208:50.81 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 
 504592    2039 boinc     39  19 R 212384   0.2  99.1  6  24:10.34 ../../projects/einstein.phys.uwm.edu/einsteinbinary_BRP4G_1.33_x86_64-pc+ 
   2039       1 boinc     30  10 S  73336   0.1   0.1  6  44900:08 /usr/bin/boinc   

ID: 109224 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 276 · 277 · 278 · 279 · 280 · 281 · 282 . . . 300 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org