Problems and Technical Issues with Rosetta@home

Author	Message
Falconet Send message Joined: 9 Mar 09 Posts: 355 Credit: 1,669,337 RAC: 0	Message 104378 - Posted: 21 Jan 2022, 22:45:26 UTC - in response to Message 104376. Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge". It's not that big, it's only a few million tasks. I've seen the queue at 15 million. But maybe that was several projects at once. You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point. That's why it's "huge" ID: 104378 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1935 Credit: 18,534,891 RAC: 0	Message 104379 - Posted: 21 Jan 2022, 23:00:30 UTC - in response to Message 104378. Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge". It's not that big, it's only a few million tasks. I've seen the queue at 15 million. But maybe that was several projects at once. You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point. That's why it's "huge" Yep. Roughly 1 in 133 is being processed. Compared to Rosetta 4.20 at their peak (20 million queued up, 400k in progress) 1 in 50 being processed. And given the huge issues with Python Tasks, such as those that sit there not actually using any CPU time so they're not actually being processed, i'd suggest that 1 in 133 value in reality is way, way, waaaay worse than that. Grant Darwin NT ID: 104379 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 18	Message 104380 - Posted: 21 Jan 2022, 23:26:40 UTC - in response to Message 104378. You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point. That's why it's "huge" In the same way as this microwave is huge compared to that sofa, because I'm carrying it on my bicycle instead of the car. ID: 104380 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 104381 - Posted: 22 Jan 2022, 0:02:19 UTC - in response to Message 104371. Pythons for 12 hours? They average 2 hours here. As for only being able to run a couple at a time, you need a lot of RAM. I can run 5 but it won't do 6 in 16GB. No, not an individual run for 12 hours. After running a series of them continually. I didn't say anything about running only two. I usually run at least eight, and am presently running twenty on a Ryzen 3900X with 80 GB of memory. Holy cow! 80 gigs?!?! That's more than my budget can afford! It hurt enough to put in 32 on top of the 16 I put in about 4 years ago. I just can't see investing much more for being a volunteer. At best another 1080 or better, but that's it. I have a Ryzen with 64GB, but it's my main computer. Less than that is pitiful by today's standards. It will take 128GB. I have two Boinc only machines with 36GB in them. I upped them just enough to run LHC. Once I get my new drive installed this weekend, I should be able to undo the restriction I have right now on python and with the current memory, I should be able to run a few more pythons plus all my other projects or a full load of pythons (16) and have a little bit of memory left over. ID: 104381 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 104382 - Posted: 22 Jan 2022, 0:06:43 UTC - in response to Message 104379. Sid Celery posted something a few months ago that he received from Admin or someone like that who said that the Python job that had been submitted by one of the IPD researchers was "huge". It's not that big, it's only a few million tasks. I've seen the queue at 15 million. But maybe that was several projects at once. You are correct but these 2 million tasks will take a long time to finish at the current rate because only 15,000 or so are running at any given point. That's why it's "huge" Yep. Roughly 1 in 133 is being processed. Compared to Rosetta 4.20 at their peak (20 million queued up, 400k in progress) 1 in 50 being processed. And given the huge issues with Python Tasks, such as those that sit there not actually using any CPU time so they're not actually being processed, i'd suggest that 1 in 133 value in reality is way, way, waaaay worse than that. Well you've seen the numbers. People come and try it out and leave. Others can't get it to work and leave. Without the staff taking notice or caring, it will be a downward to stable trend of systems instead of upward. But again, they don't care about numbers, just as long as the work gets done eventually. ID: 104382 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 104383 - Posted: 22 Jan 2022, 0:10:59 UTC - in response to Message 104375. Most projects don't take nearly as much as the pythons or LHC of course. I like memory, but beyond 64 GB you have stability problems, since you have to use all four slots. Sometimes it works, but you often have to juggle memory around. You may have to spend more than you anticipated. Two slots is a lot safer. Everything works better with more memory, if you're not using it you get a massive disk cache. Using all four slots does not cause stability problems. Always test your new memory with memtest before use, even quality stuff has duds. I have 49 and change spread out over 4 slots. Everything works as it should. The new drive is 500 gigs and it will be dedicated to BOINC So there is more than enough room for swap or whatever else BOINC wants to do. ID: 104383 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 104384 - Posted: 22 Jan 2022, 0:12:54 UTC Total queued jobs: 2,589,661 In progress: 53,882 Successes last 24h: 34,678 that's what the page says. Pretty small numbers against the 2 mill. ID: 104384 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104385 - Posted: 22 Jan 2022, 3:22:59 UTC - in response to Message 104375. Everything works better with more memory, if you're not using it you get a massive disk cache. Using all four slots does not cause stability problems. Always test your new memory with memtest before use, even quality stuff has duds. You get a write cache only if you install one. PrimoCache is the only one that I know of for Windows, which I use to protect my SSDs. Memtest really doesn't have much to do with stability. It is mainly for errors, which might cause crashes, but more likely failures in work units. With large amounts of memory, especially the two-sided memory modules, you will see many more crashes using four slots. Check the forums. ID: 104385 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1935 Credit: 18,534,891 RAC: 0	Message 104386 - Posted: 22 Jan 2022, 3:57:28 UTC - in response to Message 104385. Last modified: 22 Jan 2022, 4:06:35 UTC You get a write cache only if you install one. PrimoCache is the only one that I know of for Windows, which I use to protect my SSDs. ? The default setting for Windows is write caching enabled. If you want to set it's size (other than doing registry hacks), then you'd need a 3rd party one. With large amounts of memory, especially the two-sided memory modules, you will see many more crashes using four slots. Check the forums. I've had systems with only 2 slots used & memory problems. I've had systems with all slots used & no problems. While the more components, the greater the likely hood of failure, the biggest cause of issues with more than 2 modules is people pushing the RAM too hard. Yes, 2 modules allows you tighter timings and higher clocks. But as long as you use modules of the same brand & model, and don't push them beyond their rated clocks & timings, you won't have any issues. Look at server systems that may have 32 (or more) DIMM slots. Grant Darwin NT ID: 104386 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104387 - Posted: 22 Jan 2022, 6:37:57 UTC - in response to Message 104386. Last modified: 22 Jan 2022, 6:53:17 UTC The default setting for Windows is write caching enabled. If you want to set it's size (other than doing registry hacks), then you'd need a 3rd party one. That is just the cache on the disk drive itself, and is relatively small. These days, it is often just a faster section of the flash memory (e.g., two-level instead of four level or more). Therefore, it is subject to the same wearout mechanism, just a bit more slowly. But using a portion of main memory as the write cache is much faster, and will protect the SSD from the high level of writes, such as on the pythons. And it can be very large. I usually use at least 8 GB. I posted on it in another topic. I've had systems with only 2 slots used & memory problems. I've had systems with all slots used & no problems. While the more components, the greater the likely hood of failure, the biggest cause of issues with more than 2 modules is people pushing the RAM too hard. Yes, 2 modules allows you tighter timings and higher clocks. But as long as you use modules of the same brand & model, and don't push them beyond their rated clocks & timings, you won't have any issues. Look at server systems that may have 32 (or more) DIMM slots. I have had much more experience. And the larger the CPU, the worse the problems. With two Ryzen 3900X and two Ryzen 3950X, I have seen them all. It saved me some grief with the 5900 series. ID: 104387 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1935 Credit: 18,534,891 RAC: 0	Message 104388 - Posted: 22 Jan 2022, 8:07:15 UTC - in response to Message 104387. The default setting for Windows is write caching enabled. If you want to set it's size (other than doing registry hacks), then you'd need a 3rd party one. That is just the cache on the disk drive itself, and is relatively small. These days, it is often just a faster section of the flash memory (e.g., two-level instead of four level or more). Therefore, it is subject to the same wearout mechanism, just a bit more slowly. But using a portion of main memory as the write cache is much faster, and will protect the SSD from the high level of writes, such as on the pythons. And it can be very large. I usually use at least 8 GB. I posted on it in another topic. Every article i've seen about the WIn10 write caching says it is using system RAM to cache writes- it has nothing to do with the drive's own onboard buffering. Grant Darwin NT ID: 104388 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104393 - Posted: 22 Jan 2022, 13:06:00 UTC - in response to Message 104388. Last modified: 22 Jan 2022, 13:11:14 UTC Every article i've seen about the WIn10 write caching says it is using system RAM to cache writes- it has nothing to do with the drive's own onboard buffering. I think you are confusing that with read caches, but I will look. If it were caching writes, you would probably know it. If the cached writes were save to disk, it would take a long time to shut down, for example. And the programs that show the writes to disk would indicate it. I don't see it. Read caches are easier to implement, but less necessary. They don't save the SSD from excessive writes. And the reads from SSDs are fast anyway, so the read caches are not all that necessary. EDIT: The only thing I see is this. https://www.windowscentral.com/how-manage-disk-write-caching-external-storage-windows-10 That is just disk write caching, as I previously discussed. It uses only a small amount of memory, not the GB that you need to protect the SSDs from the pythons. The write rates on the pythons are horrendous. I am getting well over 1 TB/day (almost 2 TB) when running 20 pythons, even with a huge 26 GB write cache. That is too much. I will do something else with this machine. ID: 104393 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104394 - Posted: 22 Jan 2022, 14:31:13 UTC - in response to Message 104393. Last modified: 22 Jan 2022, 14:34:01 UTC By the way, I used to just put projects with high write rates on a ramdisk, and have all the writes go to main memory. That really solves the problem. But on the Ryzen 3900X with all the pythons, the BOINC data folder is 107 GB; too much. I might be able to pull it off on a Ryzen 3600 though; 12 virtual cores might work. But I think they really need to develop the pythons a bit and call back when they are ready. ID: 104394 · Rating: 0 · rate: / Reply Quote

gbayler Send message Joined: 10 Apr 20 Posts: 14 Credit: 3,069,484 RAC: 0	Message 104400 - Posted: 22 Jan 2022, 18:09:09 UTC For the Linux-users out there: I have written a Perl-script boinc_watchdog.pl that checks for "0 CPU"-tasks (tasks with a very low CPU utilization, that likely won't terminate) and whether there is at least one task executing. If it finds "0 CPU"-tasks, it aborts them, and if there is not a single task executing, it restarts the boinc-client. I run it every 30 minutes as a cron job; for me, it works quite well. I am perfectly aware that this doesn't solve the root cause of the current problems, this is merely a workaround. Still, I think it is an improvement in comparison to having to manually abort tasks or restart the PC every other day. Here you can find it: https://github.com/gbayler/boinc_watchdog Hope that it is useful for someone else too! :) Günther ID: 104400 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 18	Message 104401 - Posted: 22 Jan 2022, 18:25:34 UTC - in response to Message 104383. Last modified: 22 Jan 2022, 18:27:59 UTC I have 49 and change spread out over 4 slots. Everything works as it should. The new drive is 500 gigs and it will be dedicated to BOINC So there is more than enough room for swap or whatever else BOINC wants to do. Swap files are for poor people without enough RAM :-) If you don't have matched pairs of RAM, things can slow down. Dual channel is a great benefit for some things but not others. Depends if they're accessing the memory a lot. I changed my Ryzen to dual channel to make my game faster. It didn't help, but half the Boinc projects sped up a lot. ID: 104401 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 18	Message 104402 - Posted: 22 Jan 2022, 18:30:46 UTC - in response to Message 104385. Last modified: 22 Jan 2022, 18:34:41 UTC Everything works better with more memory, if you're not using it you get a massive disk cache. Using all four slots does not cause stability problems. Always test your new memory with memtest before use, even quality stuff has duds. You get a write cache only if you install one. PrimoCache is the only one that I know of for Windows, which I use to protect my SSDs. AFAIK Windows has a write cache unless it's a removable drive. In fact I know it does, because I've copied a huge amount of files from an SSD to a rotary drive, and the rotary drive kept being accessed long after things looked like they'd copied. Here's a cite: https://www.tenforums.com/tutorials/21904-enable-disable-disk-write-caching-windows-10-a.html Memtest really doesn't have much to do with stability. It is mainly for errors, which might cause crashes, but more likely failures in work units. Memtest is everything to do with stability. Every single time someone has come to me with a crashing computer, I've found dodgy memory using Memtest. With large amounts of memory, especially the two-sided memory modules, you will see many more crashes using four slots. Check the forums. Not in my experience. Must be dodgy memory. I can find nothing on google suggesting 4 sticks causes problems. ID: 104402 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1603 Credit: 13,015,132 RAC: 18	Message 104403 - Posted: 22 Jan 2022, 18:37:24 UTC - in response to Message 104393. The write rates on the pythons are horrendous. I am getting well over 1 TB/day (almost 2 TB) when running 20 pythons, even with a huge 26 GB write cache. That is too much. I will do something else with this machine. SSDs have a longer life than rotary drives nowadays, look up the expected writes allowed to your SSD model and see how long the Pythons would take to wear it out. And caching the writes won't help anyway, since they have to be done at some point. ID: 104403 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104404 - Posted: 22 Jan 2022, 18:50:00 UTC - in response to Message 104403. SSDs have a longer life than rotary drives nowadays, look up the expected writes allowed to your SSD model and see how long the Pythons would take to wear it out. And caching the writes won't help anyway, since they have to be done at some point. You can find out the hard way about SSD lifetimes. They usually don't publish the figures now, probably because they have been going down as the chip geometries shrink. The caching for science projects works differently than if you are copying a video file, which would all have to be transferred. But in a scientific algorithm, you usually read from a location, do a calculation, a then store the value back, either into the original location or a related one. Therefore, by storing the information in DRAM memory, most of the writes are done to the memory. You transfer to the SSD only the residual writes remaining at the end of the cache latency period. In fact, if you made the cache latency (write-delay) long enough, you would never have to transfer any of the writes to the SSD. That is effectively what a ramdisk does, but it requires a lot more memory. You would have to store the entire BOINC data folder. ID: 104404 · Rating: 0 · rate: / Reply Quote

.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0	Message 104405 - Posted: 22 Jan 2022, 18:55:35 UTC - in response to Message 104394. Last modified: 22 Jan 2022, 19:52:33 UTC By the way, I used to just put projects with high write rates on a ramdisk, and have all the writes go to main memory. That really solves the problem. But on the Ryzen 3900X with all the pythons, the BOINC data folder is 107 GB; too much. I might be able to pull it off on a Ryzen 3600 though; 12 virtual cores might work. But I think they really need to develop the pythons a bit and call back when they are ready. Yes , some of the pythons need a kick in the compilers Amazingly I have 31 python and 7 R4,2 tasks running ATM and I have been through them to clear out two 0 cpu dud work units , it is a pain having to that at least once a day Rosetta is using 235GB of disk space though the most I have seen was 266GB Ram use right now is 59GB total system use 71GB on `standby` and only 40MB `free` of 128GB fitted in 8 slots [crashes ?? wot crashes !! . . . . tic tic tic . . . BOOM :), SSD write bombardment by pythons , following an idea by [Greg I think] I have put in a 500GB SATA SSD Samsung 870 evo [£58 on ebay new still sealed] I will see how long it lasts , though I haven't installed the additional "Samsung Magician" apps yet to keep an eye on the write rate , trim, garbage clean up etc installed only boinc on it , to speed up python work unit loading times . it looked like the fastest kid on the block in benchmarks at low price , there is faster stuff out there at a high cost I did look at M2 NVME drives but getting them to work in win7 looks like a pain of magical incantations on the command line to load the drivers , win8.1 onwards has them in already [I checked MS forum] OK time to post this drivel on the forum and see what happens :) ID: 104405 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 104406 - Posted: 22 Jan 2022, 18:59:27 UTC - in response to Message 104405. Ram use right now is 59GB total system use 71GB on `standby` and only 40MB `free` of 128GB fitted in 8 slots [crashes ?? wot crashes !! . . . . tic tic tic . . . BOOM :), SSD write bombardment by pythons , following an idea by [Grant I think] I have put in a 500GB SSD Samsung 870 evo [£58 on ebay new still sealed] I will see how long it lasts , though I haven't installed the additional "Samsung Magician" apps yet to keep an eye on the write rate , trim, garbage clean up etc Good. I was hoping that someone would do some real-world tests. I don't want do them myself. ID: 104406 · Rating: 0 · rate: / Reply Quote