Rosetta Python floods disks with snapshots

Message boards : Number crunching : Rosetta Python floods disks with snapshots

To post messages, you must log in.

AuthorMessage
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104262 - Posted: 15 Jan 2022, 19:28:49 UTC

I don't run Rosetta's python (vbox) tasks.
Nonetheless, a couple of comments might be interesting for those who do.


Found this in the logfiles:
2022-01-15 21:08:35 (2464): Setting Memory Size for VM. (6144MB)

This means the VM allocates RAM up to 6144 MB.
RAM that is really in use becomes important looking at the next entries.
2022-01-15 21:21:13 (2464): Creating new snapshot for VM.
2022-01-15 21:21:19 (2464): Checkpoint completed.
2022-01-15 21:31:13 (2464): Creating new snapshot for VM.
2022-01-15 21:31:19 (2464): Deleting stale snapshot.
2022-01-15 21:31:20 (2464): Checkpoint completed.
2022-01-15 21:41:14 (2464): Creating new snapshot for VM.
2022-01-15 21:41:20 (2464): Deleting stale snapshot.
2022-01-15 21:41:21 (2464): Checkpoint completed.
.
.
.

This means vboxwrapper writes a snapshot to disk (to the snapshot directory below .../slots/n) every 10 minutes.
This snapshot includes an image of the RAM used by the VM at that moment.
Hence, the size could be small, but it could also be the 6144 MB mentioned above.
Your disks are happy about that, especially SSDs.

It might be worth to test whether those snapshots are really required for Rosetta tasks.
If not, the project admins should add "<disable_automatic_checkpoints/>" to the vbox_job.xml delivered as part of the app.

Volunteers who want to test it should do the following steps:
1. Shut down BOINC
2. Insert "<dont_check_file_sizes>1</dont_check_file_sizes>" in cc_config.xml (remove it after the test)
3. Insert "<disable_automatic_checkpoints/>" in Rosetta's vbox_job.xml (don't know which filename they use but it is mentioned in the softlink you find in the slots directory)
4. Start BOINC
5. Run a new Rosetta python task and check it's stderr.txt as well as the corresponding VM's snapshot folder.
ID: 104262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1670
Credit: 17,496,658
RAC: 24,495
Message 104267 - Posted: 15 Jan 2022, 21:18:19 UTC - in response to Message 104262.  

This means the VM allocates RAM up to 6144 MB.
RAM that is really in use becomes important looking at the next entries.
RAM in use is generally only a GB or so.
The problem with no snapshots- if BOINC has to restart at any time, all the work done on any Tasks not yet completed up to that point is lost.

As it is, the RAM requirement was meant to have been changed to 3GB. Looks like it might have reverted back to it's previous value- which would explain the drop in the number of Python Tasks being processed at any given time, even with the lack of Rosetta 4.20 work.
Grant
Darwin NT
ID: 104267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
computezrmle

Send message
Joined: 9 Dec 11
Posts: 63
Credit: 9,680,103
RAC: 0
Message 104270 - Posted: 15 Jan 2022, 21:42:12 UTC - in response to Message 104267.  

Might be worth to check this.

My experience with CMS from LHC@home:
- They use "<disable_automatic_checkpoints/>"
- I set my BOINC client's checkpoint interval to ~3200 s
- If I suspend/resume a task before that point it starts from scratch
- If I suspend/resume after that point it writes a snapshot and continues from there
ID: 104270 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 104274 - Posted: 16 Jan 2022, 1:39:18 UTC

Looks like they checkpoint every ten minits
This line is from one of my valid work units .

2022-01-15 18:25:40 (3920): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))

I have set checkpoint interval to 3600 seconds [one hour] in boinc mangler [I don't reboot etc unless I have to]
I will have a play with this idea and see what I can break :)
ID: 104274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JAMES DORISIO

Send message
Joined: 25 Dec 05
Posts: 15
Credit: 200,782,615
RAC: 101,372
Message 104280 - Posted: 16 Jan 2022, 15:30:07 UTC

I have had the checkpoint interval set to 3600 seconds, below is the sdterr output from a python task

2022-01-16 01:57:49 (9405): Setting checkpoint interval to 3600 seconds. (Higher value of (Preference: 3600 seconds) or (Vbox_job.xml: 600 seconds))

2022-01-16 02:58:03 (9405): Creating new snapshot for VM.
2022-01-16 02:58:12 (9405): Checkpoint completed.
2022-01-16 03:36:54 (9405): Status Report: Elapsed Time: '6000.749009'
2022-01-16 03:36:54 (9405): Status Report: CPU Time: '5950.310000'
2022-01-16 03:58:29 (9405): Creating new snapshot for VM.
2022-01-16 03:58:38 (9405): Deleting stale snapshot.
2022-01-16 03:58:38 (9405): Checkpoint completed.
2022-01-16 04:58:55 (9405): Creating new snapshot for VM.
2022-01-16 04:59:04 (9405): Deleting stale snapshot.
2022-01-16 04:59:04 (9405): Checkpoint completed.
2022-01-16 05:15:58 (9405): Status Report: Elapsed Time: '12001.543604'
2022-01-16 05:15:58 (9405): Status Report: CPU Time: '11930.110000'

It looks like it is using the higher value of 3600 seconds 1 hour
I am going to try changing this to 7200 seconds to see what happens as these computers are on 24 hours a day and rarely reboot.

Jim
ID: 104280 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 104281 - Posted: 16 Jan 2022, 15:50:44 UTC - in response to Message 104262.  
Last modified: 16 Jan 2022, 16:39:05 UTC

Very nice. I have now set the write interval to 3600 seconds. Thanks.

EDIT: But I have to reboot this Ubuntu machine at least once a day to restart the "Vm job unmanageable" jobs, and would take a big hit.
So I will go back to 600 seconds and rely on my large write cache to protect my SSD. The writes by the pythons are much larger than the checkpoints it seems.
ID: 104281 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Rosetta Python floods disks with snapshots



©2024 University of Washington
https://www.bakerlab.org