Page 2 of 3

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Dec 21, 2022 4:16 pm
by StefanR5R
Of my 20 + 2 newest completed tasks, the 20 on Epyc all had a peak working set size of 4.8 GB (varies by just a few MB), and the 2 on the Xeon E3 had a peak working set size of 4.4 GB (varying by just a few MB as well).

Added in 12 minutes 35 seconds:
PS:
I watched the running tasks with htop for a little while, process table sorted by resident memory size. This confirms what has been said about the current OpenIFS models earlier: The memory footprint fluctuates during the run time, and peak memory usage is only reached during brief moments.

This fluctuation makes it difficult for the boinc client to react in time, to keep some of the work waiting when memory gets too tight. It may be a good idea to set "when the computer is [not] in use, use at most [...] %" of the memory to a percentage which is not too high, i.e. leave some margin.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Dec 21, 2022 4:32 pm
by crashtech
Looks like the instance that successfully returned work was set to run 8 CPDN tasks and has 64GB, though it's running 8 other tasks that use close to 1GB each. So there was plenty of headroom in that instance. It's likely I was not diligent enough in allocating resources, maybe I will try adding one of the others at a time with reduced task numbers and see if I can get them to complete some work.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Thu Dec 22, 2022 5:04 pm
by biodoc
I've got 100 tasks validated and 7 tasks with errors so far. I was looking at the syslog on one computer with 4 errors early on and discovered a cron job that was spamming the seti server for work. Oops!

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Dec 23, 2022 12:49 am
by StefanR5R
In this week so far: 88 valid, 0 errors.
Available CPUs/ RAM/ disk/ heating demand of the flat would allow me to do much more, but my Internet uplink doesn't.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Dec 23, 2022 8:15 am
by biodoc
When I started processing this latest batch, I set it up to run 6 tasks simultaneously on each 16 core Zen processor (2-3950X, 2-5950X). By default, I had one core feeding a gpu on each computer (FAH and MW). Eventually, I started seeing a few errors so I checked clock speed (cat /proc/cpuinfo) and found the idle cores at 2.2 GHZ as expected but the OpenIFS cores were running at 4.5 GHz! I increased the tasks to 7 simultaneous and still saw the same clock speed. I decided to run Universe tasks with the formula Universe + OpenIFS + GPU support = 16 tasks and found the cores running universe tasks @3.4 GHz and the cores running OpenIFS dropped to 4.1 GHz. I have the PPT set to 110-115 on these computers. I have seen variability in clock speed before depending on the project/tasks I am processing but I guess I was a bit shocked at what I was seeing with the OpenIFS tasks.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Dec 23, 2022 9:39 am
by crashtech
Hmm, interesting.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Dec 23, 2022 10:10 am
by crashtech
I just found a somewhat opposite condition on the 2x 2690v4 PC. I'd been running 8 tasks of CPDN and 8 tasks of Milkyway, and CPU-X was reporting 1200 MHz(!) Loading the system with a further 8 threads of ODLK has allowed it to run at 3200. I'm led to suspect that I need to get into the BIOS and figure out how to lock it into max turbo.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Dec 23, 2022 11:47 am
by biodoc
biodoc wrote: Fri Dec 23, 2022 8:15 am found the cores running universe tasks @3.4 GHz and the cores running OpenIFS dropped to 4.1 GHz.
I'm pretty sure I'm wrong on this observation. The 3.4 GHz must be part of the "snapshot". Maybe an average #?

Code: Select all

mark@x32-linux4:~$ cat /proc/cpuinfo | grep 'cpu MHz'
cpu MHz         : 2200.000
cpu MHz         : 4158.032
cpu MHz         : 4147.321
cpu MHz         : 2200.000
cpu MHz         : 2200.000
cpu MHz         : 4147.419
cpu MHz         : 2200.000
cpu MHz         : 4147.474
cpu MHz         : 4147.445
cpu MHz         : 4126.589
cpu MHz         : 4147.501
cpu MHz         : 3400.000
cpu MHz         : 2200.000
cpu MHz         : 4147.581
cpu MHz         : 4147.600
cpu MHz         : 2200.000
cpu MHz         : 3591.799
cpu MHz         : 4147.729
cpu MHz         : 2200.000
cpu MHz         : 4147.770
cpu MHz         : 4147.785
cpu MHz         : 4132.831
cpu MHz         : 4147.828
cpu MHz         : 2200.000
cpu MHz         : 2200.000
cpu MHz         : 4135.679
cpu MHz         : 2200.000
cpu MHz         : 4147.893
cpu MHz         : 4147.912
cpu MHz         : 2200.000
cpu MHz         : 2200.000
cpu MHz         : 4147.985

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Dec 23, 2022 12:13 pm
by crashtech
I thought I read that AMD has instituted a pretty fine grained clock gating scheme, perhaps that is just evidence of such?

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Sat Dec 24, 2022 7:19 pm
by crashtech
https://www.cpdn.org/result.php?resultid=22284162
Contains:

Code: Select all

double free or corruption (out)
Might not be my computer this time, but on it I have decremented by 1 the running CPDN tasks.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Sun Dec 25, 2022 1:55 am
by StefanR5R
The upload server is down since >half a day ago.

Take care that boinc's disk quota aren't reached. In theory, boinc will stop tasks. But if it doesn't work as designed, tasks would error out.
On Dec 24 Dave Jackson wrote:Email sent but I do not see why Andy should sort this on Christmas day! it may well be the new year before it gets sorted.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Dec 28, 2022 2:51 pm
by biodoc
Their new upload server is located at JASMIN which is a cloud computing provider.

https://www.ceda.ac.uk/blog/ceda-and-ja ... -period-1/

From the link above:
"CEDA and JASMIN services should therefore be considered to be running “at-risk” from 17:00 on Friday 16th December to 9:00 on Tuesday 3rd January 2023, with little or no helpdesk service available during this period"
  • To summarise:

    Friday 16th December 2022 17:00

    change freeze starts

    limited helpdesk support available

    Friday 23rd December 2022

    site closure at 15:00 GMT

    services unsupported.

    Tuesday 3th January 2023

    site reopens

    normal service resumes

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Tue Jan 03, 2023 7:00 pm
by StefanR5R
According to the CPDN forum, they managed to revive the upload server today… only to have it fail again shortly thereafter. (Again no ssh access possible for the project admin.) Next try happens tomorrow. If it stays up, their network may be bottlenecked for a while when all the backlogged files come in.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 04, 2023 8:10 am
by biodoc
Now they are saying there still may be a problem with the block storage device. "We are migrating VMs off that device while we investigate the problem. I'll let you know when we have migrated the VMs."

I had to google "block storage". :) https://aws.amazon.com/what-is/block-storage/

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 04, 2023 1:57 pm
by StefanR5R
Floppy disk drives are examples of block storage devices.
Perhaps they are migrating from 5.25" to 3.5" floppies now.
Glenn Carver wrote:The upload server has about 25Tb of storage,
OK, admittedly that'd be a *lot* of floppies.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 04, 2023 2:02 pm
by crashtech
I wonder how many floppy discs in the world are still in shape to read/write. Maybe not enough!

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 04, 2023 5:36 pm
by Skillz
crashtech wrote: Wed Jan 04, 2023 2:02 pm I wonder how many floppy discs in the world are still in shape to read/write. Maybe not enough!
I think you'd be surprised. A lot of governments and industries like the airline industry still use floppy discs today. They're extremely reliable and upgrading those systems to use modern technologies would cost a ton.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 04, 2023 7:20 pm
by crashtech
Well, we only need about 18 million to get the job done.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 04, 2023 7:28 pm
by biodoc
I started seeing uploads earlier so hopefully that will continue.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 04, 2023 9:38 pm
by crashtech
I saw some going, too.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Thu Jan 05, 2023 4:41 am
by biodoc
About half of my 340 tasks uploaded but about 11 PM EST the uploads stopped again.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Thu Jan 05, 2023 2:02 pm
by StefanR5R
A proposed fix is still in the making.
On January 5 Glenn Carver wrote:Just out of a meeting with CPDN. They understand what the problem is. The upload server will be restarted but with reduced max http connections to keep it stable as best as possible. At some point very soon (today/tomorrow) they will move the upload server to a new RAID disk array (which is what's causing the problem). They may decide to do the move before restarting depending on how quickly the JASMIN cloud provider can give them the temporary space they need to move the files whilst setting up the new server.

Regarding JASMIN's capacity for no. of simultaneous connections, I'm told these OpenIFS batches are not the highest load CPDN have ever seen and JASMIN has plenty of capacity. It's just the underlying disk system that was the issue (it's not the only boinc project to suffer raid disk array issues).
My two active hosts were able to report 34 results yesterday and 5 results today. 57 results are waiting to be uploaded. (They may have made partial progress, I haven't checked in detail.) For context, if the upload server was available 100% 24h/d, my Internet link would allow me to upload a little less than 50 results per day.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Thu Jan 05, 2023 2:18 pm
by crashtech
We're playing with a lot of data, guess that finds the weak links. First upload speeds, now server disk space, speed, or both.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Jan 06, 2023 11:35 am
by StefanR5R
New ETA: Monday.

Based on their progress so far, I am not making any guess which Monday. :-|

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 11, 2023 3:28 pm
by biodoc
The upload server has been up for several hours now. Hopefully it's stable now.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 11, 2023 3:54 pm
by StefanR5R
It's not.
xii5ku could upload some bits a few hours ago, but now 100% of the transfer attempts fail.
parsnip could upload, but has an increasing rate of failed transfers too now and therefore shut everything off again (due to disk space constraints).

I realize that the server has to catch up, but as poor as its performance is now, this will take a long while or may even never recover entirely.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 11, 2023 6:22 pm
by crashtech
Never recover entirely? That's sad.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Wed Jan 11, 2023 7:50 pm
by biodoc
Now the upload server is out of disk space. Ugh.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Thu Jan 12, 2023 12:42 am
by StefanR5R
StefanR5R wrote: Wed Jan 11, 2023 3:54 pm I realize that the server has to catch up, but as poor as its performance is now, this will take a long while or may even never recover entirely.
That's just my own pessimistic opinion, not fact, based on seeing server availability going down (and finally vanishing entirely) during the few hours yesterday when I was at my computers, combined with parsnip's similar experience.
biodoc wrote: Wed Jan 11, 2023 7:50 pm Now the upload server is out of disk space. Ugh.
Indeed. Unbelievable.

They have a rather well defined workload: Every task has got the same number of files and file sizes, they know how many hosts are active and how many tasks are in progress, and the scientific team had a precise plan how many tasks to get done in the current project, and in which desired timeframe. Yet it looks like somebody continues to ignore all that and believes to be able to seriously cut corners in the sizing of the upload server.

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Posted: Fri Jan 13, 2023 1:08 pm
by StefanR5R
The upload server outage continues — sort of — for another few days. And yes, the server remains seriously undersized.
On January 13 David Wallom wrote:Hello All,

Brief update on status.

The upload server is back running and we are currently in the process of transferring ~24TB of built up project results from that system to the analysis datastores. This process is going to take ~5 days running 5 parallel streams (the files are all OpenIFS workunits).

I have asked Andy to restart uploads but to throttle to ensure that our total stored volume does keep decreasing, i.e. our upload rate doesn't exceed our transfer rate. As such we'll be slow for a while but will gradually increase the upload server bandwidth to you guys as we clear batches.

The issue was caused by an initial instability bought about because the system disks for the VMs that run the upload server and the data storage volumes are all actually hosted in the same physical data system. When the data volumes fill they affect the performance of the other disks as well.... This was exasperated because they allowed us to create extremely large volumes that were really beyond the capability of the storage system so we have to move the data internally as well. Not an idea solution and we've told JASMIN this.

Thank you for your understanding in whats been a difficult few days.

David
(source)
Typo fix: …difficult days weeks
On January 13 David Wallom wrote:Hi,

The current limit is 50 concurrent connections.

Cheers

David
(source)

A discussion of result reporting deadlines came up. In this context, it was reiterated that the whole set of the current workunits needs to get done until end of February.
On January 13 Glenn Carver wrote:We don't hit the first batch deadlines for 950 & 951 until 19th January. I think that'll be fine, if a little tight. I'd prefer to keep it like that so we know what tasks are never coming back sooner rather than later as this project must finished end Feb.
(source)