[climateprediction.net] Linux work *perhaps* coming up in quantity

News and Information related to Distributed Computing
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

Of my 20 + 2 newest completed tasks, the 20 on Epyc all had a peak working set size of 4.8 GB (varies by just a few MB), and the 2 on the Xeon E3 had a peak working set size of 4.4 GB (varying by just a few MB as well).

Added in 12 minutes 35 seconds:
PS:
I watched the running tasks with htop for a little while, process table sorted by resident memory size. This confirms what has been said about the current OpenIFS models earlier: The memory footprint fluctuates during the run time, and peak memory usage is only reached during brief moments.

This fluctuation makes it difficult for the boinc client to react in time, to keep some of the work waiting when memory gets too tight. It may be a good idea to set "when the computer is [not] in use, use at most [...] %" of the memory to a percentage which is not too high, i.e. leave some margin.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

Looks like the instance that successfully returned work was set to run 8 CPDN tasks and has 64GB, though it's running 8 other tasks that use close to 1GB each. So there was plenty of headroom in that instance. It's likely I was not diligent enough in allocating resources, maybe I will try adding one of the others at a time with reduced task numbers and see if I can get them to complete some work.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

I've got 100 tasks validated and 7 tasks with errors so far. I was looking at the syslog on one computer with 4 errors early on and discovered a cron job that was spamming the seti server for work. Oops!
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

In this week so far: 88 valid, 0 errors.
Available CPUs/ RAM/ disk/ heating demand of the flat would allow me to do much more, but my Internet uplink doesn't.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

When I started processing this latest batch, I set it up to run 6 tasks simultaneously on each 16 core Zen processor (2-3950X, 2-5950X). By default, I had one core feeding a gpu on each computer (FAH and MW). Eventually, I started seeing a few errors so I checked clock speed (cat /proc/cpuinfo) and found the idle cores at 2.2 GHZ as expected but the OpenIFS cores were running at 4.5 GHz! I increased the tasks to 7 simultaneous and still saw the same clock speed. I decided to run Universe tasks with the formula Universe + OpenIFS + GPU support = 16 tasks and found the cores running universe tasks @3.4 GHz and the cores running OpenIFS dropped to 4.1 GHz. I have the PPT set to 110-115 on these computers. I have seen variability in clock speed before depending on the project/tasks I am processing but I guess I was a bit shocked at what I was seeing with the OpenIFS tasks.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

Hmm, interesting.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

I just found a somewhat opposite condition on the 2x 2690v4 PC. I'd been running 8 tasks of CPDN and 8 tasks of Milkyway, and CPU-X was reporting 1200 MHz(!) Loading the system with a further 8 threads of ODLK has allowed it to run at 3200. I'm led to suspect that I need to get into the BIOS and figure out how to lock it into max turbo.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

biodoc wrote: Fri Dec 23, 2022 8:15 am found the cores running universe tasks @3.4 GHz and the cores running OpenIFS dropped to 4.1 GHz.
I'm pretty sure I'm wrong on this observation. The 3.4 GHz must be part of the "snapshot". Maybe an average #?

Code: Select all

mark@x32-linux4:~$ cat /proc/cpuinfo | grep 'cpu MHz'
cpu MHz         : 2200.000
cpu MHz         : 4158.032
cpu MHz         : 4147.321
cpu MHz         : 2200.000
cpu MHz         : 2200.000
cpu MHz         : 4147.419
cpu MHz         : 2200.000
cpu MHz         : 4147.474
cpu MHz         : 4147.445
cpu MHz         : 4126.589
cpu MHz         : 4147.501
cpu MHz         : 3400.000
cpu MHz         : 2200.000
cpu MHz         : 4147.581
cpu MHz         : 4147.600
cpu MHz         : 2200.000
cpu MHz         : 3591.799
cpu MHz         : 4147.729
cpu MHz         : 2200.000
cpu MHz         : 4147.770
cpu MHz         : 4147.785
cpu MHz         : 4132.831
cpu MHz         : 4147.828
cpu MHz         : 2200.000
cpu MHz         : 2200.000
cpu MHz         : 4135.679
cpu MHz         : 2200.000
cpu MHz         : 4147.893
cpu MHz         : 4147.912
cpu MHz         : 2200.000
cpu MHz         : 2200.000
cpu MHz         : 4147.985
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

I thought I read that AMD has instituted a pretty fine grained clock gating scheme, perhaps that is just evidence of such?
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

https://www.cpdn.org/result.php?resultid=22284162
Contains:

Code: Select all

double free or corruption (out)
Might not be my computer this time, but on it I have decremented by 1 the running CPDN tasks.
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

The upload server is down since >half a day ago.

Take care that boinc's disk quota aren't reached. In theory, boinc will stop tasks. But if it doesn't work as designed, tasks would error out.
On Dec 24 Dave Jackson wrote:Email sent but I do not see why Andy should sort this on Christmas day! it may well be the new year before it gets sorted.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

Their new upload server is located at JASMIN which is a cloud computing provider.

https://www.ceda.ac.uk/blog/ceda-and-ja ... -period-1/

From the link above:
"CEDA and JASMIN services should therefore be considered to be running “at-risk” from 17:00 on Friday 16th December to 9:00 on Tuesday 3rd January 2023, with little or no helpdesk service available during this period"
  • To summarise:

    Friday 16th December 2022 17:00

    change freeze starts

    limited helpdesk support available

    Friday 23rd December 2022

    site closure at 15:00 GMT

    services unsupported.

    Tuesday 3th January 2023

    site reopens

    normal service resumes
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

According to the CPDN forum, they managed to revive the upload server today… only to have it fail again shortly thereafter. (Again no ssh access possible for the project admin.) Next try happens tomorrow. If it stays up, their network may be bottlenecked for a while when all the backlogged files come in.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

Now they are saying there still may be a problem with the block storage device. "We are migrating VMs off that device while we investigate the problem. I'll let you know when we have migrated the VMs."

I had to google "block storage". :) https://aws.amazon.com/what-is/block-storage/
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

Floppy disk drives are examples of block storage devices.
Perhaps they are migrating from 5.25" to 3.5" floppies now.
Glenn Carver wrote:The upload server has about 25Tb of storage,
OK, admittedly that'd be a *lot* of floppies.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

I wonder how many floppy discs in the world are still in shape to read/write. Maybe not enough!
Skillz
Site Admin
Reactions:
Posts: 1854
Joined: Sun Sep 15, 2019 3:03 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by Skillz »

crashtech wrote: Wed Jan 04, 2023 2:02 pm I wonder how many floppy discs in the world are still in shape to read/write. Maybe not enough!
I think you'd be surprised. A lot of governments and industries like the airline industry still use floppy discs today. They're extremely reliable and upgrading those systems to use modern technologies would cost a ton.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

Well, we only need about 18 million to get the job done.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

I started seeing uploads earlier so hopefully that will continue.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

I saw some going, too.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

About half of my 340 tasks uploaded but about 11 PM EST the uploads stopped again.
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

A proposed fix is still in the making.
On January 5 Glenn Carver wrote:Just out of a meeting with CPDN. They understand what the problem is. The upload server will be restarted but with reduced max http connections to keep it stable as best as possible. At some point very soon (today/tomorrow) they will move the upload server to a new RAID disk array (which is what's causing the problem). They may decide to do the move before restarting depending on how quickly the JASMIN cloud provider can give them the temporary space they need to move the files whilst setting up the new server.

Regarding JASMIN's capacity for no. of simultaneous connections, I'm told these OpenIFS batches are not the highest load CPDN have ever seen and JASMIN has plenty of capacity. It's just the underlying disk system that was the issue (it's not the only boinc project to suffer raid disk array issues).
My two active hosts were able to report 34 results yesterday and 5 results today. 57 results are waiting to be uploaded. (They may have made partial progress, I haven't checked in detail.) For context, if the upload server was available 100% 24h/d, my Internet link would allow me to upload a little less than 50 results per day.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

We're playing with a lot of data, guess that finds the weak links. First upload speeds, now server disk space, speed, or both.
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

New ETA: Monday.

Based on their progress so far, I am not making any guess which Monday. :-|
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

The upload server has been up for several hours now. Hopefully it's stable now.
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

It's not.
xii5ku could upload some bits a few hours ago, but now 100% of the transfer attempts fail.
parsnip could upload, but has an increasing rate of failed transfers too now and therefore shut everything off again (due to disk space constraints).

I realize that the server has to catch up, but as poor as its performance is now, this will take a long while or may even never recover entirely.
crashtech
TAAT Member
Reactions:
Posts: 1544
Joined: Sun Sep 15, 2019 4:45 pm
Location: Idaho, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by crashtech »

Never recover entirely? That's sad.
User avatar
biodoc
TAAT Member
Reactions:
Posts: 1014
Joined: Sun Sep 15, 2019 3:22 pm
Location: Massachusetts, USA

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by biodoc »

Now the upload server is out of disk space. Ugh.
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

StefanR5R wrote: Wed Jan 11, 2023 3:54 pm I realize that the server has to catch up, but as poor as its performance is now, this will take a long while or may even never recover entirely.
That's just my own pessimistic opinion, not fact, based on seeing server availability going down (and finally vanishing entirely) during the few hours yesterday when I was at my computers, combined with parsnip's similar experience.
biodoc wrote: Wed Jan 11, 2023 7:50 pm Now the upload server is out of disk space. Ugh.
Indeed. Unbelievable.

They have a rather well defined workload: Every task has got the same number of files and file sizes, they know how many hosts are active and how many tasks are in progress, and the scientific team had a precise plan how many tasks to get done in the current project, and in which desired timeframe. Yet it looks like somebody continues to ignore all that and believes to be able to seriously cut corners in the sizing of the upload server.
StefanR5R
TAAT Member
Reactions:
Posts: 1661
Joined: Wed Sep 25, 2019 4:32 pm

Re: [climateprediction.net] Linux work *perhaps* coming up in quantity

Post by StefanR5R »

The upload server outage continues — sort of — for another few days. And yes, the server remains seriously undersized.
On January 13 David Wallom wrote:Hello All,

Brief update on status.

The upload server is back running and we are currently in the process of transferring ~24TB of built up project results from that system to the analysis datastores. This process is going to take ~5 days running 5 parallel streams (the files are all OpenIFS workunits).

I have asked Andy to restart uploads but to throttle to ensure that our total stored volume does keep decreasing, i.e. our upload rate doesn't exceed our transfer rate. As such we'll be slow for a while but will gradually increase the upload server bandwidth to you guys as we clear batches.

The issue was caused by an initial instability bought about because the system disks for the VMs that run the upload server and the data storage volumes are all actually hosted in the same physical data system. When the data volumes fill they affect the performance of the other disks as well.... This was exasperated because they allowed us to create extremely large volumes that were really beyond the capability of the storage system so we have to move the data internally as well. Not an idea solution and we've told JASMIN this.

Thank you for your understanding in whats been a difficult few days.

David
(source)
Typo fix: …difficult days weeks
On January 13 David Wallom wrote:Hi,

The current limit is 50 concurrent connections.

Cheers

David
(source)

A discussion of result reporting deadlines came up. In this context, it was reiterated that the whole set of the current workunits needs to get done until end of February.
On January 13 Glenn Carver wrote:We don't hit the first batch deadlines for 950 & 951 until 19th January. I think that'll be fine, if a little tight. I'd prefer to keep it like that so we know what tasks are never coming back sooner rather than later as this project must finished end Feb.
(source)
Post Reply