[climateprediction.net] Linux work *perhaps* coming up in quantity

Post by **StefanR5R** » Thu Nov 17, 2022 4:22 pm

From thread "New work discussion" at the CPDN message board, message 66477 and later. This discussion branched off of the ages old question of whether the reporting deadlines shouldn't be shortened a little nowadays.

On November 16 SolarSyonyk wrote:My understanding of the CPDN tasks is that there are no "urgent tasks," just "We'd like some percentage of them back before we go forward with analysis." They just sweep the parameters across the range of interest for whatever parameters are being studied, and as long as you get a good percentage back, you have the data you need - if you have 100 tasks with 0.001 differences between parameters, you can still get the shape of the curve without getting every task back.

On November 16 Glenn Carver wrote:That was probably the case back in the early days when CPDN was running very large ensembles of climate length forecasts - I wasn't involved back then. That's not the case for the current projects which are looking at extreme weather, changes in weather patterns due to changing climate etc. Such forecasts are much shorter; weeks/months. In these cases, the experiments are designed requiring 100% return. Hence to need to make sure we rerun failures quicker.

The project I'm most closely associated with wanted to put out 125,000 OpenIFS tasks, but a quick calculation showed it would be nearly Christmas before we had 80% back. The scientist doing the work has a contract that finishes in Feb 2023 and needs analysis done by end of the year in order to write reports/papers/etc. So we've had to compromise and will probably only put out ~60,000. But we'll watch the return rate and if we think more can be done in time, they will get sent out.

Hope that gives a picture of how things look from the scientist's side.

On November 17 Glenn Carver wrote:Small batch of OpenIFS tasks about to go onto the test site next week. If all ok, expect 60,000 workunits (linux only) to appear soon after. I'm sure moderators will announce nearer the time. I understand this project will take priority so no Hadley model tasks for a while.

And from the New Work Announcements thread:

On November 16 Dave Jackson wrote:#941, the last of this current run [of HadSM4 tasks /StefanR5R], at least that I can see in the pipeline has been poured into the hopper. I hope someone has strong arms when it comes to the OpenIFS tasks.

Post by **crashtech** » Thu Nov 17, 2022 11:18 pm

Last time I ran CPDN, the tasks took weeks to complete. I'd like to start helping again if the new tasks are not so long.

Post by **StefanR5R** » Fri Nov 18, 2022 1:10 am

The current batch of "UK Met Office HadSM4 at N144 resolution" takes 2 days per task. (All tasks of this batch were sent to hosts already.) I don't recall how long OpenIFS work used to take, and whether or not suspend-to-disk works reliably with those. In general, there is a bit of a risk of CPDN tasks failing when they are resumed-from-disk.¹ You'll still have credit for such failed tasks up until the last trickle-up message, but the project needs to resend such tasks of course to get them done eventually.

¹) Edit, I happen to have scripts which I called cpdn_suspend_gently.sh and cpdn_resume_gently.sh, but am unclear whether or not they actually help. (After all, I don't know anything at all about the cause of the occasional failures at resumption.) I should polish these scripts a little and put them up at the CPDN message board for discussion eventually…

Post by **crashtech** » Fri Nov 18, 2022 11:53 am

I've set one host to ask for work. CPDN doesn't like to be bothered too often, they've set their server to ask for a 3636 second comm deferral.

Post by **mmonnin** » Fri Nov 18, 2022 6:49 pm

I usually get some tasks but only let a few run at once. Then gradually release some to run as others finish. the 216 tasks took awhile. Even on my 5950x the SM4 N144 tasks are taking 2.5 days

Post by **StefanR5R** » Thu Nov 24, 2022 7:20 pm

The OpenIFS application is for Linux x86-64. SO, unlike other CPDN applications, 32 bit libraries are not required. Tasks which are currently tested by the project take ~half a day on modern cores, up to 8 GB RAM, and are presumably cache-intensive.

Post by **StefanR5R** » Tue Nov 29, 2022 1:07 am

The first little batch of 1000 "OpenIFS 43r3 Perturbed Surface" tasks was released a few hours ago. All of these have been sent to hosts by now. (There is talk of 3000 tasks in the message board, but server status shows just 1000.)

On Zen 2 @ 2.9 GHz, no SMT, projected task duration is 2.6…3.5 days — oops, 16…18 hours, based on elapsed time / completion percentage. (The client's "estimated time remaining" is way off.) After these first few hours of run time, working set size is 3.1…4.1 GB (typically 3.6 GB).

Edit: Each of these tasks emits up to ~40 file uploads per each trickle-up message, with 14 MB per file. In comparison, HadSM4 tasks had about ~10 files or so but ~140 MB per file. — Edit 2: According to the message board, each task produces one of such 14 MB files every 7…11 minutes. Probably depends on the execution speed on the host. It'll supposedly be 123 output files per task. (That is 1.7 GB total upload per task if file size stays like this.)

I haven't done the math yet, but it looks like my own OpenIFS productivity will not be determined by core count or RAM capacity or heating demand of the flat, but by upload bandwidth.

Post by **biodoc** » Tue Nov 29, 2022 4:23 am

StefanR5R wrote: ↑Tue Nov 29, 2022 1:07 am The first little batch of 1000 "OpenIFS 43r3 Perturbed Surface" tasks was released a few hours ago.

I picked of 4 of these tasks on my dual ivy bridge.

Post by **StefanR5R** » Tue Nov 29, 2022 7:32 am

Glenn Carver wrote:There's another 2000 ready to go as soon as Andy gets to it. And then there's plenty more after, the scientist needs to run a minimum of ~42000.

(post 66625)

Post by **biodoc** » Tue Nov 29, 2022 8:37 am

I've picked up a few more on my other computers.

Post by **StefanR5R** » Tue Nov 29, 2022 2:47 pm

In a few hours I'll have the first tasks completed but still uploading for a while. I will then know the real rate of result data generation and can adjust my logical CPU count at this project, based on my Internet uplink speed. (Or uplink slowness, for a better fitting term.) I already reduced the CPUs a lot from what I had attached in the morning. :-(

On the other hand, I shall take into account that there will be 10 days soon during which I will compute at PrimeGrid only. That one has got some bigger result files now too, but there should still be ample of room to get pending CPDN uploads out during these 10 days.

Added in 5 hours 6 minutes 18 seconds:
Notes to self: 16.5…17 h task duration, result files 0…121 = ~14.25 MB, final file 122 =~24.4 MB
total: 1.72 GB uploads per result, 105 MB/h results data rate per core (2.5 GB/day per core).

Let's say somebody had a modem of Alexander Graham Bell's original design, with 8 Mbit/s = 84 GByte/day uplink rate. That blessed person could run 33 OpenIFS tasks in parallel in steady state.

Post by **crashtech** » Tue Nov 29, 2022 4:16 pm

I have only about 10Mbps upload at the shop, so I should be mindful of task number since the connection is not dedicated to just CPDN. Right now I have 22 tasks running.

Post by **mmonnin** » Wed Nov 30, 2022 6:21 pm

I picked up 6x of these. Initial ETA is 18 days. After 11:38 it is at 19% done so nowhere close to 18 days on a 1950x fully loaded CPU on all threads. Some other HadSM4 tasks also running. 3.7GB mem 4.2GB virtual.

Post by **StefanR5R** » Thu Dec 01, 2022 12:21 am

As far as I have seen, 1000 tasks came out late Monday, another 2000 tasks some time on Tuesday, and no other batch since. Therefore "tasks ready to send" went back to the usual 0 at some point yesterday. server_status.php shows runtimes of avg (min - max) of 16.4 (9 - 47) h, and of those, the avg and min are representative and the max a large outlier. — Edit, I bet this maximum came from a RAM starved host which was swapping to disk. — Runtime variability on a given host should be very low; the variability according to server_status.php is certainly caused simply by the variety of hosts which are attached.

If they really want to get >42000 done before the holidays, they'd better avoid any more gaps with 0 work availability, and they need a decent number of participants with good Internet uplinks.

Post by **biodoc** » Sat Dec 03, 2022 10:32 am

I know a dedicated boinc instance for CPND makes the most sense but I used an app_config.xml file in CPND project directory to control the number of OpenIFS tasks running simultaneously on each computer. I started out with 8 tasks on my 16 core Zen3 and Zen4 computers each with 64 GB of RAM and eventually reduced that to 6 tasks to be absolutely sure there was enough RAM to support 6 tasks running simultaneoulsy. My experience with my old dual Ivy bridge was poor at best. I ran 10 tasks simultaneously (1 task for each real core). This computer has 96 GB of ECC RAM so I assumed that would be enough. My internet connection is rated at 110 Mbits/sec down and upload so that should be enough to handle the trickle and final uploads of 34-42 total tasks running simultaneously. As posted by @StefanR5R in the CPND forum, rebooting the computer is a bad idea since all tasks running at the time will eventually fail. I lost 10 tasks on my dual ivy testing this. Another observation made by @StefanR5R was enabling "leave nonGPU tasks in memory while suspended" in boinc options is a necessity for completing tasks successfully that have been temporarily suspended for any reason.

Summary of OpenIFS results by computer:

3950X, 64 GB RAM, mint 20.3. MW running on 4 instances (Radeon VII)
33 tasks completed successfully, no errors.
This was the only computer at the start I had with "leave nonGPU tasks in memory while suspended" enabled in boinc options.

3950X, 64 GB RAM, mint 20.3. F@H running on nvidia GPU.
32 successful tasks, 1 error (suspended and restarted without "leave nonGPU tasks in memory while suspended" enabled in boinc options).

5950X, 64 GB ECC RAM, mint 21. F@H running on nvidia GPU.
29 successful tasks, 2 errors. One error was suspended and restarted without "leave nonGPU tasks in memory while suspended" enabled in boinc options. Another error was 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT. It's not clear what the problem was in the stderr output.

5950X, 64 GB ECC RAM, kubuntu 20.04. F@H running on nvidia GPU.
41 successful tasks. 7 errors.
4 "double free or corruption (out)" at end of stderr output.

Note: The 3 computers with ECC RAM, I never recorded any correctable or uncorrectable errors.
2 "free(): invalid pointer" at end of stderr output.
1 "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT"

This computer is my main computer so I have other crap running including chrome, boinctasks-js, discord and boinc manager from time to time. 8 simultaneous tasks was not sustainable due to RAM limits (90% available to boinc) 7 simultaneous tasks were borderline and 6 seemed about right. Also FAH core 22 tasks reserve 2.5 GB of system RAM The other Zens are Headless dedicated DC rigs.

The dual ivy was a comedy of errors mostly my fault: 6 completed tasks, 16 errors (10 errors due to system reboot and 2 due to "leave nonGPU tasks in memory while suspended" disabled in boinc options).
.

Post by **crashtech** » Sat Dec 03, 2022 11:59 am

Long run times and the inability to survive a reboot is a recipe for failure. I keep wanting to like this project, but they make it tough. With the high RAM and network requirements, it would be really great if multithreading could be implemented.

Post by **StefanR5R** » Sat Dec 03, 2022 1:44 pm

For what it's worth, here is a snapshot of the RAM usage on my computers:
26x oifs + 38x llrSGS = 94 GB RAM in use
29x oifs + 35x llrSGS = 110 GB
53x oifs + 11x llrSGS = 184 GB
llrSGS's RAM footprint is actually negligible. There is nothing else running on these computers, e.g. no desktop environment, no GPU jobs. That is, the average RAM footprint of oifs was ~3.6 GB at the time when I looked, and the largest currently running task had 4.3 GB working set size.

(As mentioned, that's many more simultaneously running OpenIFS tasks than my Internet uplink can sustain. But the PrimeGrid challenge is afoot, during which CPUs will be taken off of CPDN and opportunity exists to clear pending uploads.)

A nice aspect of concentrating the workload on few wide computers instead of more but narrower computers is that RAM can be provisioned according to average demand, rather than according to peak demand. BTW, when I watch "top", there is several GB difference in used RAM at each display update of top.

The OpenIFS model is a multiphysics model, and its separate sub-models are not solved all at once. I read that even the width of the time steps can differ between the sub-models; notably, that the radiation sub-model can be solved with coarser time steps than the rest. As it happens, maximum RAM usage supposedly occurs when the radiation code runs.

This is the place where the grandfather of this code is running on:
https://www.ecmwf.int/en/computing/our- ... r-facility

Added in 7 minutes 51 seconds:

crashtech wrote: ↑Sat Dec 03, 2022 11:59 am Long run times and the inability to survive a reboot is a recipe for failure. I keep wanting to like this project, but they make it tough. With the high RAM and network requirements, it would be really great if multithreading could be implemented.

They will use multithreading in order to be able to increase the resolution of the models. Finer resolution of course means more CPU time per time step, and bigger RAM footprint. I don't know what the consequences for result data rate will be; at least the CPDN operators are well aware that many contributors have very limited bandwidth at their disposal.

The HadAM/ HadSM models (i.e., UK Met Office code) already had a certain failure rate when they were suspended to disk/ resumed from disk. With OpenIFS, the suspend-resume failure rate has apparently increased to 100.0 %. Which the developers hopefully will be working on.

I'm under the impression that the various errors which were observed so far are the reason why we haven't seen any new work batches since the first 3000 tasks were submitted during Monday and Tuesday.

Added in 8 minutes 14 seconds:
PS, multithreading is already present in the OpenIFS code, via OpenMP like e.g. the new Genefer version is using. Multithreading is just not activated yet in OpenIFS's adaptation to the BOINC infrastructure.

Added in 8 minutes 48 seconds:

StefanR5R wrote: ↑Sat Dec 03, 2022 12:46 pm They will use multithreading in order to be able to increase the resolution of the models. Finer resolution of course means more CPU time per time step, and bigger RAM footprint.

Though if the outcome is that instead of one core working on 4 GB of data, that maybe four cores work on 8 GB data, that'd obviously be an improvement to computer utilization for many contributors.

Added in 49 minutes 29 seconds:
Sorry for the spam. :-)

On Dec 3 Glenn Carver wrote:'Fraid I don't have time to read all these posts right now as I'm busy looking into the various problems. I think they have all been pretty much captured by posts here. And the detailed forensic reports are extremely helpful, so thanks very much for that. I have fed the list back to CPDN.

There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported.

So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;)

(from the message board)

Post by **StefanR5R** » Thu Dec 15, 2022 1:15 am

During the last several days,
– an application update was installed which should now survive suspend-to-disk/ resume from disk,
– the upload server was upgraded (it's a cloud server),
– a tiny batch of of "OpenIFS 43r3 Baroclinic Lifecycle" was sent. Run time and RAM footprint are similar to the previous "OpenIFS 43r3 Perturbed Surface" batch, I don't know about upload size. This small batch is of course already fully assigned to hosts, but it seems to be to test the waters for some more OIFS BL/PS work afterwards.

Post by **crashtech** » Thu Dec 15, 2022 11:02 am

I'd be pleased to get some CPDN work.

Post by **StefanR5R** » Thu Dec 15, 2022 3:12 pm

StefanR5R wrote: ↑Thu Dec 15, 2022 1:15 am – the upload server was upgraded (it's a cloud server),

There is at least one user on their message board who reports high upload speeds. I on the other hand am still getting transient HTTP errors, like before.

Post by **crashtech** » Thu Dec 15, 2022 4:07 pm

I am reminded to be grateful for the fiber optic network that my telco has recently installed at both of my locations, even though we are considered to be in a rural area, they are apparently forward-looking and willing to invest (and also to receive government subsidies, presumably).

Post by **StefanR5R** » Tue Dec 20, 2022 2:44 pm

Glenn Carver wrote:OpenIFS Perturbed Surface batches

The next set of batches for the OpenIFS Perturbed Surface application are going out today/tomorrow. CPDN have asked me to write a short item on them as the scientist running the project is away. He will write something more complete for the CPDN website in due course.

The aim of this project is to understand how predictable the weather events over Europe were in 2021. The forecasts for 2021 have already gone out. The next batches will cover the 39 yrs from 2020 backwards. These are called 'hindcasts' and are used to establish a baseline in order to compare with the forecasts of 2021 to establish whether differences in the forecasts are significant or not, in a statistical sense.

For each of the 39 years, there will be 1000 forecasts. Each forecast modifies the way the Earth's surface is treated in the model by small amounts. This means each forecast creates a slightly different outcome. By using 39 years, this allows for variability in the climate (interannual variability). In total, there will be 39,000 tasks.

These 'perturbed surface' forecasts will be compared with a set of control forecasts without perturbations. These are being run offline outside of CPDN.

From a technical point of view, each batch will be the same as has been sent out before for the year 2021; same model runtime, same output sizes. There were a few of these forecasts that failed due to the model being perturbed in a way that was no longer physical realizable. Although a 'fail' this is still useful information for the scientist.

(from the New work discussion 2 thread)

Post by **mmonnin** » Tue Dec 20, 2022 11:02 pm

I got some of theses tasks today. Still a good chuck of memory usage.

Post by **crashtech** » Tue Dec 20, 2022 11:48 pm

"Tue 20 Dec 2022 09:13:08 PM MST | climateprediction.net | This computer has finished a daily quota of 1 tasks"

Post by **StefanR5R** » Wed Dec 21, 2022 1:53 am

The usual reason for this is a series of error returns from the client. It will clear only if the client then delivers a few successful results (or just one?), or if the server admin manually removes this block.

Due to the suspend bug of the earlier version of the application I had a series of failures on one of my computers too. But this computer did not encounter the quota block then because it still had a cache of work stored at that time, fetched before the errors happened. The cached work got done successfully and must have cleared a quota block if there was one.

Post by **crashtech** » Wed Dec 21, 2022 8:31 am

All of the computers I am running this on are experiencing very high error rates at this time.
In fact, not one of the new units has been completed successfully.
https://www.cpdn.org/results.php?userid ... e=6&appid=
I'm going to stop running it.

Post by **StefanR5R** » Wed Dec 21, 2022 3:06 pm

results.php isn't working with a userid filter (except if logged in as this user). A hostid filter would work for other users though.

FWIW, by now I completed 22 tasks of the newest batch which was issued yesterday, 20 on an Epyc with Opensuse, 2 on a Xeon E3 with Gentoo Linux, and they are all valid. Three days earlier the Epyc grabbed 7 tasks with the same application version which is now current (OpenIFS 43r3 Perturbed Surface v1.05 x86_64-pc-linux-gnu), and these advance tasks all completed valid too.

There had been some workunits in which the perturbation input parameters were too high, such that the model failed to compute. Evidently I did not receive any of those unlucky workunits so far.

I also did not suspend-to-disk/resume-from-disk yet, although this is supposed to work now. (I suspended-to-RAM/resumed-from-RAM several of the tasks, but this has always been harmless.) Furthermore, I have plenty of RAM and plenty of disk space for what I am running concurrently (which is not very much due to my upload bandwidth bottleneck).

Post by **crashtech** » Wed Dec 21, 2022 3:24 pm

Like this, maybe?

https://www.cpdn.org/hosts_user.php?sor ... id=1139080

Here's individual links just in case:

https://www.cpdn.org/show_host_detail.p ... id=1530048
https://www.cpdn.org/show_host_detail.p ... id=1530051
https://www.cpdn.org/show_host_detail.p ... id=1530059
https://www.cpdn.org/show_host_detail.p ... id=1535089

Post by **StefanR5R** » Wed Dec 21, 2022 3:40 pm

Hosts 1535089 and 1530059 do have some successful tasks at least. (Credit will be applied later.) But indeed, overall that's a *lot* of failures.

As far as I looked through them, all the failed tasks quit at some more or less early timestep with "The child process terminated with status: 0", "..Failed, model did not complete successfully". Sounds like it could be bad input parameters. But why were there so many bad workunits sent to you in particular? That's a lot of bad luck, if so.

Also, as far as I looked, there are no returns from replica tasks of the same workunits yet. Should be worth it to check back on them later, if another computer managed to complete its task or failed in the same way. (If other computers can't get them done either, then the reason is most certainly bad input.)

Post by **crashtech** » Wed Dec 21, 2022 3:45 pm

I'd really like to be able to run CPDN, if there's anything I can do to to facilitate it. Perhaps I should try fewer tasks, most of the times I checked in on them they were close to being out of memory.

EDIT: Actually I notice that one host seems to be doing okay, so I will let that one run:
https://www.cpdn.org/show_host_detail.p ... id=1535089

teamanandtech.org

[climateprediction.net] Linux work perhaps coming up in quantity

[climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity

Re: [climateprediction.net] Linux work perhaps coming up in quantity