For what it's worth, here is a snapshot of the RAM usage on my computers:
26x oifs + 38x llrSGS = 94 GB RAM in use
29x oifs + 35x llrSGS = 110 GB
53x oifs + 11x llrSGS = 184 GB
llrSGS's RAM footprint is actually negligible. There is nothing else running on these computers, e.g. no desktop environment, no GPU jobs. That is, the average RAM footprint of oifs was ~3.6 GB at the time when I looked, and the largest currently running task had 4.3 GB working set size.
(As mentioned, that's many more simultaneously running OpenIFS tasks than my Internet uplink can sustain. But the PrimeGrid challenge is afoot, during which CPUs will be taken off of CPDN and opportunity exists to clear pending uploads.)
A nice aspect of concentrating the workload on few wide computers instead of more but narrower computers is that RAM can be provisioned according to average demand, rather than according to peak demand. BTW, when I watch "top", there is several GB difference in used RAM at each display update of top.
The OpenIFS model is a multiphysics model, and its separate sub-models are not solved all at once. I read that even the width of the time steps can differ between the sub-models; notably, that the radiation sub-model can be solved with coarser time steps than the rest. As it happens, maximum RAM usage supposedly occurs when the radiation code runs.
This is the place where the grandfather of this code is running on:
https://www.ecmwf.int/en/computing/our- ... r-facility
Added in 7 minutes 51 seconds:
crashtech wrote: ↑Sat Dec 03, 2022 11:59 am
Long run times and the inability to survive a reboot is a recipe for failure. I keep wanting to like this project, but they make it tough. With the high RAM and network requirements, it would be really great if multithreading could be implemented.
They will use multithreading in order to be able to increase the resolution of the models. Finer resolution of course means more CPU time per time step, and bigger RAM footprint. I don't know what the consequences for result data rate will be; at least the CPDN operators are well aware that many contributors have very limited bandwidth at their disposal.
The HadAM/ HadSM models (i.e., UK Met Office code) already had a certain failure rate when they were suspended to disk/ resumed from disk. With OpenIFS, the suspend-resume failure rate has apparently increased to 100.0 %. Which the developers hopefully will be working on.
I'm under the impression that the various errors which were observed so far are the reason why we haven't seen any new work batches since the first 3000 tasks were submitted during Monday and Tuesday.
Added in 8 minutes 14 seconds:
PS, multithreading is already present in the OpenIFS code, via OpenMP like e.g. the new Genefer version is using. Multithreading is just not activated yet in OpenIFS's adaptation to the BOINC infrastructure.
Added in 8 minutes 48 seconds:
StefanR5R wrote: ↑Sat Dec 03, 2022 12:46 pm
They will use multithreading in order to be able to increase the resolution of the models. Finer resolution of course means more CPU time per time step, and bigger RAM footprint.
Though if the outcome is that instead of one core working on 4 GB of data, that maybe four cores work on 8 GB data, that'd obviously be an improvement to computer utilization for many contributors.
Added in 49 minutes 29 seconds:
Sorry for the spam. :-
On Dec 3 Glenn Carver wrote:'Fraid I don't have time to read all these posts right now as I'm busy looking into the various problems. I think they have all been pretty much captured by posts here. And the detailed forensic reports are extremely helpful, so thanks very much for that. I have fed the list back to CPDN.
There *is* an issue with restarts. The model process itself restarts just fine if the client/machine is shutdown and restarted. However, the controlling wrapper code then appears to be miscalculating where the model is in the forecast and this leads to the 'missing file' problem that's been reported.
So if you can manage to keep the tasks running uninterrupted they *should* work (famous last words), or at least fail less often. I have not tried 'keep non-gpu tasks in memory' option, that might help. And I know I said OpenIFS shouldn't have restart problems, but it's not the fault of the model ;)
(from the message board