[MLC@Home] [TWIM Notes] Nov 2 2020

News and Information related to Distributed Computing
Post Reply
BOINC_News
Reactions:
Posts: 997
Joined: Sun Nov 08, 2020 3:51 pm

[MLC@Home] [TWIM Notes] Nov 2 2020

Post by BOINC_News »

This Week in MLC@Home Notes for Nov 2 2020
A weekly summary of news and notes for MLC@Home

If you're in the US, and you haven't already taken advantage of early voting, please VOTE tomorrow.

Summary
GPU week(s), part 2. The downsides of giving weekly updates is sometimes you don't have a lot to report. Sadly most of this week was lost attempting to debug a strange crash related solely to the client when compiled for Linux/CUDA. It turns out it has nothing to do with our code, its a strange interaction between the threading library used in the pre-compiled pytorch libraries, and the BOINC library. It took several days for us to realize it wasn't our code causing the issue, now it'll take a few more to find a solution, which will likely require a custom-compiled pytorch, which will take a few more days to debug.

GPU speedups are worth it, but please bear with us as we work through these issues in the testing queue, and keep churning away on the main research queue!

News:
  • We expect to have most CUDA issues ironed out and have them in general (non-beta) use by lastnext week. ROCm support would be next, but it is a lower priority.
    Datasets 1,2 and 3 continue crunching away. GREAT progress so far!
    We will back off on the WU FLOPS estimates for any newly issues WUs starting this week, this should solve the overestimation problems with time.
    Spent a little time working on DS4 and the DS3 paper, but GPU debugging again took up the majority of free time.


Project status snapshot:
(note these numbers are approximations)

Tasks
Tasks ready to send 41034 Tasks in progress 9231 Users With credit 1080 Registered in past 24 hours 39 Hosts With recent credit 2106 Registered in past 24 hours 24 Current GigaFLOPS 29333.56

Dataset 1 and 2 progress:
SingleDirectMachine 10002/10004 EightBitMachine 10001/10006 SingleInvertMachine 10001/10003 SimpleXORMachine 10000/10002 ParityMachine 912/10005 ParityModified 289/10005 EightBitModified 6597/10006 SimpleXORModified 10005/10005 SingleDirectModified 10004/10004 SingleInvertModified 10002/10002
Dataset 3 progress:
Overall (so far): 44466/50557 Milestone 1, 100x100: 10000/10000 Milestone 2, 100x1000: 44466/100000 Milestone 3: 100x10000: 44466/1000000

Last week's TWIM Notes: Oct 27 2020

Thanks again to all our volunteers!

-- The MLC@Home Admins[/s]

Source: https://www.mlcathome.org/mlcathome/for ... php?id=113
Post Reply