Conversation
|
Hi @SarahAlidoost and @meiertgrootes, I created an exmaple training process on a subset of the two year data., and ran it on Levante. In this PR I included example SLURM training process, with an README on how to config the jobs on Levante. A copy of the example run can be found on |
SarahAlidoost
left a comment
There was a problem hiding this comment.
@rogerkuou thanks for the script. Since the PR #29 fixed a few issues, we need to merge main into this branch. I also left some comments, mainly about the structure of the example.py and the code that should be run with slurm. If something is unclear, please let me know. In meantime, I will work on issue #33.
Co-authored-by: SarahAlidoost <55081872+SarahAlidoost@users.noreply.github.com>
|
Hi @SarahAlidoost, thanks for the review! I implemented most of your comments:
I did not implemente the training utility function and will leave it to #33 . Can you give another look? |
SarahAlidoost
left a comment
There was a problem hiding this comment.
@rogerkuou thanks for addressing the comments 👍 . Here some more suggestions:
- I see that the example notebook has been changed in this PR. I cannot see exactly what is changed, but since this PR is about testing large data on HPC, let's not change the example notebook.
- No need to add inference script in this PR. For now we can skip that one. Let's focus on setup of the training on HPC in this PR. Also, in fixing #32 we can add inefrence script later.
- After implementing these suggestions and re-running the slurm job, can you please add the slurm logfile to the PR as well? Also, can you perhaps give an indication how much resources have been used to complete the job.
If something not clear, please let me know.
|
Hi @SarahAlidoost, I have adapted your comments. A SLURM logfile has been included, with documentation in the README file on how to get the resource. However I did not find a convenient function to calculate the CPU efficiency in the SLURM job. The current way need to run one command after the SLURM job finished. Below is a short summary of my small experiment:
|
Looking at the slurm output these lines, it seems that training has not been done correctly. There are some issues how the training workflow is configured, see my comments below.
dataset = STDataset(
daily_da=daily_subset["ts"],
monthly_da=monthly_subset["ts"],
land_mask=lsm_subset["lsm"],
patch_size=(patch_size[1]*20, patch_size[2]*20), # based on the patch_size in model
)
it seems that training has not been done correctly, see my comment above.
|
SarahAlidoost
left a comment
There was a problem hiding this comment.
@rogerkuou thanks for addressing the suggestions. Please see my comments and let me know if something isn't clear.
|
Hi @SarahAlidoost @meiertgrootes, I updated the example training script with you utility functions, and performed two experiments, both with 128 cores:
Based on the two experiments I have several findings, also to address @SarahAlidoost 's comments in pervious review:
Please feel free to have another look. |
| - example_training.py: example training script | ||
| - example.slurm: example SLURM script to execute the training script on SLURM system | ||
| - eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E. | ||
| - eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit. |
There was a problem hiding this comment.
Why job was cancelled in 1 hour while the #SBATCH --time=04:00:00?
| 23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0 | ||
| ``` | ||
|
|
||
| The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. No newline at end of file |
There was a problem hiding this comment.
why 256 CPU was allocated while #SBATCH --ntasks-per-node=128?
There was a problem hiding this comment.
It seems that the reason is each physical CPU core shows as 2 virtual CPU cores, see https://docs.dkrz.de/doc/levante/configuration.html. We can add the explanation and the the link to the readme.
| - example_training.py: example training script | ||
| - example.slurm: example SLURM script to execute the training script on SLURM system | ||
| - eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E. | ||
| - eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit. |
There was a problem hiding this comment.
I checked two months of data and it looks like the values at lon = 179.9 are NaN, which might have happened during data processing. I didn’t find any issues with lat > 80 though, since the target still has data there. I made issue #41
| - example_training.py: example training script | ||
| - example.slurm: example SLURM script to execute the training script on SLURM system | ||
| - eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E. | ||
| - eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit. |
There was a problem hiding this comment.
in the eso4clima_24449471_full.out file, why example_subset.slurm is used, see:
* Command : /home/b/b383704/eso4clima/train_twoyears/
* example_subset.slurm| - example_training.py: example training script | ||
| - example.slurm: example SLURM script to execute the training script on SLURM system | ||
| - eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E. | ||
| - eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit. |
There was a problem hiding this comment.
Looking at the eso4clima_24449471_full.out file, there is the warning UserWarning: Patch size (120, 120) does not evenly divide image dimensions (H=720, W=640). Uncovered pixels: 0 in height, 40 in width. Consider adjusting patch_size or image dimensions for full coverage. warnings.warn(. I remember we have discussed this offline. I made #42
| 23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0 | ||
| ``` | ||
|
|
||
| The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. No newline at end of file |
There was a problem hiding this comment.
| The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. | |
| The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. |
| 23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0 | ||
| ``` | ||
|
|
||
| The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. No newline at end of file |
There was a problem hiding this comment.
Based on the calculation, the job used only ~37% of the available CPU resources, the task may not scale efficiently to 256 cores, room for optimization. I made #44
Hi @rogerkuou Thanks! From your comment, I understood that you ran the job for 1 hour on purpose. I don’t think that’s the best way to assess performance, but we can leave it for now. About the I’ve added some comments and made a few issues, no need to run the job at this point. We can improve the slurm job step by step later. For now, let’s keep the PR open. |
fix #25
did not finish the train-validation-test split in this PR, but made a new issue #28