25 test two year data by rogerkuou · Pull Request #27 · ESMValGroup/ClimaNet

rogerkuou · 2026-02-27T07:38:19Z

did not finish the train-validation-test split in this PR, but made a new issue #28

rogerkuou · 2026-02-27T14:03:21Z

Hi @SarahAlidoost and @meiertgrootes, I created an exmaple training process on a subset of the two year data., and ran it on Levante. In this PR I included example SLURM training process, with an README on how to config the jobs on Levante.

A copy of the example run can be found on /work/bd0854/b380854/eso4clima. I executed the slurm task on my home dir, and copied the entire experiment here.

SarahAlidoost

@rogerkuou thanks for the script. Since the PR #29 fixed a few issues, we need to merge main into this branch. I also left some comments, mainly about the structure of the example.py and the code that should be run with slurm. If something is unclear, please let me know. In meantime, I will work on issue #33.

Co-authored-by: SarahAlidoost <55081872+SarahAlidoost@users.noreply.github.com>

rogerkuou · 2026-03-18T14:35:51Z

Hi @SarahAlidoost, thanks for the review! I implemented most of your comments:

I separated the example script to two: training and inference
Now the training scipt export models with checkpoint. I slightlt modified the class to make it returning the config
I used logging to replace the print statements
The plotting part has been removed

I did not implemente the training utility function and will leave it to #33 .

Can you give another look?

SarahAlidoost

@rogerkuou thanks for addressing the comments 👍 . Here some more suggestions:

I see that the example notebook has been changed in this PR. I cannot see exactly what is changed, but since this PR is about testing large data on HPC, let's not change the example notebook.
No need to add inference script in this PR. For now we can skip that one. Let's focus on setup of the training on HPC in this PR. Also, in fixing #32 we can add inefrence script later.
After implementing these suggestions and re-running the slurm job, can you please add the slurm logfile to the PR as well? Also, can you perhaps give an indication how much resources have been used to complete the job.

If something not clear, please let me know.

rogerkuou · 2026-03-27T10:34:37Z

Hi @SarahAlidoost, I have adapted your comments. A SLURM logfile has been included, with documentation in the README file on how to get the resource. However I did not find a convenient function to calculate the CPU efficiency in the SLURM job. The current way need to run one command after the SLURM job finished.

Below is a short summary of my small experiment:

Two months day->month training on global scale
Reserved1 node (256 CPUs)
The job finished in <3 minuts
Peak memory usage ~4GB
CPU efficiency ~37% (Only 9 epocohs. I expect this to increase for a longer training run.)

SarahAlidoost · 2026-04-01T11:22:25Z

Hi @SarahAlidoost, I have adapted your comments. A SLURM logfile has been included, with documentation in the README file on how to get the resource. However I did not find a convenient function to calculate the CPU efficiency in the SLURM job. The current way need to run one command after the SLURM job finished.

Looking at the slurm output these lines, it seems that training has not been done correctly. There are some issues how the training workflow is configured, see my comments below.

Below is a short summary of my small experiment:

Two months day->month training on global scale

In this case, num_months should be set to 24 and not 2 (see model creation in the script).
I'm not sure how chunking the data and dataloader works together. Does STDataset creates a lazy dataset and dataloader load data lazily?
the patch_size_training=80 isnot large enough considering global scale. A larger chunk should fit in the memory.
In creating dataloader, batch_size is still 2, I wonder how the entire global data is used in training?
please move creating dataset after creating the model, because the patch_size of the dataset should be compatible with model patch_size, please see the documentation and example in the example notebook:

dataset = STDataset(
    daily_da=daily_subset["ts"],
    monthly_da=monthly_subset["ts"],
    land_mask=lsm_subset["lsm"],
    patch_size=(patch_size[1]*20, patch_size[2]*20),  # based on the patch_size in model
)

Reserved1 node (256 CPUs)

The job finished in <3 minuts

it seems that training has not been done correctly, see my comment above.

Peak memory usage ~4GB

CPU efficiency ~37% (Only 9 epocohs. I expect this to increase for a longer training run.)

SarahAlidoost

@rogerkuou thanks for addressing the suggestions. Please see my comments and let me know if something isn't clear.

rogerkuou · 2026-04-23T19:58:46Z

Hi @SarahAlidoost @meiertgrootes, I updated the example training script with you utility functions, and performed two experiments, both with 128 cores:

A train on two year dataset and space subset from 30S to 30N and from 30W to 30E. This experiment successfully finished in ~2hrs. Log file eso4clima_24438134_subset.out.
A train still on two year dataset but with ALMOST global coverage, from 80S to 80N and from 179.99W to 179.99E. Since this experiment is only for evaluate the performance, I only ran a SLURM job of 1hr. The traning finished >20 epochs and killed by SLURM time limit. Log file eso4clima_24449471_full.out.

Based on the two experiments I have several findings, also to address @SarahAlidoost 's comments in pervious review:

When including high lattitude data (>80N/S), the training cotinues giving loss=inf even after 40 epochs. I have not investigate the cause, nor the threshold of the lattitude. But I got the impression that the current coverage is something we want to work with. Therefore I left this spatial subsetting in the exmaple script. (maybe we should create an issue?)
In the current log file there are many warnings when using open_mfdataset to lazy loading multiple nc files. I did another experiment with zarr and these warnings are gone. I would recommend if possible, convert the files to Zarr. This might be easy for daily datasets but hard for hourly.
The STDataset seem to handle the lazy loaded data well without need to interfene.
following @SarahAlidoost previous comments, I changed the batch_size = 10 with accumulation_steps = 2. This seems to work well for global datasets. I use a patch_size_training = 120 for patch size.

Please feel free to have another look.

SarahAlidoost · 2026-04-24T09:38:09Z

+- example_training.py: example training script
+- example.slurm: example SLURM script to execute the training script on SLURM system
+- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
+- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.


Why job was cancelled in 1 hour while the #SBATCH --time=04:00:00?

SarahAlidoost · 2026-04-24T09:38:33Z

+23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0
+```
+
+The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.


why 256 CPU was allocated while #SBATCH --ntasks-per-node=128?

It seems that the reason is each physical CPU core shows as 2 virtual CPU cores, see https://docs.dkrz.de/doc/levante/configuration.html. We can add the explanation and the the link to the readme.

SarahAlidoost · 2026-04-24T09:41:38Z

+- example_training.py: example training script
+- example.slurm: example SLURM script to execute the training script on SLURM system
+- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
+- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.


I checked two months of data and it looks like the values at lon = 179.9 are NaN, which might have happened during data processing. I didn’t find any issues with lat > 80 though, since the target still has data there. I made issue #41

SarahAlidoost · 2026-04-24T09:43:20Z

+- example_training.py: example training script
+- example.slurm: example SLURM script to execute the training script on SLURM system
+- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
+- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.


in the eso4clima_24449471_full.out file, why example_subset.slurm is used, see:

* Command : /home/b/b383704/eso4clima/train_twoyears/ * example_subset.slurm

SarahAlidoost · 2026-04-24T10:20:21Z

+- example_training.py: example training script
+- example.slurm: example SLURM script to execute the training script on SLURM system
+- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
+- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.


Looking at the eso4clima_24449471_full.out file, there is the warning UserWarning: Patch size (120, 120) does not evenly divide image dimensions (H=720, W=640). Uncovered pixels: 0 in height, 40 in width. Consider adjusting patch_size or image dimensions for full coverage. warnings.warn(. I remember we have discussed this offline. I made #42

SarahAlidoost · 2026-04-24T10:51:36Z

+23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0
+```
+
+The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.


Suggested change

The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.

The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.

SarahAlidoost · 2026-04-24T10:55:13Z

+23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0
+```
+
+The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.


Based on the calculation, the job used only ~37% of the available CPU resources, the task may not scale efficiently to 256 cores, room for optimization. I made #44

SarahAlidoost · 2026-04-24T10:57:06Z

Hi @SarahAlidoost @meiertgrootes, I updated the example training script with you utility functions, and performed two experiments, both with 128 cores:

A train on two year dataset and space subset from 30S to 30N and from 30W to 30E. This experiment successfully finished in ~2hrs. Log file eso4clima_24438134_subset.out.

A train still on two year dataset but with ALMOST global coverage, from 80S to 80N and from 179.99W to 179.99E. Since this experiment is only for evaluate the performance, I only ran a SLURM job of 1hr. The traning finished >20 epochs and killed by SLURM time limit. Log file eso4clima_24449471_full.out.

Based on the two experiments I have several findings, also to address @SarahAlidoost 's comments in pervious review:

When including high lattitude data (>80N/S), the training cotinues giving loss=inf even after 40 epochs. I have not investigate the cause, nor the threshold of the lattitude. But I got the impression that the current coverage is something we want to work with. Therefore I left this spatial subsetting in the exmaple script. (maybe we should create an issue?)

In the current log file there are many warnings when using open_mfdataset to lazy loading multiple nc files. I did another experiment with zarr and these warnings are gone. I would recommend if possible, convert the files to Zarr. This might be easy for daily datasets but hard for hourly.

The STDataset seem to handle the lazy loaded data well without need to interfene.

following @SarahAlidoost previous comments, I changed the batch_size = 10 with accumulation_steps = 2. This seems to work well for global datasets. I use a patch_size_training = 120 for patch size.

Please feel free to have another look.

Hi @rogerkuou Thanks! From your comment, I understood that you ran the job for 1 hour on purpose. I don’t think that’s the best way to assess performance, but we can leave it for now. About the open_mfdataset warning and the zarr vs. nc, let’s first make sure we understand the warning; are they actually related to the file format? If so, could you make an issue for it?

I’ve added some comments and made a few issues, no need to run the job at this point. We can improve the slurm job step by step later. For now, let’s keep the PR open.

initial example of two year data

df1105f

rogerkuou mentioned this pull request Feb 27, 2026

Add months mixing #24

Merged

rogerkuou added 6 commits February 27, 2026 11:42

update examples notebook

fef01cf

add example training scritps

2b3ac47

add example slurm file

2bdf8b5

update fig dir

1159fbc

add README

3e2c4b4

Merge branch 'main' into 25_test_two_year_data

994d36b

rogerkuou marked this pull request as ready for review February 27, 2026 14:00

rogerkuou requested review from SarahAlidoost and meiertgrootes February 27, 2026 14:04

SarahAlidoost requested changes Mar 17, 2026

View reviewed changes

rogerkuou and others added 7 commits March 18, 2026 13:41

Apply suggestions from code review

8cd1c8f

Co-authored-by: SarahAlidoost <55081872+SarahAlidoost@users.noreply.github.com>

fix conflicts

fe8f024

separate training and inference

a3ba05d

update model exportation with checkpoint

3c99673

add inference scripts

2b4c7c5

use logging to replace print

9ed2e00

update example slurm scripts

1de1cfb

rogerkuou requested a review from SarahAlidoost March 18, 2026 14:36

SarahAlidoost mentioned this pull request Mar 18, 2026

Add a util function for training loop #34

Merged