Skip to content

25 test two year data#27

Open
rogerkuou wants to merge 36 commits intomainfrom
25_test_two_year_data
Open

25 test two year data#27
rogerkuou wants to merge 36 commits intomainfrom
25_test_two_year_data

Conversation

@rogerkuou
Copy link
Copy Markdown
Collaborator

@rogerkuou rogerkuou commented Feb 27, 2026

fix #25

did not finish the train-validation-test split in this PR, but made a new issue #28

@rogerkuou rogerkuou mentioned this pull request Feb 27, 2026
@rogerkuou rogerkuou marked this pull request as ready for review February 27, 2026 14:00
@rogerkuou
Copy link
Copy Markdown
Collaborator Author

Hi @SarahAlidoost and @meiertgrootes, I created an exmaple training process on a subset of the two year data., and ran it on Levante. In this PR I included example SLURM training process, with an README on how to config the jobs on Levante.

A copy of the example run can be found on /work/bd0854/b380854/eso4clima. I executed the slurm task on my home dir, and copied the entire experiment here.

Copy link
Copy Markdown
Member

@SarahAlidoost SarahAlidoost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogerkuou thanks for the script. Since the PR #29 fixed a few issues, we need to merge main into this branch. I also left some comments, mainly about the structure of the example.py and the code that should be run with slurm. If something is unclear, please let me know. In meantime, I will work on issue #33.

Comment thread scripts/example.py Outdated
Comment thread scripts/example_training.py
Comment thread scripts/example.py Outdated
Comment thread scripts/example.py Outdated
Comment thread scripts/example_training.py Outdated
Comment thread scripts/example.py Outdated
Comment thread scripts/example.py Outdated
Comment thread scripts/example.py Outdated
Comment thread scripts/example.py Outdated
Comment thread scripts/example.py Outdated
@rogerkuou
Copy link
Copy Markdown
Collaborator Author

Hi @SarahAlidoost, thanks for the review! I implemented most of your comments:

  1. I separated the example script to two: training and inference
  2. Now the training scipt export models with checkpoint. I slightlt modified the class to make it returning the config
  3. I used logging to replace the print statements
  4. The plotting part has been removed

I did not implemente the training utility function and will leave it to #33 .

Can you give another look?

Comment thread scripts/example_training.py Outdated
Comment thread scripts/example_training.py Outdated
Comment thread scripts/example_training.py Outdated
Comment thread scripts/example_training.py Outdated
Comment thread climanet/st_encoder_decoder.py Outdated
Copy link
Copy Markdown
Member

@SarahAlidoost SarahAlidoost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogerkuou thanks for addressing the comments 👍 . Here some more suggestions:

  • I see that the example notebook has been changed in this PR. I cannot see exactly what is changed, but since this PR is about testing large data on HPC, let's not change the example notebook.
  • No need to add inference script in this PR. For now we can skip that one. Let's focus on setup of the training on HPC in this PR. Also, in fixing #32 we can add inefrence script later.
  • After implementing these suggestions and re-running the slurm job, can you please add the slurm logfile to the PR as well? Also, can you perhaps give an indication how much resources have been used to complete the job.

If something not clear, please let me know.

@rogerkuou
Copy link
Copy Markdown
Collaborator Author

Hi @SarahAlidoost, I have adapted your comments. A SLURM logfile has been included, with documentation in the README file on how to get the resource. However I did not find a convenient function to calculate the CPU efficiency in the SLURM job. The current way need to run one command after the SLURM job finished.

Below is a short summary of my small experiment:

  1. Two months day->month training on global scale
  2. Reserved1 node (256 CPUs)
  3. The job finished in <3 minuts
  4. Peak memory usage ~4GB
  5. CPU efficiency ~37% (Only 9 epocohs. I expect this to increase for a longer training run.)

@rogerkuou rogerkuou requested a review from SarahAlidoost March 27, 2026 10:34
@SarahAlidoost
Copy link
Copy Markdown
Member

SarahAlidoost commented Apr 1, 2026

Hi @SarahAlidoost, I have adapted your comments. A SLURM logfile has been included, with documentation in the README file on how to get the resource. However I did not find a convenient function to calculate the CPU efficiency in the SLURM job. The current way need to run one command after the SLURM job finished.

Looking at the slurm output these lines, it seems that training has not been done correctly. There are some issues how the training workflow is configured, see my comments below.

Below is a short summary of my small experiment:

  1. Two months day->month training on global scale
  • In this case, num_months should be set to 24 and not 2 (see model creation in the script).
  • I'm not sure how chunking the data and dataloader works together. Does STDataset creates a lazy dataset and dataloader load data lazily?
  • the patch_size_training=80 isnot large enough considering global scale. A larger chunk should fit in the memory.
  • In creating dataloader, batch_size is still 2, I wonder how the entire global data is used in training?
  • please move creating dataset after creating the model, because the patch_size of the dataset should be compatible with model patch_size, please see the documentation and example in the example notebook:
dataset = STDataset(
    daily_da=daily_subset["ts"],
    monthly_da=monthly_subset["ts"],
    land_mask=lsm_subset["lsm"],
    patch_size=(patch_size[1]*20, patch_size[2]*20),  # based on the patch_size in model
)
  1. Reserved1 node (256 CPUs)
  2. The job finished in <3 minuts

it seems that training has not been done correctly, see my comment above.

  1. Peak memory usage ~4GB
  2. CPU efficiency ~37% (Only 9 epocohs. I expect this to increase for a longer training run.)

Copy link
Copy Markdown
Member

@SarahAlidoost SarahAlidoost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rogerkuou thanks for addressing the suggestions. Please see my comments and let me know if something isn't clear.

@rogerkuou
Copy link
Copy Markdown
Collaborator Author

Hi @SarahAlidoost @meiertgrootes, I updated the example training script with you utility functions, and performed two experiments, both with 128 cores:

  • A train on two year dataset and space subset from 30S to 30N and from 30W to 30E. This experiment successfully finished in ~2hrs. Log file eso4clima_24438134_subset.out.
  • A train still on two year dataset but with ALMOST global coverage, from 80S to 80N and from 179.99W to 179.99E. Since this experiment is only for evaluate the performance, I only ran a SLURM job of 1hr. The traning finished >20 epochs and killed by SLURM time limit. Log file eso4clima_24449471_full.out.

Based on the two experiments I have several findings, also to address @SarahAlidoost 's comments in pervious review:

  1. When including high lattitude data (>80N/S), the training cotinues giving loss=inf even after 40 epochs. I have not investigate the cause, nor the threshold of the lattitude. But I got the impression that the current coverage is something we want to work with. Therefore I left this spatial subsetting in the exmaple script. (maybe we should create an issue?)
  2. In the current log file there are many warnings when using open_mfdataset to lazy loading multiple nc files. I did another experiment with zarr and these warnings are gone. I would recommend if possible, convert the files to Zarr. This might be easy for daily datasets but hard for hourly.
  3. The STDataset seem to handle the lazy loaded data well without need to interfene.
  4. following @SarahAlidoost previous comments, I changed the batch_size = 10 with accumulation_steps = 2. This seems to work well for global datasets. I use a patch_size_training = 120 for patch size.

Please feel free to have another look.

Comment thread scripts/README.md
- example_training.py: example training script
- example.slurm: example SLURM script to execute the training script on SLURM system
- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why job was cancelled in 1 hour while the #SBATCH --time=04:00:00?

Comment thread scripts/README.md
23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0
```

The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 256 CPU was allocated while #SBATCH --ntasks-per-node=128?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the reason is each physical CPU core shows as 2 virtual CPU cores, see https://docs.dkrz.de/doc/levante/configuration.html. We can add the explanation and the the link to the readme.

Comment thread scripts/README.md
- example_training.py: example training script
- example.slurm: example SLURM script to execute the training script on SLURM system
- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked two months of data and it looks like the values at lon = 179.9 are NaN, which might have happened during data processing. I didn’t find any issues with lat > 80 though, since the target still has data there. I made issue #41

Comment thread scripts/README.md
- example_training.py: example training script
- example.slurm: example SLURM script to execute the training script on SLURM system
- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the eso4clima_24449471_full.out file, why example_subset.slurm is used, see:

* Command          : /home/b/b383704/eso4clima/train_twoyears/
*                    example_subset.slurm

Comment thread scripts/README.md
- example_training.py: example training script
- example.slurm: example SLURM script to execute the training script on SLURM system
- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.
Copy link
Copy Markdown
Member

@SarahAlidoost SarahAlidoost Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the eso4clima_24449471_full.out file, there is the warning UserWarning: Patch size (120, 120) does not evenly divide image dimensions (H=720, W=640). Uncovered pixels: 0 in height, 40 in width. Consider adjusting patch_size or image dimensions for full coverage. warnings.warn(. I remember we have discussed this offline. I made #42

Comment thread scripts/README.md
23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0
```

The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.
The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.

Comment thread scripts/README.md
23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0
```

The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the efficiency of resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`. No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the calculation, the job used only ~37% of the available CPU resources, the task may not scale efficiently to 256 cores, room for optimization. I made #44

@SarahAlidoost
Copy link
Copy Markdown
Member

SarahAlidoost commented Apr 24, 2026

Hi @SarahAlidoost @meiertgrootes, I updated the example training script with you utility functions, and performed two experiments, both with 128 cores:

  • A train on two year dataset and space subset from 30S to 30N and from 30W to 30E. This experiment successfully finished in ~2hrs. Log file eso4clima_24438134_subset.out.
  • A train still on two year dataset but with ALMOST global coverage, from 80S to 80N and from 179.99W to 179.99E. Since this experiment is only for evaluate the performance, I only ran a SLURM job of 1hr. The traning finished >20 epochs and killed by SLURM time limit. Log file eso4clima_24449471_full.out.

Based on the two experiments I have several findings, also to address @SarahAlidoost 's comments in pervious review:

  1. When including high lattitude data (>80N/S), the training cotinues giving loss=inf even after 40 epochs. I have not investigate the cause, nor the threshold of the lattitude. But I got the impression that the current coverage is something we want to work with. Therefore I left this spatial subsetting in the exmaple script. (maybe we should create an issue?)
  2. In the current log file there are many warnings when using open_mfdataset to lazy loading multiple nc files. I did another experiment with zarr and these warnings are gone. I would recommend if possible, convert the files to Zarr. This might be easy for daily datasets but hard for hourly.
  3. The STDataset seem to handle the lazy loaded data well without need to interfene.
  4. following @SarahAlidoost previous comments, I changed the batch_size = 10 with accumulation_steps = 2. This seems to work well for global datasets. I use a patch_size_training = 120 for patch size.

Please feel free to have another look.

Hi @rogerkuou Thanks! From your comment, I understood that you ran the job for 1 hour on purpose. I don’t think that’s the best way to assess performance, but we can leave it for now. About the open_mfdataset warning and the zarr vs. nc, let’s first make sure we understand the warning; are they actually related to the file format? If so, could you make an issue for it?

I’ve added some comments and made a few issues, no need to run the job at this point. We can improve the slurm job step by step later. For now, let’s keep the PR open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test training on the two year dataset

2 participants