PLMs for general lab usage.
- Multiple Input Formats: Support for both FASTA and CSV input files
- Flexible Configuration: Customizable model parameters, batch sizes, and output options
- Optimized for HPC: Ready-to-use SLURM scripts for Stanford's Sherlock cluster
- Multiple ESM Models: Support for ESM2 and ESM3 architectures
In each of the LLM-specific folders, you can find an sbatch script, which you can submit to Slurm. Simply follow the README in each of the folder to run it with the correct arguments.
If you want to run the Python script on your machine directly, you can run the .py files in each of the LLM-specific folders.
You can set up a Conda/Mamba environment with all the dependencies using the environment file requirements.txt:
micromamba create -n plms python=3.12
micromamba activate plms
pip install -r requirements.txt
Note: Python 3.12 is required for full feature support, including ESM SDK >= 3.2.3. Currently (Oct 2025) CUDA<=12.4, such that cuda works with the Nvidia drivers available on Sherlock (v520.61.05).
To install Flash Attention (recommended for speed, only for Ampere, Ada, or Hopper GPUs), run (should take <10 min):
# Requires >100GB RAM in case of many CPUs
pip install flash-attn --no-build-isolation
If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:
MAX_JOBS=4 pip install flash-attn --no-build-isolation
All scripts support both FASTA and CSV input formats with automatic format detection.
Standard FASTA files with sequence headers and sequences:
>sequence_id_1
ACDEFGHIKLMNPQRSTVWY
>sequence_id_2
MKTAYIAKQRQISFVKSHFSRQLE
CSV files with customizable column names (defaults: id and sequence):
id,sequence,description
seq1,ACDEFGHIKLMNPQRSTVWY,My first protein
seq2,MKTAYIAKQRQISFVKSHFSRQLE,My second proteinUsage examples:
# FASTA input (auto-detected)
python ESM2/run_batch_esm2.py --input_file sequences.fasta --output_file output.pt
# CSV input with default columns (id, sequence)
python ESM2/run_batch_esm2.py --input_file sequences.csv --output_file output.pt
# CSV input with custom column names
python ESM2/run_batch_esm2.py \
--input_file data.csv \
--output_file output.pt \
--label_col "protein_id" \
--seq_col "aa_sequence"The repository uses a flexible dataset architecture in the utils package:
BatchedDataset: Abstract base class for all sequence datasets with efficient batching logicFastaBatchedDataset: Loads sequences from FASTA filesCSVBatchedDataset: Loads sequences from CSV files with configurable columnsload_dataset(): Convenience function that automatically detects file format and returns the appropriate dataset
This architecture makes it easy to extend support to additional formats (JSON, Parquet, HDF5, etc.) by simply subclassing BatchedDataset.
All utility modules are organized in the utils/ directory for easy maintenance and extensibility.
Supported file extensions:
- FASTA:
.fasta,.fa,.faa,.fna - CSV:
.csv
To create a new version, once must use a non-Sherlock machine. This is because Sherlock does not give root access, which is needed to build intermediate Docker images. To release a new software version: tag the repo with the version, run make, and then copy the output to our group space. Versioning uses the CalVer YYYY.MM.MICRO syntax: year, month, and monthly release number.
Example release, on a non-Sherlock machine:
git tag 2025.10.0 # YYYY.MM.MICRO change accordingly
make
scp /tmp/drorlabplms/drorlabplms_2025.10.0.sif <USERNAME>@dtn.sherlock.stanford.edu:/oak/stanford/groups/rondror/software/plms/singularity
To make the new version the default, update the symlink on a Sherlock machine:
cd oak/stanford/groups/rondror/software/plms/singularity
ln -sf drorlabplms_2025.10.0.sif drorlabplms.sif