drorlab_PLMs

PLMs for general lab usage.

Features

Multiple Input Formats: Support for both FASTA and CSV input files
Flexible Configuration: Customizable model parameters, batch sizes, and output options
Optimized for HPC: Ready-to-use SLURM scripts for Stanford's Sherlock cluster
Multiple ESM Models: Support for ESM2 and ESM3 architectures

How to run

Using a container (recommended; no need to install anything)

In each of the LLM-specific folders, you can find an sbatch script, which you can submit to Slurm. Simply follow the README in each of the folder to run it with the correct arguments.

Using Python / Conda

If you want to run the Python script on your machine directly, you can run the .py files in each of the LLM-specific folders.

You can set up a Conda/Mamba environment with all the dependencies using the environment file requirements.txt:

micromamba create -n plms python=3.12
micromamba activate plms
pip install -r requirements.txt

Note: Python 3.12 is required for full feature support, including ESM SDK >= 3.2.3. Currently (Oct 2025) CUDA<=12.4, such that cuda works with the Nvidia drivers available on Sherlock (v520.61.05).

To install Flash Attention (recommended for speed, only for Ampere, Ada, or Hopper GPUs), run (should take <10 min):

# Requires >100GB RAM in case of many CPUs
pip install flash-attn --no-build-isolation

If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:

MAX_JOBS=4 pip install flash-attn --no-build-isolation

Input Formats

All scripts support both FASTA and CSV input formats with automatic format detection.

FASTA Format

Standard FASTA files with sequence headers and sequences:

>sequence_id_1
ACDEFGHIKLMNPQRSTVWY
>sequence_id_2
MKTAYIAKQRQISFVKSHFSRQLE

CSV Format

CSV files with customizable column names (defaults: id and sequence):

id,sequence,description
seq1,ACDEFGHIKLMNPQRSTVWY,My first protein
seq2,MKTAYIAKQRQISFVKSHFSRQLE,My second protein

Usage examples:

# FASTA input (auto-detected)
python ESM2/run_batch_esm2.py --input_file sequences.fasta --output_file output.pt

# CSV input with default columns (id, sequence)
python ESM2/run_batch_esm2.py --input_file sequences.csv --output_file output.pt

# CSV input with custom column names
python ESM2/run_batch_esm2.py \
  --input_file data.csv \
  --output_file output.pt \
  --label_col "protein_id" \
  --seq_col "aa_sequence"

Batched Dataset Classes

The repository uses a flexible dataset architecture in the utils package:

BatchedDataset: Abstract base class for all sequence datasets with efficient batching logic
FastaBatchedDataset: Loads sequences from FASTA files
CSVBatchedDataset: Loads sequences from CSV files with configurable columns
load_dataset(): Convenience function that automatically detects file format and returns the appropriate dataset

This architecture makes it easy to extend support to additional formats (JSON, Parquet, HDF5, etc.) by simply subclassing BatchedDataset.

All utility modules are organized in the utils/ directory for easy maintenance and extensibility.

Supported file extensions:

FASTA: .fasta, .fa, .faa, .fna
CSV: .csv

Creating a new Sigularity image

To create a new version, once must use a non-Sherlock machine. This is because Sherlock does not give root access, which is needed to build intermediate Docker images. To release a new software version: tag the repo with the version, run make, and then copy the output to our group space. Versioning uses the CalVer YYYY.MM.MICRO syntax: year, month, and monthly release number.

Example release, on a non-Sherlock machine:

git tag 2025.10.0 # YYYY.MM.MICRO change accordingly
make
scp /tmp/drorlabplms/drorlabplms_2025.10.0.sif <USERNAME>@dtn.sherlock.stanford.edu:/oak/stanford/groups/rondror/software/plms/singularity

To make the new version the default, update the symlink on a Sherlock machine:

cd oak/stanford/groups/rondror/software/plms/singularity
ln -sf drorlabplms_2025.10.0.sif drorlabplms.sif

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
ESM2		ESM2
ESM3		ESM3
example		example
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

drorlab_PLMs

Features

How to run

Using a container (recommended; no need to install anything)

Using Python / Conda

Input Formats

FASTA Format

CSV Format

Batched Dataset Classes

Creating a new Sigularity image

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

drorlab_PLMs

Features

How to run

Using a container (recommended; no need to install anything)

Using Python / Conda

Input Formats

FASTA Format

CSV Format

Batched Dataset Classes

Creating a new Sigularity image

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages