Skip to content

drorlab/drorlab_PLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

drorlab_PLMs

PLMs for general lab usage.

Features

  • Multiple Input Formats: Support for both FASTA and CSV input files
  • Flexible Configuration: Customizable model parameters, batch sizes, and output options
  • Optimized for HPC: Ready-to-use SLURM scripts for Stanford's Sherlock cluster
  • Multiple ESM Models: Support for ESM2 and ESM3 architectures

How to run

Using a container (recommended; no need to install anything)

In each of the LLM-specific folders, you can find an sbatch script, which you can submit to Slurm. Simply follow the README in each of the folder to run it with the correct arguments.

Using Python / Conda

If you want to run the Python script on your machine directly, you can run the .py files in each of the LLM-specific folders.

You can set up a Conda/Mamba environment with all the dependencies using the environment file requirements.txt:

micromamba create -n plms python=3.12
micromamba activate plms
pip install -r requirements.txt

Note: Python 3.12 is required for full feature support, including ESM SDK >= 3.2.3. Currently (Oct 2025) CUDA<=12.4, such that cuda works with the Nvidia drivers available on Sherlock (v520.61.05).

To install Flash Attention (recommended for speed, only for Ampere, Ada, or Hopper GPUs), run (should take <10 min):

# Requires >100GB RAM in case of many CPUs
pip install flash-attn --no-build-isolation

If your machine has less than 96GB of RAM and lots of CPU cores, ninja might run too many parallel compilation jobs that could exhaust the amount of RAM. To limit the number of parallel compilation jobs, you can set the environment variable MAX_JOBS:

MAX_JOBS=4 pip install flash-attn --no-build-isolation

Input Formats

All scripts support both FASTA and CSV input formats with automatic format detection.

FASTA Format

Standard FASTA files with sequence headers and sequences:

>sequence_id_1
ACDEFGHIKLMNPQRSTVWY
>sequence_id_2
MKTAYIAKQRQISFVKSHFSRQLE

CSV Format

CSV files with customizable column names (defaults: id and sequence):

id,sequence,description
seq1,ACDEFGHIKLMNPQRSTVWY,My first protein
seq2,MKTAYIAKQRQISFVKSHFSRQLE,My second protein

Usage examples:

# FASTA input (auto-detected)
python ESM2/run_batch_esm2.py --input_file sequences.fasta --output_file output.pt

# CSV input with default columns (id, sequence)
python ESM2/run_batch_esm2.py --input_file sequences.csv --output_file output.pt

# CSV input with custom column names
python ESM2/run_batch_esm2.py \
  --input_file data.csv \
  --output_file output.pt \
  --label_col "protein_id" \
  --seq_col "aa_sequence"

Batched Dataset Classes

The repository uses a flexible dataset architecture in the utils package:

  • BatchedDataset: Abstract base class for all sequence datasets with efficient batching logic
  • FastaBatchedDataset: Loads sequences from FASTA files
  • CSVBatchedDataset: Loads sequences from CSV files with configurable columns
  • load_dataset(): Convenience function that automatically detects file format and returns the appropriate dataset

This architecture makes it easy to extend support to additional formats (JSON, Parquet, HDF5, etc.) by simply subclassing BatchedDataset.

All utility modules are organized in the utils/ directory for easy maintenance and extensibility.

Supported file extensions:

  • FASTA: .fasta, .fa, .faa, .fna
  • CSV: .csv

Creating a new Sigularity image

To create a new version, once must use a non-Sherlock machine. This is because Sherlock does not give root access, which is needed to build intermediate Docker images. To release a new software version: tag the repo with the version, run make, and then copy the output to our group space. Versioning uses the CalVer YYYY.MM.MICRO syntax: year, month, and monthly release number.

Example release, on a non-Sherlock machine:

git tag 2025.10.0 # YYYY.MM.MICRO change accordingly
make
scp /tmp/drorlabplms/drorlabplms_2025.10.0.sif <USERNAME>@dtn.sherlock.stanford.edu:/oak/stanford/groups/rondror/software/plms/singularity

To make the new version the default, update the symlink on a Sherlock machine:

cd oak/stanford/groups/rondror/software/plms/singularity
ln -sf drorlabplms_2025.10.0.sif drorlabplms.sif

About

plms for general lab usage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages