Skip to content

YanFangCS/GenLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Let ViT Speak: Generative Language-Image Pre-training

Yan Fang1,2,* · Mengcheng Lan2,3,* · Zilong Huang2,† · Weixian Lei2 · Yunqing Zhao2 · Yujie Zhong2 · Yingchen Yu2 · Qi She2 · Yao Zhao1 · Yunchao Wei1,†

Beijing Jiaotong University1 & ByteDance2 & Nanyang Technological University3

Home Page Paper Arxiv Model HuggingFace

TL;DR: GenLIP -- lets ViT speak. We show that a strong MLLM vision encoder can be pretrained with just one Transformer and one autoregressive language modeling objective -- no contrastive loss, no dual-tower architecture, and no extra text decoder. Despite its simplicity, GenLIP scales effectively and performs well as a vision encoder in MLLMs, with particularly strong gains on Doc & OCR tasks.

teaser

Table of Contents

News

  • 2025-05-03: Code released. [✔]

Getting Started

Installation

# Clone the repository
git clone https://github.com/YanFangCS/GenLIP
cd GenLIP

# Install dependencies
pip install -r requirements.txt
pip install -e .   # install veomni from this repo

Note: If you are using PyTorch >= 2.6.0, you need to install ByteCheckpoint manually:

git clone https://github.com/ByteDance-Seed/ByteCheckpoint.git
cd ByteCheckpoint
# Modify the torch version assert statement in bytecheckpoint/checkpointer/fsdp_checkpointer.py#L232-L234 to support torch >= 2.6.0
# assert "2.1.0" <= torch.__version__.strip()
pip install -e .

Datasets

Data Source

We use several caption datasets during pretraining:

Stage 1:

Stage 2:

Optional for Stage 2:

For Stage 1, training GenLIP with 1B seen samples is sufficient to obtain a strong vision encoder. For Stage 2, training GenLIP with Infinity-MM and BLIP3o-Long-Caption using NaViT is sufficient. Training with the two additional datasets (CapRL and PLM-Image-Auto) does not bring further performance gains, but we list them here as potential alternatives.

Data Format

All datasets need to be downloaded and processed into suitable formats for pretraining. Please ensure your preprocessing function can correctly consume your data.

Below are example data formats:

# Stage 1 caption data
# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']
json_content = {
  'caption': 'A modern coffee machine with a digital display and two white coffee cups filled with coffee is shown. The machine has a stainless steel finish and is accompanied by a milk frothing pitcher with a white liquid inside. The coffee machine is placed on a surface with a white background.'
}

# Stage 2 caption data
# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']
json_content = {
  'conversation': [
    {
      'from': 'user',
      'value': '<image>Describe this image in detail.'
    },
    {
      'from': 'assistant',
      'value': 'The image depicts a serene waterfront scene with calm, slightly rippled water in the foreground...'
    }
  ]
}

You can also process the datasets into other formats as needed. To ensure training runs smoothly, check and modify the process_sample function implementation to match your data format.

Configuration

We provide three model configurations in configs/model_configs/genlip/:

  • genlip_l16_224.json
  • genlip_so16_224.json
  • genlip_g16_224.json

Along with corresponding training configurations in configs/pretrain/genlip/:

  • stage1/train_genlip_*_recap.yaml
  • stage2/train_genlip_*_navit.yaml

You may need to modify model.config_path in the YAML config files to point to the correct model configuration.

Remember to update the dataset paths in the config files before starting training.

Training

A training script is provided in jobs/train.sh. You can start training with:

bash jobs/train.sh <main_func> <train_config>

# Stage 1 example:
bash jobs/train.sh tasks/train_genlip_stage1.py configs/pretrain/genlip/stage1/train_genlip_so16_224_recap.yaml

# Stage 2 example:
bash jobs/train.sh tasks/train_genlip_navit.py configs/pretrain/genlip/stage2/train_genlip_so16_navit.yaml
  • <main_func>: the training script to execute (e.g., tasks/train_genlip_stage1.py for Stage 1, tasks/train_genlip_navit.py for Stage 2).
  • <train_config>: the training configuration file to use.

All you need to do is set the paths and appropriate hyperparameters in the config files, then launch the script and wait for training to complete.

For multi-node training, we also provide jobs/train_multinode.sh and jobs/train_slurm_multinode.sh. You can modify them to fit your cluster setup and launch distributed training across multiple nodes.

Model Checkpoints

The pretrained models are available on HuggingFace.

Acknowledgments

Our codebase is built upon:

  • VeOmni: A simple and high-performance multi-modal model training framework developed by the ByteDance Seed team.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you find this project helpful, please give us a star and cite our paper:

@article{fang2026letvitspeakgenerative,
  title={Let ViT Speak: Generative Language-Image Pre-training}, 
  author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
  journal={arXiv preprint arXiv:2605.00809},
  year={2026}
}

About

Official repo for "Let ViT Speak: Generative Language-Image Pre-training"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors