Let ViT Speak: Generative Language-Image Pre-training

Yan Fang^1,2,* · Mengcheng Lan^2,3,* · Zilong Huang^2,† · Weixian Lei² · Yunqing Zhao² · Yujie Zhong² · Yingchen Yu² · Qi She² · Yao Zhao¹ · Yunchao Wei^1,†

Beijing Jiaotong University¹ & ByteDance² & Nanyang Technological University³

TL;DR: GenLIP -- lets ViT speak. We show that a strong MLLM vision encoder can be pretrained with just one Transformer and one autoregressive language modeling objective -- no contrastive loss, no dual-tower architecture, and no extra text decoder. Despite its simplicity, GenLIP scales effectively and performs well as a vision encoder in MLLMs, with particularly strong gains on Doc & OCR tasks.

News

2025-05-03: Code released. [✔]

Getting Started

Installation

# Clone the repository
git clone https://github.com/YanFangCS/GenLIP
cd GenLIP

# Install dependencies
pip install -r requirements.txt
pip install -e .   # install veomni from this repo

Note: If you are using PyTorch >= 2.6.0, you need to install ByteCheckpoint manually:

git clone https://github.com/ByteDance-Seed/ByteCheckpoint.git
cd ByteCheckpoint
# Modify the torch version assert statement in bytecheckpoint/checkpointer/fsdp_checkpointer.py#L232-L234 to support torch >= 2.6.0
# assert "2.1.0" <= torch.__version__.strip()
pip install -e .

Datasets

Data Source

We use several caption datasets during pretraining:

Stage 1:

Recap-DataComp-1B

Stage 2:

Infinity-MM (stage1 subset)
BLIP3o-Pretrain-Long-Caption

Optional for Stage 2:

CapRL-2M
PLM-Image-Auto (caption subset only)

For Stage 1, training GenLIP with 1B seen samples is sufficient to obtain a strong vision encoder. For Stage 2, training GenLIP with Infinity-MM and BLIP3o-Long-Caption using NaViT is sufficient. Training with the two additional datasets (CapRL and PLM-Image-Auto) does not bring further performance gains, but we list them here as potential alternatives.

Data Format

All datasets need to be downloaded and processed into suitable formats for pretraining. Please ensure your preprocessing function can correctly consume your data.

Below are example data formats:

# Stage 1 caption data
# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']
json_content = {
  'caption': 'A modern coffee machine with a digital display and two white coffee cups filled with coffee is shown. The machine has a stainless steel finish and is accompanied by a milk frothing pitcher with a white liquid inside. The coffee machine is placed on a surface with a white background.'
}

# Stage 2 caption data
# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']
json_content = {
  'conversation': [
    {
      'from': 'user',
      'value': '<image>Describe this image in detail.'
    },
    {
      'from': 'assistant',
      'value': 'The image depicts a serene waterfront scene with calm, slightly rippled water in the foreground...'
    }
  ]
}

You can also process the datasets into other formats as needed. To ensure training runs smoothly, check and modify the process_sample function implementation to match your data format.

Configuration

We provide three model configurations in configs/model_configs/genlip/:

genlip_l16_224.json
genlip_so16_224.json
genlip_g16_224.json

Along with corresponding training configurations in configs/pretrain/genlip/:

stage1/train_genlip_*_recap.yaml
stage2/train_genlip_*_navit.yaml

You may need to modify model.config_path in the YAML config files to point to the correct model configuration.

Remember to update the dataset paths in the config files before starting training.

Training

A training script is provided in jobs/train.sh. You can start training with:

bash jobs/train.sh <main_func> <train_config>

# Stage 1 example:
bash jobs/train.sh tasks/train_genlip_stage1.py configs/pretrain/genlip/stage1/train_genlip_so16_224_recap.yaml

# Stage 2 example:
bash jobs/train.sh tasks/train_genlip_navit.py configs/pretrain/genlip/stage2/train_genlip_so16_navit.yaml

<main_func>: the training script to execute (e.g., tasks/train_genlip_stage1.py for Stage 1, tasks/train_genlip_navit.py for Stage 2).
<train_config>: the training configuration file to use.

All you need to do is set the paths and appropriate hyperparameters in the config files, then launch the script and wait for training to complete.

For multi-node training, we also provide jobs/train_multinode.sh and jobs/train_slurm_multinode.sh. You can modify them to fit your cluster setup and launch distributed training across multiple nodes.

Model Checkpoints

The pretrained models are available on HuggingFace.

Acknowledgments

Our codebase is built upon:

VeOmni: A simple and high-performance multi-modal model training framework developed by the ByteDance Seed team.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you find this project helpful, please give us a star and cite our paper:

@article{fang2026letvitspeakgenerative,
  title={Let ViT Speak: Generative Language-Image Pre-training}, 
  author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
  journal={arXiv preprint arXiv:2605.00809},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
configs		configs
jobs		jobs
scripts		scripts
tasks		tasks
veomni		veomni
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Let ViT Speak: Generative Language-Image Pre-training

Table of Contents

News

Getting Started

Installation

Datasets

Data Source

Data Format

Configuration

Training

Model Checkpoints

Acknowledgments

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Let ViT Speak: Generative Language-Image Pre-training

Table of Contents

News

Getting Started

Installation

Datasets

Data Source

Data Format

Configuration

Training

Model Checkpoints

Acknowledgments

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages