Yan Fang1,2,* · Mengcheng Lan2,3,* · Zilong Huang2,† · Weixian Lei2 · Yunqing Zhao2 · Yujie Zhong2 · Yingchen Yu2 · Qi She2 · Yao Zhao1 · Yunchao Wei1,†
Beijing Jiaotong University1 & ByteDance2 & Nanyang Technological University3
TL;DR: GenLIP -- lets ViT speak. We show that a strong MLLM vision encoder can be pretrained with just one Transformer and one autoregressive language modeling objective -- no contrastive loss, no dual-tower architecture, and no extra text decoder. Despite its simplicity, GenLIP scales effectively and performs well as a vision encoder in MLLMs, with particularly strong gains on Doc & OCR tasks.
- 2025-05-03: Code released. [✔]
# Clone the repository
git clone https://github.com/YanFangCS/GenLIP
cd GenLIP
# Install dependencies
pip install -r requirements.txt
pip install -e . # install veomni from this repoNote: If you are using PyTorch >= 2.6.0, you need to install ByteCheckpoint manually:
git clone https://github.com/ByteDance-Seed/ByteCheckpoint.git cd ByteCheckpoint # Modify the torch version assert statement in bytecheckpoint/checkpointer/fsdp_checkpointer.py#L232-L234 to support torch >= 2.6.0 # assert "2.1.0" <= torch.__version__.strip() pip install -e .
We use several caption datasets during pretraining:
Stage 1:
Stage 2:
- Infinity-MM (stage1 subset)
- BLIP3o-Pretrain-Long-Caption
Optional for Stage 2:
- CapRL-2M
- PLM-Image-Auto (caption subset only)
For Stage 1, training GenLIP with 1B seen samples is sufficient to obtain a strong vision encoder. For Stage 2, training GenLIP with Infinity-MM and BLIP3o-Long-Caption using NaViT is sufficient. Training with the two additional datasets (CapRL and PLM-Image-Auto) does not bring further performance gains, but we list them here as potential alternatives.
All datasets need to be downloaded and processed into suitable formats for pretraining. Please ensure your preprocessing function can correctly consume your data.
Below are example data formats:
# Stage 1 caption data
# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']
json_content = {
'caption': 'A modern coffee machine with a digital display and two white coffee cups filled with coffee is shown. The machine has a stainless steel finish and is accompanied by a milk frothing pitcher with a white liquid inside. The coffee machine is placed on a surface with a white background.'
}
# Stage 2 caption data
# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']
json_content = {
'conversation': [
{
'from': 'user',
'value': '<image>Describe this image in detail.'
},
{
'from': 'assistant',
'value': 'The image depicts a serene waterfront scene with calm, slightly rippled water in the foreground...'
}
]
}You can also process the datasets into other formats as needed. To ensure training runs smoothly, check and modify the process_sample function implementation to match your data format.
We provide three model configurations in configs/model_configs/genlip/:
genlip_l16_224.jsongenlip_so16_224.jsongenlip_g16_224.json
Along with corresponding training configurations in configs/pretrain/genlip/:
stage1/train_genlip_*_recap.yamlstage2/train_genlip_*_navit.yaml
You may need to modify model.config_path in the YAML config files to point to the correct model configuration.
Remember to update the dataset paths in the config files before starting training.
A training script is provided in jobs/train.sh. You can start training with:
bash jobs/train.sh <main_func> <train_config>
# Stage 1 example:
bash jobs/train.sh tasks/train_genlip_stage1.py configs/pretrain/genlip/stage1/train_genlip_so16_224_recap.yaml
# Stage 2 example:
bash jobs/train.sh tasks/train_genlip_navit.py configs/pretrain/genlip/stage2/train_genlip_so16_navit.yaml<main_func>: the training script to execute (e.g.,tasks/train_genlip_stage1.pyfor Stage 1,tasks/train_genlip_navit.pyfor Stage 2).<train_config>: the training configuration file to use.
All you need to do is set the paths and appropriate hyperparameters in the config files, then launch the script and wait for training to complete.
For multi-node training, we also provide jobs/train_multinode.sh and jobs/train_slurm_multinode.sh. You can modify them to fit your cluster setup and launch distributed training across multiple nodes.
The pretrained models are available on HuggingFace.
Our codebase is built upon:
- VeOmni: A simple and high-performance multi-modal model training framework developed by the ByteDance Seed team.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you find this project helpful, please give us a star and cite our paper:
@article{fang2026letvitspeakgenerative,
title={Let ViT Speak: Generative Language-Image Pre-training},
author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},
journal={arXiv preprint arXiv:2605.00809},
year={2026}
}