Official implementation for our CVPR'26 paper A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning.
- Python
3.9.19
Required modules tested in this repo:
torch==2.4.0+cu121
torchvision==0.15.2+cu117
transformers==4.51.3
accelerate==1.7.0
safetensors==0.4.5
tokenizers==0.21.1
sentencepiece==0.1.99
decord==0.6.0
pandas==2.2.3
numpy==1.26.4
Pillow==9.5.0
tqdm==4.66.5
opencv-python==4.11.0
ruptures==1.1.10
requests==2.31.0
packaging==24.1
ftfy==6.3.1
regex==2.5.147
einops==0.6.1
boto3==1.42.35
botocore==1.42.35
nextqa_pipeline.py- Update dataset annotation path:
../vidagent/dataset/NExTVideo/test.csv - Update video root:
video_root="../vidagent/dataset/NExTVideo/videos" - Update output file if needed:
result_path="nextqa_results.json"
- Update dataset annotation path:
egoschema_pipeline.py- Update dataset annotation path:
../vidagent/dataset/egoschema/test.csv - Update video root:
video_root="../vidagent/dataset/egoschema/videos" - Update output file if needed:
result_path="egoschema_results.json"
- Update dataset annotation path:
mlvu_pipeline.py- Update dataset annotation path:
../vidagent/dataset/MLVU/test.csv - Update video root:
video_root="../vidagent/dataset/MLVU/videos" - Update output file if needed:
result_path="mlvu_results.json"
- Update dataset annotation path:
cutting_points.py- Replace absolute cache/env defaults to paths valid on your machine:
TORCH_HOMEHUGGINGFACE_HUB_CACHEKERAS_HOME
- Replace absolute cache/env defaults to paths valid on your machine:
pipeline_runtime.py- Ensure ASP-CLIP checkpoint path is valid:
model_path="pytorch_model_0.0011.bin.25"inAspClipSelector. - If the checkpoint is stored elsewhere, update that path.
- Ensure ASP-CLIP checkpoint path is valid:
run.sbatch(if using SLURM)- Update cluster-specific fields: GPU type/count, partition/QOS, walltime, memory.
- Update environment name:
conda activate py39. - Update cache path export:
HF_HOME=....
eval.py- Update input JSON path.
- Dataset files: usually save the questions in test.csv (except for LongVideoBench), and in videos folder save video files.
nextqa_pipeline.pyegoschema_pipeline.pymlvu_pipeline.py
pipeline_runtime.py: shared multi-round debate pipeline engine (all benchmark logic)model_backends.py: model client classes (InternVL and Qwen2.5-VL)prompt_templates.py: benchmark prompt templates and history promptsvision_io.py: image/video preprocessing andprocess_vision_infocutting_points.py: event-based video partition and output cutting pointsmodules/andpytorch_model_0.0011.bin.25: ASP-CLIP code + weightsllava/: local LLaVA implementation used by backends
Our runtime results are stored in results/. Due to a later optimization of the code, the output structure of the current code might be slightly different from the json files in results/ directory. However, the accuracy should match. Json files in results/ directory are exactly those analyzed in paper experiments. Now nextqa_results.json is the result from the current code. We are working on validating other benchmarks.
cd A4VL
python main.py --list
python main.py nextqa
python main.py ego
python main.py mlvu
python main.py all
python eval.py