Authors: Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi
Chartographer is a counterfactual chart generation pipeline for evaluating whether vision-language models answer chart questions through visual reasoning rather than shortcuts or prior familiarity with a chart.
It converts chart QA examples into counterfactual chart-question families: the original chart, a base reconstruction, and seed-controlled counterfactual variants whose answers are recomputed with executable QA logic.
Python 3.10+ is recommended.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtSet API keys for the providers you plan to use:
export OPENAI_API_KEY=your_api_key_here
export ANTHROPIC_API_KEY=your_api_key_hereFor local Hugging Face VLMs, install any hardware-specific packages separately. To prefer local model weights, set:
export CHARTOGRAPHER_MODEL_WEIGHTS_DIR=/path/to/model-weightsDatasets are configured with a JSON file. Set CHARTOGRAPHER_DATASETS_FILE before running Chartographer. See examples/datasets.example.json for the datasets used in the paper and templates for custom datasets.
Minimal local dataset config:
{
"datasets": {
"my_dataset": {
"local_file_template": "{repo_root}/data/my_dataset/{split}.json",
"local_dir": "my_dataset",
"question_col": "question",
"image_col": "image",
"answer_col": "answer"
}
}
}export CHARTOGRAPHER_DATASETS_FILE=/path/to/datasets.jsonFor datasets with reconstruction or counterfactual variants, set variant_col and family_id_col in the config. Datasets exported by Chartographer include those fields automatically.
Use the make targets for the standard workflow. RECONSTRUCTION_MODEL is used for chart reconstruction and QA regeneration. PREDICTION_MODEL is the VLM being evaluated. JUDGE_MODEL checks prediction correctness.
make reconstruction-workflow DATASET=my_dataset SPLIT=dev RECONSTRUCTION_MODEL=reconstruction-model REVISION_ROUNDS=2
make qa-workflow DATASET=my_dataset SPLIT=dev RECONSTRUCTION_MODEL=reconstruction-model SEED=0
make seed-workflow DATASET=my_dataset SPLIT=dev SEED_START=0 SEED_END=9REVISION_ROUNDS=N is optional on reconstruction-workflow. It runs N self-refinement turns: diagnose the current render, revise the reconstruction, and render it again. The last revision becomes the active reconstruction, temporary revision files are cleaned, and the files needed by QA/export are rebuilt from the promoted reconstruction.
Export the original chart, base reconstruction, and seed-controlled counterfactual variants as a local evaluation dataset:
make export-family-dataset DATASET=my_dataset SPLIT=dev OUTPUT_DATASET=my_dataset_families FAMILY_SEEDS=0-9
export CHARTOGRAPHER_DATASETS_FILE=$PWD/data/my_dataset_families/datasets.jsonRun prediction and evaluation on the exported family dataset:
make prediction-workflow DATASET=my_dataset_families SPLIT=dev PREDICTION_MODEL=prediction-model JUDGE_MODEL=judge-modelUse make help for individual steps. See docs/workflow.md for direct python -m ... commands and output locations.
src/clients/ API and local VLM clients
src/common/ dataset, answer, and prediction I/O utilities
src/config/ model aliases and task prompts
src/pipeline/reconstruction/ chart reconstruction and counterfactual rendering
src/pipeline/qa/ QA regeneration and execution
src/pipeline/datasets/ chart-question family dataset export
src/pipeline/prediction/ VLM prediction, evaluation, and visualization
Generated outputs are written under results/; local generated datasets are written under data/.
If you use Chartographer, please cite the paper:
@misc{jiang2026chartographer,
title={Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models},
author={Yifan Jiang and Dae Yon Hwang and Jesse C. Cresswell and Freda Shi},
year={2026},
eprint={2605.27311},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.27311}
}See LICENSE.
