Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness


1UC Santa Cruz, 2Lawrence Livermore National Laboratory
alt text

Comparison of clean performance and robustness of our Delta-CLIP model with previous robust and non-robust CLIP models.

Abstract

This paper investigates the robustness of vision-language models against adversarial visual perturbations and introduces a novel ``double visual defense" to enhance this robustness. Unlike previous approaches that resort to lightweight adversarial fine-tuning of a pre-trained CLIP model, we perform large-scale adversarial vision-language pre-training from scratch using web-scale data. We then strengthen the defense by incorporating adversarial visual instruction tuning. The resulting models from each stage, Delta-CLIP and Delta2-LLaVA, show substantially enhanced zero-shot robustness and set a new state-of-the-art in adversarial defense for vision-language models. For example, the adversarial robustness of Delta-CLIP surpasses that of the previous best models on ImageNet-1k by ~20%. Similarly, compared to prior art, Delta2-LLaVA brings a \app30\% robustness improvement to image captioning task and a \app20\% robustness improvement to visual question answering task. Furthermore, our models exhibit stronger zero-shot recognition capability, fewer hallucinations, and superior reasoning performance compared to baselines.

Double Visual Defense Pipeline

Our Double Visual Defense framework, which involves an adversarial contrastive pre-training stage and an adversarial visual instruction tuning stage.

Less Hallucination

Delta2-LLaVA shows less degree of hallucination compared to LLaVA that are based on previous robust CLIP models.

Emergence of Typographic Attacks

We observe an intriguing phenomenon that typographical attack naturally emerge from naive l-inf adversarial attacks when applied to our adversarially trained Delta2-LLaVA models.

Acknowledgement

We would like to thank TPU Research Cloud (TRC) program, Google Cloud Research Credits program, and AWS Cloud Credit for Research program for partially supporting our computing needs. Cihang Xie is partially support by a gift from Open Philanthropy. This work is partially based upon the work supported by the National Center for Transportation Cybersecurity and Resiliency (TraCR) (a U.S. Department of Transportation National University Transportation Center) headquartered at Clemson University, Clemson, South Carolina, USA. Any opinions, findings, conclusions, and recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of TraCR, and the U.S. Government assumes no liability for the contents or use thereof.

Prepared by LLNL under Contract DE-AC52-07NA27344 and supported by the LLNL-LDRD Program under Project No. 24-ERD-010 and 24-ERD-058 (LLNL-CONF-2001211). This manuscript has been authored by Lawrence Livermore National Security, LLC under Contract No. DE-AC52-07NA27344 with the U.S. Department of Energy. The United States Government retains, and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

BibTeX


    @article{wang2025double,
      title   = {Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness},
      author  = {Wang, Zeyu and Xie, Cihang and Bartoldson, Brian and Kailkhura, Bhavya},
      journal = {arXiv preprint arXiv:2501.09446},
      year    = {2025}
    }