Logo MMKE-Bench

A Multimodal Editing Benchmark for Diverse Visual Knowledge

1 BIGAI 2 USTC 3 PKU 4 BIT

*Core Contributors
†Corresponding to: liqing@bigai.ai | zlzheng@bigai.ai

TL;DR: We propose MMKE-Bench, a challenging benchmark for evaluating diverse semantic editing in real-world scenarios.

geometric reasoning

Overview of the MMKE-Bench dataset. Our contribution can be summarized as follows:

1) Overview of MMKE-Bench: MMKE-Bench is introduced as a benchmark designed to test semantic editing capabilities in realistic scenarios. It utilizes natural language for knowledge representation and includes three editing types aligned with practical contexts.

2) Development of the Benchmark Pipeline: Describes the novel pipeline used to develop the benchmark, which includes collecting original knowledge, generating editable knowledge, and crafting evaluation questions based on specific principles.

3) Experimental Analysis and Challenges: Details extensive experiments with various standard methods and large language models, highlighting several limitations in the existing approaches to knowledge editing in both single and multiple edit scenarios.

MMMKE-Bench

Abstract

Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit knowledge, offering a more flexible and effective format. The benchmark consists of 2,940 pieces of knowledge and 8,363 images across 33 broad categories, with evaluation questions automatically generated and human-verified. We assess five state-of-the-art knowledge editing methods on three prominent LMMs, revealing that no method excels across all criteria, and that visual and user-specific edits are particularly challenging. MMKE-Bench sets a new standard for evaluating the robustness of multimodal knowledge editing techniques, driving progress in this rapidly evolving field.

Statistics

Method

We construct the benchmark through four steps: i) Original Knowledge Collection; ii) Editing Knowledge Generation; iii) Evaluation Question Generation; and iv) Human Verification.

Original Knowledge Collection

In gathering original knowledge, we first list candidate fine-grained entities, visual semantics, or user-specific items, and then collect their corresponding images and descriptions. For visual entity editing, we source candidates from two datasets: the multimodal knowledge graph, MMpedia, and the visual entity recognition dataset, OVEN. For visual semantic editing, we define the candidates across 14 broad categories of semantic knowledge, including single-person behaviors, single-object behaviors or attributes, object relationships, and global structures. For user-specific editing, we consider 9 broad categories of personalized information sources, such as favorite singers, owned pets, and alma maters. The type of the samples is shown in Figure 1.

algebraic reasoning

Editing Knowledge Generation

Considering the multimodal nature of large multimodal models (LMMs), we propose editing both text and visual modalities when constructing the benchmark. Specifically, we focus on editing visual entities and visual semantic knowledge while leaving user-specific knowledge unchanged. The former is treated as knowledge editing, while the latter is regarded as knowledge insertion.
For the visual modality, we follow the image-replacement-based editing approach from previous work, where an image of the entity or semantic action is randomly replaced with another of the same type. In the text modality, we modify key information about the entity and the rule or meaning into counterfactual content for visual entity editing and visual semantic editing, respectively. Additionally, we update the action description to align with the new visual content.

Evaluation Question Generation

We adhere to four key evaluation principles to generate both the questions and answers, namely Reliability, Locality, Generalization, and Portability. The Reliability question, Generalization question, and Portability question is generated by LLM and Locality question is collected from existing benchmark.

Human Check & Benchmark Statistics

During benchmark construction, we manually collected, reviewed, and filtered the samples multiple times. In the original knowledge collection stage, we conducted a thorough manual review of the images associated with each entity, behavior, and object to ensure the quality of the collected visuals. Furthermore, after counterfactual editing and question generation, we manually reviewed the questions, revised unsuitable questions, and corrected wrong answers.
The statistics of MMKE-Bench are shown in Tab 1. MMKE-Bench encompasses 3 classes of edited knowledge, totaling 2,940 knowledge pieces and 7,229 images. The knowledge spans 110 types, highlighting the diversity of MMKE-Bench. We split the dataset into training and validation sets at 4:6, with the training set reserved solely for specific knowledge editing methods(e.g., SERAC and MEND).

Experiment Results

Single Editing

To evaluate our benchmark, we conduct experiments on three representative multimodal language models (MLLMs): BLIP-2, MiniGPT-4, and LLaVA1.5. Besides, following the previous benchmarks, we select five representative multimodal knowledge editing methods: 1) Fine-tuning (FT). We focus on fine-tuning the LLM (FT-LLM) or the vision-language alignment module (FT-Alignment), where only the last layer of the LLM is fine-tuned. 2) Knowledge Editor (KE). KE uses a hyper-network with constrained optimization to predict the weight update at test time. 3) MEND: MEND learns a low-rank decomposition of the gradient of standard fine-tuning. 4) SERAC: SERAC is a memory-based method that stores edits in explicit memory. 5) In-context Knowledge Editing (IKE): IKE is inspired by in-context learning, incorporating new demonstration formatting and organization strategies to guide knowledge editing.

Model Name Size Editing Method T-Loc I-Loc T-Rel I-Rel I-Gen Port
LLaVA-1.5 7B FT-Alignment 100.00 9.15 35.72 38.65 39.74 37.62
LLaVA-1.5 7B IKE 68.25 17.43 63.49 59.98 59.98 51.30
LLaVA-1.5 7B SERAC 99.87 99.26 35.70 35.02 34.98 40.24
LLaVA-1.5 7B MEND 97.32 75.29 51.30 47.21 46.58 41.83
LLaVA-1.5 7B KE 79.89 18.73 46.45 46.19 46.29 48.77
LLaVA-1.5 7B FT-LLM 77.71 17.58 53.89 49.54 49.30 41.23
LLaVA-1.5 7B FT-Alignment 100.00 11.45 28.92 51.41 40.72 27.84
LLaVA-1.5 7B IKE 64.11 19.44 63.54 61.92 61.31 26.08
LLaVA-1.5 7B SERAC 99.90 99.98 29.01 29.97 29.17 20.73
LLaVA-1.5 7B MEND 98.27 82.90 41.21 46.64 45.90 23.29
LLaVA-1.5 7B KE 74.61 7.95 47.82 38.78 37.49 24.07
LLaVA-1.5 7B FT-LLM 77.81 16.11 49.18 48.28 47.49 14.48
LLaVA-1.5 7B FT-Alignment 100.00 10.87 42.40 40.21 43.65 23.65
LLaVA-1.5 7B IKE 63.48 18.93 75.65 62.73 62.79 22.87
LLaVA-1.5 7B SERAC 99.99 99.81 42.24 36.29 36.67 13.63
LLaVA-1.5 7B MEND 98.49 85.41 50.92 45.14 44.86 14.49
LLaVA-1.5 7B KE 79.51 10.80 54.85 48.65 49.46 23.67
LLaVA-1.5 7B FT-LLM 75.08 20.41 58.18 47.80 48.56 13.11
LLaVA-1.5 7B FT-Alignment 100.00 10.49 35.68 43.42 41.37 29.60
LLaVA-1.5 7B IKE 65.28 18.60 67.56 61.54 61.36 33.42
LLaVA-1.5 7B SERAC 99.92 99.68 35.65 33.76 33.61 24.87
LLaVA-1.5 7B MEND 98.03 81.20 47.81 46.33 45.78 26.54
LLaVA-1.5 7B KE 78.00 12.49 49.71 44.54 44.41 32.17
LLaVA-1.5 7B FT-LLM 76.87 18.03 53.75 48.54 48.45 22.94
MiniGPT-4 7.8B FT-LLM 84.13 31.53 49.22 41.13 41.40 31.25
MiniGPT-4 7.8B FT-Alignment 100.00 24.85 31.89 33.87 33.93 30.79
MiniGPT-4 7.8B IKE 75.50 15.25 56.42 53.80 53.72 41.09
MiniGPT-4 7.8B SERAC 99.97 99.76 31.88 30.53 30.35 33.43
MiniGPT-4 7.8B MEND 97.49 77.70 47.26 42.20 41.82 34.43
MiniGPT-4 7.8B KE 76.44 18.47 41.28 40.03 40.44 41.55
MiniGPT-4 7.8B FT-LLM 83.96 31.54 44.45 44.85 43.91 8.16
MiniGPT-4 7.8B FT-Alignment 100.00 25.20 24.93 46.45 42.29 11.43
MiniGPT-4 7.8B IKE 66.45 12.79 55.44 54.85 53.01 10.50
MiniGPT-4 7.8B SERAC 98.70 98.80 27.08 29.65 28.33 10.35
MiniGPT-4 7.8B MEND 97.34 77.16 37.45 42.17 42.62 8.65
MiniGPT-4 7.8B KE 84.14 21.25 38.14 35.23 33.94 14.72
MiniGPT-4 7.8B FT-LLM 83.13 34.04 39.74 38.94 38.60 10.53
MiniGPT-4 7.8B FT-Alignment 100.00 25.30 21.07 33.25 33.40 12.33
MiniGPT-4 7.8B IKE 75.35 14.56 61.55 54.86 54.81 11.85
MiniGPT-4 7.8B SERAC 100.00 99.90 21.09 30.63 30.27 10.50
MiniGPT-4 7.8B MEND 97.47 79.19 28.70 40.94 40.25 11.34
MiniGPT-4 7.8B KE 78.46 20.12 22.60 37.91 37.72 19.92
MiniGPT-4 7.8B FT-LLM 83.74 32.37 44.47 41.64 41.30 16.65
MiniGPT-4 7.8B FT-Alignment 100.00 25.12 25.96 37.86 36.54 18.18
MiniGPT-4 7.8B IKE 72.43 14.20 57.80 54.50 53.85 21.15
MiniGPT-4 7.8B SERAC 99.56 99.49 26.68 30.27 29.65 18.09
MiniGPT-4 7.8B MEND 97.43 78.02 37.80 41.77 41.56 18.14
MiniGPT-4 7.8B KE 79.68 19.95 34.01 37.72 37.37 25.40
BLIP2-OPT 3.8B FT-LLM 69.76 21.47 39.21 35.76 36.21 18.11
BLIP2-OPT 3.8B FT-Alignment 100.00 8.83 20.89 27.51 27.02 19.25
BLIP2-OPT 3.8B IKE 55.77 13.19 41.88 41.80 41.76 25.93
BLIP2-OPT 3.8B SERAC 99.99 99.69 20.90 20.27 20.49 19.76
BLIP2-OPT 3.8B MEND 96.02 69.37 35.67 34.41 34.48 21.31
BLIP2-OPT 3.8B KE 83.61 18.02 28.14 28.25 28.46 30.76
BLIP2-OPT 3.8B FT-LLM 64.11 19.25 33.42 30.79 30.71 2.76
BLIP2-OPT 3.8B FT-Alignment 100.00 9.48 18.17 35.81 32.67 5.15
BLIP2-OPT 3.8B IKE 47.10 13.92 35.56 42.07 41.10 5.03
BLIP2-OPT 3.8B SERAC 99.90 99.16 18.26 18.61 17.96 3.18
BLIP2-OPT 3.8B MEND 97.37 75.02 26.38 27.18 27.56 3.64
BLIP2-OPT 3.8B KE 69.15 15.68 27.57 20.55 21.30 5.76
BLIP2-OPT 3.8B FT-LLM 61.28 20.49 12.52 27.33 27.80 5.46
BLIP2-OPT 3.8B FT-Alignment 100.00 8.74 7.46 17.19 17.31 6.17
BLIP2-OPT 3.8B IKE 47.39 12.25 13.25 31.04 30.71 6.03
BLIP2-OPT 3.8B SERAC 100.00 99.76 7.46 14.20 14.50 5.10
BLIP2-OPT 3.8B MEND 96.95 76.21 11.06 25.21 25.19 5.22
BLIP2-OPT 3.8B KE 65.70 15.73 12.79 19.83 19.71 10.80
BLIP2-OPT 3.8B FT-LLM 65.05 20.40 28.38 31.29 31.57 8.78
BLIP2-OPT 3.8B FT-Alignment 100.00 9.02 15.51 26.84 25.67 10.19
BLIP2-OPT 3.8B IKE 50.09 13.12 30.23 38.30 37.86 12.33
BLIP2-OPT 3.8B SERAC 99.96 99.54 15.54 17.69 17.65 9.56
BLIP2-OPT 3.8B MEND 96.75 73.31 25.00 30.14 30.26 10.13
BLIP2-OPT 3.8B KE 72.39 16.05 23.97 24.19 24.34 16.09

You can click on the header of the table to sort it in descending order.

Comparison Of The Existing Benchmark

BLIP v1 Image
Evaluation comparison of IKE for BLIP2 with existing benchmarks. I-Gen and Port for MMEdit, along with Port for MIKE, is set 1, as they ignore the relevant criteria.
MiniGPT-4 Image
Evaluation comparison of IKE for MiniGPT-4 with existing benchmarks. I-Gen and Port for MMEdit, along with Port for MIKE, is set 1, as they are not evaluated.

Data Examples

Qualitative Examples

BibTeX


          @article{du2024mmke_bench,
            title = {MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge},
            author = {Yuntao Du and Kailin Jiang and Zhi Gao and Chenrui Shi and Zilong Zheng and Siyuan Qi and Qing Li},
            year = {2024}
          }