MMKE-Bench

Knowledge editing techniques have emerged as essential tools for updating the factual knowledge of large language models (LLMs) and multimodal models (LMMs), allowing them to correct outdated or inaccurate information without retraining from scratch. However, existing benchmarks for multimodal knowledge editing primarily focus on entity-level knowledge represented as simple triplets, which fail to capture the complexity of real-world multimodal information. To address this issue, we introduce MMKE-Bench, a comprehensive MultiModal Knowledge Editing Benchmark, designed to evaluate the ability of LMMs to edit diverse visual knowledge in real-world scenarios. MMKE-Bench addresses these limitations by incorporating three types of editing tasks: visual entity editing, visual semantic editing, and user-specific editing. Besides, MMKE-Bench uses free-form natural language to represent and edit knowledge, offering a more flexible and effective format. The benchmark consists of 2,940 pieces of knowledge and 8,363 images across 33 broad categories, with evaluation questions automatically generated and human-verified. We assess five state-of-the-art knowledge editing methods on three prominent LMMs, revealing that no method excels across all criteria, and that visual and user-specific edits are particularly challenging. MMKE-Bench sets a new standard for evaluating the robustness of multimodal knowledge editing techniques, driving progress in this rapidly evolving field.

Statistics

Figure 1. The types of samples in MMKE-Bench.

Table 1. The statistics of MMKE-Bench.

Table 2. Overall comparison with existing multimodal knowledge editing benchmarks.

Table 3. The image source of visual semantic knowledge in MMKE-Bench.

Table 4. The relationship between humans and the objects and data source of user-specific data in MMKE-Bench.

We construct the benchmark through four steps: i) Original Knowledge Collection; ii) Editing Knowledge Generation; iii) Evaluation Question Generation; and iv) Human Verification.

Original Knowledge Collection

In gathering original knowledge, we first list candidate fine-grained entities, visual semantics, or user-specific items, and then collect their corresponding images and descriptions. For visual entity editing, we source candidates from two datasets: the multimodal knowledge graph, MMpedia, and the visual entity recognition dataset, OVEN. For visual semantic editing, we define the candidates across 14 broad categories of semantic knowledge, including single-person behaviors, single-object behaviors or attributes, object relationships, and global structures. For user-specific editing, we consider 9 broad categories of personalized information sources, such as favorite singers, owned pets, and alma maters. The type of the samples is shown in Figure 1.

Editing Knowledge Generation

Considering the multimodal nature of large multimodal models (LMMs), we propose editing both text and visual modalities when constructing the benchmark. Specifically, we focus on editing visual entities and visual semantic knowledge while leaving user-specific knowledge unchanged. The former is treated as knowledge editing, while the latter is regarded as knowledge insertion.
For the visual modality, we follow the image-replacement-based editing approach from previous work, where an image of the entity or semantic action is randomly replaced with another of the same type. In the text modality, we modify key information about the entity and the rule or meaning into counterfactual content for visual entity editing and visual semantic editing, respectively. Additionally, we update the action description to align with the new visual content.

Evaluation Question Generation

We adhere to four key evaluation principles to generate both the questions and answers, namely Reliability, Locality, Generalization, and Portability. The Reliability question, Generalization question, and Portability question is generated by LLM and Locality question is collected from existing benchmark.

Human Check & Benchmark Statistics

During benchmark construction, we manually collected, reviewed, and filtered the samples multiple times. In the original knowledge collection stage, we conducted a thorough manual review of the images associated with each entity, behavior, and object to ensure the quality of the collected visuals. Furthermore, after counterfactual editing and question generation, we manually reviewed the questions, revised unsuitable questions, and corrected wrong answers.
The statistics of MMKE-Bench are shown in Tab 1. MMKE-Bench encompasses 3 classes of edited knowledge, totaling 2,940 knowledge pieces and 7,229 images. The knowledge spans 110 types, highlighting the diversity of MMKE-Bench. We split the dataset into training and validation sets at 4:6, with the training set reserved solely for specific knowledge editing methods(e.g., SERAC and MEND).

Single Editing

To evaluate our benchmark, we conduct experiments on three representative multimodal language models (MLLMs): BLIP-2, MiniGPT-4, and LLaVA1.5. Besides, following the previous benchmarks, we select five representative multimodal knowledge editing methods: 1) Fine-tuning (FT). We focus on fine-tuning the LLM (FT-LLM) or the vision-language alignment module (FT-Alignment), where only the last layer of the LLM is fine-tuned. 2) Knowledge Editor (KE). KE uses a hyper-network with constrained optimization to predict the weight update at test time. 3) MEND: MEND learns a low-rank decomposition of the gradient of standard fine-tuning. 4) SERAC: SERAC is a memory-based method that stores edits in explicit memory. 5) In-context Knowledge Editing (IKE): IKE is inspired by in-context learning, incorporating new demonstration formatting and organization strategies to guide knowledge editing.

Model Name	Size	Editing Method	T-Loc	I-Loc	T-Rel	I-Rel	I-Gen	Port
LLaVA-1.5	7B	FT-Alignment	100.00	9.15	35.72	38.65	39.74	37.62
LLaVA-1.5	7B	IKE	68.25	17.43	63.49	59.98	59.98	51.30
LLaVA-1.5	7B	SERAC	99.87	99.26	35.70	35.02	34.98	40.24
LLaVA-1.5	7B	MEND	97.32	75.29	51.30	47.21	46.58	41.83
LLaVA-1.5	7B	KE	79.89	18.73	46.45	46.19	46.29	48.77
LLaVA-1.5	7B	FT-LLM	77.71	17.58	53.89	49.54	49.30	41.23
LLaVA-1.5	7B	FT-Alignment	100.00	11.45	28.92	51.41	40.72	27.84
LLaVA-1.5	7B	IKE	64.11	19.44	63.54	61.92	61.31	26.08
LLaVA-1.5	7B	SERAC	99.90	99.98	29.01	29.97	29.17	20.73
LLaVA-1.5	7B	MEND	98.27	82.90	41.21	46.64	45.90	23.29
LLaVA-1.5	7B	KE	74.61	7.95	47.82	38.78	37.49	24.07
LLaVA-1.5	7B	FT-LLM	77.81	16.11	49.18	48.28	47.49	14.48
LLaVA-1.5	7B	FT-Alignment	100.00	10.87	42.40	40.21	43.65	23.65
LLaVA-1.5	7B	IKE	63.48	18.93	75.65	62.73	62.79	22.87
LLaVA-1.5	7B	SERAC	99.99	99.81	42.24	36.29	36.67	13.63
LLaVA-1.5	7B	MEND	98.49	85.41	50.92	45.14	44.86	14.49
LLaVA-1.5	7B	KE	79.51	10.80	54.85	48.65	49.46	23.67
LLaVA-1.5	7B	FT-LLM	75.08	20.41	58.18	47.80	48.56	13.11
LLaVA-1.5	7B	FT-Alignment	100.00	10.49	35.68	43.42	41.37	29.60
LLaVA-1.5	7B	IKE	65.28	18.60	67.56	61.54	61.36	33.42
LLaVA-1.5	7B	SERAC	99.92	99.68	35.65	33.76	33.61	24.87
LLaVA-1.5	7B	MEND	98.03	81.20	47.81	46.33	45.78	26.54
LLaVA-1.5	7B	KE	78.00	12.49	49.71	44.54	44.41	32.17
LLaVA-1.5	7B	FT-LLM	76.87	18.03	53.75	48.54	48.45	22.94
MiniGPT-4	7.8B	FT-LLM	84.13	31.53	49.22	41.13	41.40	31.25
MiniGPT-4	7.8B	FT-Alignment	100.00	24.85	31.89	33.87	33.93	30.79
MiniGPT-4	7.8B	IKE	75.50	15.25	56.42	53.80	53.72	41.09
MiniGPT-4	7.8B	SERAC	99.97	99.76	31.88	30.53	30.35	33.43
MiniGPT-4	7.8B	MEND	97.49	77.70	47.26	42.20	41.82	34.43
MiniGPT-4	7.8B	KE	76.44	18.47	41.28	40.03	40.44	41.55
MiniGPT-4	7.8B	FT-LLM	83.96	31.54	44.45	44.85	43.91	8.16
MiniGPT-4	7.8B	FT-Alignment	100.00	25.20	24.93	46.45	42.29	11.43
MiniGPT-4	7.8B	IKE	66.45	12.79	55.44	54.85	53.01	10.50
MiniGPT-4	7.8B	SERAC	98.70	98.80	27.08	29.65	28.33	10.35
MiniGPT-4	7.8B	MEND	97.34	77.16	37.45	42.17	42.62	8.65
MiniGPT-4	7.8B	KE	84.14	21.25	38.14	35.23	33.94	14.72
MiniGPT-4	7.8B	FT-LLM	83.13	34.04	39.74	38.94	38.60	10.53
MiniGPT-4	7.8B	FT-Alignment	100.00	25.30	21.07	33.25	33.40	12.33
MiniGPT-4	7.8B	IKE	75.35	14.56	61.55	54.86	54.81	11.85
MiniGPT-4	7.8B	SERAC	100.00	99.90	21.09	30.63	30.27	10.50
MiniGPT-4	7.8B	MEND	97.47	79.19	28.70	40.94	40.25	11.34
MiniGPT-4	7.8B	KE	78.46	20.12	22.60	37.91	37.72	19.92
MiniGPT-4	7.8B	FT-LLM	83.74	32.37	44.47	41.64	41.30	16.65
MiniGPT-4	7.8B	FT-Alignment	100.00	25.12	25.96	37.86	36.54	18.18
MiniGPT-4	7.8B	IKE	72.43	14.20	57.80	54.50	53.85	21.15
MiniGPT-4	7.8B	SERAC	99.56	99.49	26.68	30.27	29.65	18.09
MiniGPT-4	7.8B	MEND	97.43	78.02	37.80	41.77	41.56	18.14
MiniGPT-4	7.8B	KE	79.68	19.95	34.01	37.72	37.37	25.40
BLIP2-OPT	3.8B	FT-LLM	69.76	21.47	39.21	35.76	36.21	18.11
BLIP2-OPT	3.8B	FT-Alignment	100.00	8.83	20.89	27.51	27.02	19.25
BLIP2-OPT	3.8B	IKE	55.77	13.19	41.88	41.80	41.76	25.93
BLIP2-OPT	3.8B	SERAC	99.99	99.69	20.90	20.27	20.49	19.76
BLIP2-OPT	3.8B	MEND	96.02	69.37	35.67	34.41	34.48	21.31
BLIP2-OPT	3.8B	KE	83.61	18.02	28.14	28.25	28.46	30.76
BLIP2-OPT	3.8B	FT-LLM	64.11	19.25	33.42	30.79	30.71	2.76
BLIP2-OPT	3.8B	FT-Alignment	100.00	9.48	18.17	35.81	32.67	5.15
BLIP2-OPT	3.8B	IKE	47.10	13.92	35.56	42.07	41.10	5.03
BLIP2-OPT	3.8B	SERAC	99.90	99.16	18.26	18.61	17.96	3.18
BLIP2-OPT	3.8B	MEND	97.37	75.02	26.38	27.18	27.56	3.64
BLIP2-OPT	3.8B	KE	69.15	15.68	27.57	20.55	21.30	5.76
BLIP2-OPT	3.8B	FT-LLM	61.28	20.49	12.52	27.33	27.80	5.46
BLIP2-OPT	3.8B	FT-Alignment	100.00	8.74	7.46	17.19	17.31	6.17
BLIP2-OPT	3.8B	IKE	47.39	12.25	13.25	31.04	30.71	6.03
BLIP2-OPT	3.8B	SERAC	100.00	99.76	7.46	14.20	14.50	5.10
BLIP2-OPT	3.8B	MEND	96.95	76.21	11.06	25.21	25.19	5.22
BLIP2-OPT	3.8B	KE	65.70	15.73	12.79	19.83	19.71	10.80
BLIP2-OPT	3.8B	FT-LLM	65.05	20.40	28.38	31.29	31.57	8.78
BLIP2-OPT	3.8B	FT-Alignment	100.00	9.02	15.51	26.84	25.67	10.19
BLIP2-OPT	3.8B	IKE	50.09	13.12	30.23	38.30	37.86	12.33
BLIP2-OPT	3.8B	SERAC	99.96	99.54	15.54	17.69	17.65	9.56
BLIP2-OPT	3.8B	MEND	96.75	73.31	25.00	30.14	30.26	10.13
BLIP2-OPT	3.8B	KE	72.39	16.05	23.97	24.19	24.34	16.09

You can click on the header of the table to sort it in descending order.

Comparison Of The Existing Benchmark

Evaluation comparison of IKE for BLIP2 with existing benchmarks. I-Gen and Port for MMEdit, along with Port for MIKE, is set 1, as they ignore the relevant criteria.

Evaluation comparison of IKE for MiniGPT-4 with existing benchmarks. I-Gen and Port for MMEdit, along with Port for MIKE, is set 1, as they are not evaluated.

Qualitative Examples

BibTeX


          @article{du2025mmke_bench,
            title = {MMKE-Bench: A Multimodal Editing Benchmark for Diverse Visual Knowledge},
            author = {Yuntao Du and Kailin Jiang and Zhi Gao and Chenrui Shi and Zilong Zheng and Siyuan Qi and Qing Li},
            booktitle={The Thirteenth International Conference on Learning Representations},
            year = {2025}
          }

MMKE-Bench

A Multimodal Editing Benchmark for Diverse Visual Knowledge

MMMKE-Bench

Abstract