Are LLMs Reasoning or Ranking? Diagnosing and Mitigating Order Bias in Knowledge Graph Completion

Main Article Content

Kamal Hamaz, Assia Tebib, Mohamed Ali Bouanaka

Abstract

Introduction: Knowledge graphs power modern question-answering, recommendation, and entity-aware language models, but are perennially incomplete. The task of knowledge graph completion (KGC) infers missing triples and is increasingly approached by finetuning a large language model (LLM) to select the correct entity from a candidate list produced by a lightweight embedding retriever. Such systems report strong Hits@1 improvements over the retriever and are interpreted as evidence that the LLM reasons about which candidate matches the query.


Objectives: This paper tests that interpretation directly. We ask two questions. First, are LLM-based discrimination rerankers actually reasoning about candidate identity, or are they exploiting candidate position in the prompt as a shortcut? Second, if the shortcut is present, can it be removed by a simple change to the training procedure?


Methods: We finetune Mistral-7B with QLoRA on FB15K-237 candidate lists produced by a TransE retriever under two training regimes: candidates always in retriever-score order, and candidates independently shuffled in every batch. Each trained model is evaluated under three inference-time orderings of the same candidate set: identity, uniform random permutation, and adversarial reversal. Differences across orderings isolate the reranker's dependence on candidate position rather than candidate content.


Results: The ordered-trained reranker reaches Hits@1 of 0.310 in distribution but collapses to 0.059 under shuffled inference and to 0.011 under reversal (below the 0.05 chance level of guessing uniformly in the candidate list), with residual accuracy falling below the TransE retriever it builds on. The shuffled-trained reranker is essentially flat across the three conditions (0.249, 0.247, and 0.244 Hits@1), reducing the across-condition spread by a factor of about sixty at the cost of approximately six absolute Hits@1 points.


Conclusions: Apparent reasoning by LLM-based KGC rerankers is largely an artifact of candidate-position exploitation. Once the shortcut is blocked, the LLM contributes little beyond the retriever it sits on top of. Robustness to candidate ordering should be a routine evaluation criterion for any LLM-augmented retrieval system, and the framing of such systems as reasoning models should be tempered with a corresponding control.

Article Details

Section
Articles