Given the recent explosion of large language models (LLMs) that can make convincingly human-like statements, it makes sense that there’s been a deepened focus on developing the models to be able to explain how they make decisions. But how can we be sure that what they’re saying is the truth?
In a new paper, researchers from Microsoft and MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) propose a novel method for measuring LLM explanations with respect to their “faithfulness”—that is, how accurately an explanation represents the reasoning process behind the model’s answer.
As lead author and Ph.D. student Katie Matton explains, faithfulness is no minor concern: if an LLM produces explanations that are plausible but unfaithful, users might develop false confidence in its responses and fail to recognize when recommendations are misaligned with their own values, like avoiding bias in hiring.