Large, pretrained language models have led to a flurry of new state-of-the-art results being reported in many areas of natural language processing. However, recent work has also shown that such models tend to solve language tasks by relying on superfi...
Large, pretrained language models have led to a flurry of new state-of-the-art results being reported in many areas of natural language processing. However, recent work has also shown that such models tend to solve language tasks by relying on superficial cues found in benchmark datasets, instead of acquiring the capabilities envisioned by the task designers. In this short opinion piece, I review a report by Niven & Kao (2019) of this so-called Clever Hans effect on an argument reasoning task and discuss possible solutions for its prevention.