Coping with dumb LLMs using classic ML (softwaredoug.com)
In previous posts I use a local LLM to choose which two products were more relevant for a search query (see this github repo). Using human labels in an open e-commerce search dataset as a baseline (WANDS from Wayfair), I measure the LLM’s preference for a product, seeing if it matches human raters. If I can do this, then I can use my laptop as the search relevance judge.