Evaluating LargeLanguage Models (LLMs) like ChatGPT is tricky compared to traditional machinelearning models. This is mainly due to the lack of clear "GroundTruth" data that models can be objectively measured against.
In traditionalML/DL systems, ground truths act as the benchmark or "correctanswers" that models are evaluated against to determine their performanceand accuracy. However, for generative LLMs that produce free-form text, thereis no single correct output to compare predictions to. This makes evaluatingLLMs like judging a creative writing essay - more subjective and nuanced.
So while groundtruths and metrics like accuracy worked well for ML models with definitiveright/wrong outputs, they are ill-suited for generative models like LLMs withfuzzy evaluation criteria. New customized methods are needed to properly assessLLMs beyond just precision and recall.
In traditionalML/DL, ground truths are an essential part of model building as they are usedfor measuring the quality of the model's predictions or classifications. Theseevaluations are essential to determine which model experiments should bedeployed in production as well as help teams sample production data andannotate it to determine cohorts of low model performance to improve upon.
LLMs on the otherhand operate in realms where defining a clear ground truth is challenging. Theyare often tasked with producing human-like text where there's no single rightanswer to do a word-to-word comparison against. There is no correct marketing copyor correct sales email to benchmark your application against.
One way to evaluateLLM-generated responses is to involve humans and ask them to rate the qualityof the responses. This is essentially what most of us are doing when wemanually look at the responses of two models or two prompts and determine whichone looks better. Although quick to start, this process becomes time-consumingas well as highly subjective very quickly. Given the fact that RLHF involvescreating an LLM reward model that scores model responses during training, thislimitation shows the need to explore other options.
LLMs can be used toevaluate their own outputs by having them check responses on specific criteriarather than grading overall correctness. This involves breaking down evaluationinto smaller, simpler tasks that are more concrete for the LLM to assess. Forexample, having the LLM check if a response utilizes context appropriately orfollows certain guidelines. Rather than asking if a marketing email is goodoverall, the LLM can verify if the email matches the brand's tone.
By evaluating onspecific dimensions like being grounded in the provided context and adhering torules, LLMs can produce reliable scores to help developers gauge performance.The key is decomposing a complex and subjective assessment into discrete objectivesan LLM can accurately judge. This allows developers to aggregate scores onmultiple factors to determine overall quality.
The evaluation ofLLMs varies based on the application. For example, marketing messages mayconsider creativity, brand tone, appropriateness and conciseness as keymetrics. Customer support chatbots focus on hallucinations, politeness andresponse completeness. Code generation applications emphasize syntax accuracy,complexity and code quality.
Some frameworkshave categorized evaluation dimensions into logical thinking, backgroundknowledge and user alignment. The LLM-Eval paper introduces a framework withdimensions like correctness, engagement, fluency that leverage the reasoningcapabilities of LLMs. Similarly, Flask breaks down evaluations into logicalthinking, background knowledge, problem handling and user alignment.
The performance onthese dimensions depends on the underlying LLM model architecture, finetuningtechniques and prompt engineering. But having clear evaluation criteriatailored to the application is key to benchmarking LLMs.
Evaluating whetheran LLM understands the given task and makes use of the provided context iscritical. This involves assessing the response's appropriateness, completeness,and grounding in the given information.
Responseappropriateness evaluates if the LLM's response correctly follows the user'sintent, answers all aspects of the query, and avoids irrelevant information.Completeness checks that the response fully addresses the entire user input.Relevancy examines if the response contains tangential or unnecessary details.Structural integrity verifies the response has the required format or schema.Semantic similarity assesses the alignment between the user's query and theLLM's response.
Context awarenessmetrics evaluate if the LLM response is grounded in the given context, ratherthan hallucinating facts. Key dimensions include factual accuracy, contextalignment, quality of retrieved context, adequate contextual information, andcontext utilization score. These evaluations are vital for retrieval-augmentedLLM applications.
Ensuring highquality language output is a key aspect of LLM evaluation. This category ofmetrics focuses on assessing language use on both a general and task-specificlevel.
On a general level,metrics like grammar correctness, fluency, and coherence evaluate universalattributes of language quality. Since LLMs are trained on massive amounts ofnatural text data, they often perform well on these baseline language qualitymetrics. However, appropriateness on other universal metrics like toxicity andfairness requires careful monitoring and refinement via techniques likereinforcement learning from human feedback.
There are alsotask-dependent language quality factors that relate specifically to theapplication context. For conversational agents, important metrics includeevaluating the tonality, creativity, and interestingness of responses. Ensuringappropriate tone that fits the brand persona is critical. Metrics need to alignwith goals, so a friendly sales chatbot requires different assessments than aformal transactional system. Defining the right custom language quality metricsis key for LLMs.
LLMs excel incomplex reasoning tasks involving understanding user intent, extractingcontext-based knowledge, analyzing response options, and crafting concise,correct replies. Models with enhanced reasoning capabilities outperform others.This includes dimensions like logical correctness (right conclusions,consistent with minor input changes), logical efficiency (shortest solutionpath), and common sense understanding (grasping common concepts).
Typically, startingwith a great model and utilizing prompt techniques like chain-of-thought is thebest one can do here. Chain-of-thought prompting involves asking the LLM toexplain its reasoning step-by-step, which improves logical coherence and commonsense. Assessing reasoning allows developers to identify gaps in knowledge orflaws in logic before deployment.
Many applicationsrequire customized metrics tailored to their specific needs. For instance, acustomer support chatbot might need a custom metric to check if it is followinga developer-defined guideline of not mentioning pricing-related information in theresponse. An LLM-based link-sharing app might require verifying trusteddomains. An article summarization bot may want to assess the presence ofspecific article elements.
In LLM-poweredapplications, as with any machine learning system, there’s no universalsolution. It's vital to define custom evaluation criteria that closely alignwith the business goals to accurately measure performance. Custom metrics allowdevelopers to precisely evaluate how well the LLM is performing for theirparticular use case. Rather than relying solely on generic metrics, customevaluations provide key insights into whether the LLM is generatinghigh-quality, business-relevant outputs.
Evaluating LLMs isa complex yet necessary task that involves using both predefined metrics aswell as custom evaluations tailored to your specific application. Whiletraditional NLP metrics like BLEU, ROUGE, and perplexity provide usefulsignals, the unique generative capabilities of large language models requiregoing beyond these and defining new metrics that capture different aspects ofmodel performance.
Evaluating thelinguistic quality through metrics for fluency, coherence, and grammar is agood start. Additionally, given LLMs' tendency to hallucinate, it is criticalto assess the degree of factual accuracy, grounding in context, and avoidanceof false information. Custom criteria closely aligned with business goals arekey to accurately measuring real-world performance.
By combining a setof universal metrics with custom evaluations for your particular use case, youcan build a rigorous framework to thoroughly assess your LLM applications. Thismultifaceted approach helps identify areas needing improvement and paves thepath to unlocking the full potential of your large language model. Monitoringall aspects of model behavior is complex yet necessary to deploy high-qualityLLM applications.