Evaluating LLM generated SQL

An analysis on current approaches
Mohammadreza Pourreza
August 17, 2023

Turning data into knowledge

Dataherald helps business teams query and understand data using plain English – no SQL required. If you are looking for an AI tool that would enable question answering directly from your company’s databases, we encourage you to join our waitlist and explore our open source engine on Github” 
Background

For a text-to-SQL engine, Evaluation plays a critical role as a post-processing step following SQL query generation. Its primary objective is to provide insights into the confidence level of the agent regarding the generated query. This module takes the question and the corresponding generated SQL query as input parameters, and its output is a confidence score that falls within the range of 0 to 1. This confidence score signifies the level of certainty associated with the generated SQL query.

Evaluation serves as a mechanism to assess the reliability and quality of the generated SQL query. By leveraging various techniques and methodologies, an Evaluation module aims to measure the accuracy, coherence, and validity of the generated query in relation to the input question. It helps in gauging the agent’s understanding of the user’s intent and the ability to convert natural language queries into accurate SQL representations.

Uncertainty vs confidence

In machine learning literature, there are two closely related yet distinct concepts: uncertainty and confidence. These are often mistakenly used interchangeably. First, let’s clarify the difference between them.

When referring to a machine learning method and a specific input X = x to the model, uncertainty (U(x)) represents a score or measure indicating the model’s level of certainty about its answer for that input alone. It solely quantifies the model’s uncertainty regarding the output prediction.

On the other hand, confidence scores (C(x, y)) hold a slightly different meaning. They are associated with both the input and the corresponding prediction, expressed as a score or metric. Confidence scores reflect the model’s confidence in the generated answer given the input, considering both the input and the predicted output.

In the rest of this article, we will delve into various methods for measuring both uncertainty and confidence, specifically in the context of translating natural language to SQL using Language Models (LLMs).

Black-box vs white-box methods

When it comes to measuring uncertainty and confidence in Language Models (LLMs), two distinct evaluation approaches exist:

  • Black-box approaches: These methods view the model as a black box, assuming limited access to internal hidden states or logits. Black-box approaches are particularly relevant for closed-source LLMs, such as GPT-4, where internal weights and parameters are not accessible. In such cases, evaluation relies on external observations or inputs to estimate uncertainty and confidence.
  • White-box approaches: These methods rely on strong mathematical theories and leverage internal parameters of the models to measure uncertainty and confidence. By utilizing the model’s internal mechanisms, white-box approaches provide deeper insights into the uncertainty and confidence levels. These approaches are commonly employed when access to the internal workings of the LLM, including weights and parameters, is available.

By distinguishing between black-box and white-box evaluation approaches, we can adopt suitable methods based on the availability of internal model details and the specific LLM being analyzed. Each approach offers its unique strengths for assessing uncertainty and confidence in LLMs.

White-box methods for LLMs

With the emergence of closed-source LLMs (such as commercialized LLM APIs) white-box methods have become harder to implement. Due to limited access to the internal logits and weights of these models, it is challenging to apply traditional white-box approaches effectively. Consequently, the availability of methods belonging to the white-box family is significantly restricted.

1) Average logprobs

OpenAI LLMs that are accessible with the completion API (This method is still not applicable to Chat models like GPT-4). The API has a parameter logprobs which is:

logprobs: Include the log probabilities on the logprobs most likely tokens, as well as the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sample token, so there may be up to logprobs+1 elements in the response.


This parameter can be used to find the probability of the generated response using the following formula:


where s is the response generated by the model and 𝑛𝑠 is the number of tokens for the generated response 𝑠 and 𝑤𝑠,𝑖 represents the 𝑖-th token in 𝑠.

This probability metric associated with the generated response can serve as a reliable confidence score for assessing its quality and reliability.

Limitations:
  1. This probability can be unfairly impacted by the length of the generated response. For example, if a response is long and contains several predictable tokens, they can skew the average logprobs toward higher values. For example, consider the token avg that occurs in some SQL query generated by a LLM. The token predicted to follow this token will be “(“ with very high probability. This predictable token “(“ can substantially increase the average logprob of this response. If a response contains many such predictable tokens, then its average logprob score will be higher, whether or not this response is actually reflective of the desired intent.
  2. This approach is only applicable for LLMs like text-davinci-003 that are accessible with the completion API. The Chat Completion API doesn’t have the logprobs parameter.

Black-box methods

In recent years, approaches have emerged for measuring the confidence and uncertainty of Language Models (LLMs) without access to the internal states of the LLM. When it comes to black-box methods, there are three primary types of approaches:

  • Consistency methods: Consistency methods are inspired by ensemble learning techniques and are built upon their underlying theories. These methods generate a set of responses for a given question using ensemble techniques. Subsequently, post-processing techniques are employed to analyze the consistency and uncertainty metrics derived from this response set.
  • Verbal approaches: Verbal approaches capitalize on the LLMs’ inherent capabilities to assess the confidence of generated answers. These methods involve providing both the input and the corresponding generated response to the LLM and explicitly requesting the model to provide a confidence score for the generated answer.
  • Hybrid methods: These methods leverage the individual signals of uncertainty to enhance their accuracy. By combining the strengths of both signals, these methods have demonstrated improved performance and reliability. It’s important to note that while this combination enhances accuracy, it may result in increased computational cost and time complexity.

In general, consistency methods are more accurate than verbal approaches in measuring the confidence and uncertainty while the verbal approaches are faster and more cost effective.

Consistency Methods
1) Clustering based on Jaccard similarity

This approach involves categorizing sample responses into distinct clusters based on Jaccard similarity. Jaccard similarity quantifies the similarity between two sets of data by identifying shared and distinct members. In this context, the sets represent the generated samples, while the items within the sets represent the words or tokens used in those samples.

By setting a threshold for the similarity of the generated samples, we can cluster samples with similarity above the threshold together. The number of resulting clusters can serve as an indication of uncertainty. Additionally, the ratio between the number of members within each cluster and the total number of generated samples represents the confidence score for the members of that specific cluster.

Limitations:
  1. The major limitation of this approach is its lack of consideration of word or token orders.
  2. Another problem of this method is its inability to capture crucial expressions such as negation (This is not a big problem for the SQL queries).
2) Traditional clustering approaches

Another approach for identifying similar samples involves leveraging traditional clustering methods, such as Spectral clustering, on the generated samples. Recent research papers have explored the application of Spectral clustering by considering each generated response as a separate node in the clustering process. They construct a symmetric weighted adjacency matrix, which captures the relationships between the generated responses.

To define the uncertainty and confidence, these papers utilize the normalized graph Laplacian and degree matrix derived from the adjacency matrix. By analyzing the structure of the graph, they can quantify the level of uncertainty and confidence associated with the generated responses.

To determine the weights in the adjacency matrix, natural language inference (NLI) models, such as an off-the-shelf DeBERTa-large model, are employed. NLI classifiers are designed to predict scores (logits) for three distinct classes: entailment, neutral, and contradiction. In this context, the predicted probabilities from the NLI classifier serve as a measure of similarity between the generated responses, and subsequently, these probabilities are used to compute weights in the adjacency matrix.

By leveraging spectral clustering and NLI models, this approach offers a robust methodology for quantifying similarity, uncertainty, and confidence among the generated samples. The combination of graph-based analysis and NLI techniques allows for a comprehensive understanding of the relationships and confidence levels within the generated response set.

Verbal approaches
1) Vanilla verbalized confidence

The most direct approach to obtain verbalized confidence from LLMs is by explicitly requesting them to produce a confidence score within a range of 0 to 1.

However, it has been observed that this approach is often ineffective in accurately estimating confidence. LLMs tend to exhibit overconfidence in their generated answers when relying solely on this method.

2) Reasoning-Enhanced Verbalized Confidence (CoT-based)

The effectiveness of CoT approach in stimulating reasoning processes in LLMs and improving model accuracy across various datasets has been well-documented. LLMs can be prompted to articulate their reasoning process by including the phrase “Let’s think step by step” in the prompt. A possible solution to the overconfidence of the models, mentioned in the previous verbal approach, is to use zero-shot CoT method to let the model break down the problem and use reasoning steps to find the confidence score.

3) Top-K Verbalized Confidence

This method involves prompting LLMs to generate the top K guesses for a given question. Each guess is accompanied by a corresponding probability, representing the confidence level of the model’s prediction. The probabilities are expressed as percentages ranging from 0% to 100%.

Among the generated guesses, the one assigned the highest probability is selected as the final answer. Additionally, the associated probability serves as the verbalized confidence for the final answer. This probability score provides an indication of the model’s confidence in the accuracy and correctness of the chosen response.

By leveraging this approach, we can obtain a single response along with a quantitative measure of confidence. The higher the probability score, the greater the model’s confidence in the selected answer.

Hybrid methods

Verbalized confidence and consistency-based methods provide valuable insights into estimating a model’s confidence in its outputs. However, relying solely on either verbalized confidence or answer consistency is not comprehensive enough to capture the true underlying uncertainty of the model’s predictions. There are two main reasons for this limitation:

  1. Verbalized confidences given by language models (LLMs) often exhibit a tendency towards being highly overconfident. These confidences may not accurately reflect the true uncertainty associated with the model’s predictions. Therefore, relying solely on verbalized confidence may lead to a misrepresentation of the actual confidence levels.
  2. Consistency-based methods, while useful in capturing certain aspects of uncertainty, may not fully capture fine-grained variations and can still suffer from miscalibration. They might overlook subtle nuances and uncertainties present in the model’s predictions, thus limiting their ability to provide a comprehensive confidence estimation.

To address these limitations, integrating both verbalized confidence and consistency-based approaches can offer a more robust estimation of the model’s confidence. By combining these orthogonal perspectives, we can leverage the strengths of each method and mitigate their individual weaknesses. This integration allows for a more comprehensive and nuanced understanding of the model’s confidence levels, capturing both the overconfidence tendencies of verbalized confidence and the nuanced uncertainties that consistency-based methods may overlook.

References

From words to code

Just ask for calibration

Can LLMs Express Their Uncertainty?

The Internal State of an LLM Knows When its Lying

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Join Our Waitlist Today

GitHub