Text-to-SQL Benchmarks and the Current State-of-the-Art
Technical blog on evaluation and benchmarking of Text-to-SQL
Turning data into knowledge
The sudden leap forward in natural language capabilities of AI models like GPT-4 and Bard have opened the door for rapid progress in many NLP use cases. One such use case is text-to-SQL, generating SQL queries to answer questions posed in natural language.
With multiple approaches and solutions flooding the market, we are left with the problem of evaluation. Which approach is most efficient? Which one more reliably produces accurate answers? Which one adapts to different datasets best? To help answer these questions, the open-source industry and academia put forth several benchmarks, but the three most used today are:
Introduced by Salesforce in late 2017, WikiSQL was the first truly large compendium of data built for the text-to-SQL use case. However, it has one major drawback: simplicity.
All of the provided SQL queries are exceedingly simple, with only SELECT, FROM, and WHERE clauses. Furthermore, the tables in the dataset have no linkages to other tables. Although models trained on WikiSQL can still work on new databases, they can only answer simple natural language questions that then translate into simple SQL queries.
For this reason, most of the recent research in the world of text-to-SQL focuses on more complex benchmarks. In fact, the WikiSQL leaderboard only has submissions from 2021 or earlier. With multiple submissions achieving a test accuracy of over 90% (with the best-performing submission reaching 93%), practitioners are now shifting focus to much more complex query generation, for which WikiSQL falls woefully short.
- Size: 154.74 MB
- Data points: 87,726 unique question-SQL pairs
- Databases: 24,241 tables from Wikipedia
- Domains: 1
The Spider dataset aims to cover some of the shortcomings of the WikiSQL dataset. Developed through the efforts of 11 Yale students spending over 1,000 man hours, the Spider dataset introduces two critical elements: complexity and cross-domainality.
- Complexity: The SQL queries go beyond the straightforward SELECT and WHERE clauses that WikiSQL is limited to, covering the more complex GROUP BY, ORDER BY, and HAVING clauses along with nested queries. Furthermore, all databases have multiple tables linked through foreign keys, allowing for complicated queries that join across tables.
- Cross-domainality: With 200 complex databases across a high number of domains, Spider is able to include unseen databases in the test set, allowing us to test the model’s generalizability.
Evaluation of different submissions incorporates the following:
- Component Matching: Do the individual components of the SELECT, WHERE, GROUP BY, and ORDER BY clauses match? Are the extracted KEYWORDS correct?
- Exact Matching: Do all of the above components match exactly?
- Execution Accuracy: Is the answer correct?
- SQL Hardness: Queries are divided into four levels (easy, medium, hard, and extra hard) and weighted accordingly for the final evaluation.
There are a few variations of the Spider dataset that are used to evaluate the robustness and generalizability of models under different perturbations, such as Spider-Syn (used to test how well text-to-SQL models adapt to synonym substitution) and Spider-DK (tests how well text-to-SQL models incorporate added domain knowledge).
- Size: 919.2 MB
- Data points: 10,181 questions and 5,693 unique complex SQL queries
- Databases: 200
- Domains: 138
Spider does not include the performance of out-of-the-box LLMs on their leaderboard. The best performing model comes from Mohammed Pourreza, a University of Alberta postgraduate student who has recently joined the Dataherald team! Mo’s approach leverages a range of prompting techniques, including decomposed prompting and chain-of-thought prompting. By employing these strategies alongside GPT-4, their approach outperformed all other submissions, achieving a 85.3% execution accuracy on the hold-out test set of Spider.
This dataset was compiled by researchers from multiple global institutions to be more realistic than WikiSQL and Spider.
- Because the data was collected from real-world scenarios, they retain their original, “dirty” format.
- It also provides external knowledge, similar to how real-world developers may have external knowledge from metadata, docs, or other existing context stores.
The BIRD dataset also underscores the importance of efficient queries. The evaluation method for BIRD is the first to include a Valid Efficiency Score (VES), a new metric designed to measure the efficiency along with the usual execution correctness of a provided SQL query.
- Size: 33.4 GB
- Data points: 12,751 unique question-SQL pairs
- Databases: 95 (80 for training, 15 for evaluation)
- Domains: 37
For both EX (execution accuracy) and VES (Valid Efficiency Score), GPT-4 tops the current leaderboard. However, the famous LLM’s performance still leaves a lot to be desired, trailing far behind human performance.
The only approach that even cracks the top-5 is a collaboration between Alibaba’s DAMO (Academy for Discovery, Adventure, Momentum and Outlook), Hong Kong University, and The University of Illinois Urbana-Champagne, which builds on ChatGPT with Chain-of-Thought prompting. However, this solution severely underperforms even GPT-4.
Advancements in LLMs have opened the door, but only ever so slightly, for complex problems like text-to-SQL. Sure, simple questions with simple answers can be handled quite reliably, but no solution exists currently in the market that can even come close to human performance on even slightly more complex queries. However, when we DO develop one, these benchmarks will be the first to let us know.