How to connect LLM to SQL database with LangChain SQLAgent

Technical tutorial for Natural Language to SQL connections
Dishen Wang
August 17, 2023

Turning data into knowledge

Dataherald helps business teams query and understand data using plain English – no SQL required. If you are looking for an AI tool that would enable question answering directly from your company’s databases, we encourage you to join our waitlist and explore our open source engine on Github” 
Introduction

In our last blog post we discussed the topic of connecting a PostGres database to Large Language Model (LLM) and provided an example of how to use LangChain SQLChain to connect and ask questions about a database. In this post we will do the same with a LangChain SQLAgent and compare the performance of the two approaches.

Why SQLAgent

Before jumping into the tutorial, let us first understand what is an agent and why it might be preferred over a simple SQLChain. An agent is a component that has access to a suite of tools, including a Large Language Model (LLM). Its distinguishing characteristic lies in its ability to make informed decisions based on user input, utilizing the appropriate tools until it achieves a satisfactory answer. For example in the context of text-to-SQL, the LangChain SQLAgent will not give up if there is an error in executing the generated SQL. Instead, it will attempt to recover by interpreting the error in a subsequent LLM call and rectify the issue. Therefore, in theory, SQLAgent should outperform SQLChain in productivity and accuracy.

Let us see how SQLAgent does against our real_estate database.

1. Getting started

If you read our first blog on how to use LangChain SQLChain to connect a LLM to a database, then you might already be familiar with some of the steps below. Feel free to skip around and follow the steps as needed.

In this tutorial we will be using OpenAI’s gpt-3.5-turbo model for our LLM model and Dataherald’s real_estate for our database. This tutorial will be using postgres as the sql dialect. If you are using a different sql dialect please check out the SQLAlchemy documentation on how to setup your database connection.

Let us first install the required packages. For postgres, make sure you have installed postgreSQL on your machine. To utilize OpenAI’s API, be sure to have an OpenAI account and its API key ready.

Install the following packages:

pip install langchain
pip install openai
pip install psycopg2

Next create a python file called main.py and import the following:

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain.llms.openai import OpenAI
from langchain.agents import AgentExecutor
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI

2. Connect the database

For postgres databases, use the following format string for the database URI and declare the required variables above:

pg_uri = f"postgresql+psycopg2://{username}:{password}@{host}:{port}/{mydatabase}"

Now let’s setup our database connection:

db = SQLDatabase.from_uri(pg_uri)

3. Setup LLM

Since we will be using OpenAI’s gpt-3.5-turbo model, let us use our OpenAI account’s API key:

OPENAI_API_KEY = "your OpenAI key"

Let’s define our LLM model:

gpt = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name='gpt-3.5-turbo')

4. Setup Agent

Let us define our agent’s toolkit which will be used to answer the user question:

toolkit = SQLDatabaseToolkit(db=db, llm=gpt)


Then we can create our agent executor with agent type ZERO_SHOT_REACT_DESCRIPTION:

agent_executor = create_sql_agent(    
llm=gpt,    
toolkit=toolkit,    
verbose=True,    
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,)



For more information on available agent types and their characteristics, please checkout LangChain’s documentation here.

5. Ask a query

Finally, let us define a question and execute the agent with the question:

question = "Average rent in Chicago from Oct 2022 till Dec 2022"agent_executor.run(question)

Here are the results after we ran our program:

> Entering new  chain...Action: sql_db_list_tablesAction Input: ""Observation: redfin_homes_sold, zillow_home_value_index, redfin_median_list_price, redfin_median_ppsf, renthub_neighborhood_listings, zillow_observed_rent_index, zillow_rent, redfin_avg_sale_to_list, renthub_median_rent, zillow_home_value, redfin_inventory, redfin_new_listings, renthub_num_rental_listings, redfin_median_sale_price, redfin_pending_sales, renthub_average_rentThought:I can query the tables "renthub_average_rent" and "zillow_rent" to get the average rent price in Chicago. I will query the schema of these tables to determine the relevant columns.Action: sql_db_schemaAction Input: "renthub_average_rent, zillow_rent"Observation: CREATE TABLE renthub_average_rent (        period_start DATE NOT NULL,         period_end DATE NOT NULL,         period_type TEXT NOT NULL,         ...)/*3 rows from renthub_average_rent table:period_start    period_end      period_type     geo_type        property_type   location_name   dh_state_fips   dh_state_name   dh_state_abbr   dh_county_fips  dh_county_name  dh_county_fullname      dh_place_fips      dh_place_name   dh_place_fullname       dh_city_name    dh_city_fullname        dh_neighborhood_id      dh_neighborhood_name    dh_zip_code     is_seasonally_adjusted  metric_value    hg_date_updated2022-02-01      2022-02-28      monthly neighborhood    House   Far North       48      Texas   TX      085     Collin  Collin County   19000   Dallas  Dallas city     Dallas  Dallas city     273698  Far North  -       False   2543.752941176471       2023-05-30 14:57:38.8123532020-12-01      2020-12-31      monthly neighborhood    Apartment Building      Aurora Highlands        08      Colorado        CO      001     Adams   Adams County    04000   Aurora  Aurora city     Aurora     Aurora city     342766  Aurora Highlands        -       False   1306.857142857143       2023-05-30 14:57:38.8123532021-12-01      2021-12-31      monthly neighborhood    Apartment Building      Penrose 51      Virginia        VA      013     Arlington       Arlington County        03000   Arlington       Arlington CDP      Arlington       Arlington CDP   271765  Penrose -       False   2009.9245283018868      2023-05-30 14:57:38.812353*/CREATE TABLE zillow_rent (        period_start DATE NOT NULL,         period_end DATE NOT NULL,         period_type TEXT NOT NULL,         ...)/*3 rows from zillow_rent table:period_start    period_end      period_type     geo_type        is_seasonally_adjusted  property_type   location_name   location_fullname       dh_state_fips   dh_state_name   dh_state_abbr   dh_county_fips     dh_county_name  dh_county_fullname      dh_place_fips   dh_place_name   dh_place_fullname       dh_city_name    dh_city_fullname        dh_zip_code     metric_value    hg_date_updated*/Thought:I can query the "renthub_average_rent" and "zillow_rent" tables to get the average rent price in Chicago. The relevant columns are "metric_value" in the "renthub_average_rent" table and "metric_value" in the "zillow_rent" table. I will write a query to calculate the average rent price from October 2022 to December 2022 in Chicago.Action: sql_db_query_checkerAction Input: "SELECT AVG(metric_value) FROM renthub_average_rent WHERE location_name = 'Chicago' AND period_start >= '2022-10-01' AND period_end <= '2022-12-31'"Observation: The original query does not contain any common mistakes. Here is the reproduced original query:```sqlSELECT AVG(metric_value) FROM renthub_average_rent WHERE location_name = 'Chicago' AND period_start >= '2022-10-01' AND period_end <= '2022-12-31';```Thought:The query is correct and ready to be executed to get the average rent price in Chicago from October 2022 to December 2022.Action: sql_db_queryAction Input: "SELECT AVG(metric_value) FROM renthub_average_rent WHERE location_name = 'Chicago' AND period_start >= '2022-10-01' AND period_end <= '2022-12-31'"Observation: [(2344.4993233104487,)]Thought:The average rent price in Chicago from October 2022 to December 2022 is $2344.50.Final Answer: $2344.50


voilà

Discussion

During our tests, we ran multiple questions on both SQLChain and SQLAgent using GPT-3.5 and compared their respective results. Our findings revealed that SQLAgent outperformed SQLChain by answering a greater number of questions. For instance, when asked “Is the average rent price in Chicago increasing from June 2022 till March 2023”, the output we got from SQLChain is:

There doesn't seem to be a table specifically for average rent in Chicago. Since I don't have access to the relevant data, I don't know the answer to the question.


and the output for SQLAgent is:

Final Answer: The average rent price in Chicago is increasing from June 2022 till March 2023.


For accuracy, however, our findings indicate a higher incidence of incorrect responses from SQLAgent. Besides the general shortcomings (see conclusion in How to — LangChain SQLChain) of using LLM to query database, we hypothesize that SQLAgent will occasionally make its best attempt to answer a question even when concrete results cannot be obtained from the SQL query. For example, when asked the question: “Are home prices increasing in San Diego from Nov 2022 till Dec 2022?”, the SQLAgent produces the following chain of thought and action:

Action: list_tables_sql_dbAction Input: ""Thought:I can query the tables "zillow_home_value_index" and "zillow_home_value" to get information about home prices. I should query the schema of these tables to see what columns are available.Action: schema_sql_dbAction Input: "zillow_home_value_index, zillow_home_value"Thought:I can query the tables "zillow_home_value_index" and "zillow_home_value" to get information about home prices. The relevant columns are "period_start", "period_end", "location_name", and "metric_value". I can filter the results based on the location name "San Diego" and the period between November 2022 and December 2022. I should use the "zillow_home_value" table because it has more specific property types.Action: query_sql_dbAction Input: "SELECT period_start, period_end, location_name, metric_value FROM zillow_home_value WHERE location_name = 'San Diego' AND period_start >= '2022-11-01' AND period_end <= '2022-12-31' LIMIT 10"Observation: []Thought:There are no results for the query. This means that there is no data available for home prices in San Diego from November 2022 till December 2022.Final Answer: No, home prices are not increasing in San Diego from Nov 2022 till Dec 2022.


As we can see, SQLAgent did not observe any data from the sql query but chose to provide an answer anyways.

Hence, if you find yourself in need of SQLAgent’s capability to construct intricate chains of calls to language models and other tools, it would be prudent to formulate well-defined and precise questions while also cross-verifying the results with your database.

Conclusion

LangChain SQLAgent is a powerful tool that creates complex LLM chain calls for answering user questions. Although it return a response for most relevant questions, it fails to prevent LLM hallucination.

If you haven’t read the blog about how to use LangChain SQLChain, please check it out here. We will also be covering how to use LlamaIndex for connecting LLMs to databases later this month so if you’re interested in learning more about using LLM to answer database questions please consider subscribing, thank you.

Join Our Waitlist Today

GitHub