Well, I think the next logical step would be to use a local database as a RAG source. I.E. additional private data to allow the LLM to better answer business specific questions. In the real world a business would likely not want that data made available to an LLM that could conceivably make it public. So, they likely would be using a private implementation of an LLM. But for this experiment not too concerned about that.
I expect, where possible, vector or graph databases would be the preferred choice. And I may get there. But, I have never used a vector or graph database, so I will start with a SQL database and the appropriate LangChain toolkit(s).
However, not yet sure what I will use for the source database. It’s not like I am currently running one full of some sort of suitable data.
Chinook.db
Well, the LangChain tutorial I will be following uses the Chinook.db database; a sample database that represents a digital media store. So I will do the same, with the SQLite version. I downloaded the zip from GitHub and extracted the database file into my working directory.
Let’s have a quick look at the database.
... ...
from langchain_community.utilities import SQLDatabase
... ...
db = SQLDatabase.from_uri("sqlite:///chinook.db")
if True:
print(db.dialect)
print(db.get_usable_table_names())
print(db.get_table_info(["artists"]))
print(db.run("SELECT * FROM artists LIMIT 10;"))
exit(0)
And, in the terminal window I got the following.
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot.py
sqlite
['albums', 'artists', 'customers', 'employees', 'genres', 'invoice_items', 'invoices', 'media_types', 'playlist_track', 'playlists', 'tracks']
CREATE TABLE artists (
"ArtistId" INTEGER NOT NULL,
"Name" NVARCHAR(120),
PRIMARY KEY ("ArtistId")
)
[(1, 'AC/DC'), (2, 'Accept'), (3, 'Aerosmith'), (4, 'Alanis Morissette'), (5, 'Alice In Chains'), (6, 'Antônio Carlos Jobim'), (7, 'Apocalyptica'), (8, 'Audioslave'), (9, 'BackBeat'), (10, 'Billy Cobham')]
You will note that the query returned a list of tuples. SQLite represents the data for returned rows as tuples. Let’s see what happens if I only ask for artist names not whole rows. That is:
print(db.run("SELECT Name FROM artists LIMIT 10;"))
[('AC/DC',), ('Accept',), ('Aerosmith',), ('Alanis Morissette',), ('Alice In Chains',), ('Antônio Carlos Jobim',), ('Apocalyptica',), ('Audioslave',), ('BackBeat',), ('Billy Cobham',)]
In Python, a specific syntax is required for a tuple with one element to distinguish it from a single value enclosed in parentheses. The single element has to be followed by a comma. So the tuple effectively has an empty second element. Something one may need to remember when processing query results.
Build Our Agent
Okay, let’s move on and get our agent instantiated.
We could in fact do this using a chain of choreographed steps. Which is what the above mentioned tutorial did to start. But, I am more interested in using an agentic approach; which is the approach covered later the tutorial.
Toolkit
We will provide the LLM with a set of tools allowing it to take the necessary steps to generate and execute a SQL query on our database. The
The SQLDatabaseToolkit includes tools that can:
Create and execute queries
Check query syntax
Retrieve table descriptions
… and more
Build a Question/Answering System Over SQL Data
One of the parameters, in addition to the database instance, needed to instantiate the toolkit is a reference to the LLM model we are using.
db = SQLDatabase.from_uri("sqlite:///chinook.db")
sql_mdl = ChatOpenAI(model="gpt-4o-mini")
db_tool = SQLDatabaseToolkit(db=db, llm=sql_mdl)
db_tools = db_tool.get_tools()
if True:
dbts = [f"{tool.name}: {tool.description}" for tool in db_tools]
for dbt in dbts:
print(f"\n{dbt}")
exit(0)
And, the output of that bit of test code is as follows.
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot.py
sql_db_query: Input to this tool is a detailed and correct SQL query, output is a result from the database. If the query is not correct, an error message will be returned. If an error is returned, rewrite the query, check the query, and try again. If you encounter an issue with Unknown column 'xxxx' in 'field list', use sql_db_schema to query the correct table fields.
sql_db_schema: Input to this tool is a comma-separated list of tables, output is the schema and sample rows for those tables. Be sure that the tables actually exist by calling sql_db_list_tables first! Example Input: table1, table2, table3
sql_db_list_tables: Input is an empty string, output is a comma-separated list of tables in the database.
sql_db_query_checker: Use this tool to double check if your query is correct before executing it. Always use this tool before executing a query with sql_db_query!
Try the Tools in the Kit
I thought I’d try using the tools provided by the kit in some test code, rather than through the LLM model. Figured it would give me some idea of what the model itself would do. That meant importing some additional classes. And a bit of reading to sort it all out.
... ...
from langchain_openai import ChatOpenAI, OpenAI
... ...
from langchain_community.tools.sql_database.tool import (
InfoSQLDatabaseTool,
ListSQLDatabaseTool,
QuerySQLCheckerTool,
QuerySQLDatabaseTool,
)
... ...
if True:
tbls = ListSQLDatabaseTool(db=db).run("")
print(f"\nListSQLDatabaseTool: {tbls}")
i_tool = InfoSQLDatabaseTool(db=db)
i_tbl = i_tool.run("employees")
print(f'\nInfoSQLDatabaseTool("employees"): {i_tbl}')
q_chkr = QuerySQLCheckerTool(db=db, llm=OpenAI(), verbose=True)
q_tst = "SELECT COUNT(EmpId) AS EmployeeCount FROM employees;"
print(f"\ncheck query: {q_tst}")
q_chkr.run(q_tst)
q_exc = QuerySQLDatabaseTool(db=db)
print(f'\n\nbad query: QuerySQLDatabaseTool({q_tst}) ->\n \
{QuerySQLDatabaseTool(db=db).invoke({"dialect":"SQLite", "query":q_tst})}')
print(f'\ngood query: QuerySQLDatabaseTool("SELECT COUNT(EmployeeId) AS EmployeeCount FROM employees;") ->:\n\t \
{QuerySQLDatabaseTool(db=db).invoke({"dialect":"SQLite", "query":"SELECT COUNT(EmployeeId) AS EmployeeCount FROM employees;"})}')
exit(0)
Note that I have purposely included an error in the query I passed to the query checker. And, as you can see in the output below, it did not find an error (query was displayed in green) or return a corrected query. I expect that is because it does not, as coded, have access to the other tool it would need to check that the column name was valid. And, I could not find a way to pass it that information. I tried to use bind to connect the two runnables; but, could not get that to work. However, running the bad query did return an error. So I guess I could write a function to check the query by executing it and looking for an error message.
(agnt-3.12) PS R:\learn\ds_agent> python chat_bot.py
ListSQLDatabaseTool: albums, artists, customers, employees, genres, invoice_items, invoices, media_types, playlist_track, playlists, tracks
InfoSQLDatabaseTool("employees"):
CREATE TABLE employees (
"EmployeeId" INTEGER NOT NULL,
"LastName" NVARCHAR(20) NOT NULL,
"FirstName" NVARCHAR(20) NOT NULL,
"Title" NVARCHAR(30),
"ReportsTo" INTEGER,
"BirthDate" DATETIME,
"HireDate" DATETIME,
"Address" NVARCHAR(70),
"City" NVARCHAR(40),
"State" NVARCHAR(40),
"Country" NVARCHAR(40),
"PostalCode" NVARCHAR(10),
"Phone" NVARCHAR(24),
"Fax" NVARCHAR(24),
"Email" NVARCHAR(60),
PRIMARY KEY ("EmployeeId"),
FOREIGN KEY("ReportsTo") REFERENCES employees ("EmployeeId")
)
/*
3 rows from employees table:
EmployeeId LastName FirstName Title ReportsTo BirthDate HireDate Address City State Country PostalCode Phone Fax Email
1 Adams Andrew General Manager None 1962-02-18 00:00:00 2002-08-14 00:00:00 11120 Jasper Ave NW Edmonton AB Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com
2 Edwards Nancy Sales Manager 1 1958-12-08 00:00:00 2002-05-01 00:00:00 825 8 Ave SW CalgaryAB Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com
3 Peacock Jane Sales Support Agent 2 1973-08-29 00:00:00 2002-04-01 00:00:00 1111 6 Ave SW Calgary AB Canada T2P 5M5 +1 (403) 262-3443 +1 (403) 262-6712 jane@chinookcorp.com
*/
check query: SELECT COUNT(EmpId) AS EmployeeCount FROM employees;
SELECT COUNT(EmpId) AS EmployeeCount FROM employees;
bad query: QuerySQLDatabaseTool(SELECT COUNT(EmpId) AS EmployeeCount FROM employees;) ->
Error: (sqlite3.OperationalError) no such column: EmpId
[SQL: SELECT COUNT(EmpId) AS EmployeeCount FROM employees;]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
good query: QuerySQLDatabaseTool("SELECT COUNT(EmployeeId) AS EmployeeCount FROM employees;") ->:
[(8,)]
Set-up Agent
Okay, let’s move along. One thing we are going to need is a very explicit template telling the LLM how to make use of our toolkit. I basically just copied the prompt code from the LangChain tutorial. Though I saw a comment on another post that suggested you could try using a chatbot to generate the prompt template. Most of the rest of the code we have seen before.
model = ChatOpenAI(model="gpt-4o-mini").bind_tools(db_tools)
call_tool = ToolNode(db_tools)
sys_msg = """
You are an agent designed to interact with a SQL database.
Given an input question, create a syntactically correct {dialect} query to run,
then look at the results of the query and return the answer. Unless the user
specifies a specific number of examples they wish to obtain, always limit your
query to at most {top_k} results.
You can order the results by a relevant column to return the most interesting
examples in the database. Never query for all the columns from a specific table,
only ask for the relevant columns given the question.
You MUST double check your query before executing it. If you get an error while
executing a query, rewrite the query and try again.
DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the
database.
To start you should ALWAYS look at the tables in the database to see what you
can query. Do NOT skip this step.
Then you should query the schema of the most relevant tables.
""".format(
dialect="SQLite",
top_k=5,
)
prompt = ChatPromptTemplate(
[("system", sys_msg),
("placeholder", "{messages}")
]
)
def call_model(state: MessagesState):
chain = prompt | model
# chain = LLMChain(model=model, prompt=prompt)
# updated_messages = model.invoke(state["messages"])
updated_messages = chain.invoke(state)
return {"messages": updated_messages}
workflow = StateGraph(MessagesState)
workflow.add_node("model_node", call_model)
workflow.add_node("tools", call_tool)
workflow.add_edge(START, "model_node")
workflow.add_conditional_edges("model_node", tools_condition)
workflow.add_edge("tools", "model_node")
memory = MemorySaver()
app = workflow.compile(memory)
def chatbot(chat_id: int):
config = {"configurable": {"thread_id": chat_id}}
while True:
user_input = input("User: ")
if user_input in ["exit", "quit", "x", "q"]:
print("AI: See you later!")
break
else:
print("AI: ", end="")
for chunk, metadata in app.stream({"messages": user_input}, config, stream_mode="messages"):
if isinstance(chunk, AIMessage):
print(chunk.content, end="", flush=True)
print("\n")
if __name__ == "__main__":
chatbot(4)
Sample Run
User: Which country's customers spent the most?
AI: sql
SELECT c.Country, SUM(i.Total) AS TotalSpent
FROM customers c
JOIN invoices i ON c.CustomerId = i.CustomerId
GROUP BY c.Country
ORDER BY TotalSpent DESC
LIMIT 5;
The countries where customers spent the most are:
1. **USA**: $523.06
2. **Canada**: $303.96
3. **France**: $195.10
4. **Brazil**: $190.10
5. **Germany**: $156.48
User: Which genre had the most spent on it.
AI: sql
SELECT g.Name, SUM(ii.UnitPrice * ii.Quantity) AS TotalSpent
FROM genres g
JOIN tracks t ON g.GenreId = t.GenreId
JOIN invoice_items ii ON t.TrackId = ii.TrackId
GROUP BY g.Name
ORDER BY TotalSpent DESC
LIMIT 5;
The genre with the most spent on it is:
1. **Rock**: $826.65
2. **Latin**: $382.14
3. **Metal**: $261.36
4. **Alternative & Punk**: $241.56
5. **TV Shows**: $93.53
User: Which customer purchased the most albums?
AI: sql
SELECT c.FirstName, c.LastName, COUNT(ii.TrackId) AS AlbumsPurchased
FROM customers c
JOIN invoices i ON c.CustomerId = i.CustomerId
JOIN invoice_items ii ON i.InvoiceId = ii.InvoiceId
GROUP BY c.CustomerId, c.FirstName, c.LastName
ORDER BY AlbumsPurchased DESC
LIMIT 5;
The customers who purchased the most albums are:
1. **Luís Gonçalves**: 38 albums
2. **Leonie Köhler**: 38 albums
3. **François Tremblay**: 38 albums
4. **Bjørn Hansen**: 38 albums
5. **František Wichterlová**: 38 albums
These customers are tied with the same number of purchases.
User: Who were the top 10 customers based on total album purchases?
AI: sql
SELECT c.FirstName, c.LastName, COUNT(ii.TrackId) AS AlbumsPurchased
FROM customers c
JOIN invoices i ON c.CustomerId = i.CustomerId
JOIN invoice_items ii ON i.InvoiceId = ii.InvoiceId
GROUP BY c.CustomerId, c.FirstName, c.LastName
ORDER BY AlbumsPurchased DESC
LIMIT 10;
The top 10 customers based on total album purchases are:
1. **Luís Gonçalves**: 38 albums
2. **Leonie Köhler**: 38 albums
3. **François Tremblay**: 38 albums
4. **Bjørn Hansen**: 38 albums
5. **František Wichterlová**: 38 albums
6. **Helena Holý**: 38 albums
7. **Astrid Gruber**: 38 albums
8. **Daan Peeters**: 38 albums
9. **Kara Nielsen**: 38 albums
10. **Eduardo Martins**: 38 albums
All listed customers have made the same number of purchases.
User: What is the average number of album purchases by customer?
AI: sql
SELECT AVG(AlbumsPurchased) AS AveragePurchases
FROM (SELECT COUNT(ii.TrackId) AS AlbumsPurchased
FROM customers c
JOIN invoices i ON c.CustomerId = i.CustomerId
JOIN invoice_items ii ON i.InvoiceId = ii.InvoiceId
GROUP BY c.CustomerId) AS Subquery;
The average number of album purchases by customer is approximately **37.97** albums.
User: Which customers purchased less than 38 albums?
AI: sql
SELECT c.FirstName, c.LastName, COUNT(ii.TrackId) AS AlbumsPurchased
FROM customers c
JOIN invoices i ON c.CustomerId = i.CustomerId
JOIN invoice_items ii ON i.InvoiceId = ii.InvoiceId
GROUP BY c.CustomerId, c.FirstName, c.LastName
HAVING AlbumsPurchased < 38
ORDER BY AlbumsPurchased;
The customer who purchased less than 38 albums is:
- **Puja Srivastava**: 36 albums.
User: Which customers purchased less than 38 albums?
AI: The customer who purchased less than 38 albums is:
- **Puja Srivastava**: 36 albums.
User: What is the average length of time employees have been working for the company give today is June 5, 2020?
AI: sql
SELECT AVG(JULIANDAY('2020-06-05') - JULIANDAY(HireDate)) AS AverageDaysWorked FROM employees;
The average length of time employees have been working for the company is approximately **6260.75 days**, which is about **17.15 years**.
User: x
AI: See you later!
So that seems to work just fine.
Done
That’s it for this one. As for the next post, not too sure where I will go next. More chatbot or back to coding the transformer model I appear to have completely forgotten about. The latter will require a major mental refresh. I was also thinking I would code a module to run the queries in the above chatbot output to confirm that they in fact work as expected.
Until next, do enjoy your time coding—whatever it be that you are working on.
Resources
- langchain_community.utilities.sql_database.SQLDatabase
- Components | Tools | Toolkits | SQLDatabase Toolkit
- LangChain Python API Reference | tools | InfoSQLDatabaseTool
- LangChain Python API Reference | tools | ListSQLDatabaseTool
- LangChain Python API Reference | tools | QuerySQLCheckerTool
- LangChain Python API Reference | tools | QuerySQLDatabaseTool
- LangChain Python API Reference | Build a Question/Answering system over SQL data
- Sample Databases for SQLite