Metadata-Version: 2.4
Name: sqlshield
Version: 0.0.17
Summary: A Shield for your LLM generated SQL Queries. It provides an application level control for securing the database from SQL generated by LLM.
Author: Sandeep Giri
Author-email: sandeepgiri@gmail.com
Project-URL: Documentation, https://github.com/terno-ai/llm-sql-shield/README.md
Project-URL: Source, https://github.com/terno-ai/llm-sql-shield
Project-URL: Tracker, https://github.com/terno-ai/llm-sql-shield/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary

# LLM SQL Shield

A shield for your database to neutralize the SQL queries generated by Large Language Models (LLM).

## Background

Large Language Models (LLM) have proven to be extraordinarily effective in generating SQL queries, revolutionizing SQL generation using large language models.

However, there is a significant challenge: a threat to the database. The queries generated by LLMs can access data that they aren't supposed to. No matter how constrained the prompt is, it is always possible to jailbreak. Given the nature of LLMs, they can never be controlled deterministically.

Securing the database is difficult because:

1. Databases are usually controlled and managed by a different team. Therefore, any changes to the databases would take time.
2. Databases are usually accessed and modified by multiple different services. Therefore, making any change to the database is difficult.
3. Database security is designed for a handful of roles and users. Achieving security for thousands of users is difficult.
4. Row-based security can be achieved by predefined views, but if we have to have dynamic views or parameterized views, it is not possible in databases as of now.
5. LLMs require the table names and column names to be self-explanatory. To achieve the same in a database, you will have to clone the database and make modifications to it, but then making the data sync in these tables becomes a difficult task.

## LLM SQL Shield - Features

With SQL Shield, you can:

1. Limit the tables that you expose to LLM.
2. Rename the tables as per your wish by setting proper `pub_name`.
3. Limit the columns from each table.
4. Rename the columns to make them more meaningful by setting proper `pub_name`.
5. Limit the rows that can be accessed by providing `filters`.
6. The `filters` can have variables that you can fill in at the time of query generation and execution.

## Getting started or How does it work?

### Install
It is available as the pip package. You can simply install it like this:

```bash
pip install sqlshield
```

### Prepare the Schema

Once installed, you can import the following:
```python
# Models contain the models for Table, Column, and Database.
from sqlshield.models import *

# The main work of handling and generating SQL is done by this
# The entry point is Session object.
from sqlshield.shield import *
```

The basic idea is that you create a pseudo-schema of your database keeping track of the original table name and column names. You expose this pseudo-schema to LLM to generate a query and then translate this query to internal tables.

Further, in the pseudo-schema, you create the row filters or parameterized views that basically filter rows per user.

The first step is to create the schema. Either you can create a schema by creating instances of MDatabase, MTable, and MColumn or you can simply instantiate by loading existing tables from your database using SQLAlchemy. I would recommend the second approach because it is easier and more maintainable - think of the new tables being introduced.

Let's go ahead and load the database objects using SQLAlchemy.
```bash
pip install sqlalchemy
```

```python
# Using SQLAlchemy 
import sqlalchemy
from sqlalchemy import inspect
from sqlalchemy import text

engine = sqlalchemy.create_engine('sqlite:///chinook.db')
inspector = inspect(engine)
```

Now, using the inspector, you can create the SQL Shield's models: MTable, MColumn, and MDatabase automatically.

```python
mDb = MDatabase.from_inspector(inspector)
```

Once you have the MDatabase instance ready, you can start modifying it to ensure that only minimal information is exposed to the LLMs.

The following code removes all the tables except for four tables. This also reduces the prompt size.

```python
mDb.keep_only_tables(set(['Customer', 'Employee', 'Invoice', 'InvoiceLine']))
```

`MDatabase.keep_only_tables` is a handy utility method. To suit your business needs, you directly modify the `MDatabase.tables` object, which is a `set()`.

If you want to modify each table, you can access the table directly from `MDatabase.tables` `set()` object. Since it is easier to create a dictionary and then access it, we have provided `get_table_dict()` as shown in the below code:

```python
# It is a handy method to get the tables as a dictionary.
# Please use the MDatabase.tables
tables = mDb.get_table_dict()
```

Now, say we want to modify a table with the name `Customer`. We can change its name to `Customers`. Quite often the table names in your organization could be really messy and LLMs would not be able to generate good queries on this. You can give a good `pub_name` to each table. `pub_name` is what is shown to the LLM.

```python
customer_table = tables['Customer']
# Change Name of table
customer_table.pub_name = 'Customers'
```

*Filters* are the most powerful feature of SQL Shield. Filters let you create subspaces for users. Using filters, you can limit the rows shown to the user. Say, there is a table that contains the data for all 500 teams in your organization. You want each team to see only their data. So you would create a filter now and at runtime, you would insert the team as part of parameters.

Here in this example, we want users to access only the rows of the `Customers` table that belong to their company.

```python
# Add a filter
customer_table.filters = 'where company = {company}'
```

You can access the columns of an `MTable` using the `columns` which is a `set()`. It also provides a handy method `drop_columns`. I will be adding more handy methods based on the user's requests.

But you can simply access all columns from the set and modify it. Each column is `MColumn`, it has an important field `pub_name`. If you want to modify the name of a column that is visible to an LLM, just change pub_name. Don't change `name`.

```python
# Drop some columns
customer_table.drop_columns(set(['Address']))
```

That's it. Your database is ready!

### Generate Schema for LLM
With this MDatabase `mDb` prepared in previous steps, we can generate schema to augment the prompt as follows:

```python
schema_generated = mDb.generate_schema()
```

We can send this schema along with the question in the prompt to any LLMs and once LLMs generate SQL, we can re-write the SQL. Say, an LLM has generated a query `aSql`.

We can now, generate the safe query using the `aSQL` as follows:

```python
d = {'company':'\'Telus\''}
sess = Session(mDb, d)
gSQL = sess.generateNativeSQL(aSql)
```

That's it. `gSQL` would have a query that is safe. Please note that we created a dictionary of parameters and passed it to the constructor of Session because we had parameterized filters that would limit the rows to the `company = Talus`.

Now, you can execute the `gSQL` on your actual database peacefully.

## Complete Example with Open AI

Here is an example:

```python
import sqlalchemy
from sqlalchemy import inspect
from sqlalchemy import text

from sqlshield.models import *
from sqlshield.shield import *
import os

# TODO: Specify correct OpenAI key
os.environ["OPENAI_API_KEY"] = 'sk-XXXXXXXXX'

# TODO: You can download this SQLite3 DB file: https://github.com/terno-ai/llm-sql-shield/raw/main/tests/chinook.db
# And save it in your current directory
# Connect to DB
engine = sqlalchemy.create_engine('sqlite:///chinook.db')
inspector = inspect(engine)

# Load default DB
mDb = MDatabase.from_inspector(inspector)

mDb.keep_only_tables(set(['Customer', 'Employee', 'Invoice', 'InvoiceLine']))

tables = mDb.get_table_dict()
customer_table = tables['Customer']

# Change Name of table
customer_table.pub_name = 'Customers'

# Add a filter
customer_table.filters = 'where company = {company}'

# Drop some colums
customer_table.drop_columns(set(['Address']))

# Column renaming

question = "Show me all customers."

from openai import OpenAI
client = OpenAI()

schema_generated = mDb.generate_schema()
print('The following schema was generated: ', schema_generated)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are an SQL Analyst. Your role is to generate the SQL given a question. Only generate SQL nothing else."},
        {"role": "user", "content": question},
        {"role": "assistant", "content": "The tables schema is follows: " + schema_generated},
    ]
)

aSql = response.choices[0].message.content
print('SQL Generated by LLM: ', aSql)

d = {'company':'\'Telus\''}
sess = Session(mDb, d)
gSQL = sess.generateNativeSQL(aSql)
print("Native SQL: ", gSQL)

print(" ===================== ")
with engine.connect() as con:
    rs = con.execute(text(gSQL))
    for row in rs:
        print(row)

```

## Testing
### Comment out ext_modules and cmdclass from setup.py first
`pip install -e .`

### Usage
```python
import sqlshield
shield = sqlshield.SQLShield(...)

from sqlshield import SQLShield
shield = SQLShield(...)
```

### Run inside tests directory
`coverage run test.py`
