An introduction to Marvin and Ibis

LLMs and data
Author

Cody Peterson

Published

October 12, 2023

Introduction

In this “LLMs and data” series, we’ll explore how to apply large-language models (LLMs) to data analytics. We’ll walk through the steps to build Ibis Birdbrain.

Throughout the series, we’ll be using Marvin and Ibis. A brief introduction to each is provided below.

Marvin

Marvin is an AI engineering framework that makes it easy to build up to an interactive conversational application.

Marvin makes calls to an AI platform. You typically use an API key set as an environment variable – in this case, we’ll load a .env file that contians secrets for the AI platform that Marvin will use. We also set the large language model model.

import marvin

from rich import print
from time import sleep
from dotenv import load_dotenv

load_dotenv()

# increase accuracy
marvin.settings.llm_model = "openai/gpt-4"
# decrease cost
# marvin.settings.llm_model = "openai/gpt-3.5-turbo"

test_str = "working with data and LLMs on 18+ data platforms is easy!"
test_str
1
Import the libraries we need.
2
Load the environment variable to setup Marvin to call our OpenAI account.
3
Configure the LLM model to use.
4
Some text to test on
'working with data and LLMs on 18+ data platforms is easy!'

Functions

AI functions are one of the building blocks in Marvin and allow yout to specify a typed python function with no code – only a docstring – to achieve a wide variety of tasks.

We’ll demonstrate this with an AI function that trnaslates text:

@marvin.ai_fn
def translate(text: str, from_: str = "English", to: str = "Spanish") -> str:
    """translates the text"""

translate(test_str)
'trabajar con datos y LLMs en más de 18 plataformas de datos es fácil!'
Code
sleep(1)
1
Avoid rate-limiting by waiting.
translate(translate(test_str), from_="Spanish", to="English")
'Working with data and LLMs on more than 18 data platforms is easy!'
Code
sleep(3)
1
Avoid rate-limiting by waiting.

Models

AI models are another building block for generating python classes from input text. It’s a great way to build structured data from unstructured data that can be customized for your needs.

We’ll demosntrate this with an AI model that translates text:

from pydantic import BaseModel, Field

# decrease cost
marvin.settings.llm_model = "openai/gpt-3.5-turbo"

@marvin.ai_model
class ExtractParts(BaseModel):
    """Extracts parts of a sentence"""
    subject: str = Field(..., description="The subject of the sentence.")
    objects: list[str] = Field(..., description="The objects of the sentence.")
    predicate: str = Field(..., description="The predicate of the sentence.")
    modifiers: list[str] = Field(..., description="The modifiers of the sentence.")

ExtractParts(test_str)
ExtractParts(subject='working', objects=['data', 'LLMs'], predicate='is', modifiers=['on 18+ data platforms', 'easy'])
Code
sleep(1)
1
Avoid rate-limiting by waiting.

Classifiers

AI classifiers are another building block for generating python classes from input text. It’s the most efficient (time and cost) method for applying LLMs as it only results in a single output token, selecting an output in a specified Enum.

We’ll demonstrate this by classifying the language of some text:

from enum import Enum

# increase accuracy
marvin.settings.llm_model = "openai/gpt-4"

@marvin.ai_classifier
class IdentifyLanguage(Enum):
    """Identifies the language of the text"""

    english = "English"
    spanish = "Spanish"


IdentifyLanguage(test_str).value
'English'
Code
sleep(1)
1
Avoid rate-limiting by waiting.
IdentifyLanguage(translate(test_str)).value
'Spanish'
Code
sleep(3)
1
Avoid rate-limiting by waiting.

Ibis

Ibis is the portable Python dataframe library that enables Ibis Birdbrain to work on many data platforms at native scale.

Ibis makes calls to a data platform, providing an API but pushing the compute to (local or remote) query engines and storage. DuckDB is the default and we’ll typically use it for demo puroses. You can work with an in-memory instance, but we’ll often create a database file from example data:

import ibis

con = ibis.connect("duckdb://penguins.ddb")
t = ibis.examples.penguins.fetch()
t = con.create_table("penguins", t.to_pyarrow(), overwrite=True)
1
Import the libraries we need.
2
Setup the demo datain an Ibis backend.

You will typically connect to an existing data platform via your corresponding Ibis backend and have access to a number of tables:

import ibis

ibis.options.interactive = True

con = ibis.connect("duckdb://penguins.ddb")
t = con.table("penguins")
1
Import Ibis.
2
Configure Ibis (interactive).
3
Connect to the data and load a table into a variable.

Backend

A backend provides the connection and basic management of the data platform. Above, we created the con variable that is an instance of a DuckDB backend:

con
<ibis.backends.duckdb.Backend at 0x16a17af10>

It usually contains some tables:

con.list_tables()
['penguins']

We can access some internals of Ibis to see what backends are available:

Tip

Don’t rely on accessing internals of Ibis in production.

backends = [entrypoint.name for entrypoint in ibis.util.backend_entry_points()]
backends
['bigquery',
 'clickhouse',
 'dask',
 'datafusion',
 'druid',
 'duckdb',
 'flink',
 'impala',
 'mssql',
 'mysql',
 'oracle',
 'pandas',
 'polars',
 'postgres',
 'pyspark',
 'snowflake',
 'sqlite',
 'trino']

Table

You typically work with a table, conventionally named t for demo or exploratory purposes:

t
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species  island     bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex     year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ stringstringfloat64float64int64int64stringint64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie Torgersen39.118.71813750male  2007 │
│ Adelie Torgersen39.517.41863800female2007 │
│ Adelie Torgersen40.318.01953250female2007 │
│ Adelie TorgersennannanNULLNULLNULL2007 │
│ Adelie Torgersen36.719.31933450female2007 │
│ Adelie Torgersen39.320.61903650male  2007 │
│ Adelie Torgersen38.917.81813625female2007 │
│ Adelie Torgersen39.219.61954675male  2007 │
│ Adelie Torgersen34.118.11933475NULL2007 │
│ Adelie Torgersen42.020.21904250NULL2007 │
│  │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

When working with many tables, you should name them descriptively.

Schema

A table has a schema that Ibis maps to the data platform’s data types:

t.schema()
ibis.Schema {
  species            string
  island             string
  bill_length_mm     float64
  bill_depth_mm      float64
  flipper_length_mm  int64
  body_mass_g        int64
  sex                string
  year               int64
}

LLMs and data: Marvin and Ibis

You can use Marvin and Ibis together to easily apply LLMs to data.

from ibis.expr.schema import Schema
from ibis.expr.types.relations import Table

@marvin.ai_fn
def sql_select(
    text: str, table_name: str = t.get_name(), schema: Schema = t.schema()
) -> str:
    """writes the SQL SELECT statement to query the table according to the text"""


query = "the unique combination of species and islands"
sql = sql_select(query).strip(";")
sql
'SELECT DISTINCT species, island FROM penguins'
t.sql(sql)
┏━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ species    island    ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━┩
│ stringstring    │
├───────────┼───────────┤
│ Adelie   Torgersen │
│ Adelie   Biscoe    │
│ Adelie   Dream     │
│ Gentoo   Biscoe    │
│ ChinstrapDream     │
└───────────┴───────────┘
Code
sleep(3)
1
Avoid rate-limiting by waiting.
t.sql(sql_select(query + " and include their counts in from highest to lowest").strip(";"))
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species    island     count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ stringstringint64 │
├───────────┼───────────┼───────┤
│ Gentoo   Biscoe   124 │
│ ChinstrapDream    68 │
│ Adelie   Dream    56 │
│ Adelie   Torgersen52 │
│ Adelie   Biscoe   44 │
└───────────┴───────────┴───────┘

Next steps

You can get involved with Ibis Birdbrain, our open-source data & AI project for building next-generation natural language interfaces to data.

Read the next post in this series.

Back to top