More Value Expressions¶
Setup¶
[1]:
import os
import ibis
ibis.options.interactive = True
connection = ibis.sqlite.connect(os.path.join('data', 'geography.db'))
Type casting¶
The Ibis type system is pretty basic and will get better (and more documented over time). It maps directly onto the current Impala type system
int8
int16
int32
int64
boolean
float
double
string
timestamp
decimal($precision, $scale)
These type names can be used to cast from one type to another
[2]:
countries = connection.table('countries')
countries
connection.table('gdp')
[2]:
country_code year value
0 ABW 1986 4.054634e+08
1 ABW 1987 4.876025e+08
2 ABW 1988 5.964236e+08
3 ABW 1989 6.953044e+08
4 ABW 1990 7.648871e+08
... ... ... ...
9995 SVK 2002 3.513034e+10
9996 SVK 2003 4.681659e+10
9997 SVK 2004 5.733202e+10
9998 SVK 2005 6.278531e+10
9999 SVK 2006 7.070810e+10
[10000 rows x 3 columns]
[3]:
countries = connection.table('countries')
countries.population.cast('float').sum()
[3]:
6878963738.0
[4]:
countries.area_km2.cast('int32').sum()
[4]:
150012536
Case / if-then-else expressions¶
We support a number of variants of the SQL-equivalent CASE
expression, and will add more API functions over time to meet different use cases and enhance the expressiveness of any branching-based value logic.
[5]:
expr = (countries.continent
.case()
.when('AF', 'Africa')
.when('AN', 'Antarctica')
.when('AS', 'Asia')
.when('EU', 'Europe')
.when('NA', 'North America')
.when('OC', 'Oceania')
.when('SA', 'South America')
.else_(countries.continent)
.end()
.name('continent_name'))
expr.value_counts()
[5]:
continent_name count
0 Africa 58
1 Antarctica 5
2 Asia 51
3 Europe 54
4 North America 42
5 Oceania 28
6 South America 14
If the else_
default condition is not provided, any values not matching one of the conditions will be NULL
.
[6]:
expr = (countries.continent
.case()
.when('AF', 'Africa')
.when('AS', 'Asia')
.when('EU', 'Europe')
.when('NA', 'North America')
.when('OC', 'Oceania')
.when('SA', 'South America')
.end()
.name('continent_name_with_nulls'))
expr.value_counts()
[6]:
continent_name_with_nulls count
0 None 5
1 Africa 58
2 Asia 51
3 Europe 54
4 North America 42
5 Oceania 28
6 South America 14
To test for an arbitrary series of boolean conditions, use the case
API method and pass any boolean expressions potentially involving columns of the table:
[7]:
expr = (ibis.case()
.when(countries.population > 25_000_000, 'big')
.when(countries.population < 5_000_000, 'small')
.else_('medium')
.end()
.name('size'))
countries['name', 'population', expr].limit(10)
[7]:
name population size
0 Andorra 84000 small
1 United Arab Emirates 4975593 small
2 Afghanistan 29121286 big
3 Antigua and Barbuda 86754 small
4 Anguilla 13254 small
5 Albania 2986952 small
6 Armenia 2968000 small
7 Netherlands Antilles 300000 small
8 Angola 13068161 medium
9 Antarctica 0 small
Simple ternary-cases (like the Python X if COND else Y
) can be written using the ifelse
function:
[8]:
expr = ((countries.continent == 'AS')
.ifelse('Asia', 'Not Asia')
.name('is_asia'))
countries['name', 'continent', expr].limit(10)
[8]:
name continent is_asia
0 Andorra EU Not Asia
1 United Arab Emirates AS Asia
2 Afghanistan AS Asia
3 Antigua and Barbuda NA Not Asia
4 Anguilla NA Not Asia
5 Albania EU Not Asia
6 Armenia AS Asia
7 Netherlands Antilles NA Not Asia
8 Angola AF Not Asia
9 Antarctica AN Not Asia
Set membership¶
The isin
and notin
functions are like their pandas counterparts. These can take:
A list of value expressions, either literal values or other column expressions
An array/column expression of some kind
[9]:
is_america = countries.continent.isin(['NA', 'SA'])
countries[is_america].continent.value_counts()
[9]:
continent count
0 NA 42
1 SA 14
You can also check for membership in an array. Here is an example of filtering based on the top 3 (ignoring ties) most frequently-occurring values in the string_col
column of alltypes:
[10]:
top_continents = countries.continent.value_counts().limit(3).continent
top_continents_filter = countries.continent.isin(top_continents)
expr = countries[top_continents_filter]
expr.count()
[10]:
114
This is a common enough operation that we provide a special analytical filter function topk
:
[11]:
countries.continent.topk(3)
[11]:
continent count
0 AF 58
1 EU 54
2 AS 51
Cool, huh? More on topk
later.
Null-ness¶
Like their pandas equivalents, the isnull
and notnull
functions return True values if the values are null, or non-null, respectively. For example:
[12]:
expr = (countries.continent
.case()
.when('AF', 'Africa')
.when('EU', 'Europe')
.when('AS', 'Asia')
.end()
.name('top_continent_name'))
expr.isnull().value_counts()
[12]:
unnamed count
0 False 163
1 True 89
Functions like isnull
can be combined with case
expressions or functions like ifelse
to replace null values with some other value. ifelse
here will use the first value supplied for any True
value and the second value for any False
value. Either value can be a scalar or array.
[13]:
expr2 = expr.isnull().ifelse('Other continent', expr).name('continent')
expr2.value_counts()
[13]:
continent count
0 Africa 58
1 Asia 51
2 Europe 54
3 Other continent 89
Distinct-based operations¶
Ibis supports using distinct
to remove duplicate rows or values on tables or arrays. For example:
[14]:
countries['continent'].distinct()
[14]:
0 EU
1 AS
2 NA
3 AF
4 AN
5 SA
6 OC
Name: continent, dtype: object
[15]:
countries.continent.distinct()
[15]:
0 EU
1 AS
2 NA
3 AF
4 AN
5 SA
6 OC
Name: continent, dtype: object
This can be combined with count
to form a reduction metric:
[16]:
metric = (countries.continent
.distinct().count()
.name('num_continents'))
metric
[16]:
7
This is common enough to have a shortcut nunique
:
[17]:
countries.continent.nunique()
[17]:
7
String operations¶
What’s supported is pretty basic right now. We intend to support the full gamut of regular expression munging with a nice API, though in some cases some work will be required on SQLite’s backend to support everything.
[18]:
countries[['name']].limit(5)
[18]:
name
0 Andorra
1 United Arab Emirates
2 Afghanistan
3 Antigua and Barbuda
4 Anguilla
At the moment, basic substring operations (substr
, with conveniences left
and right
) and Python-like APIs such as lower
and upper
(for case normalization) are supported. So you could count first letter occurrences in a string column like so:
[19]:
expr = countries.name.lower().left(1).name('first_letter')
expr.value_counts().sort_by(('count', False)).limit(10)
[19]:
first_letter count
0 s 34
1 b 22
2 m 22
3 c 21
4 g 16
5 n 16
6 a 15
7 t 14
8 p 12
9 i 10
For fuzzy and regex filtering/searching, you can use one of the following
like
, works as the SQLLIKE
keywordrlike
, likere.search
or SQLRLIKE
contains
, likex in str_value
in Python
[20]:
countries[countries.name.like('%GE%')].name
[20]:
0 Argentina
1 Germany
2 Algeria
3 Georgia
4 South Georgia and South Sandwich Islands
5 Niger
6 Nigeria
Name: name, dtype: object
[21]:
countries[countries.name.lower().rlike('.*ge.*')].name
[21]:
0 Argentina
1 Germany
2 Algeria
3 Georgia
4 South Georgia and South Sandwich Islands
5 Niger
6 Nigeria
Name: name, dtype: object
[22]:
countries[countries.name.lower().contains('ge')].name
[22]:
0 Argentina
1 Germany
2 Algeria
3 Georgia
4 South Georgia and South Sandwich Islands
5 Niger
6 Nigeria
Name: name, dtype: object
Timestamp operations¶
Date and time functionality is relatively limited at present compared with pandas, but we’ll get there. The main things we have right now are
Field access (year, month, day, …)
Timedeltas
Comparisons with fixed timestamps
[23]:
independence = connection.table('independence')
independence[independence.independence_date, independence.independence_date.month().name('month')].limit(10)
[23]:
independence_date month
0 1919-08-19 8
1 1912-11-28 11
2 1962-07-05 7
3 1975-11-11 11
4 1981-11-01 11
5 1816-07-09 7
6 1918-05-28 5
7 1991-09-21 9
8 1955-10-26 10
9 1918-05-28 5
Somewhat more comprehensively
[24]:
def get_field(f):
return getattr(independence.independence_date, f)().name(f)
fields = ['year', 'month', 'day'] # datetime fields can also use: 'hour', 'minute', 'second', 'millisecond'
projection = [independence.independence_date] + [get_field(x) for x in fields]
independence[projection].limit(10)
[24]:
independence_date year month day
0 1919-08-19 1919 8 19
1 1912-11-28 1912 11 28
2 1962-07-05 1962 7 5
3 1975-11-11 1975 11 11
4 1981-11-01 1981 11 1
5 1816-07-09 1816 7 9
6 1918-05-28 1918 5 28
7 1991-09-21 1991 9 21
8 1955-10-26 1955 10 26
9 1918-05-28 1918 5 28
For timestamp arithmetic and comparisons, check out functions in the top level ibis
namespace. This include things like day
and second
, but also the ibis.timestamp
function:
[25]:
independence[independence.independence_date.min(),
independence.independence_date.max(),
independence.count().name('nrows')].distinct()
[25]:
min max nrows
0 1291-08-01 2011-07-09 186
[26]:
independence[independence.independence_date > '2000-01-01'].count()
[26]:
3
Some backends support adding offsets, for example independence.independence_date + ibis.interval(days=1)
or ibis.now() - independence.independence_date
.