Lakota

A git-like storage for timeseries

How does git keep track of your data ?

Checksums

A checkum is a function that transform a message into a number.

Example

from hashlib import sha1

important_info = b"Hello world"
print(sha1(important_info).hexdigest())
# -> 7b502c3a1f48c8609ae212cdfb639dee39673f5e

another_info = b"Hello world!"
print(sha1(another_info).hexdigest())
# -> d3486ae9136e7856bc42212385ea797094475802

Properties:

Reverting is impossible but we can store the information:

db = {
    b'7b502c3a1f48c8609ae212cdfb639dee39673f5e': important_info, # "Hello world"
    b'd3486ae9136e7856bc42212385ea797094475802': another_info,  # "Hello world!"
}

So we can query it:

print(db[b'7b502c3a1f48c8609ae212cdfb639dee39673f5e'])
# -> b'Hello world'

Moreover we can compute a digest of digests

cheksum = sha1(b'7b502c3a1f48c8609ae212cdfb639dee39673f5e')
cheksum.update(b'd3486ae9136e7856bc42212385ea797094475802')
print(cheksum.hexdigest())
# -> 4812b3c515b3c453d399fb95bb5b0a261c3542c9

db[b'4812b3c515b3c453d399fb95bb5b0a261c3542c9'] = [
    b'7b502c3a1f48c8609ae212cdfb639dee39673f5e',
    b'd3486ae9136e7856bc42212385ea797094475802'
]

(and store it)

Let's add a new message

some_new_info = b'Bye'
print(sha1(some_new_info).hexdigest())
# -> f792424064d0ca1a7d14efe0588f10c052d28e69
db[b'f792424064d0ca1a7d14efe0588f10c052d28e69'] = some_new_info

And update our checksum of checksum

cheksum.update(b'f792424064d0ca1a7d14efe0588f10c052d28e69')
print(cheksum.hexdigest())
# -> 87230955960e9c54c6cc43db35ad91602bdfdd73

db[b'87230955960e9c54c6cc43db35ad91602bdfdd73'] = [
    b'7b502c3a1f48c8609ae212cdfb639dee39673f5e',
    b'd3486ae9136e7856bc42212385ea797094475802',
    b'f792424064d0ca1a7d14efe0588f10c052d28e69',
]

Our minimal git implementation is nearly complete.

log = [
    b'4812b3c515b3c453d399fb95bb5b0a261c3542c9', # First commit
    b'87230955960e9c54c6cc43db35ad91602bdfdd73', # Second commit
]

The log contains only the digests of digests (not the original messages)

For example the equivalent of a checkout is:

commit = b'87230955960e9c54c6cc43db35ad91602bdfdd73'
for key in db[commit]:
    msg = db[key]
    print(f'{key}: {msg}')

# -> b'7b502c3a1f48c8609ae212cdfb639dee39673f5e': b'Hello world'
#    b'd3486ae9136e7856bc42212385ea797094475802': b'Hello world!'
#    b'f792424064d0ca1a7d14efe0588f10c052d28e69': b'Bye'

So with the above concept, we see how we can:

Merkle tree

This checksum of checksums (or checksum of checksums of checksums ...) is called a Merkle Tree.

What about timeseries ?

A timeseries is a dataframe with at least two columns

df = DataFrame({
    'timestamp': ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05'],
    'value': [1, 2, 3, 4, 5],
})
print(df)

# ->
#        timestamp  value
#    0  2020-01-01      1
#    1  2020-01-02      2
#    2  2020-01-03      3
#    3  2020-01-04      4
#    4  2020-01-05      5

We compute one digest per column:

print(sha1(df['timestamp'].to_numpy()).hexdigest())
# -> 35926b0130f74f08c6020af537e765258deeef03

print(sha1(df['value'].to_numpy()).hexdigest())
# -> 7bfa1c0042237357f6eb89ec07d5e8a89e2d1d0e

And combine those to have a digest of our dataframe:

sha1(b'3592...').update(b'7bf...')

And then do the same operations for a collection of dataframe:

second_df = DataFrame({
    'timestamp': ['2020-01-06', '2020-01-07', '2020-01-08', '2020-01-09', '2020-01-10'],
    'value': [6, 7, 8, 9, 10],
})

(rince and repeat)

Remarks:

Conclusions

If we add two ingredients:

We get a timeseries database that offer the following properties:

All those ideas and more have been implement in Lakota

Demo Time

Web UI

Pull dataset from CLI

Thank you. Questions ?