WEBVTT

00:00.000 --> 00:11.000
I first of all want to talk about how this confidence is possible.

00:11.000 --> 00:18.000
It's not any sponsor, it's not any big company organizing this, it's not me, it is volunteers.

00:18.000 --> 00:22.000
And Dylan is volunteering today to help me organize this day.

00:22.000 --> 00:32.000
First of all, I'm an applause for organizing.

00:32.000 --> 00:35.000
I will forget myself later.

00:35.000 --> 00:38.000
I am introducing Dylan Nalswell's speaker today.

00:38.000 --> 00:43.000
We actually have been working together on some amazing open source projects, which are also super complex.

00:43.000 --> 00:49.000
And I was always complaining about 500 variables, throwing around everywhere in our codebase.

00:49.000 --> 00:54.000
And somebody very smart said, let's fix this and let's build something called StateDB.

00:54.000 --> 00:59.000
And it was so interesting that I invited Dylan here on stage to talk about this.

00:59.000 --> 01:01.000
Run for pass for Dylan.

01:10.000 --> 01:11.000
No, it's not that green.

01:11.000 --> 01:12.000
Other side?

01:12.000 --> 01:13.000
Yeah, that's all.

01:13.000 --> 01:16.000
All right, thank you all. How is your fourth name?

01:16.000 --> 01:17.000
Yeah, how are you doing?

01:17.000 --> 01:18.000
Great.

01:19.000 --> 01:20.000
So you need not more excited?

01:23.000 --> 01:26.000
It keeps on for the rest of the day.

01:26.000 --> 01:32.000
So I would like to talk about StateDB for I do, like a little wham I, my name is Iamank.

01:32.000 --> 01:36.000
I work for Isabel and which is now part of Cisco.

01:36.000 --> 01:39.000
I'm a silly contributor.

01:40.000 --> 01:49.000
Thank my main, my main job, and I do, I'm a refuer for the CBPF project, which CBM silly muses.

01:49.000 --> 01:56.000
And for those of you who don't know, those are both like very go heavy projects.

01:56.000 --> 02:01.000
To explain why StateDB, I need to explain a little bit about Sillium.

02:01.000 --> 02:07.000
So Sillium is a cloud network product that both a lot.

02:07.000 --> 02:09.000
Won't go into it very deeply.

02:09.000 --> 02:17.000
But I like to draw a very basic diagram of like all of the things that go in and out of Sillium.

02:17.000 --> 02:27.000
API servers at CD and why a bunch of things on the kernel and even interprocess on the same notes.

02:27.000 --> 02:31.000
And what tends to happen is a lot of information comes in.

02:31.000 --> 02:33.000
A lot of events happen.

02:33.000 --> 02:36.000
We need to reconcile all of that information.

02:36.000 --> 02:42.000
And we need to do it, we need to do it quite often with a lot of different data.

02:42.000 --> 02:52.000
So if you, if we look at how you store program state, one of the very naive approach,

02:52.000 --> 02:57.000
like which, which is good like to start with always, is you take a map.

02:57.000 --> 03:01.000
I put my data in a map or an array or some sort of container.

03:01.000 --> 03:08.000
Now I need to make it multiprocess or slap a local in there.

03:08.000 --> 03:11.000
And then we replicate this pattern.

03:11.000 --> 03:17.000
So I made an apple manager and an orange manager and they managed to own little bits of data.

03:17.000 --> 03:19.000
And these have consumers.

03:19.000 --> 03:26.000
So what tends to happen is that these managers are created by different people with different goals.

03:26.000 --> 03:31.000
They take maybe the same data, maybe different data indexed differently.

03:31.000 --> 03:34.000
And you have to create some sort of API for this.

03:34.000 --> 03:39.000
So with a normal map, you would just make a lookup function at that function at the lead function.

03:39.000 --> 03:46.000
But with a look, as you add a look, so you can look and defer the release whenever you do it in a function.

03:46.000 --> 03:48.000
But this doesn't always work.

03:48.000 --> 03:50.000
If you need to operate on bits at the data.

03:50.000 --> 03:55.000
So what we often have is we say, okay, we need to like loop over this map.

03:55.000 --> 04:01.000
So to do that, we need to hope to look the entire time because we can't have changes in the middle of looking at it.

04:01.000 --> 04:12.000
So, but you don't really, like, so what ends up happening is you need to expose that looks somehow to your consumer, which is back practice, shouldn't be done.

04:12.000 --> 04:21.000
Or you create some sort of callback mechanism, we say, okay, let me loop over all of the state that I have and call some sort of callback function while that look is held.

04:21.000 --> 04:29.000
But in the callback function, your consumer can do anything, including, for example, to help me through some other manager taking another look.

04:29.000 --> 04:34.000
And this, this calls of problems. So we have stated the way we discussed some problems.

04:34.000 --> 04:40.000
And the biggest one that I found that is actually super breaking is that looks.

04:40.000 --> 04:47.000
So let's say consumer one takes the look from Apple manager and then orange manager and consumer two happens through it the other way around.

04:47.000 --> 04:57.000
If they both happen to take their own looks for like the preferred looks first, then they need to both wait for each other and your application simply stops working.

04:57.000 --> 05:02.000
No one can proceed and there's no way to recover from that.

05:02.000 --> 05:10.000
So to summarize, we have that looks, we have data duplication, which was also the thing, like, you have a store, you have a map, but you need to have a key.

05:10.000 --> 05:22.000
And not every consumer wants to index on the same data, someone to, for example, take a Kubernetes resource index on the name, but all the ones to do it on a namespace or label set or something else.

05:22.000 --> 05:30.000
And we have a bunch of these API variations, which we're really frustrating for developers because you do it one way here, you have to do it another way here.

05:30.000 --> 05:36.000
Every time you want to use some data to relearn how this specific bit of call did it.

05:36.000 --> 05:40.000
So we started looking into ways to fix this.

05:40.000 --> 05:43.000
And the big inspiration was Jacob Nomad.

05:43.000 --> 05:48.000
So Jacob Nomad is also sort of cloud orchestration platform.

05:48.000 --> 05:54.000
And the way they dealt with distributing state is to use something called GoMMDB.

05:54.000 --> 06:02.000
So they have a raft protocol to basically pick masters and those masters are where you insert data and they distribute everywhere.

06:02.000 --> 06:09.000
And this distribution of data is picked up by a single thread in their system.

06:09.000 --> 06:14.000
And it takes, and it writes into this in memory database.

06:14.000 --> 06:20.000
So we looked at this and this was, like, 90% of what we want, but there are a few issues.

06:20.000 --> 06:23.000
Namely, we had a single database right look.

06:23.000 --> 06:29.000
And we typically have not one place very insert data, but a lot of producers of data.

06:29.000 --> 06:37.000
They, it's an older code base, so they, they used plain interface sold and we want to type safety.

06:37.000 --> 06:44.000
And there were just a number of small little features that we wanted and they just didn't have yet.

06:44.000 --> 06:50.000
We actually first started with this as our basis.

06:50.000 --> 06:55.000
But then we wanted to, to do with to add some things.

06:55.000 --> 07:01.000
So we basically took that and rewrote it to suit our needs.

07:01.000 --> 07:06.000
So we added generic, so all of our tables have generic data types.

07:06.000 --> 07:11.000
Independent right look, so you can look multiple tables independently or sets of tables.

07:11.000 --> 07:15.000
And also like the few other little features that we're talking about.

07:15.000 --> 07:18.000
So we can initialize tables.

07:19.000 --> 07:21.000
We can watch tables.

07:21.000 --> 07:27.000
So whenever things change, we can see what happens, metrics, it's a nice iterators.

07:27.000 --> 07:30.000
Optimistic concurrency, stuff.

07:30.000 --> 07:36.000
I won't be talking about all of this, but these are, these are, like, the things that we, the added.

07:36.000 --> 07:41.000
So to summarize what is, what is this in memory database, state to be.

07:41.000 --> 07:44.000
So it's in memory, it doesn't persist to this.

07:44.000 --> 07:47.000
And your application starts up, you get it, empty database.

07:47.000 --> 07:49.000
You need to populate to yourself.

07:49.000 --> 07:53.000
Technically, you can, like, persist to yourself, but the library just doesn't do it for you.

07:53.000 --> 07:57.000
And because it's all in memory, you can actually have very complicated data types in there,

07:57.000 --> 08:02.000
such as maps, channels, pointers, like you name it.

08:02.000 --> 08:06.000
If it's a go type, you can store it as a value in the store.

08:06.000 --> 08:09.000
Something you wouldn't be able to do on a disk easily.

08:10.000 --> 08:12.000
It has multi-version concurrency control.

08:12.000 --> 08:15.000
So multiple processes can read and write.

08:15.000 --> 08:19.000
And you can, and they don't interfere with each other.

08:19.000 --> 08:22.000
So you can have a re-transaction, a write transaction going.

08:22.000 --> 08:27.000
And the re-transaction will still see, like, all of the data consistently.

08:27.000 --> 08:32.000
We can lock across tables, so we can, we can lock up set of tables together.

08:32.000 --> 08:38.000
And then, like, update them together in a transaction or not.

08:38.000 --> 08:45.000
But multiple indexes, so you also have primary index, which you can make as many secondary indexes as you want,

08:45.000 --> 08:53.000
to index at same data and watch channels, which is a concept we took from Hasicorp.

08:53.000 --> 08:57.000
And the idea is that you can, basically, you do a query.

08:57.000 --> 09:03.000
You got a channel back to see when did my, like, did my query change, did the content change.

09:03.000 --> 09:05.000
So how would you use something like that?

09:05.000 --> 09:08.000
I'm back to the Apple, to the fruit example.

09:08.000 --> 09:10.000
I created an Apple.

09:10.000 --> 09:13.000
I created, I defined an Apple ID.

09:13.000 --> 09:15.000
I will come back to that as part of it.

09:15.000 --> 09:17.000
So that would be the ID part of my key.

09:17.000 --> 09:22.000
And I assigned the grade, which is there's an enum, a letter A to F.

09:22.000 --> 09:24.000
And then we create a table.

09:24.000 --> 09:31.000
In this case, I'm creating a global object from the table, because that's easier from a standalone perspective.

09:31.000 --> 09:35.000
In our case, we use it in a via dependency injection method.

09:35.000 --> 09:39.000
So, but for, like, a standalone review, this is, this is best.

09:39.000 --> 09:41.000
So we define it as a readable and writeable table.

09:41.000 --> 09:46.000
You can technically also cost it to a read only table if you want to.

09:46.000 --> 09:49.000
And we, we said it, it stores an Apple.

09:49.000 --> 09:52.000
And we give it some, and we give it to indexes.

09:52.000 --> 09:56.000
So the ID index is a primary index, and the Apple grade index is an index,

09:56.000 --> 10:01.000
so it's no longer a key, but on this value, this is a great value that it has.

10:01.000 --> 10:04.000
So the primary index looks something like this.

10:04.000 --> 10:08.000
So we have, so, so I have this Apple ID.

10:08.000 --> 10:10.000
It has a two key method.

10:10.000 --> 10:11.000
I defined this myself.

10:11.000 --> 10:17.000
It's just, it produces a bite slice where it takes three numbers that just makes one key out of it.

10:17.000 --> 10:19.000
Put it in a key set.

10:19.000 --> 10:25.000
And we, and I take the Apple ID itself, and I call the same method.

10:25.000 --> 10:29.000
So what you'll notice is I give this index a name, this is just for observability.

10:29.000 --> 10:33.000
From object is whenever you insert an object into the database, it will call this function.

10:33.000 --> 10:37.000
To translate it to get a key from your object, whatever it's shape is.

10:37.000 --> 10:42.000
And from key is when you do a query, so you give a key to a query.

10:42.000 --> 10:46.000
And it will, and it will go into the database to actually find it.

10:46.000 --> 10:49.000
The unique part is, because it's a primary index, it has to be unique,

10:49.000 --> 10:52.000
which can also have a new unique indexes.

10:53.000 --> 10:56.000
For example, the grade one is a non-unique index.

10:56.000 --> 11:01.000
If I, I can have multiple, I can have multiple apples with the A grade, for example.

11:01.000 --> 11:04.000
And what I did here is I took the grade.

11:04.000 --> 11:10.000
Because it's an, I have to cost it into, into you into 32.

11:10.000 --> 11:14.000
And the index that you, you in 32 is just like predefined.

11:14.000 --> 11:19.000
We have a few predefined functions that take go data types and turn them into keys.

11:19.000 --> 11:22.000
And again, a key is just a, a bite slice.

11:22.000 --> 11:27.000
Anything that can be represented as a bite slice can be indexed in the database.

11:27.000 --> 11:32.000
So we start our, we create our database, we register our table, and we start it.

11:32.000 --> 11:40.000
So this starts, this start bit is starts a background worker, which is important for a few things.

11:40.000 --> 11:43.000
I've got produce a few apples.

11:43.000 --> 11:47.000
So, oh, so when I produce something, I write a transaction.

11:47.000 --> 11:51.000
I say, I will defer the commit. I insert my apple.

11:51.000 --> 11:55.000
And if something errors, I will abort my transaction.

11:55.000 --> 11:59.000
And but you can imagine that you can do this with multiple, multiple things.

11:59.000 --> 12:05.000
I didn't take the first and the second argument that it returns, but it actually returns the old object.

12:05.000 --> 12:10.000
So if there was already an apple with the same key in there, it would return me the old apple that it has replaced.

12:10.000 --> 12:13.000
And the Boolean tells me if it did replace something or not.

12:13.000 --> 12:18.000
Basically an insert and an or update.

12:18.000 --> 12:26.000
Getting it is very simple. So we, again, we, we do a get with a reach transaction.

12:26.000 --> 12:31.000
We have the, we create a query from this ID.

12:31.000 --> 12:34.000
And we get our, we would get our apple back.

12:34.000 --> 12:38.000
To read the whole table, we use a different call.

12:38.000 --> 12:41.000
So we tell the table, hey, we want to get all and watch.

12:41.000 --> 12:48.000
So you, every apiical has the or most apiicals have this watch variant, which would, so this returns an iterator.

12:48.000 --> 12:51.000
So they, they, these are the new, the new iterator types.

12:51.000 --> 12:55.000
And they, it just returns me an iterator of apple.

12:55.000 --> 13:01.000
And the watch is a channel, just an, an empty channel with a, or a channel with an empty struct.

13:01.000 --> 13:09.000
And it gets closed by the database when, in this case, because we query the whole table of anything happens to the stable.

13:09.000 --> 13:17.000
It, it, it closes the channel and notify us that, hey, your data, you, there's something new you, you can observe.

13:17.000 --> 13:24.000
We can do a prefix watch. So if you notice, if you notice, here I have orchards, three and a number.

13:24.000 --> 13:26.000
Here I only have the orchards and the three.

13:26.000 --> 13:33.000
So this is only the first part of my key. And I can basically say, give me every apple from this orchard and this three.

13:33.000 --> 13:36.000
And it will give me everything with that prefix prefix in the key.

13:36.000 --> 13:44.000
And you can use this for IP addresses, strings, even stuff like, stuff like this.

13:44.000 --> 13:49.000
So what is the practical example of this? So this is how, sort of, how we use it.

13:49.000 --> 13:52.000
So we have a bunch of resources on the top.

13:52.000 --> 13:55.000
This, they, they live in the Kubernetes API server.

13:55.000 --> 14:01.000
We have a GRPC watch that goes into a component called Recall or Reflector.

14:01.000 --> 14:04.000
There's no part of the library, but something we built in Selium.

14:04.000 --> 14:08.000
They go into tables and these tables can then be observed.

14:08.000 --> 14:14.000
And the cool thing about this infrastructure is that you have the tables as an abstraction layer between all of your components.

14:14.000 --> 14:24.000
Meaning that if I wanted to test my LB Reflector, I can, in my test, leave out the Reflectors and the map.

14:24.000 --> 14:27.000
And I can just create a bunch of information in my tables.

14:27.000 --> 14:33.000
This is my input. And then I can assert that after it's records that after it's like looked at it,

14:33.000 --> 14:38.000
that it produced exact output to my, to my lower tables that I expected.

14:38.000 --> 14:41.000
So I can test my components in isolation.

14:41.000 --> 14:43.000
It's also extensible.

14:43.000 --> 14:49.000
So like we've got a few errors going off to other components that need to use the same information.

14:49.000 --> 14:54.000
And if some component needs to index it, then we just add a new index to the existing table.

14:54.000 --> 14:58.000
And we don't have to create a separate completely separate data store.

14:58.000 --> 15:04.000
So how does this internally work? It's actually a really cool data structure behind it,

15:04.000 --> 15:06.000
called an immutable depth of red X-3.

15:06.000 --> 15:09.000
So to explain it, we'll first start at the red X-3.

15:09.000 --> 15:13.000
For those who don't know, I added six objects in here.

15:13.000 --> 15:16.000
And what are red X-3 does is effort.

15:16.000 --> 15:20.000
So it's essentially the idea of a binary 3, where you sort things,

15:20.000 --> 15:28.000
where you sort your keys, but instead of having two children,

15:28.000 --> 15:32.000
you can have 255 children.

15:32.000 --> 15:38.000
So it's basically a 3, where your nodes get very, very big.

15:38.000 --> 15:40.000
So the yellow ones are my nodes.

15:40.000 --> 15:44.000
So this is what something would look like and everything with a common prefix.

15:44.000 --> 15:48.000
Everything with a common prefix shares nodes in this way.

15:48.000 --> 15:53.000
The depth of this has to do with how the data gets represented.

15:53.000 --> 16:03.000
So normal red X will be the picture above, where you would use 255 pointers for every time you get a byte.

16:03.000 --> 16:07.000
You can basically look that up in an array and get your next node.

16:07.000 --> 16:11.000
But 255 is very big if you have a sparse set.

16:11.000 --> 16:16.000
So what the depth of red X-3 does is it picks smaller nodes if you have less data in there.

16:16.000 --> 16:19.000
This is essentially an implementation detail.

16:19.000 --> 16:21.000
There are four different nodes.

16:21.000 --> 16:25.000
So the small node gets represented like this.

16:25.000 --> 16:28.000
So you just have an array of your keys and your pointers.

16:28.000 --> 16:31.000
And you simply loop over them to find the next one,

16:31.000 --> 16:34.000
because it's just fast to some modern CPUs.

16:34.000 --> 16:40.000
And then only when you need to do very big, when you actually have very wide trees,

16:40.000 --> 16:43.000
you scale up internally to use bigger nodes.

16:43.000 --> 16:48.000
And this is both faster and lookup and also memory use.

16:48.000 --> 16:52.000
The immutable part is the thing that makes the database actually work.

16:52.000 --> 16:57.000
So if we go back to our red X, what if you want to update this?

16:57.000 --> 17:01.000
So let's say I'm going to add a testing node to this.

17:01.000 --> 17:04.000
So the steps would be for an immutable red X-3.

17:04.000 --> 17:08.000
So we kind of change existing the existing nodes.

17:08.000 --> 17:10.000
We create a new root.

17:10.000 --> 17:14.000
The T, we already, we know that T is a shared node.

17:14.000 --> 17:17.000
So we basically take the existing T, we copy it.

17:17.000 --> 17:21.000
So we copy also the existing link it has.

17:21.000 --> 17:23.000
We know that EST is a node.

17:23.000 --> 17:27.000
So this is both a leaf node and so it has a value in it.

17:27.000 --> 17:28.000
And also goes further.

17:28.000 --> 17:29.000
So we copy it.

17:29.000 --> 17:31.000
And then we put our in under it.

17:31.000 --> 17:35.000
So now we have two copies of this bit.

17:36.000 --> 17:38.000
And if you actually look at.

17:38.000 --> 17:41.000
So if you look at this from the new point,

17:41.000 --> 17:44.000
I can still access all of the same nodes.

17:44.000 --> 17:48.000
That I would be able to from the left accept.

17:48.000 --> 17:50.000
It's also a testing.

17:50.000 --> 17:55.000
So basically create a sort of diff between if I were to just modify it three.

17:55.000 --> 17:57.000
I just add the difference.

17:57.000 --> 18:02.000
From our immutable red X, the way we, this would work is we have a consumer.

18:02.000 --> 18:04.000
That says I'll give me the root node.

18:04.000 --> 18:06.000
So it gets the old one.

18:06.000 --> 18:08.000
Now the update happens.

18:08.000 --> 18:11.000
So these green ones get added.

18:11.000 --> 18:13.000
We switch the pointer over.

18:13.000 --> 18:16.000
So it's a sort of RCU change.

18:16.000 --> 18:19.000
And then the new node comes along.

18:19.000 --> 18:21.000
It says I'll give me the root of three.

18:21.000 --> 18:23.000
It gets the new one.

18:23.000 --> 18:27.000
And this is how our versioning works at a basic level.

18:27.000 --> 18:30.000
So the new consumer can see the new version, the old consumer

18:30.000 --> 18:33.000
can see the new version, the new changes.

18:33.000 --> 18:36.000
And they can both independently operate on this.

18:36.000 --> 18:38.000
Man, the moment the consumer one goes away.

18:38.000 --> 18:42.000
There are no longer any points or studies to this root.

18:42.000 --> 18:44.000
So it gets actually garbage collected by go.

18:44.000 --> 18:46.000
It's by the go runtime itself.

18:46.000 --> 18:48.000
It just automatically goes away.

18:48.000 --> 18:50.000
And we can rearrange the three.

18:50.000 --> 18:54.000
And from that on, it just seems as if the three always had.

18:54.000 --> 18:55.000
Had it like this.

18:55.000 --> 19:00.000
So we have a sort of generational radics tree that keeps growing and going.

19:00.000 --> 19:08.000
And when you want to use this in a database, we extend this out.

19:08.000 --> 19:11.000
So you start with a database with a database root.

19:11.000 --> 19:14.000
The root has a bunch of tables.

19:14.000 --> 19:16.000
And every table has a multiple indexes.

19:16.000 --> 19:20.000
Every index is actually one of these immutable radics trees.

19:20.000 --> 19:24.000
And so these are not, this is not a radics tree.

19:24.000 --> 19:26.000
But it uses the same principle.

19:26.000 --> 19:34.000
So every time we update a table or an index, we make a copy of the header of that index of that table.

19:34.000 --> 19:38.000
And we do a sort of, and we do a generational copy.

19:38.000 --> 19:42.000
And then we just switch the root database pointer.

19:42.000 --> 19:46.000
So the way these queries work is because we have this tree.

19:46.000 --> 19:49.000
We can exploit its properties to the queries.

19:49.000 --> 19:53.000
So if we wanted to query a very specific object, we simply,

19:53.000 --> 19:59.000
because we know that they are sorted from less to more.

19:59.000 --> 20:04.000
We can just basically do sort of binary search through our tree work tree.

20:04.000 --> 20:09.000
But we can also do the prefix.

20:09.000 --> 20:12.000
Sorry, this is a lower bound search.

20:12.000 --> 20:18.000
So you basically say, I want to have this key and everything that is more than this key.

20:18.000 --> 20:21.000
So we can, we basically get a split.

20:21.000 --> 20:25.000
And from that point on, you can traverse the tree from left to right,

20:25.000 --> 20:28.000
all the way until you find all of the leaves.

20:28.000 --> 20:30.000
You can also do a prefix search.

20:30.000 --> 20:32.000
So you say, OK, this is my prefix.

20:32.000 --> 20:35.000
And then you can traverse the whole stop tree.

20:35.000 --> 20:39.000
So this way, we can do a bunch of these queries.

20:39.000 --> 20:42.000
Secondly, indexes work like such.

20:42.000 --> 20:47.000
So a primary index is just, I just made a primary index.

20:47.000 --> 20:51.000
A primary index with one byte for now.

20:51.000 --> 20:53.000
They point to objects.

20:53.000 --> 20:55.000
And every, this is a unique one.

20:55.000 --> 20:58.000
So every key points to exactly one object.

20:58.000 --> 20:59.000
And they come to know.

20:59.000 --> 21:05.000
Our objects in the below want to have this full property, which is a slice.

21:05.000 --> 21:07.000
So we store a slice in there.

21:07.000 --> 21:10.000
And we can actually index on that slice.

21:10.000 --> 21:13.000
So we create a full index.

21:13.000 --> 21:16.000
We take the primary.

21:16.000 --> 21:17.000
So we take that value.

21:17.000 --> 21:18.000
We index it.

21:18.000 --> 21:21.000
So we get bar or bas.

21:21.000 --> 21:26.000
And then every bar or bas points to one or more.

21:26.000 --> 21:31.000
Like, or points to the, to the actual objects.

21:31.000 --> 21:33.000
But they're still the same objects.

21:33.000 --> 21:36.000
Whenever we, whenever we did this moving.

21:36.000 --> 21:39.000
So whenever we duplicate it, we actually copied objects.

21:39.000 --> 21:42.000
But just something to remember about this.

21:42.000 --> 21:47.000
Because if you, if you have some objects that don't like shallow clones,

21:47.000 --> 21:49.000
then it isn't ideal for storage in here.

21:49.000 --> 21:53.000
Because then you get very weird issues.

21:53.000 --> 21:55.000
And I told you about the watches.

21:55.000 --> 21:57.000
So I showed you where this channel comes from.

21:57.000 --> 21:59.000
And the way this works is every header.

21:59.000 --> 22:02.000
So this is the header of every note.

22:02.000 --> 22:05.000
No, never mind which of the four versions to this.

22:05.000 --> 22:07.000
It has some flags.

22:07.000 --> 22:08.000
It has this, the prefix.

22:08.000 --> 22:10.000
So we can compress a prefix.

22:10.000 --> 22:13.000
So we don't just have one bite for one note.

22:13.000 --> 22:16.000
But we can have path compression.

22:16.000 --> 22:18.000
And it has this watch channel.

22:18.000 --> 22:22.000
And what happens is whenever we do, whenever we did the cloning.

22:22.000 --> 22:24.000
Or we have a transaction.

22:24.000 --> 22:30.000
And our transaction keeps track of which objects were replaced by a new version.

22:30.000 --> 22:33.000
And whenever we commit that transaction,

22:33.000 --> 22:37.000
we basically just loop that list of everything that's changed.

22:37.000 --> 22:39.000
And we close the watch channels.

22:39.000 --> 22:43.000
And because of the prefix, because we do the prefix searches.

22:43.000 --> 22:47.000
Or specific one, we can just pick one of the notes.

22:47.000 --> 22:49.000
And return that watch channel.

22:49.000 --> 22:53.000
And when anything changes, that is relevance to your query.

22:53.000 --> 22:57.000
That is the way you get notified.

22:57.000 --> 23:02.000
So transactions, so I talked about transactions,

23:02.000 --> 23:03.000
both having some commits.

23:03.000 --> 23:08.000
So on abort, because we have this generational copying.

23:08.000 --> 23:11.000
Whenever whenever you make a copy.

23:11.000 --> 23:13.000
And you decide, oh, I don't want to use this copy.

23:13.000 --> 23:16.000
You can just remove all references to the copy.

23:16.000 --> 23:19.000
And let it be collected by the garbage collector.

23:19.000 --> 23:22.000
There's no additional work in doing a abort.

23:22.000 --> 23:27.000
Whenever you say, whenever you say, I don't want this transaction,

23:27.000 --> 23:30.000
which is just as collected.

23:30.000 --> 23:37.000
And the interesting bit, so I talked in the beginning about that looks.

23:37.000 --> 23:40.000
We use something called a sortable mutex.

23:40.000 --> 23:43.000
And I'm going to explain it with sort of mini animation.

23:43.000 --> 23:45.000
So let's say we have five tables.

23:45.000 --> 23:49.000
And I have some rights transaction,

23:49.000 --> 23:52.000
and I want to take two, four, and five.

23:52.000 --> 23:53.000
That's possible.

23:53.000 --> 23:55.000
They are available, so they get taken.

23:55.000 --> 23:59.000
And the rule with sortable mutex is that you always take them

23:59.000 --> 24:01.000
from left to right.

24:01.000 --> 24:05.000
So you first take two, then four, then five.

24:05.000 --> 24:10.000
So let's say a second transactions come along.

24:10.000 --> 24:12.000
That wants to lock three tables.

24:12.000 --> 24:14.000
It can go, it can lock one.

24:14.000 --> 24:15.000
That's no problem.

24:15.000 --> 24:17.000
But two is already taken.

24:18.000 --> 24:20.000
And it will block.

24:20.000 --> 24:23.000
But because we always do it from left to right.

24:23.000 --> 24:25.000
And the first one, unlocking.

24:25.000 --> 24:30.000
It means that this will be okay.

24:30.000 --> 24:33.000
Let's say another one comes along.

24:33.000 --> 24:34.000
Three is still available.

24:34.000 --> 24:35.000
So this one can be taken.

24:35.000 --> 24:38.000
This thread will just go on.

24:38.000 --> 24:42.000
So we don't only the middle transaction is waiting

24:42.000 --> 24:45.000
until it can access the tables.

24:46.000 --> 24:49.000
The, our old one goes away.

24:49.000 --> 24:51.000
And then we can start looking to.

24:51.000 --> 24:53.000
But now we still have to wait.

24:53.000 --> 24:55.000
So, but we progress.

24:55.000 --> 24:57.000
So every right transaction will always progress.

24:57.000 --> 25:01.000
And then also ensures a level of fairness.

25:01.000 --> 25:02.000
It's not perfect.

25:02.000 --> 25:04.000
Scheduling wise.

25:04.000 --> 25:10.000
But because you always lock an unlock in this ordering,

25:10.000 --> 25:13.000
you can basically guarantee that you're never,

25:13.000 --> 25:16.000
that you're never going the opposite direction.

25:16.000 --> 25:19.000
You never have two people waiting for each other's looks

25:19.000 --> 25:20.000
that are held.

25:20.000 --> 25:23.000
And that's what our looking issue.

25:23.000 --> 25:28.000
So the last thing I wanted to show is the change tracking,

25:28.000 --> 25:32.000
which I find one of the best features in here.

25:32.000 --> 25:34.000
So what you can do is you can say I have a table.

25:34.000 --> 25:38.000
And I want to, I want to watch all of the, all of the changes

25:38.000 --> 25:41.000
that happens since the last time that I did something.

25:41.000 --> 25:44.000
So you can basically do one BigQuery process something

25:44.000 --> 25:47.000
and then from then on only process the differences.

25:47.000 --> 25:51.000
So you take, you need to take a right transaction to create

25:51.000 --> 25:54.000
to get this change iterator.

25:54.000 --> 25:57.000
And that's because it does a few things in the background

25:57.000 --> 26:00.000
which you need a loop from the table.

26:00.000 --> 26:03.000
But after you've done that, after you've done that,

26:03.000 --> 26:06.000
we have the, we take a right transaction.

26:06.000 --> 26:09.000
And with the right, sorry, a read transaction.

26:09.000 --> 26:13.000
And with the read transaction, we just say,

26:13.000 --> 26:17.000
we just say, okay, I'm, I'm at this point.

26:17.000 --> 26:20.000
What happened between the last time I checked and this new point.

26:20.000 --> 26:22.000
And you can iterate over all of the changes.

26:22.000 --> 26:24.000
And it tells you, is it deleted?

26:24.000 --> 26:26.000
Or, and what is the object that was there?

26:26.000 --> 26:31.000
Or what's the new object that has been inserted or updated?

26:31.000 --> 26:36.000
To do this, we have a hidden index called the revision index,

26:36.000 --> 26:38.000
which always exists.

26:38.000 --> 26:41.000
And it's every time you have a transaction,

26:41.000 --> 26:43.000
every transaction gets an incrementing number,

26:43.000 --> 26:47.000
and all changes made in that transaction are tracked.

26:47.000 --> 26:50.000
So we can do this lower-bound search,

26:50.000 --> 26:54.000
where you say, give me everything that is above this number,

26:54.000 --> 26:57.000
above the, to, on the revision index,

26:57.000 --> 27:02.000
to basically give us a sort of history from a certain provision.

27:02.000 --> 27:07.000
This works for new objects, but not for the ones that are deleted.

27:07.000 --> 27:11.000
Whenever you have a change, whenever you are watching changes,

27:11.000 --> 27:15.000
we create, we, we mark that.

27:15.000 --> 27:17.000
Whenever something then gets deleted,

27:17.000 --> 27:19.000
it goes to a sort of soft delete state.

27:19.000 --> 27:23.000
So the graveyard index, which is another primary index,

27:23.000 --> 27:25.000
but for all objects that are deleted,

27:25.000 --> 27:28.000
when you observe that something was deleted,

27:28.000 --> 27:32.000
the background worker that we did with the DB.

27:32.000 --> 27:35.000
Start will collect the, will remove these old ones.

27:36.000 --> 27:40.000
So it's allowed you to basically temporarily store the deleted objects

27:40.000 --> 27:44.000
until everyone that's interested in deleted objects has observed them,

27:44.000 --> 27:47.000
and then they will automatically go away.

27:47.000 --> 27:50.000
So that was all that I have time for.

27:50.000 --> 27:52.000
I have so much more to show you.

27:52.000 --> 27:55.000
Please have a look at state DB.

27:55.000 --> 27:58.000
And I would like to, so I am here speaking,

27:58.000 --> 28:01.000
but actually my colleague, you see,

28:02.000 --> 28:04.000
the most of the legwork on this.

28:04.000 --> 28:06.000
I was just his professional user,

28:06.000 --> 28:10.000
and to, so I had some influence on the design,

28:10.000 --> 28:13.000
and I'm working a book, but you see,

28:13.000 --> 28:16.000
it's the real hero here.

28:16.000 --> 28:18.000
Questions?

28:18.000 --> 28:20.000
Of course.

28:20.000 --> 28:24.000
Do you know if we have any questions, I'm ready.

28:24.000 --> 28:26.000
I don't have time for questions.

28:26.000 --> 28:28.000
If you have questions, Dylan is here.

28:28.000 --> 28:31.000
Okay, just don't bother and you're in the busiest moments of today.

28:31.000 --> 28:32.000
Thank you.