WEBVTT 00:00.000 --> 00:11.000 I first of all want to talk about how this confidence is possible. 00:11.000 --> 00:18.000 It's not any sponsor, it's not any big company organizing this, it's not me, it is volunteers. 00:18.000 --> 00:22.000 And Dylan is volunteering today to help me organize this day. 00:22.000 --> 00:32.000 First of all, I'm an applause for organizing. 00:32.000 --> 00:35.000 I will forget myself later. 00:35.000 --> 00:38.000 I am introducing Dylan Nalswell's speaker today. 00:38.000 --> 00:43.000 We actually have been working together on some amazing open source projects, which are also super complex. 00:43.000 --> 00:49.000 And I was always complaining about 500 variables, throwing around everywhere in our codebase. 00:49.000 --> 00:54.000 And somebody very smart said, let's fix this and let's build something called StateDB. 00:54.000 --> 00:59.000 And it was so interesting that I invited Dylan here on stage to talk about this. 00:59.000 --> 01:01.000 Run for pass for Dylan. 01:10.000 --> 01:11.000 No, it's not that green. 01:11.000 --> 01:12.000 Other side? 01:12.000 --> 01:13.000 Yeah, that's all. 01:13.000 --> 01:16.000 All right, thank you all. How is your fourth name? 01:16.000 --> 01:17.000 Yeah, how are you doing? 01:17.000 --> 01:18.000 Great. 01:19.000 --> 01:20.000 So you need not more excited? 01:23.000 --> 01:26.000 It keeps on for the rest of the day. 01:26.000 --> 01:32.000 So I would like to talk about StateDB for I do, like a little wham I, my name is Iamank. 01:32.000 --> 01:36.000 I work for Isabel and which is now part of Cisco. 01:36.000 --> 01:39.000 I'm a silly contributor. 01:40.000 --> 01:49.000 Thank my main, my main job, and I do, I'm a refuer for the CBPF project, which CBM silly muses. 01:49.000 --> 01:56.000 And for those of you who don't know, those are both like very go heavy projects. 01:56.000 --> 02:01.000 To explain why StateDB, I need to explain a little bit about Sillium. 02:01.000 --> 02:07.000 So Sillium is a cloud network product that both a lot. 02:07.000 --> 02:09.000 Won't go into it very deeply. 02:09.000 --> 02:17.000 But I like to draw a very basic diagram of like all of the things that go in and out of Sillium. 02:17.000 --> 02:27.000 API servers at CD and why a bunch of things on the kernel and even interprocess on the same notes. 02:27.000 --> 02:31.000 And what tends to happen is a lot of information comes in. 02:31.000 --> 02:33.000 A lot of events happen. 02:33.000 --> 02:36.000 We need to reconcile all of that information. 02:36.000 --> 02:42.000 And we need to do it, we need to do it quite often with a lot of different data. 02:42.000 --> 02:52.000 So if you, if we look at how you store program state, one of the very naive approach, 02:52.000 --> 02:57.000 like which, which is good like to start with always, is you take a map. 02:57.000 --> 03:01.000 I put my data in a map or an array or some sort of container. 03:01.000 --> 03:08.000 Now I need to make it multiprocess or slap a local in there. 03:08.000 --> 03:11.000 And then we replicate this pattern. 03:11.000 --> 03:17.000 So I made an apple manager and an orange manager and they managed to own little bits of data. 03:17.000 --> 03:19.000 And these have consumers. 03:19.000 --> 03:26.000 So what tends to happen is that these managers are created by different people with different goals. 03:26.000 --> 03:31.000 They take maybe the same data, maybe different data indexed differently. 03:31.000 --> 03:34.000 And you have to create some sort of API for this. 03:34.000 --> 03:39.000 So with a normal map, you would just make a lookup function at that function at the lead function. 03:39.000 --> 03:46.000 But with a look, as you add a look, so you can look and defer the release whenever you do it in a function. 03:46.000 --> 03:48.000 But this doesn't always work. 03:48.000 --> 03:50.000 If you need to operate on bits at the data. 03:50.000 --> 03:55.000 So what we often have is we say, okay, we need to like loop over this map. 03:55.000 --> 04:01.000 So to do that, we need to hope to look the entire time because we can't have changes in the middle of looking at it. 04:01.000 --> 04:12.000 So, but you don't really, like, so what ends up happening is you need to expose that looks somehow to your consumer, which is back practice, shouldn't be done. 04:12.000 --> 04:21.000 Or you create some sort of callback mechanism, we say, okay, let me loop over all of the state that I have and call some sort of callback function while that look is held. 04:21.000 --> 04:29.000 But in the callback function, your consumer can do anything, including, for example, to help me through some other manager taking another look. 04:29.000 --> 04:34.000 And this, this calls of problems. So we have stated the way we discussed some problems. 04:34.000 --> 04:40.000 And the biggest one that I found that is actually super breaking is that looks. 04:40.000 --> 04:47.000 So let's say consumer one takes the look from Apple manager and then orange manager and consumer two happens through it the other way around. 04:47.000 --> 04:57.000 If they both happen to take their own looks for like the preferred looks first, then they need to both wait for each other and your application simply stops working. 04:57.000 --> 05:02.000 No one can proceed and there's no way to recover from that. 05:02.000 --> 05:10.000 So to summarize, we have that looks, we have data duplication, which was also the thing, like, you have a store, you have a map, but you need to have a key. 05:10.000 --> 05:22.000 And not every consumer wants to index on the same data, someone to, for example, take a Kubernetes resource index on the name, but all the ones to do it on a namespace or label set or something else. 05:22.000 --> 05:30.000 And we have a bunch of these API variations, which we're really frustrating for developers because you do it one way here, you have to do it another way here. 05:30.000 --> 05:36.000 Every time you want to use some data to relearn how this specific bit of call did it. 05:36.000 --> 05:40.000 So we started looking into ways to fix this. 05:40.000 --> 05:43.000 And the big inspiration was Jacob Nomad. 05:43.000 --> 05:48.000 So Jacob Nomad is also sort of cloud orchestration platform. 05:48.000 --> 05:54.000 And the way they dealt with distributing state is to use something called GoMMDB. 05:54.000 --> 06:02.000 So they have a raft protocol to basically pick masters and those masters are where you insert data and they distribute everywhere. 06:02.000 --> 06:09.000 And this distribution of data is picked up by a single thread in their system. 06:09.000 --> 06:14.000 And it takes, and it writes into this in memory database. 06:14.000 --> 06:20.000 So we looked at this and this was, like, 90% of what we want, but there are a few issues. 06:20.000 --> 06:23.000 Namely, we had a single database right look. 06:23.000 --> 06:29.000 And we typically have not one place very insert data, but a lot of producers of data. 06:29.000 --> 06:37.000 They, it's an older code base, so they, they used plain interface sold and we want to type safety. 06:37.000 --> 06:44.000 And there were just a number of small little features that we wanted and they just didn't have yet. 06:44.000 --> 06:50.000 We actually first started with this as our basis. 06:50.000 --> 06:55.000 But then we wanted to, to do with to add some things. 06:55.000 --> 07:01.000 So we basically took that and rewrote it to suit our needs. 07:01.000 --> 07:06.000 So we added generic, so all of our tables have generic data types. 07:06.000 --> 07:11.000 Independent right look, so you can look multiple tables independently or sets of tables. 07:11.000 --> 07:15.000 And also like the few other little features that we're talking about. 07:15.000 --> 07:18.000 So we can initialize tables. 07:19.000 --> 07:21.000 We can watch tables. 07:21.000 --> 07:27.000 So whenever things change, we can see what happens, metrics, it's a nice iterators. 07:27.000 --> 07:30.000 Optimistic concurrency, stuff. 07:30.000 --> 07:36.000 I won't be talking about all of this, but these are, these are, like, the things that we, the added. 07:36.000 --> 07:41.000 So to summarize what is, what is this in memory database, state to be. 07:41.000 --> 07:44.000 So it's in memory, it doesn't persist to this. 07:44.000 --> 07:47.000 And your application starts up, you get it, empty database. 07:47.000 --> 07:49.000 You need to populate to yourself. 07:49.000 --> 07:53.000 Technically, you can, like, persist to yourself, but the library just doesn't do it for you. 07:53.000 --> 07:57.000 And because it's all in memory, you can actually have very complicated data types in there, 07:57.000 --> 08:02.000 such as maps, channels, pointers, like you name it. 08:02.000 --> 08:06.000 If it's a go type, you can store it as a value in the store. 08:06.000 --> 08:09.000 Something you wouldn't be able to do on a disk easily. 08:10.000 --> 08:12.000 It has multi-version concurrency control. 08:12.000 --> 08:15.000 So multiple processes can read and write. 08:15.000 --> 08:19.000 And you can, and they don't interfere with each other. 08:19.000 --> 08:22.000 So you can have a re-transaction, a write transaction going. 08:22.000 --> 08:27.000 And the re-transaction will still see, like, all of the data consistently. 08:27.000 --> 08:32.000 We can lock across tables, so we can, we can lock up set of tables together. 08:32.000 --> 08:38.000 And then, like, update them together in a transaction or not. 08:38.000 --> 08:45.000 But multiple indexes, so you also have primary index, which you can make as many secondary indexes as you want, 08:45.000 --> 08:53.000 to index at same data and watch channels, which is a concept we took from Hasicorp. 08:53.000 --> 08:57.000 And the idea is that you can, basically, you do a query. 08:57.000 --> 09:03.000 You got a channel back to see when did my, like, did my query change, did the content change. 09:03.000 --> 09:05.000 So how would you use something like that? 09:05.000 --> 09:08.000 I'm back to the Apple, to the fruit example. 09:08.000 --> 09:10.000 I created an Apple. 09:10.000 --> 09:13.000 I created, I defined an Apple ID. 09:13.000 --> 09:15.000 I will come back to that as part of it. 09:15.000 --> 09:17.000 So that would be the ID part of my key. 09:17.000 --> 09:22.000 And I assigned the grade, which is there's an enum, a letter A to F. 09:22.000 --> 09:24.000 And then we create a table. 09:24.000 --> 09:31.000 In this case, I'm creating a global object from the table, because that's easier from a standalone perspective. 09:31.000 --> 09:35.000 In our case, we use it in a via dependency injection method. 09:35.000 --> 09:39.000 So, but for, like, a standalone review, this is, this is best. 09:39.000 --> 09:41.000 So we define it as a readable and writeable table. 09:41.000 --> 09:46.000 You can technically also cost it to a read only table if you want to. 09:46.000 --> 09:49.000 And we, we said it, it stores an Apple. 09:49.000 --> 09:52.000 And we give it some, and we give it to indexes. 09:52.000 --> 09:56.000 So the ID index is a primary index, and the Apple grade index is an index, 09:56.000 --> 10:01.000 so it's no longer a key, but on this value, this is a great value that it has. 10:01.000 --> 10:04.000 So the primary index looks something like this. 10:04.000 --> 10:08.000 So we have, so, so I have this Apple ID. 10:08.000 --> 10:10.000 It has a two key method. 10:10.000 --> 10:11.000 I defined this myself. 10:11.000 --> 10:17.000 It's just, it produces a bite slice where it takes three numbers that just makes one key out of it. 10:17.000 --> 10:19.000 Put it in a key set. 10:19.000 --> 10:25.000 And we, and I take the Apple ID itself, and I call the same method. 10:25.000 --> 10:29.000 So what you'll notice is I give this index a name, this is just for observability. 10:29.000 --> 10:33.000 From object is whenever you insert an object into the database, it will call this function. 10:33.000 --> 10:37.000 To translate it to get a key from your object, whatever it's shape is. 10:37.000 --> 10:42.000 And from key is when you do a query, so you give a key to a query. 10:42.000 --> 10:46.000 And it will, and it will go into the database to actually find it. 10:46.000 --> 10:49.000 The unique part is, because it's a primary index, it has to be unique, 10:49.000 --> 10:52.000 which can also have a new unique indexes. 10:53.000 --> 10:56.000 For example, the grade one is a non-unique index. 10:56.000 --> 11:01.000 If I, I can have multiple, I can have multiple apples with the A grade, for example. 11:01.000 --> 11:04.000 And what I did here is I took the grade. 11:04.000 --> 11:10.000 Because it's an, I have to cost it into, into you into 32. 11:10.000 --> 11:14.000 And the index that you, you in 32 is just like predefined. 11:14.000 --> 11:19.000 We have a few predefined functions that take go data types and turn them into keys. 11:19.000 --> 11:22.000 And again, a key is just a, a bite slice. 11:22.000 --> 11:27.000 Anything that can be represented as a bite slice can be indexed in the database. 11:27.000 --> 11:32.000 So we start our, we create our database, we register our table, and we start it. 11:32.000 --> 11:40.000 So this starts, this start bit is starts a background worker, which is important for a few things. 11:40.000 --> 11:43.000 I've got produce a few apples. 11:43.000 --> 11:47.000 So, oh, so when I produce something, I write a transaction. 11:47.000 --> 11:51.000 I say, I will defer the commit. I insert my apple. 11:51.000 --> 11:55.000 And if something errors, I will abort my transaction. 11:55.000 --> 11:59.000 And but you can imagine that you can do this with multiple, multiple things. 11:59.000 --> 12:05.000 I didn't take the first and the second argument that it returns, but it actually returns the old object. 12:05.000 --> 12:10.000 So if there was already an apple with the same key in there, it would return me the old apple that it has replaced. 12:10.000 --> 12:13.000 And the Boolean tells me if it did replace something or not. 12:13.000 --> 12:18.000 Basically an insert and an or update. 12:18.000 --> 12:26.000 Getting it is very simple. So we, again, we, we do a get with a reach transaction. 12:26.000 --> 12:31.000 We have the, we create a query from this ID. 12:31.000 --> 12:34.000 And we get our, we would get our apple back. 12:34.000 --> 12:38.000 To read the whole table, we use a different call. 12:38.000 --> 12:41.000 So we tell the table, hey, we want to get all and watch. 12:41.000 --> 12:48.000 So you, every apiical has the or most apiicals have this watch variant, which would, so this returns an iterator. 12:48.000 --> 12:51.000 So they, they, these are the new, the new iterator types. 12:51.000 --> 12:55.000 And they, it just returns me an iterator of apple. 12:55.000 --> 13:01.000 And the watch is a channel, just an, an empty channel with a, or a channel with an empty struct. 13:01.000 --> 13:09.000 And it gets closed by the database when, in this case, because we query the whole table of anything happens to the stable. 13:09.000 --> 13:17.000 It, it, it closes the channel and notify us that, hey, your data, you, there's something new you, you can observe. 13:17.000 --> 13:24.000 We can do a prefix watch. So if you notice, if you notice, here I have orchards, three and a number. 13:24.000 --> 13:26.000 Here I only have the orchards and the three. 13:26.000 --> 13:33.000 So this is only the first part of my key. And I can basically say, give me every apple from this orchard and this three. 13:33.000 --> 13:36.000 And it will give me everything with that prefix prefix in the key. 13:36.000 --> 13:44.000 And you can use this for IP addresses, strings, even stuff like, stuff like this. 13:44.000 --> 13:49.000 So what is the practical example of this? So this is how, sort of, how we use it. 13:49.000 --> 13:52.000 So we have a bunch of resources on the top. 13:52.000 --> 13:55.000 This, they, they live in the Kubernetes API server. 13:55.000 --> 14:01.000 We have a GRPC watch that goes into a component called Recall or Reflector. 14:01.000 --> 14:04.000 There's no part of the library, but something we built in Selium. 14:04.000 --> 14:08.000 They go into tables and these tables can then be observed. 14:08.000 --> 14:14.000 And the cool thing about this infrastructure is that you have the tables as an abstraction layer between all of your components. 14:14.000 --> 14:24.000 Meaning that if I wanted to test my LB Reflector, I can, in my test, leave out the Reflectors and the map. 14:24.000 --> 14:27.000 And I can just create a bunch of information in my tables. 14:27.000 --> 14:33.000 This is my input. And then I can assert that after it's records that after it's like looked at it, 14:33.000 --> 14:38.000 that it produced exact output to my, to my lower tables that I expected. 14:38.000 --> 14:41.000 So I can test my components in isolation. 14:41.000 --> 14:43.000 It's also extensible. 14:43.000 --> 14:49.000 So like we've got a few errors going off to other components that need to use the same information. 14:49.000 --> 14:54.000 And if some component needs to index it, then we just add a new index to the existing table. 14:54.000 --> 14:58.000 And we don't have to create a separate completely separate data store. 14:58.000 --> 15:04.000 So how does this internally work? It's actually a really cool data structure behind it, 15:04.000 --> 15:06.000 called an immutable depth of red X-3. 15:06.000 --> 15:09.000 So to explain it, we'll first start at the red X-3. 15:09.000 --> 15:13.000 For those who don't know, I added six objects in here. 15:13.000 --> 15:16.000 And what are red X-3 does is effort. 15:16.000 --> 15:20.000 So it's essentially the idea of a binary 3, where you sort things, 15:20.000 --> 15:28.000 where you sort your keys, but instead of having two children, 15:28.000 --> 15:32.000 you can have 255 children. 15:32.000 --> 15:38.000 So it's basically a 3, where your nodes get very, very big. 15:38.000 --> 15:40.000 So the yellow ones are my nodes. 15:40.000 --> 15:44.000 So this is what something would look like and everything with a common prefix. 15:44.000 --> 15:48.000 Everything with a common prefix shares nodes in this way. 15:48.000 --> 15:53.000 The depth of this has to do with how the data gets represented. 15:53.000 --> 16:03.000 So normal red X will be the picture above, where you would use 255 pointers for every time you get a byte. 16:03.000 --> 16:07.000 You can basically look that up in an array and get your next node. 16:07.000 --> 16:11.000 But 255 is very big if you have a sparse set. 16:11.000 --> 16:16.000 So what the depth of red X-3 does is it picks smaller nodes if you have less data in there. 16:16.000 --> 16:19.000 This is essentially an implementation detail. 16:19.000 --> 16:21.000 There are four different nodes. 16:21.000 --> 16:25.000 So the small node gets represented like this. 16:25.000 --> 16:28.000 So you just have an array of your keys and your pointers. 16:28.000 --> 16:31.000 And you simply loop over them to find the next one, 16:31.000 --> 16:34.000 because it's just fast to some modern CPUs. 16:34.000 --> 16:40.000 And then only when you need to do very big, when you actually have very wide trees, 16:40.000 --> 16:43.000 you scale up internally to use bigger nodes. 16:43.000 --> 16:48.000 And this is both faster and lookup and also memory use. 16:48.000 --> 16:52.000 The immutable part is the thing that makes the database actually work. 16:52.000 --> 16:57.000 So if we go back to our red X, what if you want to update this? 16:57.000 --> 17:01.000 So let's say I'm going to add a testing node to this. 17:01.000 --> 17:04.000 So the steps would be for an immutable red X-3. 17:04.000 --> 17:08.000 So we kind of change existing the existing nodes. 17:08.000 --> 17:10.000 We create a new root. 17:10.000 --> 17:14.000 The T, we already, we know that T is a shared node. 17:14.000 --> 17:17.000 So we basically take the existing T, we copy it. 17:17.000 --> 17:21.000 So we copy also the existing link it has. 17:21.000 --> 17:23.000 We know that EST is a node. 17:23.000 --> 17:27.000 So this is both a leaf node and so it has a value in it. 17:27.000 --> 17:28.000 And also goes further. 17:28.000 --> 17:29.000 So we copy it. 17:29.000 --> 17:31.000 And then we put our in under it. 17:31.000 --> 17:35.000 So now we have two copies of this bit. 17:36.000 --> 17:38.000 And if you actually look at. 17:38.000 --> 17:41.000 So if you look at this from the new point, 17:41.000 --> 17:44.000 I can still access all of the same nodes. 17:44.000 --> 17:48.000 That I would be able to from the left accept. 17:48.000 --> 17:50.000 It's also a testing. 17:50.000 --> 17:55.000 So basically create a sort of diff between if I were to just modify it three. 17:55.000 --> 17:57.000 I just add the difference. 17:57.000 --> 18:02.000 From our immutable red X, the way we, this would work is we have a consumer. 18:02.000 --> 18:04.000 That says I'll give me the root node. 18:04.000 --> 18:06.000 So it gets the old one. 18:06.000 --> 18:08.000 Now the update happens. 18:08.000 --> 18:11.000 So these green ones get added. 18:11.000 --> 18:13.000 We switch the pointer over. 18:13.000 --> 18:16.000 So it's a sort of RCU change. 18:16.000 --> 18:19.000 And then the new node comes along. 18:19.000 --> 18:21.000 It says I'll give me the root of three. 18:21.000 --> 18:23.000 It gets the new one. 18:23.000 --> 18:27.000 And this is how our versioning works at a basic level. 18:27.000 --> 18:30.000 So the new consumer can see the new version, the old consumer 18:30.000 --> 18:33.000 can see the new version, the new changes. 18:33.000 --> 18:36.000 And they can both independently operate on this. 18:36.000 --> 18:38.000 Man, the moment the consumer one goes away. 18:38.000 --> 18:42.000 There are no longer any points or studies to this root. 18:42.000 --> 18:44.000 So it gets actually garbage collected by go. 18:44.000 --> 18:46.000 It's by the go runtime itself. 18:46.000 --> 18:48.000 It just automatically goes away. 18:48.000 --> 18:50.000 And we can rearrange the three. 18:50.000 --> 18:54.000 And from that on, it just seems as if the three always had. 18:54.000 --> 18:55.000 Had it like this. 18:55.000 --> 19:00.000 So we have a sort of generational radics tree that keeps growing and going. 19:00.000 --> 19:08.000 And when you want to use this in a database, we extend this out. 19:08.000 --> 19:11.000 So you start with a database with a database root. 19:11.000 --> 19:14.000 The root has a bunch of tables. 19:14.000 --> 19:16.000 And every table has a multiple indexes. 19:16.000 --> 19:20.000 Every index is actually one of these immutable radics trees. 19:20.000 --> 19:24.000 And so these are not, this is not a radics tree. 19:24.000 --> 19:26.000 But it uses the same principle. 19:26.000 --> 19:34.000 So every time we update a table or an index, we make a copy of the header of that index of that table. 19:34.000 --> 19:38.000 And we do a sort of, and we do a generational copy. 19:38.000 --> 19:42.000 And then we just switch the root database pointer. 19:42.000 --> 19:46.000 So the way these queries work is because we have this tree. 19:46.000 --> 19:49.000 We can exploit its properties to the queries. 19:49.000 --> 19:53.000 So if we wanted to query a very specific object, we simply, 19:53.000 --> 19:59.000 because we know that they are sorted from less to more. 19:59.000 --> 20:04.000 We can just basically do sort of binary search through our tree work tree. 20:04.000 --> 20:09.000 But we can also do the prefix. 20:09.000 --> 20:12.000 Sorry, this is a lower bound search. 20:12.000 --> 20:18.000 So you basically say, I want to have this key and everything that is more than this key. 20:18.000 --> 20:21.000 So we can, we basically get a split. 20:21.000 --> 20:25.000 And from that point on, you can traverse the tree from left to right, 20:25.000 --> 20:28.000 all the way until you find all of the leaves. 20:28.000 --> 20:30.000 You can also do a prefix search. 20:30.000 --> 20:32.000 So you say, OK, this is my prefix. 20:32.000 --> 20:35.000 And then you can traverse the whole stop tree. 20:35.000 --> 20:39.000 So this way, we can do a bunch of these queries. 20:39.000 --> 20:42.000 Secondly, indexes work like such. 20:42.000 --> 20:47.000 So a primary index is just, I just made a primary index. 20:47.000 --> 20:51.000 A primary index with one byte for now. 20:51.000 --> 20:53.000 They point to objects. 20:53.000 --> 20:55.000 And every, this is a unique one. 20:55.000 --> 20:58.000 So every key points to exactly one object. 20:58.000 --> 20:59.000 And they come to know. 20:59.000 --> 21:05.000 Our objects in the below want to have this full property, which is a slice. 21:05.000 --> 21:07.000 So we store a slice in there. 21:07.000 --> 21:10.000 And we can actually index on that slice. 21:10.000 --> 21:13.000 So we create a full index. 21:13.000 --> 21:16.000 We take the primary. 21:16.000 --> 21:17.000 So we take that value. 21:17.000 --> 21:18.000 We index it. 21:18.000 --> 21:21.000 So we get bar or bas. 21:21.000 --> 21:26.000 And then every bar or bas points to one or more. 21:26.000 --> 21:31.000 Like, or points to the, to the actual objects. 21:31.000 --> 21:33.000 But they're still the same objects. 21:33.000 --> 21:36.000 Whenever we, whenever we did this moving. 21:36.000 --> 21:39.000 So whenever we duplicate it, we actually copied objects. 21:39.000 --> 21:42.000 But just something to remember about this. 21:42.000 --> 21:47.000 Because if you, if you have some objects that don't like shallow clones, 21:47.000 --> 21:49.000 then it isn't ideal for storage in here. 21:49.000 --> 21:53.000 Because then you get very weird issues. 21:53.000 --> 21:55.000 And I told you about the watches. 21:55.000 --> 21:57.000 So I showed you where this channel comes from. 21:57.000 --> 21:59.000 And the way this works is every header. 21:59.000 --> 22:02.000 So this is the header of every note. 22:02.000 --> 22:05.000 No, never mind which of the four versions to this. 22:05.000 --> 22:07.000 It has some flags. 22:07.000 --> 22:08.000 It has this, the prefix. 22:08.000 --> 22:10.000 So we can compress a prefix. 22:10.000 --> 22:13.000 So we don't just have one bite for one note. 22:13.000 --> 22:16.000 But we can have path compression. 22:16.000 --> 22:18.000 And it has this watch channel. 22:18.000 --> 22:22.000 And what happens is whenever we do, whenever we did the cloning. 22:22.000 --> 22:24.000 Or we have a transaction. 22:24.000 --> 22:30.000 And our transaction keeps track of which objects were replaced by a new version. 22:30.000 --> 22:33.000 And whenever we commit that transaction, 22:33.000 --> 22:37.000 we basically just loop that list of everything that's changed. 22:37.000 --> 22:39.000 And we close the watch channels. 22:39.000 --> 22:43.000 And because of the prefix, because we do the prefix searches. 22:43.000 --> 22:47.000 Or specific one, we can just pick one of the notes. 22:47.000 --> 22:49.000 And return that watch channel. 22:49.000 --> 22:53.000 And when anything changes, that is relevance to your query. 22:53.000 --> 22:57.000 That is the way you get notified. 22:57.000 --> 23:02.000 So transactions, so I talked about transactions, 23:02.000 --> 23:03.000 both having some commits. 23:03.000 --> 23:08.000 So on abort, because we have this generational copying. 23:08.000 --> 23:11.000 Whenever whenever you make a copy. 23:11.000 --> 23:13.000 And you decide, oh, I don't want to use this copy. 23:13.000 --> 23:16.000 You can just remove all references to the copy. 23:16.000 --> 23:19.000 And let it be collected by the garbage collector. 23:19.000 --> 23:22.000 There's no additional work in doing a abort. 23:22.000 --> 23:27.000 Whenever you say, whenever you say, I don't want this transaction, 23:27.000 --> 23:30.000 which is just as collected. 23:30.000 --> 23:37.000 And the interesting bit, so I talked in the beginning about that looks. 23:37.000 --> 23:40.000 We use something called a sortable mutex. 23:40.000 --> 23:43.000 And I'm going to explain it with sort of mini animation. 23:43.000 --> 23:45.000 So let's say we have five tables. 23:45.000 --> 23:49.000 And I have some rights transaction, 23:49.000 --> 23:52.000 and I want to take two, four, and five. 23:52.000 --> 23:53.000 That's possible. 23:53.000 --> 23:55.000 They are available, so they get taken. 23:55.000 --> 23:59.000 And the rule with sortable mutex is that you always take them 23:59.000 --> 24:01.000 from left to right. 24:01.000 --> 24:05.000 So you first take two, then four, then five. 24:05.000 --> 24:10.000 So let's say a second transactions come along. 24:10.000 --> 24:12.000 That wants to lock three tables. 24:12.000 --> 24:14.000 It can go, it can lock one. 24:14.000 --> 24:15.000 That's no problem. 24:15.000 --> 24:17.000 But two is already taken. 24:18.000 --> 24:20.000 And it will block. 24:20.000 --> 24:23.000 But because we always do it from left to right. 24:23.000 --> 24:25.000 And the first one, unlocking. 24:25.000 --> 24:30.000 It means that this will be okay. 24:30.000 --> 24:33.000 Let's say another one comes along. 24:33.000 --> 24:34.000 Three is still available. 24:34.000 --> 24:35.000 So this one can be taken. 24:35.000 --> 24:38.000 This thread will just go on. 24:38.000 --> 24:42.000 So we don't only the middle transaction is waiting 24:42.000 --> 24:45.000 until it can access the tables. 24:46.000 --> 24:49.000 The, our old one goes away. 24:49.000 --> 24:51.000 And then we can start looking to. 24:51.000 --> 24:53.000 But now we still have to wait. 24:53.000 --> 24:55.000 So, but we progress. 24:55.000 --> 24:57.000 So every right transaction will always progress. 24:57.000 --> 25:01.000 And then also ensures a level of fairness. 25:01.000 --> 25:02.000 It's not perfect. 25:02.000 --> 25:04.000 Scheduling wise. 25:04.000 --> 25:10.000 But because you always lock an unlock in this ordering, 25:10.000 --> 25:13.000 you can basically guarantee that you're never, 25:13.000 --> 25:16.000 that you're never going the opposite direction. 25:16.000 --> 25:19.000 You never have two people waiting for each other's looks 25:19.000 --> 25:20.000 that are held. 25:20.000 --> 25:23.000 And that's what our looking issue. 25:23.000 --> 25:28.000 So the last thing I wanted to show is the change tracking, 25:28.000 --> 25:32.000 which I find one of the best features in here. 25:32.000 --> 25:34.000 So what you can do is you can say I have a table. 25:34.000 --> 25:38.000 And I want to, I want to watch all of the, all of the changes 25:38.000 --> 25:41.000 that happens since the last time that I did something. 25:41.000 --> 25:44.000 So you can basically do one BigQuery process something 25:44.000 --> 25:47.000 and then from then on only process the differences. 25:47.000 --> 25:51.000 So you take, you need to take a right transaction to create 25:51.000 --> 25:54.000 to get this change iterator. 25:54.000 --> 25:57.000 And that's because it does a few things in the background 25:57.000 --> 26:00.000 which you need a loop from the table. 26:00.000 --> 26:03.000 But after you've done that, after you've done that, 26:03.000 --> 26:06.000 we have the, we take a right transaction. 26:06.000 --> 26:09.000 And with the right, sorry, a read transaction. 26:09.000 --> 26:13.000 And with the read transaction, we just say, 26:13.000 --> 26:17.000 we just say, okay, I'm, I'm at this point. 26:17.000 --> 26:20.000 What happened between the last time I checked and this new point. 26:20.000 --> 26:22.000 And you can iterate over all of the changes. 26:22.000 --> 26:24.000 And it tells you, is it deleted? 26:24.000 --> 26:26.000 Or, and what is the object that was there? 26:26.000 --> 26:31.000 Or what's the new object that has been inserted or updated? 26:31.000 --> 26:36.000 To do this, we have a hidden index called the revision index, 26:36.000 --> 26:38.000 which always exists. 26:38.000 --> 26:41.000 And it's every time you have a transaction, 26:41.000 --> 26:43.000 every transaction gets an incrementing number, 26:43.000 --> 26:47.000 and all changes made in that transaction are tracked. 26:47.000 --> 26:50.000 So we can do this lower-bound search, 26:50.000 --> 26:54.000 where you say, give me everything that is above this number, 26:54.000 --> 26:57.000 above the, to, on the revision index, 26:57.000 --> 27:02.000 to basically give us a sort of history from a certain provision. 27:02.000 --> 27:07.000 This works for new objects, but not for the ones that are deleted. 27:07.000 --> 27:11.000 Whenever you have a change, whenever you are watching changes, 27:11.000 --> 27:15.000 we create, we, we mark that. 27:15.000 --> 27:17.000 Whenever something then gets deleted, 27:17.000 --> 27:19.000 it goes to a sort of soft delete state. 27:19.000 --> 27:23.000 So the graveyard index, which is another primary index, 27:23.000 --> 27:25.000 but for all objects that are deleted, 27:25.000 --> 27:28.000 when you observe that something was deleted, 27:28.000 --> 27:32.000 the background worker that we did with the DB. 27:32.000 --> 27:35.000 Start will collect the, will remove these old ones. 27:36.000 --> 27:40.000 So it's allowed you to basically temporarily store the deleted objects 27:40.000 --> 27:44.000 until everyone that's interested in deleted objects has observed them, 27:44.000 --> 27:47.000 and then they will automatically go away. 27:47.000 --> 27:50.000 So that was all that I have time for. 27:50.000 --> 27:52.000 I have so much more to show you. 27:52.000 --> 27:55.000 Please have a look at state DB. 27:55.000 --> 27:58.000 And I would like to, so I am here speaking, 27:58.000 --> 28:01.000 but actually my colleague, you see, 28:02.000 --> 28:04.000 the most of the legwork on this. 28:04.000 --> 28:06.000 I was just his professional user, 28:06.000 --> 28:10.000 and to, so I had some influence on the design, 28:10.000 --> 28:13.000 and I'm working a book, but you see, 28:13.000 --> 28:16.000 it's the real hero here. 28:16.000 --> 28:18.000 Questions? 28:18.000 --> 28:20.000 Of course. 28:20.000 --> 28:24.000 Do you know if we have any questions, I'm ready. 28:24.000 --> 28:26.000 I don't have time for questions. 28:26.000 --> 28:28.000 If you have questions, Dylan is here. 28:28.000 --> 28:31.000 Okay, just don't bother and you're in the busiest moments of today. 28:31.000 --> 28:32.000 Thank you.