WEBVTT 00:00.000 --> 00:13.000 It's a super fast filtering and aggregation operators to JSON pass. 00:13.000 --> 00:18.680 So that's very tempting to do, but it comes with three challenges, which I like to explain. 00:18.680 --> 00:23.320 Challenge one is when documents share no common pass. 00:23.320 --> 00:28.120 So let's look at the orange, the red, and the green box. 00:28.120 --> 00:31.320 These are three JSON documents, but they shall not come in pass, right? 00:31.320 --> 00:36.720 So we don't have two or three pass, we have seven pass in total. 00:36.720 --> 00:42.920 There's a general risk that the number of pass explodes and we'll get an avalanche of columns. 00:42.920 --> 00:49.720 Challenge two is when we actually have shared pass, but they do have different types. 00:49.720 --> 00:56.320 So if you look at the green box and the blue box, these two documents have the same structure. 00:56.320 --> 01:02.120 But the pass heads go, and one document is an integer, and the other document is a string. 01:02.120 --> 01:08.120 And that's a problem, because in relational databases, every column needs to be of exactly one type. 01:08.120 --> 01:13.520 And finally, challenge three is that the pass can have very different frequencies across JSON documents. 01:13.520 --> 01:19.320 So imagine a pass that occurs in just one document. 01:19.320 --> 01:24.720 Its column will be virtually empty except for that single one value, right? 01:24.720 --> 01:33.520 We call this pass column, and that's not really great for storage, because it contains literally your hot air. 01:33.520 --> 01:42.720 Okay, now to solve these three challenges, we designed two new data types in Clickhouse as building blocks. 01:42.720 --> 01:49.320 And eventually we added a dedicated JSON type on top of these two building blocks. 01:49.320 --> 01:57.920 Both building blocks can be used standalone and outside of a JSON context, and that would be a totally valid use case. 01:57.920 --> 02:02.120 First building block here is the variant data type. 02:02.120 --> 02:06.320 You can imagine variant as a union of multiple sub types. 02:06.320 --> 02:12.920 So basically, every value has one of the sub types, or it is null. 02:12.920 --> 02:14.320 How does it work? 02:14.320 --> 02:19.120 Let's start with the yellow example column on the left side. 02:19.120 --> 02:23.120 You can see it contains various values of different types. 02:23.120 --> 02:28.120 For example, integer strings, areas, there's also a null there. 02:28.120 --> 02:35.120 And the type of the column is variant of in 64 string and area of in 64. 02:35.120 --> 02:44.120 That's a little bit hard to read. It's in light gray on top of the yellow box in the middle. 02:44.120 --> 02:54.120 Okay, now the big black box on the right shows how their variant column is internally represented. 02:54.120 --> 02:58.120 And there are really two important data structures here. 02:58.120 --> 03:03.120 The first one is the one in light blue, and we call it discriminator column. 03:03.120 --> 03:07.120 It encodes for every value the value is type. 03:07.120 --> 03:13.120 So in the example of one in the first row of the discriminator column, means that the value is in integer. 03:14.120 --> 03:20.120 Two in the second row means it's a string, and zero in the third row means it's an area. 03:20.120 --> 03:26.120 Now you may ask, how do the sub types map to these type integers? 03:26.120 --> 03:29.120 And the mapping is actually implicit. 03:29.120 --> 03:35.120 So what happens is that we sort all the names to use a specified in the variant type. 03:35.120 --> 03:42.120 We sort them, and then the relative position, that's the type integer. 03:43.120 --> 03:55.120 The other structure here is an offset column, and that's the light gray box right next to the blue box. 03:55.120 --> 04:02.120 So this one represents for every value it's position in a sub type column. 04:02.120 --> 04:10.120 So if you look at the bottom right, you see three yellow boxes, and there's a box for every sub type of the variant, 04:10.120 --> 04:15.120 and every box stores the values of a sub type consecutively. 04:15.120 --> 04:21.120 So for example, the leftmost yellow box on the bottom right stores all the integers. 04:21.120 --> 04:24.120 And that's 42, 43, and 44. 04:24.120 --> 04:27.120 And it stores them one after another. 04:27.120 --> 04:33.120 Now the offset column is used to reconstruct the original positions of the sub type values. 04:33.120 --> 04:39.120 And it stores for every value in the original column, the position of the value in its sub type column. 04:39.120 --> 04:44.120 It's not a little complicated, but long story short, the variant data type in click house. 04:44.120 --> 04:48.120 Sort the second and the third challenge that we had on the previous slide. 04:48.120 --> 04:54.120 The second challenge, if you remember, was that JSON pass may contain different types. 04:54.120 --> 05:01.120 And the variant columns in the variant columns you can have values of different types in the same column. 05:01.120 --> 05:06.120 And finally the third challenge was that was about sparse columns. 05:06.120 --> 05:13.120 The variant column, the data is always stored in a dense fashion. 05:13.120 --> 05:21.120 Okay, now the second building block is the dynamic data type, and again it can be used standalone outside JSON context. 05:21.120 --> 05:26.120 The dynamic data type is exactly the same as the variant data type. 05:26.120 --> 05:33.120 The difference is that it does not require you to specify all the possible sub types upfront. 05:33.120 --> 05:39.120 The dynamic type can store whatever data type you throw at it at runtime. 05:39.120 --> 05:43.120 And if you look at the yellow box again, that's the original column. 05:43.120 --> 05:49.120 You see that it has now a data type dynamic, but like the sub types are gone. 05:49.120 --> 05:52.120 They are not specified explicitly. 05:52.120 --> 06:00.120 Because we no longer specify the possible sub types in the type itself, we still need to remember them somehow. 06:00.120 --> 06:04.120 And that is done by an additional file, which is called dynamic structure bin. 06:04.120 --> 06:07.120 And that's highlighted in red on the right. 06:07.120 --> 06:13.120 It contains the contained sub types, and it stores some additional statistics. 06:13.120 --> 06:19.120 So in this example, we store an area of N64, N64, and string a sub types. 06:19.120 --> 06:26.120 And besides that, as I mentioned, there's some more statistics which are needed to optimize the access later on. 06:27.120 --> 06:35.120 Now, remember there was this first challenge, right, where which was about exploding paths and column counts. 06:35.120 --> 06:37.120 And we will now address this. 06:37.120 --> 06:46.120 And the way that it's done is that we allow to specify an optional upper limit on the number of sub types in a dynamic column. 06:46.120 --> 06:54.120 If more types are, if more types than the limit are added, they are not rejected, 06:54.120 --> 07:02.120 but they are stored in a fallback column, which is going to be a little bit slower to access. 07:02.120 --> 07:09.120 So to be more precise, all values of the types which exceed the limit are stored together in a single column. 07:09.120 --> 07:17.120 And we use a very simple pair of data type and value to represent them. 07:18.120 --> 07:25.120 And this is how it looks like. Let's make an example in the yellow box on the right. You see that we used a dynamic data type. 07:25.120 --> 07:36.120 But this time, we also specify the parameter max types S3, which means that at most three types should be stored in dedicated columns. 07:36.120 --> 07:43.120 And the values of the force, the fifth, the sixth, and so on, types should go into a fallback column. 07:44.120 --> 07:53.120 Unfortunately, this column contains more than three types. So all the excess types are stored in the green box on the bottom right. 07:53.120 --> 07:58.120 Each value in the green box is a pair of data type and value. 07:58.120 --> 08:10.120 So in the example, the first value there is a pair of data and 2020-11 followed by a pair of pool and falls and so on. 08:10.120 --> 08:18.120 Now that I introduced the dynamic type and the variant type, I can finally present the JSON data type. 08:18.120 --> 08:26.120 Generally speaking, each pass across all the stored JSON documents go into a dynamic column. 08:26.120 --> 08:32.120 And this is how you formally declare the JSON type in Clickhouse for your data. 08:33.120 --> 08:42.120 There are five parameters in green, max dynamic pass. That specifies the maximum number of pass, which are stored in sub columns. 08:42.120 --> 08:47.120 In the interest of time, I will not explain this mechanism detail. 08:47.120 --> 08:56.120 And then there is in brown, max dynamic types, which specifies the number of types for a single pass, which are stored as sub columns. 08:56.120 --> 09:04.120 If the limit is exceeded, new types are stored together in a fallback column, and that's what you saw on the previous slide. 09:04.120 --> 09:14.120 And then in blue, there is some pass type name and that states which pass should never be stored or should not be stored in a dynamic column. 09:14.120 --> 09:17.120 And there is instead stored as regular columns. 09:18.120 --> 09:27.120 That's quite nice. That's, again, like a special case, and it gives you performance guarantees for certain paths that you want to access very quickly. 09:27.120 --> 09:30.120 Like that, that's without compression, more or less. 09:30.120 --> 09:41.120 And then in purple, skip and skip records, these refer to JSON paths, which should be completely ignored on other words, they will never be stored at all. 09:41.120 --> 09:45.120 Okay, now there's this intent example for the JSON data type. 09:45.120 --> 09:53.120 I will probably not explain it in detail, because the font is quite small, and you will throw things at me if I go too much into the data. 09:53.120 --> 10:10.120 The just of this figure is that there are six different JSON documents, and five pass, and we store every pass by default in a dynamic column, or in a dedicated column, if the user requested that via a parameter in the JSON type declaration. 10:11.120 --> 10:21.120 If you'd like to know more, you can see the nice bookpost about JSON support and kickhouse that we published recently, it's shown at the bottom of the slide. 10:21.120 --> 10:27.120 The last thing I like to show is a benchmark for JSON documents, which we did recently. 10:27.120 --> 10:39.120 This benchmark compares native JSON support in analytics databases, and it runs queries over a data set with one billion blue sky events modeled as JSON documents. 10:39.120 --> 10:50.120 It's total compressed size is 125 gigabyte. Everyone's free to grab the data from our public as free bucket and reproduce the measurements. 10:50.120 --> 11:06.120 So we compared the query performance of the new JSON type in clickhouse versus elastic search, ductDB, MongoDB and Postgres, and you can see clickhouse runs the queries by far as the fastest database ahead of all other systems. 11:07.120 --> 11:13.120 Okay, to sum up, clickhouse now speaks JSON, the implementation is easy to use, it's flexible, it's extremely fast. 11:13.120 --> 11:20.120 Chasing supports currently better and we plan to make it GA during 2025. 11:21.120 --> 11:30.120 And now really the last thing we're going to have a dinner, so a clickhouse dinner tonight with free drinks and free food. 11:30.120 --> 11:37.120 And you're invited to head over and we can talk about any topics about clickhouse database in general. 11:37.120 --> 11:41.120 I would be very happy to meet you there. Thank you. 11:51.120 --> 11:54.120 Any questions? 11:54.120 --> 11:56.120 Please. 11:56.120 --> 12:04.120 Yeah, so the question was about data compression and elastic search. 12:04.120 --> 12:23.120 To be honest, I don't know exactly like from the top of my mind. 12:23.120 --> 12:34.120 Like the exact steps are available in the GitHub repository, you can check it out and see what settings specifically used. 12:34.120 --> 12:36.120 Please. 12:53.120 --> 13:08.120 That's a good question. 13:08.120 --> 13:16.120 I guess I guess there is not as far as I know there are not too many benchmarks about JSON, 13:16.120 --> 13:24.120 it's important data basis for now and it's like still like an emerging thing. 13:24.120 --> 13:29.120 Yeah, but of course every benchmark has weaknesses. 13:29.120 --> 13:36.120 It's always informed by the launch of the benchmark in a balance, right? 13:36.120 --> 13:37.120 Okay, thank you. 13:37.120 --> 13:38.120 Still 30 of you, thank you. 13:38.120 --> 13:39.120 I'm on this. 13:53.120 --> 13:55.120 Yeah, I guess here. 13:59.120 --> 14:02.120 Thank you. 14:13.120 --> 14:15.120 Lovely. Thank you. 14:15.120 --> 14:17.120 Yeah. 14:17.120 --> 14:20.120 I could be. 14:20.120 --> 14:24.120 I could be. 14:24.120 --> 14:29.120 If any of you have a lot of columns of files, they've been at a lot of the properties. 14:29.120 --> 14:30.120 We've done a few things. 14:30.120 --> 14:31.120 Oh, yeah. 14:31.120 --> 14:32.120 Where are you? 14:32.120 --> 14:33.120 You need to pass them. 14:33.120 --> 14:35.120 They only have a single one. 14:35.120 --> 14:37.120 So it's possible to implement. 14:37.120 --> 14:38.120 Right? 14:38.120 --> 14:44.120 And with this dense solution, you mentioned that now is possible from the site. 14:44.120 --> 14:46.120 Would you just reach out and quiet? 14:46.120 --> 14:47.120 Yeah. 14:47.120 --> 14:49.120 You don't have to solve this at your Nicole. 14:49.120 --> 14:51.120 It's very interesting. 14:51.120 --> 14:53.120 And I didn't touch on that because I was telling you all the time. 14:53.120 --> 14:54.120 Okay. 14:54.120 --> 14:55.120 Okay. 14:55.120 --> 14:56.120 Okay. 14:56.120 --> 15:03.120 That was the first parameter in the chase in the declaration that I said, 15:03.120 --> 15:04.120 No, I marked it. 15:04.120 --> 15:05.120 Yeah, yes. 15:05.120 --> 15:07.120 Yes, another for that mechanism. 15:07.120 --> 15:08.120 Okay. 15:08.120 --> 15:14.120 Which, like, compresses, type of files which go beyond this limit. 15:14.120 --> 15:17.120 And they are stored consecutively in the same place. 15:17.120 --> 15:18.120 They're empty. 15:18.120 --> 15:21.120 But it's, but it's all described in the blog. 15:22.120 --> 15:26.120 They have to somehow, in the sense that if I have, I don't know, 15:26.120 --> 15:30.120 I don't know, one hundred parts, like a file and a column. 15:30.120 --> 15:33.120 And then a lot of parts, but then you need. 15:33.120 --> 15:35.120 Do you have any guarantees, though? 15:35.120 --> 15:37.120 At least they have spot you in the course storage. 15:37.120 --> 15:38.120 Yeah, yes. 15:38.120 --> 15:39.120 Yeah. 15:39.120 --> 15:43.120 So yes, it's, like, an upper limit on the number of files. 15:43.120 --> 15:45.120 And then an upper limit on the number of files. 15:45.120 --> 15:47.120 And it's the, but it's the same. 15:47.120 --> 15:48.120 Yeah. 15:48.120 --> 15:49.120 Yeah. 15:49.120 --> 15:50.120 Yeah. 15:50.120 --> 15:51.120 Yeah. 15:51.120 --> 15:52.120 Yeah. 15:52.120 --> 15:53.120 Okay. 15:53.120 --> 15:55.120 So I might have been like it, but I have a, yeah. 15:55.120 --> 15:56.120 Excuse me. 15:56.120 --> 15:57.120 Yes. 15:57.120 --> 15:58.120 Okay. 15:58.120 --> 15:59.120 Thank you. 15:59.120 --> 16:04.120 And this event after the post office is going to be just a form of thing. 16:04.120 --> 16:05.120 Very important. 16:05.120 --> 16:07.120 Gonna have to be here. 16:07.120 --> 16:08.120 Okay. 16:08.120 --> 16:09.120 Thank you. 16:09.120 --> 16:10.120 Yeah. 16:10.120 --> 16:11.120 And yes. 16:11.120 --> 16:14.120 I put that on the backstage room. 16:14.120 --> 16:15.120 Yeah. 16:15.120 --> 16:16.120 Yeah. 16:16.120 --> 16:17.120 Yeah. 16:17.120 --> 16:18.120 Yeah. 16:18.120 --> 16:19.120 Yeah. 16:19.120 --> 16:20.120 Yeah. 16:20.120 --> 16:21.120 Yeah. 16:21.120 --> 16:22.120 Yeah. 16:22.120 --> 16:23.120 Yeah. 16:23.120 --> 16:24.120 Yeah. 16:24.120 --> 16:25.120 Yeah. 16:25.120 --> 16:26.120 Yeah. 16:26.120 --> 16:27.120 Yeah. 16:27.120 --> 16:28.120 Yeah. 16:28.120 --> 16:29.120 Yeah. 16:29.120 --> 16:30.120 Yeah. 16:30.120 --> 16:31.120 Yeah. 16:31.120 --> 16:32.120 Yeah. 16:32.120 --> 16:33.120 Yeah. 16:33.120 --> 16:34.120 Yeah. 16:34.120 --> 16:35.120 Yeah. 16:35.120 --> 16:36.120 Yeah. 16:36.120 --> 16:37.120 Yeah. 16:37.120 --> 16:38.120 Yeah. 16:38.120 --> 16:39.120 Yeah. 16:39.120 --> 16:40.120 Yeah. 16:40.120 --> 16:41.120 Yeah. 16:41.120 --> 16:42.120 Yeah. 16:42.120 --> 16:43.120 Yeah. 16:43.120 --> 16:44.120 Yeah. 16:44.120 --> 16:46.120 He's. 16:46.120 --> 16:47.120 Yeah. 16:47.120 --> 16:49.120 He's a big team there too. 16:49.120 --> 16:50.120 No, yeah. 16:50.120 --> 16:51.120 Yeah. 16:51.120 --> 16:52.120 Yeah. 16:52.120 --> 17:06.120 We're going to have a fight. 17:06.120 --> 17:08.120 So we should be our team here. 17:08.120 --> 17:09.120 Wow. 17:09.120 --> 17:10.120 We know. 17:10.120 --> 17:11.120 We do. 17:11.120 --> 17:41.120 I don't know, I don't know, whatever is working here, I don't know, I don't know, whatever is working here, I don't know, I don't know, I don't know, whatever is working here, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't 17:41.120 --> 17:50.280 Well it's for audio, right? 17:50.280 --> 18:05.860 or if audio, right? 18:05.860 --> 18:15.860 Or if you've got audio, right? You can have this on if you want or you can use that. 18:15.860 --> 18:20.860 But this is only for the string. It doesn't work because you have to speak loud. 18:20.860 --> 18:25.860 If it's too loud, maybe the string will be... 18:25.860 --> 18:30.860 So if you think you're going to speak loud, put it on your shirt on your shirt. 18:30.860 --> 18:33.860 So then that will compensate for it. 18:33.860 --> 18:36.860 I can check the notes anyway. 18:36.860 --> 18:40.860 So when you start talking on, you'll make sure the levels are okay. 18:40.860 --> 18:42.860 Not sure if it's fine, but... 18:42.860 --> 18:43.860 Yes, confusing it. 18:43.860 --> 18:45.860 I always find the right. 18:48.860 --> 18:50.860 Is it perfect? Perfect. 18:50.860 --> 18:51.860 Yeah, I think I'm... 18:51.860 --> 18:53.860 Good, good to wear a shirt. 18:53.860 --> 18:54.860 Yeah, that's that's that. 18:54.860 --> 18:55.860 I mean, actually the hoodie is... 18:55.860 --> 18:56.860 This is... 18:56.860 --> 18:57.860 Yeah, that's that. 18:58.860 --> 18:59.860 Yeah. 18:59.860 --> 19:01.860 I just want to go to the market. 19:01.860 --> 19:03.860 If you're looking for something to do, 19:03.860 --> 19:07.860 there's something to do with the price. 19:07.860 --> 19:11.860 Yeah, that's what I do. 19:11.860 --> 19:13.860 I just want to go first. 19:13.860 --> 19:15.860 Yeah, that's what I do. 19:15.860 --> 19:17.860 I don't want to think about this at the end. 19:17.860 --> 19:19.860 Yeah, that's what I do. 19:19.860 --> 19:20.860 Oh, that's what I do. 19:20.860 --> 19:22.860 Oh, that's what I do. 19:22.860 --> 19:23.860 Yeah, that's what I do. 19:23.860 --> 19:24.860 Yeah, that's what I do. 19:24.860 --> 19:26.860 And it's a good set for me. 19:26.860 --> 19:27.860 I don't know what it kind of... 19:27.860 --> 19:29.860 I think country's more traditional... 19:29.860 --> 19:32.860 I think country's more traditional, I think... 19:32.860 --> 19:33.860 Yeah, yeah, yeah. 19:33.860 --> 19:34.860 I'm sure we can go. 19:34.860 --> 19:38.860 Yeah, that's what I do. 19:38.860 --> 19:40.860 Yeah, I'm sure that. 19:40.860 --> 19:41.860 Thank you for that. 19:41.860 --> 19:42.860 Yes, thank you for that. 19:42.860 --> 19:43.860 Thank you. 19:45.860 --> 19:48.860 I think, I don't know about the price. 19:48.860 --> 19:49.860 I think that's what I do. 19:49.860 --> 19:50.860 I think that's what I do. 19:50.860 --> 19:51.860 I want to ask. 19:51.860 --> 19:52.860 Thank you. 19:52.860 --> 19:53.860 It's all right. 19:53.860 --> 19:54.860 Thank you. 20:24.860 --> 20:30.420 And I'm not mistaken. 20:30.420 --> 20:52.960 Can I ask you what... 20:52.960 --> 21:22.960 I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't 21:22.960 --> 21:26.400 Jerry glitter around the Jeremy will give you that, he will give you your five minutes at the beginning, 21:26.400 --> 21:32.560 You see me, you don't trouble because I come in right at the end so I'll be outside making balls 21:32.560 --> 21:35.840 The moment you appear that's your time. You should be literally wrap up 21:36.480 --> 21:40.160 Okay, okay, so if you see me because I will I will switch to 21:41.520 --> 21:46.320 That's okay, but so you'll fit him he'll give you the five minute mark if you go to late and you see me 21:46.320 --> 21:48.320 You know you've got a finish 21:49.200 --> 21:53.200 Okay, okay, yeah, yeah, because I have more content than that 21:53.200 --> 21:56.080 Oh, yeah, absolutely, but I will tell you I will tell you 21:56.080 --> 22:00.560 Yeah, I will start waving. Oh, sorry. Oh, you know what? We're going to start a fling the earlier 22:00.560 --> 22:03.680 Yeah, we can do that if you want to start five minutes early. We can do that as well 22:04.080 --> 22:06.080 Do you give you five minutes more 22:06.720 --> 22:08.720 Okay, yeah 22:09.600 --> 22:11.600 So just just to not 22:12.240 --> 22:16.480 So we'll start at 20 past so that's the six minutes time. We'll start at six minutes 22:17.360 --> 22:19.360 Okay, yeah, I think 22:20.000 --> 22:22.400 Yeah, so if there is no problem with a synchronization 22:22.400 --> 22:25.120 No, because it was the newscales like a lot of people 22:26.320 --> 22:31.280 People yeah people either I'm going to be here already or if I'm going to be female it's like because I'm not at all 22:31.280 --> 22:33.280 I'm going to be female it's like anyway 22:33.840 --> 22:36.880 So this window talk with no other the basis 22:41.520 --> 22:43.520 It's the kind of the time they still see you 22:47.120 --> 22:49.120 Thanks 22:49.120 --> 22:51.280 Senior