WEBVTT 00:00.000 --> 00:12.000 I hope you just moved on. 00:12.000 --> 00:18.000 So many people contributed to this project, the one on bottom line are people from software 00:18.000 --> 00:23.000 heritage, my students, the three main people you see are the people who are invested 00:23.000 --> 00:27.000 more time probably in the project. 00:27.000 --> 00:34.000 What we want to do is graph analytics at scale, huge graph, huge means hundreds of billions 00:34.000 --> 00:39.000 of arcs, okay, hundreds of billions of edges, not millions or billions. 00:39.000 --> 00:44.000 Graphs are everywhere, okay, I assume you know what a graph is or we won't go very far 00:44.000 --> 00:51.000 with this web snapshots, social networks, biological graph, you name it, oh sorry, 00:51.000 --> 00:54.000 wrong key, I'll start in very one here. 00:55.000 --> 01:03.000 Okay, there is some delay between what I do and what my Mac does. 01:03.000 --> 01:06.000 Okay, sorry. 01:06.000 --> 01:10.000 Okay, the standard representation like adhesion, 01:10.000 --> 01:14.000 solicit will not work if you have hundreds of billion of edges. 01:14.000 --> 01:19.000 And you can do distributed approaches which usually spend most of the time 01:19.000 --> 01:25.000 around if you have a followed Mac sure is cost idea, cost over a single thread. 01:25.000 --> 01:32.000 You take any graph task, you run it on a single thread and the cost is the number of 01:32.000 --> 01:37.000 MapReduce servers you need to replicate the speed of a laptop. 01:37.000 --> 01:40.000 And the cost of some algorithm is infinite. 01:40.000 --> 01:46.000 No matter how many server you add to your MapReduce cluster, it's slower than a laptop. 01:46.000 --> 01:49.000 So that doesn't work really. 01:49.000 --> 01:53.000 What we do we've been for some times is compression, okay. 01:53.000 --> 01:59.000 So web graph is a framework to do compressed representation of graphs. 01:59.000 --> 02:03.000 It's a very long lead project, it's more than 20 years now. 02:03.000 --> 02:09.000 And there are hundreds of publications, image on conferences and journals that you've 02:09.000 --> 02:12.000 had used the web graph of the data with this review. 02:12.000 --> 02:13.000 It's all open. 02:13.000 --> 02:18.000 I mean open source, open data, we gather web crawls, we gather data from, you know, 02:18.000 --> 02:23.000 the structure of Twitter or DBLP or Wikipedia and we make it public. 02:23.000 --> 02:26.000 In 2011, that was a long time ago. 02:26.000 --> 02:28.000 And I think of it. 02:28.000 --> 02:33.000 There was news that Facebook had four degrees of separation with, wow, 02:33.000 --> 02:38.000 this is there is a long delay which actually was featured on the newer times 02:38.000 --> 02:42.000 and then there was three dots for 74, not four. 02:42.000 --> 02:46.000 It's a long story because the degrees of separation in distances in graph and runoff by one. 02:46.000 --> 02:48.000 And then your time got the off by one. 02:48.000 --> 02:51.000 Okay, I don't want to talk about it, it's very sad. 02:51.000 --> 02:58.000 But at that time we measured that at Facebook, I mean, people at Facebook did the measurement with our software. 02:58.000 --> 03:04.000 And at that time Facebook was 700 million nodes and it was 70 billion links. 03:04.000 --> 03:09.000 It was represented in just 200 gigabytes. 03:09.000 --> 03:21.000 Like common crawl, if you know the nonprofit that distributes web graph and does all their internal analytics using web graph. 03:21.000 --> 03:28.000 So the motivation for this was the software heritage, history graph. 03:28.000 --> 03:30.000 So what is this? 03:30.000 --> 03:35.000 It's the largest public archive of its style version control history. 03:35.000 --> 03:39.000 This is a UNESCO sponsored project to create a library of software. 03:39.000 --> 03:41.000 All the software ever created. 03:41.000 --> 03:45.000 And the data model is a miracle, direct a cyclic graph. 03:45.000 --> 03:52.000 So basically you have like a gigantic git repository with all the development of all public code. 03:52.000 --> 03:57.000 Even very old code, I mean they gather code for a number of sources. 03:57.000 --> 04:02.000 And this is one of the largest graphs, a human activity available. 04:02.000 --> 04:13.000 Like presently we are at 44 billion nodes and seven more than 700 billion arcs, which are basically committing dependencies, 04:13.000 --> 04:20.000 which presently is represented by web graph into 150 gigabytes instead of more than six terabytes, 04:20.000 --> 04:24.000 which makes it possible to do things in memory. 04:24.000 --> 04:36.000 Now previously, software heritage is using a pipeline based on our Java historical 20 of Java implementation of a web graph. 04:36.000 --> 04:45.000 And being able to store the graph explicitly, I mean you can do visits, you can run any kind of normal implementation algorithm we have in mind, 04:45.000 --> 04:50.000 because you don't have to write something specific for a distributed implementation. 04:50.000 --> 04:55.000 But clearly, at this scale, start Java was getting in the way. 04:55.000 --> 04:59.000 I was looking for an excuse to learn Rust. 04:59.000 --> 05:02.000 Tomazo was in Rust. 05:02.000 --> 05:09.000 Stefan wanted to do this, so essentially that was the time to move to Rust. 05:09.000 --> 05:12.000 Of course, it's hyperformas and safe. 05:12.000 --> 05:15.000 There are zero cross abstractions, you know this. 05:16.000 --> 05:25.000 Now this might look to you a minor issue, but if you ever try to manage something with hundreds of billions of items in Java, 05:25.000 --> 05:29.000 having two billion elements array, it's a major problem, okay? 05:29.000 --> 05:33.000 Not even two billions, two billions minus one in theory. 05:33.000 --> 05:38.000 In reality, two billions minus eight don't get me into that, it's very sad. 05:38.000 --> 05:44.000 So I shouldn't complain because I used Java happily for at least a decade. 05:44.000 --> 05:52.000 It was just around 2010 that it became almost impossible to do this, probably, sensibly in Java. 05:52.000 --> 06:00.000 Another thing that might be obvious, lazy, truly lazy iterators. 06:00.000 --> 06:03.000 The whole framework is based on lazy iterators. 06:03.000 --> 06:08.000 If you ever wrote an iterator in Java, you know what I'm talking about, nightmare. 06:08.000 --> 06:13.000 You ask, fortunately, we have actual lazy iterator. 06:13.000 --> 06:18.000 And moving to Rust required porting, or I think it completely several keys idea, 06:18.000 --> 06:21.000 that resulted in a number of crates. 06:21.000 --> 06:25.000 So what I'm trying to do today is to show you what we did, 06:25.000 --> 06:28.000 and maybe it can be useful for you too. 06:29.000 --> 06:34.000 And a couple of ideas that we met during the writing. 06:34.000 --> 06:40.000 So, absurdity is an epsilon copy, serialization, serialization framework, 06:40.000 --> 06:45.000 and I will tell you what it means because we invented the word for this thing. 06:45.000 --> 06:51.000 The MNDBG is a system for fast memory footprint analysis. 06:51.000 --> 06:54.000 This is a crate with succinct data structure. 06:54.000 --> 06:57.000 Who knows what is succinct data structure? Raise your hands. 06:57.000 --> 06:59.000 Nobody. 06:59.000 --> 07:03.000 Seriously. Okay, there is a two-line introduction. 07:03.000 --> 07:06.000 These streams, which streams for instantaneous, 07:06.000 --> 07:11.000 instantaneous code, gamma code, compression, variable byte. 07:11.000 --> 07:15.000 Okay, now someone knows, I know. 07:15.000 --> 07:18.000 Well, and of course, a web graph. 07:18.000 --> 07:23.000 Let's start with the first idea, which took a lot of time to work out. 07:23.000 --> 07:28.000 So, our purpose, keep in mind, because it's called an epsilon copy, blah, blah, blah. 07:28.000 --> 07:34.000 The purpose is to memory map very large data structures directly from disk. 07:34.000 --> 07:38.000 You have data structure that are 300 gigabytes wide. 07:38.000 --> 07:42.000 Loading them into memory takes half an hour, literally, half an hour. 07:42.000 --> 07:44.000 What can we do to have fact census? 07:44.000 --> 07:48.000 I mean, the thing they do at Facebook Google, the data center name, 07:48.000 --> 07:52.000 is to have them in a specific format on this, memory map then 07:52.000 --> 07:56.000 and access them directly from the memory mapping. 07:56.000 --> 07:59.000 Okay, so we wanted to do something like that in Rust. 07:59.000 --> 08:06.000 And essentially, to do that, what you do is something called zero copy. 08:06.000 --> 08:10.000 Zero copy means that instead of this serializing something, 08:10.000 --> 08:16.000 you load the bytes in memory, and then those bytes are already the object you want. 08:16.000 --> 08:20.000 More or less, so there is no DC realization happening. 08:20.000 --> 08:24.000 And in that case, you can even directly memory map the file. 08:24.000 --> 08:28.000 So instead of DC realizing an object, you memory map the file, 08:28.000 --> 08:30.000 and it's already the object you want. 08:30.000 --> 08:35.000 There are a few crazy to do this, but they didn't do what we needed. 08:35.000 --> 08:38.000 One is abomination from Framac Sherry. 08:38.000 --> 08:43.000 Abomination rewrites all the references into memory to make it work, 08:43.000 --> 08:46.000 because you don't know where the object is. 08:46.000 --> 08:51.000 Out of, we cannot do that because we want to do memory mapping of a mutable 08:51.000 --> 08:53.000 data structure, we cannot write. 08:53.000 --> 08:56.000 Zero rec is a popular create to do that. 08:56.000 --> 09:00.000 It has a special type that you use instead of rec, 09:00.000 --> 09:03.000 that essentially does relative indexing. 09:03.000 --> 09:05.000 Huge performance impact. 09:05.000 --> 09:09.000 We are shading every cycle, not possible. 09:09.000 --> 09:13.000 Another one that is very popular is Archive. 09:13.000 --> 09:17.000 Archive is some impact, not very clear on performance, 09:17.000 --> 09:23.000 but the main problem with Archive, what you DC realize, zero DC realize, 09:23.000 --> 09:26.000 is not what you see realize, it's a different structure. 09:26.000 --> 09:31.000 So you have to re-implement every single method on the DC realized structure, 09:31.000 --> 09:35.000 and we have a lot of algorithmic things going on. 09:35.000 --> 09:37.000 There was not working for us. 09:37.000 --> 09:40.000 So we developed this idea. 09:40.000 --> 09:43.000 Let's say that I show you now. 09:43.000 --> 09:48.000 The concept of our ideas that you need some collaboration 09:48.000 --> 09:49.000 from the data structure. 09:49.000 --> 09:53.000 You cannot just let a derived absurdity and it will work for you. 09:53.000 --> 09:58.000 In exchange, we have something that we didn't have from this crate. 09:58.000 --> 10:02.000 So there are two types of data, zero copy and the copy. 10:02.000 --> 10:04.000 There are two traits. 10:04.000 --> 10:08.000 And what we do is, if you have any kind of sequence, 10:08.000 --> 10:11.000 vector, boxes, license of zero copy types, 10:11.000 --> 10:14.000 when you DC realize, instead of the vector, 10:14.000 --> 10:16.000 you get a reference to the memory. 10:16.000 --> 10:19.000 So this is the zero copy part. 10:19.000 --> 10:22.000 But the point is that the rest of the structure, 10:22.000 --> 10:26.000 which is very small with respect to the large arrays, 10:26.000 --> 10:28.000 is allocated normally. 10:28.000 --> 10:32.000 So it's, we copy a little bit, an epsilon part of the structure. 10:32.000 --> 10:35.000 It's not entirely zero copy. 10:35.000 --> 10:38.000 Zero copy means we don't allocate anything. 10:38.000 --> 10:41.000 The memory is your structure. 10:41.000 --> 10:45.000 We take the memory, we use like 99% of it, 10:45.000 --> 10:48.000 and then we allocate a little bit. 10:48.000 --> 10:52.000 And to do that, we use the enjoying to create implementation 10:52.000 --> 10:55.000 based on a societate type that I show you later. 10:55.000 --> 10:58.000 Maybe how many people are asked, like, 10:59.000 --> 11:02.000 when you're programming it, listen to us here. 11:02.000 --> 11:05.000 Okay, so there is some people who probably met the problem. 11:05.000 --> 11:10.000 But again, the idea is that you get something that I write for us. 11:10.000 --> 11:12.000 Let me just keep a second. 11:12.000 --> 11:14.000 This, and I'll show you, first of all, 11:14.000 --> 11:17.000 how you do these joint implementations. 11:17.000 --> 11:22.000 So the idea is that you want to implement a trait differently, 11:22.000 --> 11:26.000 depending on another trait, which you cannot do, of course, 11:26.000 --> 11:27.000 because the rules. 11:27.000 --> 11:30.000 So the way you do it is, like this, 11:30.000 --> 11:33.000 you create a selector, basically. 11:33.000 --> 11:36.000 So a trait that is an associated type, 11:36.000 --> 11:39.000 that is, in this case, just two possible values. 11:39.000 --> 11:40.000 Phase one. 11:40.000 --> 11:44.000 Phase two, you put in my direction your trait 11:44.000 --> 11:47.000 with a specific instance of the first trait. 11:47.000 --> 11:50.000 So zero copy depends on being a copy type, 11:50.000 --> 11:52.000 with copy equals zero, 11:52.000 --> 11:54.000 but anything with copy type, 11:54.000 --> 11:56.000 copy equals zero, will be zero copy. 11:56.000 --> 11:59.000 So now zero copy and the other trait, 11:59.000 --> 12:04.000 with internal value zero, are the same thing. 12:04.000 --> 12:09.000 Now, what we would like to write is this. 12:09.000 --> 12:14.000 Oh, this is, okay, my keyboard, okay. 12:14.000 --> 12:17.000 What we would like to write is this. 12:17.000 --> 12:21.000 I can write to different implementation of my trait, 12:21.000 --> 12:24.000 like this you like for t, depending on whether t, 12:24.000 --> 12:27.000 and this will not compile, because it won't. 12:27.000 --> 12:29.000 It used to, like if you were to go, 12:29.000 --> 12:30.000 but now it doesn't anymore, 12:30.000 --> 12:32.000 but it will be the small helper trait, 12:32.000 --> 12:33.000 you can make this work. 12:33.000 --> 12:35.000 So you can add a digital implementation of a trait, 12:35.000 --> 12:37.000 depending on another trait. 12:37.000 --> 12:39.000 And this is how we manage the whole, 12:39.000 --> 12:43.000 something is deep copy, something is zero copy thing. 12:43.000 --> 12:45.000 Okay, if you are interested, 12:45.000 --> 12:49.000 there is a PR going on to make this work in the compiler, 12:49.000 --> 12:53.000 and the PR contains the RSIato trait we're using here. 12:54.000 --> 12:56.000 So this is a very interesting technique, 12:56.000 --> 12:59.000 that without which we could never be able to write, 12:59.000 --> 13:01.000 exactly, example. 13:01.000 --> 13:06.000 Okay, so what happened here is that you declare the structure 13:06.000 --> 13:09.000 with an ID, an integer, and some data. 13:09.000 --> 13:11.000 Okay, and you derive, absurdly. 13:11.000 --> 13:14.000 Unfortunately, there is a procedural macro. 13:14.000 --> 13:17.000 Now, you create a structure with a vector inside, 13:17.000 --> 13:20.000 and you serialize it somewhere. 13:20.000 --> 13:22.000 Okay, you store it, easy. 13:22.000 --> 13:26.000 Now, what you do is you load the serialized form 13:26.000 --> 13:28.000 in a buffery memory. 13:28.000 --> 13:30.000 So it's now with a piece of memory. 13:30.000 --> 13:33.000 Okay, and at that point, you can add 13:33.000 --> 13:35.000 a single serialized structure, 13:35.000 --> 13:38.000 and what you get is a structure that instead of a vector 13:38.000 --> 13:40.000 contains a reference to a slice, 13:40.000 --> 13:42.000 there is in your serialized data. 13:42.000 --> 13:46.000 Okay, you can also do a full serialization 13:46.000 --> 13:48.000 like a normal serialization framework, 13:48.000 --> 13:50.000 and it would be very, very fast 13:50.000 --> 13:53.000 because all zero copy are just one red exact. 13:53.000 --> 13:55.000 There's no looping, 13:55.000 --> 13:58.000 or you can even directly memory map it. 13:58.000 --> 14:01.000 And, okay, let me show what's happening here. 14:01.000 --> 14:04.000 You have your structure, a phase one. 14:04.000 --> 14:08.000 The structure is an ID, okay. 14:08.000 --> 14:12.000 I don't know why that is cut like a two-second delay 14:12.000 --> 14:15.000 between, okay, whatever. 14:15.000 --> 14:18.000 Consection time, you have your structure, 14:18.000 --> 14:21.000 there is an AD, a vector, a known pointer, 14:21.000 --> 14:23.000 length cap, okay. 14:23.000 --> 14:26.000 I think this is pretty obvious. 14:26.000 --> 14:29.000 Then we serialize, and you just have ID length 14:29.000 --> 14:32.000 and the data, plus a lot of headers, checks, 14:32.000 --> 14:34.000 some verification, okay. 14:34.000 --> 14:36.000 But this is the data we write. 14:36.000 --> 14:39.000 When you do the epsilon DC realization, 14:39.000 --> 14:42.000 you get a structure that contains 14:42.000 --> 14:44.000 just a reference to the data. 14:44.000 --> 14:47.000 So you see, IDLN are being effectively copied, 14:48.000 --> 14:50.000 because we allocate space for them. 14:50.000 --> 14:52.000 But the huge amount of data in the stars 14:52.000 --> 14:55.000 is not copied, it's just reference. 14:55.000 --> 14:56.000 Okay. 14:56.000 --> 14:59.000 This is basically what, like Facebook or Google does, 14:59.000 --> 15:04.000 in C++, in a more in a less safe manner, let's say. 15:04.000 --> 15:06.000 But for us, it worked very well, 15:06.000 --> 15:09.000 because now the serialized data is on disk, 15:09.000 --> 15:13.000 memory map pin takes nothing, takes just a memory map, 15:13.000 --> 15:15.000 and will be that very small structure, 15:15.000 --> 15:18.000 but the epsilon complex structure. 15:18.000 --> 15:20.000 Okay. 15:20.000 --> 15:23.000 Of course, you have to write your methods, 15:23.000 --> 15:26.000 so they work both with a vector and a slice, 15:26.000 --> 15:28.000 but that's what everybody does, 15:28.000 --> 15:31.000 so that should be too difficult. 15:31.000 --> 15:33.000 And once you have things supporting, 15:33.000 --> 15:36.000 absolutely, you can put them as a type parameter, 15:36.000 --> 15:40.000 and they will work flawlessly, recursively, up to the leaves. 15:40.000 --> 15:43.000 The only problem is that if you have a lot of fields, 15:43.000 --> 15:46.000 all of different types, and they all should support 15:46.000 --> 15:49.000 absurdity, you will have a lot of parameter types, 15:49.000 --> 15:51.000 type parameters. 15:51.000 --> 15:52.000 Sorry. 15:52.000 --> 15:53.000 Okay. 15:53.000 --> 15:54.000 So that's it. 15:54.000 --> 15:56.000 If you need to memory map, larger structure, 15:56.000 --> 15:58.000 this serialized, very quickly things, 15:58.000 --> 15:59.000 have a look at absurdity. 15:59.000 --> 16:01.000 I think we did a nice job. 16:01.000 --> 16:03.000 And notice that since the structure, 16:03.000 --> 16:06.000 you decelerize, is exactly the structure, 16:06.000 --> 16:09.000 you serialize, you can basically 16:09.000 --> 16:11.000 exchange the two things without problem, 16:11.000 --> 16:13.000 it's not a different structure, 16:13.000 --> 16:16.000 and the impact on performance is zero, 16:16.000 --> 16:18.000 because it's exactly the same structure 16:18.000 --> 16:20.000 simply instead of a vector, 16:20.000 --> 16:22.000 you have a reference to a slice. 16:22.000 --> 16:24.000 Contrarily two, the other one. 16:24.000 --> 16:25.000 Okay. 16:25.000 --> 16:26.000 Very quickly, 16:26.000 --> 16:30.000 Mandy Biji is a high performance memory 16:30.000 --> 16:31.000 compensate detector, 16:31.000 --> 16:32.000 and the question is, 16:32.000 --> 16:35.000 how can be slow a memory detector? 16:35.000 --> 16:37.000 Very easy. 16:37.000 --> 16:40.000 These are some of the most common 16:40.000 --> 16:42.000 aspects for memory detection. 16:42.000 --> 16:45.000 I mean, I want to know how much memory 16:45.000 --> 16:48.000 and object use is including the memory 16:48.000 --> 16:51.000 it references, not just a size of. 16:51.000 --> 16:56.000 This is a very large 2GB Ashmap allocated, 16:56.000 --> 16:59.000 and the first three frameworks 16:59.000 --> 17:02.000 are what you can find now on crates. 17:02.000 --> 17:04.000 The point is that the measure will be 17:04.000 --> 17:05.000 off for very reason, 17:05.000 --> 17:07.000 but the point is that they take 17:07.000 --> 17:11.000 almost two tenths of a second to establish the size, 17:11.000 --> 17:15.000 because they have to loop through two billion elements. 17:15.000 --> 17:18.000 Man's side leverages on this joint trade thing, 17:18.000 --> 17:20.000 and we'll give you the answer in, 17:20.000 --> 17:22.000 I mean, 200 nanoseconds, nothing. 17:22.000 --> 17:24.000 And if you start managing structure 17:24.000 --> 17:26.000 that are 100 gigabytes waiting 17:26.000 --> 17:28.000 to 30 seconds to get an answer, 17:28.000 --> 17:29.000 it's not very nice. 17:29.000 --> 17:31.000 Also, we can print memory layouts, 17:31.000 --> 17:32.000 including padding, 17:32.000 --> 17:34.000 so you can have a complete breakout 17:34.000 --> 17:36.000 of a data structure, 17:36.000 --> 17:39.000 knowing which part occupies more occupies less. 17:39.000 --> 17:42.000 We can even use padding in enums using 17:42.000 --> 17:45.000 the unstable feature offset of enums. 17:45.000 --> 17:48.000 So if you're interested in looking inside 17:48.000 --> 17:51.000 your data structure that is used. 17:51.000 --> 17:53.000 Sox, I will skip a little bit on this 17:53.000 --> 17:55.000 because nobody knows anything about this. 17:55.000 --> 17:56.000 Soxin data structure, 17:56.000 --> 17:59.000 but let me just expose to the idea. 17:59.000 --> 18:01.000 Soxin to the data structure, 18:01.000 --> 18:03.000 they use the space of the information 18:03.000 --> 18:05.000 theoretical lower bound, 18:05.000 --> 18:08.000 but with operation that asymptotically 18:08.000 --> 18:10.000 are the same of standard structures. 18:10.000 --> 18:13.000 Let me just make an example. 18:13.000 --> 18:16.000 There are four to the power of en binary trees. 18:16.000 --> 18:18.000 There's a number. 18:18.000 --> 18:21.000 So it should be possible to represent 18:21.000 --> 18:25.000 binary tree using log four to the power of en two en bits. 18:25.000 --> 18:28.000 But we use two en log en bits 18:28.000 --> 18:30.000 because we have n nodes, 18:30.000 --> 18:32.000 and it's known as two pointers. 18:32.000 --> 18:35.000 The two pointers are at least log n bits. 18:35.000 --> 18:38.000 So Jacob's own representation 18:38.000 --> 18:40.000 uses two en bits. 18:40.000 --> 18:43.000 And give you parent and child in constant time. 18:43.000 --> 18:46.000 So you shave off a log n factor. 18:46.000 --> 18:49.000 This is very important because, oh, 18:49.000 --> 18:52.000 but, okay. 18:52.000 --> 18:54.000 Now that you're sorry, 18:54.000 --> 18:59.000 complete, I was seeing things you were not seeing. 18:59.000 --> 19:02.000 So there is something really bad going on. 19:02.000 --> 19:06.000 I'm so sorry. 19:06.000 --> 19:08.000 Okay, now it's on the screen, 19:08.000 --> 19:10.000 but not on my screen. 19:10.000 --> 19:14.000 So, never happened before with keynote in my life. 19:14.000 --> 19:17.000 I'll try to make it look represents everything 19:17.000 --> 19:19.000 internally using alias funnel. 19:19.000 --> 19:20.000 They index the graph. 19:20.000 --> 19:23.000 It's pervasive throughout the infrastructure. 19:23.000 --> 19:25.000 We also have something else like sort of 19:25.000 --> 19:28.000 recompression blah blah, but I'm going to skip 19:28.000 --> 19:31.000 because I want to have some times for web graph. 19:31.000 --> 19:35.000 Okay, let me just skip this because I don't think it's. 19:35.000 --> 19:38.000 Super interesting in this context. 19:38.000 --> 19:41.000 Let me just make one complaint. 19:41.000 --> 19:42.000 Okay. 19:42.000 --> 19:46.000 We have a comprehensive set of traits for indexed dictionaries. 19:46.000 --> 19:49.000 And indexed dictionary is a set of integers. 19:49.000 --> 19:52.000 You can know the iF integers in increasing order. 19:52.000 --> 19:55.000 And you can do predecessor and successor. 19:55.000 --> 19:57.000 So list up or bound. 19:57.000 --> 20:00.000 Creates lower bound computations. 20:00.000 --> 20:03.000 Our main implementation is yes funnel. 20:03.000 --> 20:09.000 The main problem we have is that there is no index get in rust. 20:09.000 --> 20:12.000 And there is a pr going on, but it's stalled. 20:12.000 --> 20:14.000 What I mean is that if you know it works, 20:14.000 --> 20:17.000 if you want to overload the square bracket operator, 20:17.000 --> 20:19.000 you have to implement index. 20:19.000 --> 20:20.000 Okay. 20:20.000 --> 20:25.000 Index must return reference. 20:25.000 --> 20:27.000 That's not good. 20:27.000 --> 20:30.000 Imagine you have to implement functionally a vector that 20:30.000 --> 20:33.000 returns is squared for index side. 20:33.000 --> 20:34.000 You can. 20:34.000 --> 20:39.000 So in general, rust and intentional representation do not work 20:39.000 --> 20:41.000 well together at the moment. 20:41.000 --> 20:44.000 If your representation of the data is extension of, 20:44.000 --> 20:46.000 so you actually have the data. 20:46.000 --> 20:51.000 Like in a binary tree of integers, the integers are in the tree. 20:51.000 --> 20:54.000 If you need to expose one of the integers, 20:54.000 --> 20:56.000 you can offer a reference. 20:56.000 --> 21:00.000 But anything that is implicit, succinct, compressed anything 21:00.000 --> 21:04.000 that is intentional in which the description of the data is 21:04.000 --> 21:06.000 in the data structure. 21:06.000 --> 21:08.000 But the data is not in the data structure. 21:08.000 --> 21:12.000 It's impossible to use any kind of overloading a rust. 21:12.000 --> 21:15.000 We really hope someone at some point makes, 21:15.000 --> 21:17.000 at least in the index get work, 21:17.000 --> 21:21.000 that would be the ideal trait that overloads square brackets 21:21.000 --> 21:25.000 but returns a value, not a reference. 21:25.000 --> 21:26.000 Okay. 21:26.000 --> 21:30.000 If you can push this through any body to the component team, 21:30.000 --> 21:32.000 that would be very nice to us. 21:32.000 --> 21:35.000 Or the other PR or the other PR. 21:35.000 --> 21:37.000 Okay. 21:37.000 --> 21:40.000 Another component that might be useful will be strings. 21:40.000 --> 21:44.000 So, bit strings let you read the instead of the stream of bytes, 21:44.000 --> 21:46.000 stream of bits. 21:46.000 --> 21:48.000 Why you want to do that? 21:48.000 --> 21:52.000 Well, because to do compression, a lot of compression happens 21:52.000 --> 21:56.000 in bit strings like even jpeg and peg, whatever. 21:56.000 --> 22:00.000 Essentially, we have codes that let you write small integers with few bits 22:00.000 --> 22:03.000 and large integers with many bits. 22:03.000 --> 22:06.000 These are called instantaneous codes. 22:06.000 --> 22:11.000 And there's the side bit stream, implements many of them. 22:11.000 --> 22:15.000 The main thing that is different from other crates, we read by word. 22:15.000 --> 22:18.000 This is like a major change in efficiency. 22:18.000 --> 22:21.000 So, we don't read byte by byte and then we the code. 22:21.000 --> 22:25.000 We read like, you can say, you can set your own read word 22:25.000 --> 22:28.000 from 8 to 64 bits. 22:28.000 --> 22:31.000 And that speeds up enormously. 22:31.000 --> 22:35.000 The code in it's very easy to enrast in Java impossible. 22:36.000 --> 22:41.000 We have a lot of instantaneous code and it's very easy to add yours. 22:41.000 --> 22:45.000 Like a gamma code, if you're familiar with that, we can read it in less than two 22:45.000 --> 22:50.000 or no seconds with cheese, believe me, pretty good. 22:50.000 --> 22:55.000 You have two basic traits, bit right, which you can imagine, 22:55.000 --> 22:59.000 can read bits and write bits, not surprisingly. 22:59.000 --> 23:02.000 And then we have extension traits for the codes. 23:02.000 --> 23:08.000 You can write your own extension traits for the code you want using the infrastructure. 23:08.000 --> 23:13.000 And we have implementation of a reader and a writer that are buffered 23:13.000 --> 23:17.000 with selectable writer read word. 23:17.000 --> 23:21.000 Okay, this is a little bit too much. 23:21.000 --> 23:27.000 But one observation presently, the best read word is U32 23:28.000 --> 23:32.000 and the writer is U64 for implementation problem. 23:32.000 --> 23:36.000 When you want to read bits and you want to have the code in tables, 23:36.000 --> 23:39.000 you need to be able to pick bits in the stream. 23:39.000 --> 23:45.000 To do that, your internal bit buffer must be at least twice the word you're reading from the file. 23:45.000 --> 23:49.000 Usually, people read 8 bit at a time, so that is not an issue. 23:49.000 --> 23:52.000 But we are reading a word at a time. 23:52.000 --> 23:55.000 If we were to read 64 bit at a time, 23:55.000 --> 24:02.000 the internal buffer would be 120 bit, and 120 bits, 120 bits, 24:02.000 --> 24:04.000 integer are very slow. 24:04.000 --> 24:09.000 Now, I guess in two three years we'll be able to have read word of U64, 24:09.000 --> 24:16.000 but for the time being on Intellit list 128, you are very slow. 24:16.000 --> 24:23.000 Okay, so if you're interested in bit stream, have a look at the outside bit stream. 24:23.000 --> 24:26.000 Now, more interestingly, finally web graph. 24:26.000 --> 24:29.000 What does that can it do for you? 24:29.000 --> 24:34.000 Okay, graphs are usually represented by bit streams. 24:34.000 --> 24:37.000 Obviously, otherwise, you wouldn't have all those crates, 24:37.000 --> 24:40.000 and we use the SI bit stream for the code, 24:40.000 --> 24:43.000 and then we use sucks for pointer into the bit stream. 24:43.000 --> 24:48.000 Because using the list funnel, we can store the pointers in the bit stream in very little space. 24:48.000 --> 24:51.000 That's a classical usage of the list funnel. 24:51.000 --> 24:57.000 Like on an old software heritage graph with 34 billion nodes, 24:57.000 --> 25:00.000 and more or less half a trillion arcs, 25:00.000 --> 25:02.000 a BFS is completed in three hours. 25:02.000 --> 25:06.000 On standard hardware, I mean, half a terabyte of RAM, of course, 25:06.000 --> 25:08.000 but fairly standard hardware. 25:08.000 --> 25:11.000 And it's three time faster than it was in Java. 25:11.000 --> 25:15.000 We expected 50% more, maybe 100% more, 25:15.000 --> 25:18.000 three time faster was really nice. 25:18.000 --> 25:20.000 Consider that everything else is faster. 25:20.000 --> 25:23.000 I mean, in Java, for instance, when you have an array, 25:23.000 --> 25:26.000 there is a hundred like 50, 40 billion elements, 25:26.000 --> 25:29.000 you have to simulate it through many small arrays. 25:29.000 --> 25:31.000 There's loads of down things a lot. 25:31.000 --> 25:35.000 In the rest, all these things are fairly easy. 25:35.000 --> 25:41.000 What I'm also found obvious that the ergonomics of the API 25:41.000 --> 25:44.000 are unbelievably better with respect to Java, 25:44.000 --> 25:47.000 and just to let you have a look, 25:47.000 --> 25:51.000 Xographs and N nodes numbered from zero to N, period. 25:51.000 --> 25:55.000 There is no specific data structure for node of arts. 25:55.000 --> 25:57.000 Everything is done with integer. 25:57.000 --> 25:59.000 If you have your own index space, 25:59.000 --> 26:02.000 you have to bijectively multiply into a zero N. 26:02.000 --> 26:03.000 It's your job. 26:03.000 --> 26:08.000 We work with indices in a compact space space zero N. 26:08.000 --> 26:12.000 And when you want to access the graph structure, 26:12.000 --> 26:13.000 you get successors. 26:13.000 --> 26:17.000 So you take a node and you know the successors of the node. 26:17.000 --> 26:20.000 The other node that are connected by an arc. 26:20.000 --> 26:23.000 So there is no data structure, 26:23.000 --> 26:28.000 materializing a node on an edge arc, however you call it. 26:28.000 --> 26:30.000 Everything is done with view size. 26:30.000 --> 26:36.000 And this is what you do if you want to work with a half a trillion edges graph. 26:36.000 --> 26:41.000 Otherwise, anything you try to represent set of edges, set of nodes, 26:41.000 --> 26:46.000 stacks, becomes immense because these structures, 26:46.000 --> 26:49.000 representing nodes of edges are usually, 26:49.000 --> 26:54.000 not using much more space than just a use size. 26:54.000 --> 26:57.000 I should look there. 26:57.000 --> 27:00.000 So how does compression happens? 27:00.000 --> 27:02.000 These are complicated topics, but just as an idea, 27:02.000 --> 27:03.000 gap compression. 27:03.000 --> 27:05.000 So if you have list of successors, 27:05.000 --> 27:08.000 you take the gaps, differences. 27:08.000 --> 27:11.000 These are small numbers, because they are just gaps. 27:11.000 --> 27:13.000 And you use instantaneous code. 27:13.000 --> 27:16.000 If you're familiar with inverted indices, 27:16.000 --> 27:20.000 I mean, standard techniques in 30 years, this how you do it. 27:20.000 --> 27:24.000 But with graphs, we can do a little bit more because these large graphs 27:24.000 --> 27:27.000 are often very redundant. 27:27.000 --> 27:29.000 Like the successful list of a node, 27:29.000 --> 27:32.000 is very similar to the successful list of another node. 27:32.000 --> 27:37.000 Imagine a web snapshot, two pages from the same level of a website. 27:37.000 --> 27:40.000 They have very similar links. 27:40.000 --> 27:44.000 So what we do, we compress by reference. 27:44.000 --> 27:49.000 So instead of storing entirely my success list, my out links. 27:49.000 --> 27:52.000 First, I check if I can copy most of them from another node, 27:52.000 --> 27:56.000 then I already stored, and then I store the difference. 27:56.000 --> 27:59.000 This is a major game changing in web, 27:59.000 --> 28:02.000 in web, in particular in web snapshots, 28:02.000 --> 28:06.000 where you can get easily to one, one and a half bit per link. 28:06.000 --> 28:08.000 Instead of 64. 28:08.000 --> 28:12.000 Because web snapshots are impressively redundant. 28:12.000 --> 28:14.000 You can also intervalize. 28:14.000 --> 28:16.000 So if you have consecutive successors, 28:16.000 --> 28:20.000 maybe you can store them as intervals. 28:20.000 --> 28:24.000 We have labels on edges. 28:24.000 --> 28:27.000 And the way we do it is very elegant. 28:27.000 --> 28:30.000 I can say that because it's tomato's idea, not mine. 28:30.000 --> 28:32.000 So I can say a lot of nice things about that. 28:32.000 --> 28:38.000 So a label is like a graph that instead of returning successors 28:38.000 --> 28:39.000 returns labels. 28:39.000 --> 28:42.000 And then we have zip operators, like a generator, 28:42.000 --> 28:44.000 you can zip your graph with your label, 28:44.000 --> 28:46.000 and now you have a label graph. 28:46.000 --> 28:49.000 And this is very nice because if you have several set of labels, 28:49.000 --> 28:51.000 you can zip them and create your own custom labels, 28:51.000 --> 28:56.000 zip it with a graph, and get your custom labeling without any programming. 28:56.000 --> 28:59.000 You just have to write zip this and this, 28:59.000 --> 29:01.000 and you will get a label graph. 29:01.000 --> 29:04.000 The whole thing is based on landers. 29:04.000 --> 29:09.000 And this is the 10 minutes of the talk that will be pure rust problems. 29:09.000 --> 29:12.000 So it took us a while to get this working. 29:12.000 --> 29:16.000 So you might, do you know what the lander is in rust, 29:16.000 --> 29:18.000 or landing a generator? 29:18.000 --> 29:20.000 Anybody, never heard of it? 29:20.000 --> 29:21.000 Okay. 29:21.000 --> 29:23.000 Okay, iterators give you things. 29:23.000 --> 29:25.000 You know what a generator is, right? 29:25.000 --> 29:28.000 Okay, iterator give you things, right? 29:28.000 --> 29:32.000 And you can take two things from the iterator and do things 29:32.000 --> 29:35.000 with the two things you got from the iterator. 29:35.000 --> 29:37.000 With a lander, you can't. 29:37.000 --> 29:41.000 In a lander, you take a thing, you do something with the thing. 29:41.000 --> 29:45.000 You throw you the way, and then you can take another thing. 29:45.000 --> 29:50.000 Okay, landers are iterators that have relevant internal state. 29:50.000 --> 29:54.000 It cannot provide you with two items at the same time. 29:55.000 --> 29:58.000 For instance, because they go over a file, 29:58.000 --> 30:02.000 and they move the pointer as they return item. 30:02.000 --> 30:04.000 Okay. 30:04.000 --> 30:06.000 Now imagine a situation. 30:06.000 --> 30:09.000 We are returning iterator on the graph, 30:09.000 --> 30:12.000 and we are scanning a bit stream. 30:12.000 --> 30:14.000 We cannot use an iterator. 30:14.000 --> 30:17.000 I mean, we return a lander that we call an iterator, 30:17.000 --> 30:19.000 but it should be a lander. 30:20.000 --> 30:23.000 And these are computations and complicated. 30:23.000 --> 30:25.000 Okay, let me just skip this. 30:25.000 --> 30:28.000 Let me go just go to the trades, 30:28.000 --> 30:30.000 because we don't have much time. 30:30.000 --> 30:32.000 I thought they would have been faster. 30:32.000 --> 30:36.000 So the basic trade you access is a sequential labeling. 30:36.000 --> 30:39.000 So you have labels, and you have a lander, 30:39.000 --> 30:43.000 and node labels, lander is the thing I'm going to talk later. 30:43.000 --> 30:47.000 It's something that gives you nodes and iterator of successors. 30:47.000 --> 30:50.000 So you iterate on a graph, and you get node zero, 30:50.000 --> 30:54.000 as these successors, node two as these successors, and so on. 30:54.000 --> 30:56.000 And these are labeling. 30:56.000 --> 30:59.000 A graph is just a labeling with you size labels. 30:59.000 --> 31:00.000 Phase one. 31:00.000 --> 31:02.000 Phase two, random axis. 31:02.000 --> 31:03.000 You will random axis. 31:03.000 --> 31:06.000 So what you go do is something like this. 31:06.000 --> 31:12.000 And you get a labels method that tells you which are the labels, 31:12.000 --> 31:14.000 and an out-degree method. 31:14.000 --> 31:19.000 And then a random axis graph is a random axis labeling with you size labels. 31:19.000 --> 31:24.000 If you have an actual labeling and you zip it, then you get a label to graph. 31:24.000 --> 31:26.000 So everything is a labeling. 31:26.000 --> 31:29.000 If you're labeling returns your size, you get a graph. 31:29.000 --> 31:30.000 You can zip labeling together. 31:30.000 --> 31:35.000 If you zip together a graph and a labeling, you have a label to graph. 31:35.000 --> 31:38.000 But everything is composable like legal. 31:38.000 --> 31:43.000 Now the main thing to make this work properly is in famous lander. 31:43.000 --> 31:47.000 Okay, let me just explain you the idea. 31:47.000 --> 31:52.000 When GAT is, you know what is a generic associated type. 31:52.000 --> 31:55.000 Raise your hand if you know what a generic associated. 31:55.000 --> 31:56.000 Okay, someone knows. 31:56.000 --> 31:57.000 Okay. 31:57.000 --> 32:03.000 If you look around when people were introduced in generic associated types, 32:03.000 --> 32:06.000 these will be examples everywhere. 32:06.000 --> 32:08.000 So this is a lander. 32:08.000 --> 32:12.000 So you see the item as a lifetime. 32:12.000 --> 32:16.000 Sorry that this should be an A, that is, I mean, taking the slide. 32:16.000 --> 32:20.000 Now what happens is that when you get an element from the lander, 32:20.000 --> 32:23.000 it borrows exclusively from self. 32:23.000 --> 32:27.000 So until you have, until the item you get get out of scope, 32:27.000 --> 32:30.000 you cannot call next, which is exactly what you want. 32:30.000 --> 32:33.000 You want an iterator, the give you things. 32:33.000 --> 32:37.000 But until you get rid of the thing, you cannot get another thing. 32:37.000 --> 32:38.000 Okay. 32:38.000 --> 32:41.000 That seems very nice and it's totally not working. 32:41.000 --> 32:42.000 Why? 32:42.000 --> 32:49.000 Because as soon as you start to use it, you need to pass a lander to a function. 32:49.000 --> 32:56.000 And if you want to condition the type of the item, you have to write a condition, 32:56.000 --> 32:59.000 a trade bound on item with a lifetime. 32:59.000 --> 33:04.000 The only way to do it is with a four with a hiring trade bound. 33:04.000 --> 33:08.000 And at that point that four can be compensated with static, 33:08.000 --> 33:11.000 which means that the lander must be static. 33:11.000 --> 33:16.000 So the only way to do it to use this is with a static lander. 33:16.000 --> 33:18.000 And of course, our landers are not static. 33:18.000 --> 33:20.000 We need landers, they give you the graph, 33:20.000 --> 33:25.000 and they read the graph from another source to each of the ever reference. 33:25.000 --> 33:30.000 So Sabine and Jusos came with this idea, which actually works. 33:30.000 --> 33:38.000 And the idea is, if I can make it work, because sorry guys, 33:38.000 --> 33:40.000 that's okay. 33:40.000 --> 33:44.000 I'll wait three seconds, and maybe this will happen. 33:44.000 --> 33:46.000 Okay. 33:46.000 --> 33:52.000 Okay, this is only for very interesting people. 33:52.000 --> 33:56.000 So here what happens is that we have a trade lander. 33:56.000 --> 34:02.000 There is a type parameter with a default value. 34:02.000 --> 34:04.000 You will never write this. 34:04.000 --> 34:06.000 There is a reference to self. 34:06.000 --> 34:10.000 And they contains the thing you will let the item, with a dependency on it. 34:10.000 --> 34:14.000 Now what we write here now to create the letter trade, 34:14.000 --> 34:19.000 we write this four-on-lending next, blah, blah, like before. 34:19.000 --> 34:23.000 What happens now, the interesting part is that the compiler, 34:23.000 --> 34:27.000 sees a higher rank trade bound for A, 34:27.000 --> 34:33.000 but because of the reference, there is an implicit where self. 34:33.000 --> 34:34.000 Okay. 34:34.000 --> 34:39.000 So implicitly, we're saying that this is outleased by self. 34:39.000 --> 34:42.000 And I will be very nice to have this syntax. 34:42.000 --> 34:47.000 One day, maybe we will have this syntax, and all this will be easier. 34:47.000 --> 34:51.000 But this is a way you create a higher rank trade bound 34:51.000 --> 34:54.000 with a limitation on the lifetime. 34:54.000 --> 34:57.000 Once you do this, everything works. 34:57.000 --> 35:02.000 And you can pass the lander to every function you want, 35:02.000 --> 35:05.000 just writing, yeah, perfect. 35:05.000 --> 35:08.000 Just writing that three minutes now. 35:08.000 --> 35:10.000 I'm over anyway. 35:10.000 --> 35:13.000 Anyway, this is, okay, this is an interesting technique. 35:13.000 --> 35:16.000 If you need to do landers, have a look at Sabina Juson's blog. 35:16.000 --> 35:20.000 Juson's blog or at the lander create, which we are maintaining now, 35:20.000 --> 35:24.000 because the original writer is like no longer interested. 35:24.000 --> 35:27.000 We have slightly more complicated problem, 35:27.000 --> 35:32.000 because we need the iterator, 35:32.000 --> 35:35.000 literally third by the lander to be referencing the data. 35:35.000 --> 35:37.000 So we need something horrible like this, 35:37.000 --> 35:39.000 that I'm not going to explain. 35:39.000 --> 35:42.000 But thanks to Quindot and the last language form, 35:42.000 --> 35:44.000 we were able to write this, which is horrible. 35:44.000 --> 35:49.000 I never remember why it works, but it actually works and does the job. 35:49.000 --> 35:52.000 Okay, performance, amazing. 35:52.000 --> 35:56.000 These are various kinds of graphs of their sizes. 35:56.000 --> 36:00.000 And you can see the speed up of us with respect to the job implementation 36:00.000 --> 36:02.000 in the order of two, three times. 36:02.000 --> 36:05.000 They are fairly small except the last one, 36:05.000 --> 36:06.000 because they are graphs. 36:06.000 --> 36:10.000 We stopped to do crawls in the 2015, 36:10.000 --> 36:12.000 because common crawls was taking the lead, 36:12.000 --> 36:15.000 and it was not really necessary to continue this activity. 36:15.000 --> 36:17.000 But anyway, conclusion. 36:18.000 --> 36:22.000 Fantastic journey, since April 2023. 36:22.000 --> 36:26.000 I was very surprising that we could do all these 36:26.000 --> 36:30.000 on a enormous cloud base developed over 20 years. 36:30.000 --> 36:32.000 A graph traversal arriving. 36:32.000 --> 36:35.000 I would like to thank Valenti Lawrence, 36:35.000 --> 36:38.000 a software heritage that suggested a lot of improvement. 36:38.000 --> 36:42.000 Could and generally was our guinea pig for everything. 36:42.000 --> 36:44.000 The last language form, 36:44.000 --> 36:46.000 without which none of these would have happened. 36:46.000 --> 36:48.000 I mean, we've got so many suggestions 36:48.000 --> 36:50.000 who they never thought about, 36:50.000 --> 36:52.000 that I cannot even count them. 36:52.000 --> 36:56.000 And my students, which also wrote a lot of code. 36:56.000 --> 36:58.000 And thank you, if you have any questions. 36:58.000 --> 36:59.000 The talk is all right. 37:00.000 --> 37:01.000 Thank you. 37:07.000 --> 37:09.000 Thirty seconds, four questions. 37:15.000 --> 37:16.000 Hi. 37:16.000 --> 37:18.000 So you have this huge data set, 37:18.000 --> 37:20.000 221 gigabytes. 37:20.000 --> 37:22.000 Yeah, I know. 37:24.000 --> 37:27.000 Was your thing bit compatible 37:27.000 --> 37:30.000 with the Java version or was it a different 37:30.000 --> 37:33.000 dismember representation? 37:38.000 --> 37:40.000 I'm also probably a little bit dangerous. 37:42.000 --> 37:44.000 Oh, yeah. 37:44.000 --> 37:48.000 The rest version is totally compatible with Java. 37:48.000 --> 37:51.000 It generates bit by bit.