WEBVTT 00:00.000 --> 00:07.000 Good morning everybody. 00:07.000 --> 00:20.000 I will show you how I program a visit to data compression encoder in Ada from scratch during 00:20.000 --> 00:24.000 two rainy weekends. 00:24.000 --> 00:33.000 So, my motivation was fun, first of all, a challenge, a bit of a warm-up for advent of code, 00:33.000 --> 00:35.000 the latest. 00:35.000 --> 00:46.000 On the more serious notes, I wanted to fill a gap in the zip-aid library, which is a visit to encoder 00:46.000 --> 00:47.000 was missing. 00:47.000 --> 00:52.000 There was a decoder right to the yellow cell, but there was no encoder. 00:52.000 --> 01:00.000 It was absolutely this gap needed to be filled. 01:00.000 --> 01:10.000 And my expectations were quite low because this visit to a skin from experience compresses 01:10.000 --> 01:17.000 not a lot of kind of files better than for instance at the DMA. 01:18.000 --> 01:22.000 The algorithm is mostly mechanical. 01:22.000 --> 01:29.000 So, on most steps, you have only one way to encode the data. 01:29.000 --> 01:32.000 But not all steps. 01:32.000 --> 01:39.000 And the visit to is a week in version of first visit, so it's very older. 01:40.000 --> 01:45.000 At the time, there were patterns about arithmetic encoding. 01:45.000 --> 01:50.000 And the older had to change to human trees. 01:50.000 --> 01:58.000 But the results of my programming exercise were absolutely surprising. 01:58.000 --> 02:02.000 So, I got two very good surprises. 02:02.000 --> 02:07.000 Of course, I keep them from the end. 02:07.000 --> 02:14.000 So, I will explain you the visit to compressing skin in 15 minutes. 02:14.000 --> 02:17.000 Some people are skeptical. 02:17.000 --> 02:23.000 Now, actually, the visit to is completely unusual as a compressing skin. 02:23.000 --> 02:25.000 It's super simple. 02:25.000 --> 02:28.000 I promise you, it's simple. 02:28.000 --> 02:32.000 First of all, it's not a streaming and adaptive skin. 02:32.000 --> 02:37.000 It doesn't use sophisticated technique. 02:37.000 --> 02:45.000 You have twice run length encoding, which is the most trivial way of compressing data. 02:45.000 --> 02:51.000 Just repeated sequence of exactly the same symbols. 02:51.000 --> 02:54.000 And a few of the techniques. 02:54.000 --> 03:05.000 So, once you have treated your block on step 2, you output everything in one go. 03:05.000 --> 03:09.000 So, it's basically a offline algorithm. 03:09.000 --> 03:16.000 You see, in the green box, the ADA code, of course, is missing in the red ellipse. 03:16.000 --> 03:18.000 A few hundred offline. 03:18.000 --> 03:23.000 But basically, you have all the steps with these codes. 03:23.000 --> 03:27.000 It's super simple. 03:27.000 --> 03:34.000 For instance, the one of the run lengths encoding, you don't even have a very smart. 03:34.000 --> 03:40.000 You don't even have a special symbol to say, ah, now they are run. 03:40.000 --> 03:48.000 Automatically, after four symbols, you have a extra bite, saying ah, you have a run, 03:48.000 --> 03:51.000 and with the length, the extra length. 03:51.000 --> 03:59.000 So, it's very smart, and you have a game from the sixth symbol. 03:59.000 --> 04:08.000 Now, the core of the visibility to algorithm is a mysterious transform called Burroughs Wheeler, 04:08.000 --> 04:14.000 which is a bit, yeah, the history is also funny. 04:14.000 --> 04:20.000 I don't go into details, but basically, you take the blue, 04:20.000 --> 04:27.000 message on the top left, you shift it as you see in the left box. 04:27.000 --> 04:35.000 You sort all the lines, the red box contains the results of the sorting, 04:35.000 --> 04:41.000 and you output the red column, the rightmost column. 04:41.000 --> 04:43.000 That's the output of this transform. 04:43.000 --> 04:47.000 That's as simple as it is. 04:48.000 --> 04:52.000 The fascinating thing is that you can reverse this transform. 04:52.000 --> 04:57.000 So, this letter soup on the right can be reversed. 04:57.000 --> 05:04.000 If you know the index of the original sentence. 05:04.000 --> 05:08.000 So, basically, you don't compress anything, and you just, 05:08.000 --> 05:14.000 you also have an extra information, which is an integer. 05:14.000 --> 05:18.000 And you wonder, what does it have to do? 05:18.000 --> 05:23.000 With data compression, the data on the right, the red thing. 05:23.000 --> 05:31.000 It doesn't look, in this example, more compressed or compressible or whatever, 05:31.000 --> 05:34.000 than the blue message. 05:34.000 --> 05:39.000 But let's have a look on a larger example. 05:39.000 --> 05:47.000 So, here I take the source code of the encoder itself, as a test data, 05:47.000 --> 05:54.000 and you see there are lots of repetition, which this transform. 05:54.000 --> 05:59.000 And that's, of course, the reason you can do compression with that. 05:59.000 --> 06:06.000 You have lots of repetition, so you can do again run length encoding. 06:06.000 --> 06:19.000 And before the run length encoding, you do move to front transform, 06:19.000 --> 06:22.000 which is again super simple. 06:22.000 --> 06:27.000 Imagine you have a card deck, which is sorted at the beginning. 06:27.000 --> 06:33.000 And if you take the card number six, you have the index number six, 06:33.000 --> 06:35.000 because the original sorting. 06:35.000 --> 06:38.000 And after it becomes a bit more complicated, 06:38.000 --> 06:42.000 because the card are progressively shuffled, 06:42.000 --> 06:48.000 but the interesting feature, the cards that appear more often, 06:48.000 --> 06:54.000 have a lower index, and that's very prone to compression. 06:54.000 --> 06:58.000 For instance, card number six appears, 06:58.000 --> 07:06.000 after the third time the index is lower. 07:06.000 --> 07:11.000 And the final step, here it's a bit smarter, 07:11.000 --> 07:13.000 and it's more difficult. 07:13.000 --> 07:18.000 It's the entropy coding, so it's compression with Huffman trees. 07:18.000 --> 07:25.000 So, the author of this scheme had the bright idea to not have a single entropy 07:25.000 --> 07:31.000 code, you can choose between two to six Huffman trees, 07:31.000 --> 07:35.000 which are completely, freely defined, of course, 07:35.000 --> 07:38.000 try to do it efficiently. 07:38.000 --> 07:43.000 And also for each group of 50 symbol, 07:43.000 --> 07:48.000 you can allocate one of these two to six trees. 07:49.000 --> 07:54.000 So here, you have lots of room for optimization. 07:54.000 --> 08:03.000 It's where the lemons are squeezed properly. 08:03.000 --> 08:07.000 So the results of my exercise, 08:07.000 --> 08:12.000 after the last rainy Sunday, 08:12.000 --> 08:19.000 to say it's polarity, I would say, 08:19.000 --> 08:23.000 what the, okay, but there's a camera, so, like, 08:23.000 --> 08:30.000 Holy cow, so on the third bar, you see, 08:30.000 --> 08:34.000 the compression of the original visit to scheme, 08:34.000 --> 08:40.000 the one that you find everywhere, especially on Linux. 08:40.000 --> 08:46.000 So here, to the zip utility, it's using this algorithm, 08:46.000 --> 08:49.000 if you take the right switch. 08:49.000 --> 08:51.000 In the middle, you have seven zip. 08:51.000 --> 08:55.000 You know, everybody use seven zip, 08:55.000 --> 08:57.000 so if you tell it, 08:57.000 --> 09:03.000 ah, take a zip tube, encoding, here you go, 09:03.000 --> 09:07.000 and on the left, here is the compression 09:07.000 --> 09:12.000 with the new encoder and I just vote. 09:12.000 --> 09:18.000 So it was totally surprised that it's compressed better 09:18.000 --> 09:25.000 than these tools that are developed for decades. 09:25.000 --> 09:27.000 Yeah. 09:27.000 --> 09:30.000 To be mentioned, again, 09:30.000 --> 09:35.000 this is too, it's very good for human written text, 09:35.000 --> 09:39.000 like ebooks and also source code, 09:39.000 --> 09:42.000 of course not automatically generated source code, 09:42.000 --> 09:46.000 but human written source code. 09:46.000 --> 09:48.000 And less good for other things, 09:48.000 --> 09:50.000 but for these things, 09:50.000 --> 09:53.000 it's very good in general, 09:53.000 --> 09:59.000 and surprisingly, my implementation is better. 09:59.000 --> 10:04.000 And the second surprise, 10:04.000 --> 10:10.000 what happens if you have a mix of different data types? 10:10.000 --> 10:15.000 So you have many benchmark data, 10:15.000 --> 10:18.000 test data, called corpus, 10:18.000 --> 10:21.000 like the classic cuntorberry, 10:21.000 --> 10:27.000 where you have images on other kind of things, 10:27.000 --> 10:31.000 or also source code on ebooks. 10:31.000 --> 10:36.000 And the second surprise was that even in this context, 10:36.000 --> 10:43.000 by using the zip aider pre-selection option, 10:43.000 --> 10:48.000 which picks the right algorithm depending on the data type, 10:48.000 --> 10:54.000 the overall archive size is smaller 10:54.000 --> 10:58.000 than not only seven zip with the maximum compression 10:58.000 --> 11:00.000 for zip archive, 11:00.000 --> 11:07.000 but also the second bar on the bottom box, 11:07.000 --> 11:11.000 with the seven Z archive, 11:11.000 --> 11:15.000 which combines all the files together. 11:15.000 --> 11:22.000 So it was also a super rejoicing surprise there. 11:22.000 --> 11:29.000 And so, what are the things we are in the Edda Devroom? 11:29.000 --> 11:32.000 What are the benefits of Edda, 11:32.000 --> 11:38.000 except that you can make a winning encoder 11:38.000 --> 11:45.000 in two days, in three days, so quickly. 11:45.000 --> 11:49.000 And I would say, regarding data compression, 11:49.000 --> 11:53.000 it's very difficult to debug. 11:53.000 --> 11:58.000 If you have a bug, you almost have to start from scratch, 11:58.000 --> 12:03.000 because the output is random, 12:03.000 --> 12:08.000 so a wrong random doesn't look like better 12:08.000 --> 12:14.000 than a worse than the good random data. 12:14.000 --> 12:22.000 And really Edda does its best to do the things right the first time. 12:22.000 --> 12:29.000 It finds tons of dumb mistakes, programming mistakes. 12:29.000 --> 12:33.000 And if you use the customizable, 12:33.000 --> 12:37.000 you can create as many custom types as you want, 12:37.000 --> 12:43.000 so I've listed range types in the box here. 12:43.000 --> 12:49.000 And also, if you use these types, 12:49.000 --> 12:56.000 you cannot mix two easily and index with an element 12:56.000 --> 13:01.000 and this takes you sometimes to. 13:01.000 --> 13:07.000 Plus, you can define certain types with range types, 13:07.000 --> 13:15.000 which are very tailored, made range. 13:15.000 --> 13:21.000 And in compression, you always have very strictly 13:21.000 --> 13:27.000 delimited ranges, so it's perfect for that. 13:27.000 --> 13:34.000 And the two of the types listed there are data dependent, 13:34.000 --> 13:39.000 so it depends on the data you are compressing. 13:39.000 --> 13:46.000 So it's kind of, you actually have dynamic types in Edda, 13:46.000 --> 13:54.000 so they are, they depend on some data features. 13:55.000 --> 14:01.000 So that's it. Questions? 14:12.000 --> 14:18.000 So I was, so yeah, runtime and memory usage. 14:18.000 --> 14:27.000 I was not too, I didn't try to save too much in memory. 14:27.000 --> 14:31.000 So you have a few megabytes of memory usage. 14:31.000 --> 14:38.000 You could squeeze it to perhaps one megabytes by reusing things. 14:38.000 --> 14:44.000 But nowadays a few megabytes, it's not so, yeah. 14:48.000 --> 14:55.000 So how does it scale? 14:55.000 --> 14:58.000 So how does it scale? 14:58.000 --> 15:09.000 This format is by design limited to 900 kilobytes. 15:09.000 --> 15:15.000 Because there is a reason because you remember 15:15.000 --> 15:20.000 the strings that were sorted, the sorting of these strings 15:20.000 --> 15:22.000 explodes with the size. 15:22.000 --> 15:29.000 So you have to slice your data into maximum 900 kilobytes. 15:29.000 --> 15:32.000 So it's very easy with memory. 15:32.000 --> 15:36.000 You can allocate as you want. 15:36.000 --> 15:41.000 You don't need to spare too much. 15:41.000 --> 15:45.000 But runtime, there's some improvements to do. 15:45.000 --> 15:50.000 I'm using the standard Adda sorting, 15:50.000 --> 15:54.000 and I should use more specialized sorting for that. 15:54.000 --> 15:56.000 But it's for later. 15:56.000 --> 16:03.000 It runs conveniently fast on most data I would say. 16:03.000 --> 16:06.000 Similar to 70. 16:06.000 --> 16:10.000 When you say 900 kilobytes, it's the window. 16:11.000 --> 16:17.000 No, it's not that you take the first 900 kilobytes. 16:17.000 --> 16:18.000 You process it. 16:18.000 --> 16:21.000 You take the next and so on. 16:21.000 --> 16:26.000 It's very special. 16:26.000 --> 16:31.000 It's not a streaming compression scheme. 16:31.000 --> 16:33.000 It's really you take a block. 16:33.000 --> 16:36.000 You process it offline. 16:36.000 --> 16:39.000 You output it and so on and so on. 16:40.000 --> 16:43.000 Yeah, question? 16:43.000 --> 16:46.000 The white? 16:46.000 --> 16:48.000 The white? 16:48.000 --> 16:51.000 The first one. 16:51.000 --> 16:55.000 The first one. 16:55.000 --> 16:58.000 The white. 16:58.000 --> 16:59.000 The white. 16:59.000 --> 17:00.000 The white. 17:00.000 --> 17:02.000 The size. 17:02.000 --> 17:10.000 The just on the. 17:10.000 --> 17:13.000 So this one, it just bites. 17:13.000 --> 17:15.000 Compress bites. 17:15.000 --> 17:20.000 Of course, it doesn't go from zero, but you cannot compress two zero as well. 17:20.000 --> 17:29.000 So up, one minute left. 17:29.000 --> 17:34.000 So I have to wait when we're in it. 17:34.000 --> 17:39.000 So I can start with the next one.