WEBVTT 00:00.000 --> 00:11.400 Now that it is, you know what I'm doing, it is time for the let's talk today, Amy is 00:11.400 --> 00:16.120 unfortunately sick, but Baker is going to take over because he worked with Remy on this 00:16.120 --> 00:22.400 project, making the nest is actually fed on like tiny deposits with not a lot of C-F-U and 00:23.400 --> 00:34.200 so enjoy. Yes, thank you. Thank you. As Peter said, I worked with Remy on this project, this slide 00:34.200 --> 00:40.280 deck was written by him and focuses on the bits he did, I will just try to get through them as 00:40.280 --> 00:52.320 well as I can. What is DNA's this? Who here knows what DNA's this? Who doesn't? Okay, a couple 00:52.320 --> 00:58.120 of people. DNA's this is a DNA's proxy, I think there's a diagram. Right, so most people 00:58.120 --> 01:02.520 deploy it like this. You have a bunch of clients. These might be laptops or resolves in the 01:02.520 --> 01:08.840 internet. They speak to DNA's this, which loadbent's traffic to a bunch of backends. This is 01:08.840 --> 01:16.560 the common DNA's this deployment model in I-Spees and Hoosters, etc. However, DNA's this, you 01:16.560 --> 01:23.160 also sit on the client and imagine in this case, DNA's this living on your laptop or on your 01:23.160 --> 01:28.760 home router or whatever, talking to a backend resolver somewhere out there, your I-Spees or 01:28.760 --> 01:36.320 Quad9 or whatever. And it is that use case for which for which DNA's this only router might 01:36.320 --> 01:44.960 make sense. So in 2021, a bunch of people sat and over to the money to sat the DNA's this would 01:45.040 --> 01:52.520 be great because it would enable DNA's encryption from the router to the provider or Quad9, 01:52.520 --> 01:58.180 Quad8, whatever. And DNA's this already supports all the scripted protocols unlike the 01:58.180 --> 02:04.000 DNA's mask, which doesn't support as many. So we said, well, it's not that big, so we can 02:04.000 --> 02:13.240 just do that. Well, we were wrong. So in 2022, open WRT supported devices as small as this, 02:13.280 --> 02:21.720 four megabytes of flash, 32 megabytes of RAM. In 2025, four megabytes is unlikely to even fit 02:21.720 --> 02:29.000 a useful kernel, but back then they really tried. So we bought a bunch of these TP-Link boxes, 02:29.000 --> 02:34.760 which was a picture, but it looks like any TP-Link router you've seen, it's black 3 antennas. 02:35.640 --> 02:40.840 It's slightly better specced. It has like four times the numbers I mentioned. 02:42.680 --> 02:49.400 So with you be able to boot a kernel, run open WRT on it, have a weapon to face, and fit DNA's 02:49.400 --> 03:01.640 this in with that. Turns out DNA's this is not small. The binary size in a compilation that relies on 03:01.720 --> 03:09.080 shared libraries is like 9 megabytes today, plus the shared libraries, which may not be present 03:09.080 --> 03:21.320 on your router. So we realized that we were underestimating or overestimating our ability to fit 03:21.400 --> 03:28.680 inside a router, and also we didn't know what we were doing to begin with. There's 03:28.680 --> 03:36.600 it opened WRT, open WRT, two things matter, the size of your binary in the flash file system, 03:36.600 --> 03:42.840 with users compression, and the amount of memory you use that is only yours. 03:42.840 --> 03:54.520 So this memory definition is called the proportional set size, and it turns out it is 03:54.520 --> 03:59.240 super hard to measure. Like all things in memory, you would think if you have a device with 03:59.240 --> 04:04.520 16 megabytes of memory counting a few bytes here and there would become easier, and it is easier, 04:04.920 --> 04:12.760 but it's still not easy. Your memory is split up in a bunch of areas that all of different properties 04:13.000 --> 04:17.960 someone just yours, someone swapped in from your compressed binary, some are 04:19.240 --> 04:25.160 swapped in from shared library that you may or may not be the only user of, or that part of the 04:25.160 --> 04:31.480 library may or may not be just for you. So it turns out it's quite hard to actually count these things. 04:33.160 --> 04:40.680 But we managed to define a way of counting, and at that point, this will be found. 04:40.680 --> 04:46.120 The NS this itself needed about two megabytes of memory, live crypto, also needed about two 04:46.120 --> 04:52.760 megabytes of memory, and then we needed another one I have megabytes for other libraries we 04:52.760 --> 04:59.480 needed, including lip SSL. You'll notice that open SSL is actually the biggest memory user 04:59.480 --> 05:07.080 in this scenario, and then we needed two more megabytes just for storing our data estate, etc. 05:07.560 --> 05:15.400 So that's quite a bit more than four megabytes. So we tried a few easy things. 05:16.520 --> 05:22.760 The NS this defaults are for big setups, machines with four gigabytes of RAM, handling thousands of 05:22.760 --> 05:28.040 clients, etc. Your home router is not like that. So you take all these numbers that are default 05:28.040 --> 05:34.120 a tune them down. Only 50 outgoing queries at a single time, a single T speed threat, a single 05:34.120 --> 05:42.040 DOH threat, very small buffers in which we keep track of recent queries, etc. This helps some. 05:43.320 --> 05:51.240 Next up, the NS this has a shit on of features. Many of which you might not need on your home 05:51.240 --> 05:58.040 router, like the CDB or LNDB support, those were built for doing million entry block lists. 05:59.000 --> 06:05.880 Which you may want on your router, but not everybody would want it. So we took the open WRT 06:06.440 --> 06:13.080 the NS this package and added a whole bunch of extra flags to it to allow you to get rid of features 06:13.080 --> 06:20.040 completely at compile time. Then there's the compiler link of flags, those influence, binary 06:20.040 --> 06:27.640 size and memory usage a lot. As most of you would know, you can tell GCC or Clang or whatever to 06:27.720 --> 06:33.560 of to my zero, of to my one, of to my two. There's a funny little other one called optimized 06:33.560 --> 06:41.160 for size, which does something between one and two, I think, except when it would make the binary 06:41.160 --> 06:48.520 bigger. And this helps some. Hiding symbols just reduces the size of symbol tables, which is helpful. 06:48.520 --> 06:54.200 A link time-open optimization is quite an interesting one because it removes that code, 06:54.840 --> 07:00.920 even from libraries you might be linking statically. And I'll get it later, but linking libraries 07:00.920 --> 07:07.800 statically might actually reduce this usage because you can get rid of code that nobody's using. 07:09.880 --> 07:17.320 The bottom one was a bit of a pity. Disable position in dependent, I think it's execution, 07:17.320 --> 07:28.040 because by offers more security, that it makes the binary bigger. So there was quite good. 07:28.040 --> 07:34.520 The binary drop below two megabytes compressed by the file system choices open, W or T makes, 07:35.240 --> 07:39.320 and memory dropped a bit as well. So that was decent. 07:41.160 --> 07:46.120 Okay, right. So then let's figure out what is happening, where is all this memory going? 07:47.400 --> 07:52.920 The heap is the most important bit because it can be swapped out or swapped in. Most routers do not have 07:52.920 --> 07:57.960 swapped configured. And of course, the binary and the libraries are basically swapped in and out 07:57.960 --> 08:02.280 when necessary, but heat memory will just sit there, if physical memory. 08:06.040 --> 08:10.440 So, Rayme, you cell grind to investigate some of that. 08:11.400 --> 08:21.000 I have to say I haven't seen the picture before myself, but it also might be hard to read, but the 08:21.000 --> 08:29.160 big red reddish thing at the bottom is crypto malach. So a lot of a memory is barely going to 08:29.160 --> 08:36.120 live crypto allocating things, keeping stayed around for whatever reason. Then there's heap track 08:37.000 --> 08:41.320 which makes these nice flame graphs of where memory is being allocated, 08:43.320 --> 08:46.120 which I also have to use myself so I can't tell you much about it. 08:48.440 --> 08:55.160 But realizing that openness is always doing a lot of the allocating helped us strip some more 08:57.080 --> 09:04.600 use the stuff from the binary. So we don't load sivers and digest, we don't need our messages, 09:04.680 --> 09:09.480 openness, although they're not quite good. It can give quite extensive error messages, 09:09.480 --> 09:16.360 and we figured we don't really need those. Apparently, some things would allocate big and then 09:16.360 --> 09:23.240 be shrunk, which we could skip. And there's links to all the codes for these changes down there. 09:24.840 --> 09:30.600 Then there's lip H2O, which at the time we use to offer the OH, that library is that, 09:30.600 --> 09:35.880 it's not being maintained anymore and any useful capacity, but then it is what we had. 09:36.920 --> 09:42.760 It turned out that library also contains a bunch of things that we didn't need for running on a 09:42.760 --> 09:51.400 router. And indeed, the slide also mentions, we no longer use H2O. We now use NGH2P2, which sadly is 09:51.400 --> 09:57.880 what everybody uses for the OH. So there's a bit of an ecosystem problem there in that if NGH2P2 09:57.960 --> 10:02.440 has a bug, then all OPSOR's DOH, implementation is out there, we'll have that bug. 10:04.040 --> 10:09.480 So I hope something else arrives at some point, but right now this is the state of things. 10:10.920 --> 10:16.680 And again, we reduced a buffer because 8 kilobytes is more than enough for most DOH requests. 10:18.760 --> 10:25.640 We tried using wolf SSL, which is a nice project, and it has an open SSL compatibility layer. 10:26.520 --> 10:36.040 But it did not really reduce memory, and adding this extra dependency only helps, does not help the moment, 10:36.040 --> 10:41.160 some other program, does want to open as well on the same issue, because they have both libraries. 10:44.120 --> 10:47.480 So that's which did not do anything for us, but it was worth it shot. 10:47.640 --> 10:58.920 We tried UPS, which is a, which compresses binaries, not in the file system layer, but on a 10:58.920 --> 11:04.040 different layer, the problem with that is that it decompresses into memory of or if you're 11:04.040 --> 11:10.520 unlucky even onto a temp file system, which means you actually lose the benefits of the manpaging. 11:10.520 --> 11:18.520 All right, next step is to try to even harder. There's this tool called bloat, I think there's 11:18.520 --> 11:27.320 output here, yeah. So we built a binary with the features we want, we copy it, we strip it, 11:27.320 --> 11:33.000 and we run bloat on the stripped copy, because that's the one we want to measure sizes in, 11:33.000 --> 11:36.120 but we use original binary on the side to steal the bug symbols. 11:37.080 --> 11:46.040 And we found out that a lot of our memory was going to Lua. Lua is the program language we use for 11:46.040 --> 11:50.920 writing the nested configurations in, so it cannot just go, but still perhaps there was some 11:52.840 --> 12:00.120 room we could get back. So we realized that some of the structures in memory were padded 12:00.120 --> 12:11.640 inefficiently for memory purposes, and we realized that preventing false sharing, which means 12:11.640 --> 12:19.000 having unrelated variables not leaving close together in memory costs a lot of memory. 12:19.000 --> 12:25.560 So if you put those variables together, you can save memory at the cost of some performance on 12:25.640 --> 12:31.800 big multi-trade machines, which your router is not. Then there's the number of threats, 12:31.800 --> 12:37.880 the config I showed earlier did some of this, but it turns out we could strip even more threats 12:37.880 --> 12:42.600 from the process, which saves a lot of memory, because each threat comes with a stack, 12:42.600 --> 12:46.440 and the things that is doing also include a lot of states, of course. 12:46.920 --> 12:58.040 So for OpenWT, implementing a simpler threat model, where for many things, we didn't even have 12:58.040 --> 13:08.520 multiple threats capable of doing the handling, but just one. Then we found out that we were 13:08.680 --> 13:18.360 fragmenting memory, and as Andre said, the best way to synchronize threats, it's not synchronized 13:18.360 --> 13:24.760 them, the best way to allocate memory is to not allocate it. So it turned out there were a bunch 13:24.760 --> 13:31.480 of allocations, we could get rid of. The lower garbage collector is not very aggressive by default, 13:31.480 --> 13:37.320 so we now trigger it a couple of times during startup, but just goes and cleans up all those temporary 13:37.400 --> 13:43.960 objects that we don't need anymore. And of course, the number of threats helps a lot. 13:45.080 --> 13:52.760 We also tried linking the C++ standard library in. This made a memory usage slightly smaller, 13:52.760 --> 13:59.800 but it means we now are the distributor of lib s3C++, and again, if a second program wants to use 13:59.800 --> 14:08.120 the same library, then we are causing more memory usage, which is waste. So that was roughly where 14:08.120 --> 14:15.560 this ended up. We were slightly overtarget, but a waste smaller than before. The return on investment 14:15.560 --> 14:26.280 of other tricks we thought we could do would be small. So we set up an OpenWT feed on our website, 14:26.360 --> 14:35.080 and we have contributed all of this upstream to the OpenWT project, and this is the cursed status 14:35.640 --> 14:42.280 of DNS dist for the current OpenWT stable release. The binary compressed is one and have 14:42.280 --> 14:48.120 megabytes or five and a half uncompressed, which means five and a half maximum memory use for mapping 14:48.120 --> 14:54.440 the binary into. That's with all the features. Yes, so we now have two builds, a full one, 14:54.520 --> 15:01.400 and a not full one. The difference is roughly the list I showed before CDB and the B, etc. 15:02.760 --> 15:09.480 And the memory usage total, so that binary libraries and heap is around four megabytes now, 15:10.760 --> 15:16.280 which is still a lot on your home router, but it is a lot better than it was before. 15:17.480 --> 15:23.240 While we were doing this, we also added UCI integration. If you ever worked with OpenWT, 15:23.400 --> 15:29.160 you will have seen UCI or Lucy, which allows you to configure the software, the drums and your 15:29.160 --> 15:34.600 system, and before we did that all you could do was edit the DNS dist config, and there was no integration. 15:35.480 --> 15:41.960 So this will make things a lot better for OpenWT users with DNS dist. It's quite a big PR, 15:41.960 --> 15:45.800 so they haven't gotten around to reviewing it yet, but I'm sure they will soon. 15:46.760 --> 15:53.560 Here's an example of the UCI config. I'm not even sure that's being an after read here, 15:53.560 --> 16:02.360 but you can read it later. Oh, this is a fun one. We also added DDR desicnated. 16:03.400 --> 16:04.200 Bye-bye, help me. 16:04.440 --> 16:06.440 Okay. 16:12.280 --> 16:18.120 DNS desicnated network is over. It's a discovery, right? Right. Yeah. 16:18.120 --> 16:22.760 This allows DNS dist to tell your clients, like your iPhones and your Android devices, 16:23.400 --> 16:29.160 that they can use the RTRDOH to talk to your DNS dist, and they get a situation where 16:29.960 --> 16:36.440 DNS encrypted from your mobile to your router, and then encrypted from your router to some upstream, 16:37.960 --> 16:44.040 which is better than no encryption, but also a bit weird to have that re-encryption step in between. 16:44.040 --> 16:47.320 However, this allows you to do filtering on that device. 16:51.400 --> 16:52.520 And that is the last slide. 16:52.920 --> 16:54.520 Yes. 16:55.560 --> 16:57.560 Question. 17:01.080 --> 17:02.280 Thank you. Yes. 17:10.280 --> 17:13.320 If I could remove the lower part, how much memory would I save? 17:14.600 --> 17:21.800 Not a lot, because of WRT already shipped Lua, although they're getting rid of that. 17:21.880 --> 17:27.480 So it might be an interesting experiment for the next OpenWRT version. 17:36.920 --> 17:41.080 Have we tried linking statically against lip SSL with LTO enabled? 17:43.480 --> 17:51.000 I think we did, but given that open SSL is already installed, this will never give us benefits. 17:51.480 --> 17:57.000 If open SSL is not installed, and we knew where the only user, this will probably be the right 17:57.000 --> 17:59.240 cause of action, of course, of action. 18:03.240 --> 18:09.240 If I use DNS dist on OpenWRT, I'm still using DNS mask, which is a so DNS dist, 18:09.240 --> 18:15.960 is just between DNS mask and my provider in a summer, or do you replace DNS mask? 18:16.440 --> 18:22.760 So the way I've been running this at home for quite some time is I, sorry, the question is, does this 18:22.760 --> 18:30.120 replace DNS mask, or sit beside it? It could do either. You need a DHCP server. 18:30.840 --> 18:35.400 However, you can also use ODHCPD, which is actually a better DHCP server, I found. 18:36.200 --> 18:41.400 So I've been running at home. My office has been behind an OpenWRT box, 18:41.480 --> 18:46.760 with no DNS mask on it for quite some time, using ODHCPD, and DNS dist. 18:48.360 --> 18:53.720 And some of the UCI integration we did also offers host name, 18:53.720 --> 18:58.600 what land resolving, or DNS dist pools, that information from the DHCP server. 19:02.280 --> 19:08.040 I guess I was the question earlier, about from a matrix, the question has D to P integration, 19:08.360 --> 19:12.360 which is indeed a DNS mask when implemented in DNS dist somehow. 19:12.360 --> 19:17.480 Right, I did indeed just answer exactly that, very good. Anybody else? 19:20.360 --> 19:21.880 Okay, thank you all.