WEBVTT 00:00.000 --> 00:19.760 So, my presentation is about DNS, but not just about DNS, not only about DNS, and actually 00:19.760 --> 00:24.360 this idea originated as an internal company contest. 00:24.360 --> 00:29.340 When I had the contest for every employee of our company, about this, take any data 00:29.340 --> 00:35.600 that you like and make application out of it, and it was just for one or two days, and 00:35.600 --> 00:42.040 apparently I was the only participant in this contest, so I have a lot of fun, and I have 00:42.040 --> 00:43.040 this presentation. 00:43.040 --> 00:56.280 So, I need some idea, I need a nice data set, and I need a way to visualize and analyze 00:56.280 --> 01:02.080 this data set, and to get a data set, the easiest way is to get it from the internet, and 01:02.080 --> 01:10.040 we will get internet scale data set, so how to get it from DNS, and just basics. 01:10.040 --> 01:14.280 You can do forward DNS requests, and the easiest way to do it from a command line is to 01:14.280 --> 01:20.760 just use the host command line tool, and for Google it will output IPv4 address IPv6 address 01:20.760 --> 01:26.940 and something else, and you can do a reverse DNS request for IP address, it will output something 01:26.940 --> 01:36.520 for example, the name, but for Google, for the same IP address, for Google it outputs something 01:36.520 --> 01:44.240 strange.1E 100.net, question for you, what does it mean, or anE 100? 01:44.240 --> 01:53.860 1E 100 is the certificate notation for the number they call GoGo, so the point is, regardless 01:53.860 --> 02:00.460 of the DNS, records don't have to point to the original name, they actually cannot, because 02:00.460 --> 02:08.860 a lot of names can point to a single address, and actually they can contain about anything. 02:08.860 --> 02:15.580 And you can also use the more advanced tool dig, where you specify the type of record 02:15.580 --> 02:22.340 you want to get, and you will get the answer from your DNS server. 02:22.340 --> 02:29.500 And now, a question for you, if PTR records don't have to be precise, they can contain 02:29.500 --> 02:37.100 anything, an arbitrary string, you can't count on these records to contain what you need, 02:37.100 --> 02:38.540 why do they even exist? 02:38.540 --> 02:47.340 4 for mail server, mostly, for sending emails, and also for nice infrastructure of the 02:47.340 --> 02:48.340 probability. 02:48.340 --> 02:56.420 So when you run something like MTR, it is like Trace Root, you will get also these names from 02:56.420 --> 03:00.140 reverse DNS records. 03:00.140 --> 03:09.820 And the idea is to take IPv4, which contains less than 2, less than 4 billion, something 03:09.820 --> 03:17.100 IP addresses, slightly less, and just do reverse DNS records for all of them. 03:17.100 --> 03:19.780 So how to do it? 03:19.780 --> 03:28.140 There are a lot of tools, ready for that, like MES DNS, or there is a library, GNU ADNES. 03:28.140 --> 03:34.900 They are fast, from the rhythm of MES DNS, they say, if you run this tool, you will 03:34.900 --> 03:38.660 get all the result in less than 1 hour. 03:38.660 --> 03:47.420 As a problem is, if you do it from home, most likely your internet will break, and your 03:47.420 --> 03:53.780 internet provider will break your internet, and you will have to call them. 03:53.780 --> 04:02.460 If you do it from your hosting provider from AWS, from digital ocean, whatever, it is also 04:02.460 --> 04:06.940 not a good idea, they don't like it. 04:06.940 --> 04:16.140 At least, even if they don't restrict it, it will be good if you communicate in advance. 04:16.140 --> 04:20.940 If you contact them, explain your reuse case, explain what exactly you will do, they will 04:20.940 --> 04:23.580 not, so they will not worry. 04:23.580 --> 04:28.700 But I am introvert, I don't like communicating with anyone. 04:28.700 --> 04:33.660 So I wanted to find another way to do it. 04:34.660 --> 04:39.460 Yes, so from home, not a good idea, from this center, not a good idea, my friend was 04:39.460 --> 04:45.700 blocked, because I asked him to do it, and he was blocked, not me, from the cloud, also 04:45.700 --> 04:47.700 not a good idea. 04:47.700 --> 04:52.140 But there, let's take a look how DNS can work. 04:52.140 --> 04:54.940 What protocol is exist? 04:54.940 --> 05:01.540 It can work over UDP, port 53, it is unreliable, it is not secure, it can work over this 05:01.540 --> 05:09.940 CPU, it is not secure, but more reliable, there is DNS over TLS, reliable, secure, but 05:09.940 --> 05:17.060 heavyweight, there is also interesting protocol, DNS over HTTPS. 05:17.060 --> 05:29.260 It is just on HTTPS API, where you connect, and request a record and it will give you result 05:29.260 --> 05:37.300 in one of several formats, including the original binary format, or JSON for convenience. 05:37.300 --> 05:43.700 And one of those is cloud, where DNS over HTTPS over, there is a Google DNS over HTTPS 05:43.700 --> 05:50.420 over, probably many more, and interesting that you can even write, you can even use it without 05:50.420 --> 05:51.420 the domain name. 05:51.420 --> 05:59.940 If you use something like cloud for home, you have to remove this name in advance using 05:59.940 --> 06:04.100 a different method, like writing it in the host file. 06:04.100 --> 06:10.060 But you can write HTTPS, just one, one, one, one, which is the address of this DNS 06:10.060 --> 06:13.860 server, and it looks quite nice. 06:13.860 --> 06:23.100 Another is also DNS script over TCP and over UDP, and it is actually a mess, because 06:23.100 --> 06:28.860 even for encrypted DNS, you can use more than three different protocols. 06:28.860 --> 06:37.540 I decided to check if I can use cloud for DNS over HTTPS. 06:37.540 --> 06:43.020 So the API, if I use the most convenient one, with JSON, it looks like this, use that 06:43.020 --> 06:49.420 if I, your request, the type of record you want to get, and you will get JSON with all 06:49.420 --> 06:52.340 this information. 06:52.340 --> 07:01.740 It can use over HTTPS 2, even HTTPS 3, and HTTPS 1.1, so everything works, and with HTTPS 2, 07:01.740 --> 07:12.180 I can use a pipelining with multiple requests, and it also works, you can use it from 07:12.180 --> 07:13.180 Coral. 07:13.180 --> 07:20.660 I have read carefully the documentation about if I will just do several billion requests one 07:20.660 --> 07:21.660 day. 07:21.660 --> 07:24.780 Will I violate their terms of service? 07:24.780 --> 07:31.260 And apparently I just did not find any mention of it, and I thought so cloud for such a 07:31.260 --> 07:40.060 big infrastructure provider, if I do plus just one billion requests, no one will notice. 07:40.060 --> 07:46.380 They suppose that the processing three lines of requests every day, if not 100, 3 lines. 07:46.380 --> 07:53.140 So I did it, I also prepared a table for results in my database, in my favorite database, 07:53.140 --> 08:00.740 which is clickhouse, the best one is called database, and this table is a simple day-time 08:00.740 --> 08:04.580 and JSON with a string data type. 08:04.580 --> 08:11.140 And then I wrote a simple shell script. 08:11.140 --> 08:14.020 Let's read this simple shell script. 08:14.020 --> 08:20.900 So first of all, there is a SQL query inside, and the SQL query will skip all the result 08:20.900 --> 08:31.300 IPv4 addresses like 127, anything, link local, etc. 08:31.300 --> 08:39.900 And it will generate a ranges of the first three numbers. 08:39.900 --> 08:46.260 And then I will take these ranges, and also, paralyzed by the last number, the last 08:46.260 --> 08:53.340 talk that, and I will parallelize using corral, and I will generate a comment that will 08:53.340 --> 09:05.740 query this 256 addresses with a single HTTP as request using corral. 09:05.740 --> 09:11.820 And I piped it to bash, because I just generated this comment, and then I piped it into 09:11.820 --> 09:17.940 clickhouse client, finally to insert into the database, with this another SQL query, 09:17.940 --> 09:21.500 insert into. 09:21.500 --> 09:30.820 But it is just one line, so I will script you can see it, okay, so what happens? 09:30.820 --> 09:38.460 I have run it, I have left it for a day, I went to sleep, and then I woke up, actually 09:38.460 --> 09:42.700 you know, nothing actually happens. 09:42.700 --> 09:48.860 So I did it from a single machine, I did not prioritize by many machines, and I tried to 09:48.860 --> 09:57.620 be gentle, so, without using too much parallelism, the question is how long did it take? 09:57.620 --> 10:09.780 A day, okay, any other two weeks, couple of hours, two months, no, actually it took about 10:09.780 --> 10:22.100 ten days, so it was not fast due to this HTTP as request, but at least no one complains. 10:22.100 --> 10:29.820 I was a little bit paranoid that no one complains, so yeah, around ten days, so I went 10:29.820 --> 10:36.620 to this service, this service is named Grey Noise, and it represents an interesting kind 10:36.620 --> 10:45.540 of observability for the internet, it is named Internet Telescope, what is Internet Telescope? 10:45.540 --> 10:58.580 This is a bunch of sorrows across the world that have some ranges of IP addresses, the ranges 10:58.580 --> 11:05.260 could be quite big, and they just listen to anything that comes to these sorrows. 11:05.260 --> 11:14.820 They open every port, they lock every UDP packet, and they just listen, listen and collect 11:14.820 --> 11:22.420 this data, and if someone does a massive scan, or the internet, this Internet Telescope 11:22.420 --> 11:31.940 are likely to get this noise, so the sorrows is named Grey Noise, and I want to, and I found 11:31.940 --> 11:40.580 that it detected a DNS over HTTP as a scanner, so probably me, but then I checked the 11:40.820 --> 11:46.060 information, and apparently it is not me, this is just every time something happens, and 11:46.060 --> 11:51.460 I don't have to worry, and even if I was detected, it does not mean that I'm a bad guy, 11:51.460 --> 12:00.180 I'm not a bad guy, trust me, actually I did not have to do this scan, because there 12:00.180 --> 12:07.540 are open datasets of historical DNS scans, one of them were available, was available 12:07.620 --> 12:14.980 from the project, the scanner, from the company Rapid7, unfortunately they no longer host 12:14.980 --> 12:23.700 this dataset, at least they no longer give it for download, so at least I did an interesting 12:23.700 --> 12:33.060 exercise, so let's take a look at this dataset, 3.69 billion records, it is not 2 in 32, 12:34.020 --> 12:41.540 because some addresses are resolved, the dataset size is 13 gigabytes, and this is not even 12:41.540 --> 12:52.020 parsed, a compression ratio is 65, the raw dataset, if we take just these JSONs, it is almost 12:52.020 --> 13:03.620 a terabyte, and to parse it I used this simple SQL query, let's not read it, regular expressions, 13:03.620 --> 13:13.020 JSON, etc, and here is the parsed dataset, time status, flags, IP address, and domain, 13:13.100 --> 13:22.860 IP, let's take a look inside, the columns are compressed quite good, and the parsed dataset is 13:22.860 --> 13:33.420 even less, not 10 gigabytes, but 5, so I can look and it looks plausible, but now a question for you, 13:34.380 --> 13:43.740 do you know what is the most popular TLD top level domain, you say dot com, how many people will 13:43.740 --> 13:59.980 say dot com, dot arpa, maybe, how many people will say dot net, let's take a look at this easy, dot 14:00.060 --> 14:08.860 net, dot com, and dot arpa, either I just removed it from, but probably it is not that popular, 14:08.860 --> 14:16.060 I'm not sure why it is not here, it should be here, we will take a look, so dot net is the most popular, 14:16.060 --> 14:23.340 now the question for you, what is the most popular second level subdomain, or first level depends 14:23.340 --> 14:44.460 on how do you count, go, what, go you k, go gp, any other options, net, what, maybe, let's take a look, 14:44.460 --> 14:53.660 I think you all, you are all wrong, and the second for some reason, comcast dot net, and the 14:53.660 --> 15:11.020 self is bb tech, what is it you know about, but probably you know, okay, okay, now the next thing I did, 15:11.100 --> 15:16.540 is I tried to find all the reverse DNS records, containing a couple of files, the name of my favorite 15:16.540 --> 15:25.500 database, and I found a lot of them, a lot of them also with company names, and what I did, 15:25.500 --> 15:37.740 I sent this list to our sales team, actually I could use different tool, what is it, what is it, 15:38.380 --> 15:47.260 so then it shows 23,000 clickhouse servers are visible on the internet, probably because they are 15:47.260 --> 15:54.780 not security, like it was with deep seek, but probably for some other reason, like they are security, 15:54.780 --> 16:04.540 but handshake goes, to them currently it is 33,000, and I want to get more from this data, 16:04.540 --> 16:13.020 that's how I need visualization, and the most obvious example is to make this map, this picture is 16:13.020 --> 16:23.260 from 2006 from XKCD, about representing the whole range of IPv4 using a space feeling curve, 16:23.260 --> 16:30.460 so the curve does not go like the scan line, but the curve that goes like this, feeling all 16:30.540 --> 16:39.100 the space, it is difficult to say, but better to visualize, and also another example also not from 16:39.100 --> 16:49.980 me, it is from 2018, almost exactly what I want, but I want it bigger and better, so I have 16:50.460 --> 17:00.700 made another one line shell script, it is this, I will draw a picture with SQL, and this SQL query 17:01.340 --> 17:10.460 will calculate first significant subdomain, like ABC.co.uk, ABC will be the first significant subdomain, 17:11.100 --> 17:19.420 then calculate hash, Cb hash 64, then from this hash extract 3 bytes, RGB, 17:22.300 --> 17:28.700 and then use another function, more than decode, that represents a space feeling curve, 17:30.300 --> 17:40.380 and it will generate these pixels, and I will make it like a text in the format name of 17:40.380 --> 17:53.820 that portable net map, PNM, and after all of this, I converted to PNG, and it looks nice, 17:53.820 --> 18:01.580 here is the picture like I wanted, and we will dive deep into this picture in a moment, but now, 18:02.540 --> 18:15.660 yeah, this is even more, it is zoomed, but I want anyone zoomed out, so I generated it with 4k resolution 18:15.660 --> 18:21.340 using that script, actually I want a full picture, and the full picture will be 4 billion pixels, 18:22.220 --> 18:34.940 65k by 65k, so not 4k, but 65k, and 65k displays don't exist as of this moment, but I can do it 18:34.940 --> 18:47.180 interactively, yes we can generate 4 giga pixel PNG, but actually I tried it, but when I tried to 18:47.180 --> 18:55.100 open it then, most of the viewers were failing, actually, some were not failing about that slow, 18:56.060 --> 19:03.020 but most were failing, and then I decided to generate an interactive page, a HTML page, 19:04.060 --> 19:11.340 like Google Maps, allowing to zoom inside a lot of tiles, and there are different tools for that, 19:11.340 --> 19:18.860 like OpenSeedRagon, and LiftLag, LiftLag is more used for maps, and OpenSeedRagon is more used for 19:18.860 --> 19:28.780 something like scans of museum, museum pictures, drawings, and it appeared not too difficult, so I 19:29.740 --> 19:37.020 have written another script to generate tiles in a different zoom levels, and here is 19:37.980 --> 19:48.300 several lines of a loop for zoom levels, a loop for x coordinate of a tile, for y coordinate of a 19:48.300 --> 19:56.460 tile, fine name will be like here, and here is the same SQL query to generate a picture and 19:56.460 --> 20:06.940 convert to PNG, done, done, done, I left it for a day, and it finished, now I have a bunch of tiles, 20:06.940 --> 20:16.140 and I want to arrange them into a source, and the source is here, let's take a look, so here it is, 20:16.140 --> 20:25.260 here is our map, we can zoom inside, we can zoom even more, let me try, what will happen if 20:25.260 --> 20:35.340 to this, why let's, something, do you know what it is, could it be, could it be tried to guess, 20:36.060 --> 20:47.100 come cast, let's take a look, no it's Amazon, but what is interesting if I zoom even more 20:47.100 --> 20:56.700 inside this violet thing, I can see a lot of different dots, and the question is why inside 20:56.700 --> 21:07.420 Amazon, I have this different dots with a reverse name, defined it specifically, and the answer is, 21:09.660 --> 21:20.300 yeah for sending emails, for sending emails, okay let's zoom back, and it is actually so beautiful, 21:20.300 --> 21:36.780 I spent a whole day just looking at this picture, now let's go back, and see, so every click also 21:36.780 --> 21:42.860 generates a SQL query to this table and from JavaScript it is easy to do, just use this fetch 21:42.860 --> 21:52.300 request with select query inside the post, HTTP post, and it works because I have the primary 21:52.300 --> 22:01.020 king's table, it is a pointer request, it is fast, and to make a source I used, I created a user, 22:01.020 --> 22:09.660 I set up limits, Quattas, read on the access, and this HTML page, from JavaScript directly queries 22:09.740 --> 22:19.340 this service, take a race, please don't be afraid, you can use clickhouse for data analytics, 22:19.340 --> 22:25.820 you can use clickhouse for DNS analytics, and you will find something new, you will have a lot of fun, 22:25.820 --> 22:31.420 maybe you will find creative way to test and break something, the source code is available, it is 22:31.420 --> 22:38.300 really easy, HTML page to JavaScript, and that's it, ah, by the way we have a dinner today, 22:38.300 --> 22:49.580 but unfortunately it is full, so please don't look at this one, thank you