WEBVTT 00:00.000 --> 00:02.000 You 00:30.000 --> 00:32.000 You 01:00.000 --> 01:02.000 You 01:30.000 --> 01:32.000 You 02:00.000 --> 02:02.000 You 02:30.000 --> 02:32.000 You 03:00.000 --> 03:02.000 You 03:30.000 --> 03:32.000 You 03:32.000 --> 03:34.000 You 03:34.000 --> 03:36.000 You 03:54.000 --> 03:58.000 Thank you. She usually start the reboot but 03:58.000 --> 03:59.800 perhaps, okay, good. 03:59.800 --> 04:02.080 So let's start from the beginning then. 04:02.080 --> 04:04.480 So thank you very much. 04:04.480 --> 04:09.800 So this is about a form of multi-precision or redmetic. 04:09.800 --> 04:13.600 And the problem multi-precision or redmetic 04:13.600 --> 04:17.040 is that it is more cost extensive. 04:17.040 --> 04:18.960 You have a cost overhead. 04:18.960 --> 04:22.520 So that's where the parallel computing comes in. 04:22.520 --> 04:27.160 So multiple double or redmetic allows you 04:27.160 --> 04:29.920 to work more precise, but you still 04:29.920 --> 04:32.000 work with floating point of redmetic. 04:32.000 --> 04:36.840 So we are looking into making the miracle algorithms 04:36.840 --> 04:40.520 give them a more broader range. 04:40.520 --> 04:43.600 Now to make it efficiently and parallel, 04:43.600 --> 04:47.400 I'm trying to vectorize the multiple double redmetic. 04:47.400 --> 04:51.400 And I'm going to explain how you can do that. 04:51.400 --> 04:53.640 There are two types of parallel computations. 04:53.640 --> 04:58.400 So the first one is a high level, very straightforward. 04:58.400 --> 05:01.640 The harder one is actually not really straightforward. 05:01.640 --> 05:05.840 And it will just give some hints to that. 05:05.840 --> 05:10.040 Okay, so I'm teaching also computational geometry this semester. 05:10.040 --> 05:14.080 So the definition of a robust algorithm, 05:14.080 --> 05:17.520 a robust algorithm actually performs in the same way 05:17.520 --> 05:22.560 if you start to tweak the input a little bit. 05:22.560 --> 05:27.520 So typically we work with 64 bit doubles, 05:27.520 --> 05:28.960 but we can extend it. 05:28.960 --> 05:34.440 So the idea to extend 64 bit float, 05:34.440 --> 05:38.520 a redmetic using floating point of redmetic goes back 05:38.520 --> 05:41.520 to the late 60s, early 70s. 05:41.520 --> 05:42.880 When people at some point, 05:42.880 --> 05:46.280 they were still forced to work with 32 bits. 05:46.280 --> 05:51.960 But they could actually figure out how to create more precision. 05:51.960 --> 05:56.360 So I was using the packages QD lip a lot 05:56.360 --> 06:00.240 and there is also the comparative library. 06:00.240 --> 06:04.400 My goal is actually to go to GPU acceleration. 06:04.400 --> 06:08.880 So I'm presenting this on a gaming laptop, 06:08.880 --> 06:13.120 which is actually already quite as powerful as the graphics cards 06:13.120 --> 06:18.120 and the big servers that I was using 10, 12 years ago. 06:18.120 --> 06:22.600 So computers were actually too fast. 06:22.600 --> 06:24.520 For users, it's not a problem. 06:24.520 --> 06:26.160 But if you are a software developer 06:26.160 --> 06:29.160 and you really want to exploit your computer, 06:29.160 --> 06:31.000 it's a big problem. 06:31.000 --> 06:33.800 Okay, what is a multiple double? 06:33.800 --> 06:37.960 It's simply a sequence of several doubles. 06:37.960 --> 06:40.200 So if you work with complex arithmetic, 06:40.200 --> 06:43.920 you have a real part and you have an imaginary part, 06:43.920 --> 06:44.840 two doubles. 06:44.840 --> 06:48.040 Well, double double is actually very much the same. 06:48.040 --> 06:52.800 Now, if your result can be represented exactly in one double, 06:52.800 --> 06:55.600 like you want to compute the number eight here, 06:55.600 --> 06:59.080 then your second double is going to give you 06:59.080 --> 07:02.960 the accuracy at which you compute it. 07:02.960 --> 07:06.520 So with double double, you have about 32 decimal places, 07:06.520 --> 07:07.520 then you double again, 07:07.520 --> 07:12.480 but probably you have 64 decimal places again and again. 07:12.480 --> 07:13.680 So that's the good thing. 07:13.680 --> 07:16.000 So you can still do numerical analysis. 07:16.000 --> 07:18.960 You kind of have an error estimate. 07:18.960 --> 07:21.680 If your result is an exact, 07:21.680 --> 07:25.560 can be exactly represented by 64 double. 07:25.560 --> 07:28.920 The drawback is that you're not going to notice 07:28.920 --> 07:31.680 all that much with double double arithmetic. 07:31.680 --> 07:36.000 Actually, your problems are going to be compute bound. 07:36.000 --> 07:39.520 So your processor is going to work harder. 07:39.520 --> 07:42.080 But when you get to double, 07:42.080 --> 07:47.680 actually, you may want to do some parallels. 07:47.680 --> 07:50.960 And then, of course, you can double it up. 07:50.960 --> 07:54.200 So you see the average number of floating point 07:54.200 --> 07:56.920 operations that you will have to do 07:56.920 --> 08:01.760 to do an addition, a multiplication, and a division. 08:01.760 --> 08:05.320 So here is why I am doing this. 08:05.320 --> 08:07.560 I'm working with Taylor series. 08:07.560 --> 08:10.640 So you may remember Taylor series from Calculus. 08:10.640 --> 08:13.480 So here is a very nice series. 08:13.480 --> 08:18.640 So the exponential starts with the leading coefficient one, 08:18.640 --> 08:20.640 but then the coefficients actually 08:20.640 --> 08:24.760 they go down very, very fast as a factorial. 08:24.760 --> 08:28.520 If you only need eight terms, then double precision will do. 08:28.520 --> 08:32.400 But if you need more, so for example, if you need 32, 08:32.400 --> 08:35.040 theory turns in your Taylor series, 08:35.040 --> 08:39.280 to represent the exponential correctly. 08:39.280 --> 08:42.200 So also having these tiny components, 08:42.200 --> 08:45.760 you actually will need, quote, double arithmetic. 08:45.760 --> 08:50.320 And actually 32 is a good number if you think about the warp size 08:50.320 --> 08:51.520 in a GPU. 08:51.520 --> 08:57.160 So typically, we would like to work with 64 or 128. 08:57.160 --> 08:59.680 OK, so here is the idea. 08:59.680 --> 09:05.400 So what happens is that when you do an operation with a double 09:05.400 --> 09:08.120 double, it's going to do the operation, 09:08.120 --> 09:11.600 and then it's going to renormalize it again. 09:11.600 --> 09:15.160 So if you do a quad double, it's going to also renormalize. 09:15.160 --> 09:20.360 Make sure that all the doubles that represent your quad of double 09:20.360 --> 09:24.240 are actually going to be known overlapping. 09:24.240 --> 09:29.760 So that is going to have a lot of tests ifs and things like that. 09:29.760 --> 09:32.560 The idea is if you vectorize, you actually 09:32.560 --> 09:37.880 would like to postpone the renormalization. 09:37.880 --> 09:41.000 So here is if you want to add a sequence, 09:41.000 --> 09:45.840 so you have 52 bits of precision in your fraction, 09:45.840 --> 09:50.960 you simply cut your 52 in half and you insert 0s. 09:50.960 --> 09:53.800 So these 0s are going to fill up. 09:53.800 --> 09:56.000 So you could also think about if you want 09:56.000 --> 09:58.800 to do exact integer arithmetic, you 09:58.800 --> 10:02.320 would split your 64 bit into 30 bits, 10:02.320 --> 10:05.360 and you would add 0s in upfront. 10:05.360 --> 10:09.440 And these numbers will actually fill up. 10:09.440 --> 10:14.600 OK, so double, double arithmetic is also 10:14.600 --> 10:19.960 referred to to error free operations. 10:19.960 --> 10:23.120 I want to do inner products. 10:23.120 --> 10:26.880 So I want to multiply matrices, for example. 10:26.880 --> 10:31.200 The reference code is going to compute just 10:31.200 --> 10:36.120 a sequence of sums of products. 10:36.120 --> 10:40.600 So for products, if you multiply 2, say 13 bits, 10:40.600 --> 10:44.920 you can have a 16-bit 26-bit number. 10:44.920 --> 10:48.480 So that's why a double double, which consists of a high double 10:48.480 --> 10:53.920 and a low double, it's going to be split into 8 parts. 10:53.920 --> 10:58.520 The 8 parts are all going to have 32 non-zero bits, 10:58.520 --> 11:01.880 and going to be filled up with 0s. 11:01.880 --> 11:04.040 So it sounds very complicated. 11:04.040 --> 11:08.480 So what we're doing is that here with one inner product 11:08.480 --> 11:12.600 and double double arithmetic is going to be replaced 11:12.600 --> 11:17.400 by 8 inner products, but the 8 inner products 11:17.400 --> 11:21.720 are going to happen in double precision arithmetic. 11:21.720 --> 11:26.040 So computers are very good at doing matrix matrix multiplication, 11:26.040 --> 11:29.160 with double precision arithmetic. 11:29.160 --> 11:33.320 Multiple double precision in a product are replaced 11:33.320 --> 11:37.280 by inner products in double arithmetic. 11:37.280 --> 11:41.160 So that's what I mean by factoring, factoring. 11:41.160 --> 11:44.520 So here is the setup. 11:44.520 --> 11:47.680 It's important to still float in point, 11:47.680 --> 11:51.000 and the numbers can still float. 11:51.000 --> 11:55.560 You have to prevent this by kind of re-normalizing 11:55.560 --> 11:56.760 intermittently. 11:56.760 --> 12:01.000 So sometimes people say with a 64-bit double, 12:01.000 --> 12:03.280 you have 53 bits in precision, 12:03.280 --> 12:05.800 because you had that normalized bit. 12:05.800 --> 12:09.720 Now when we split a double in four pieces, 12:09.720 --> 12:12.600 we also have to make sure that these exponents 12:12.600 --> 12:18.880 are actually 0, negative 13, negative 26, negative 39. 12:18.880 --> 12:24.840 OK, so here is the computational result. 12:24.840 --> 12:28.400 And in some sense, this is already a parallel computation, 12:28.400 --> 12:32.320 because inside the computer, there are pipelines happening. 12:32.320 --> 12:36.440 If you do a double precision floating point operation, 12:36.440 --> 12:38.840 there's going to be the denormalization, 12:38.840 --> 12:42.120 making sure that the exponents are the same, 12:42.120 --> 12:45.440 adding the fractions, and then again storing. 12:45.440 --> 12:51.280 So by rewriting so that first big inner product, 12:51.280 --> 12:54.560 as many inner products of doubles, 12:54.600 --> 12:57.360 we actually get a lot more speed up, 12:57.360 --> 13:02.680 and a lot more gradual progression. 13:02.680 --> 13:06.040 So if you see if you go from, and the left column there, 13:06.040 --> 13:10.040 so the ordinary, what I call the ordinary normal, 13:10.040 --> 13:12.080 double double and ticked, 13:12.080 --> 13:15.440 you have a immediately penalty hit of 13, 13:15.440 --> 13:19.280 and then if you go to double, you have again 12 factor. 13:19.280 --> 13:24.360 But if you do it vector, it goes way, way more gradual. 13:24.360 --> 13:31.800 And this was, by the way, done on a very recent compute, 13:31.800 --> 13:38.160 with a recent chip, compiled with a full optimization. 13:38.160 --> 13:42.520 OK, now two levels of parallelism. 13:42.520 --> 13:48.360 So the big point is that going from, say 12 milliseconds to nine, 13:48.360 --> 13:51.560 seconds you're going to notice. 13:51.560 --> 13:53.440 So that's actually the whole big problem 13:53.440 --> 13:57.000 of using multiple precision arithmetic. 13:57.000 --> 14:00.120 Things that start to take milliseconds, 14:00.120 --> 14:03.800 suddenly become seconds, second becomes minutes, 14:03.800 --> 14:05.600 and minutes become hours. 14:05.600 --> 14:09.080 So the nine seconds, you can actually, 14:09.080 --> 14:11.720 if you do this in a very straightforward, 14:11.720 --> 14:15.600 so I had to do 1,000 in a product. 14:15.600 --> 14:19.880 I can actually reduce this with using all the course, 14:19.880 --> 14:26.000 and that CPU, I can reduce it from 300, 14:26.000 --> 14:33.520 from 293 to 300, going from quad-double four times more. 14:33.520 --> 14:37.680 I like speed up, but we also like quality up. 14:37.680 --> 14:41.480 So in some sense, if you use all the course in that computer, 14:41.480 --> 14:45.960 you actually can do the same computation, 14:45.960 --> 14:52.360 four times more accurate within the same time. 14:52.360 --> 14:55.400 OK, so that's the nice type of parallelism here 14:55.400 --> 14:56.880 as a reference. 14:56.880 --> 15:00.480 So this is what I call the work crew model. 15:00.480 --> 15:05.880 You just every threat picks up an in a product, 15:05.880 --> 15:09.720 and they work through a job queue. 15:09.720 --> 15:12.840 And the only thing you need to be careful about 15:12.840 --> 15:16.560 is that when you increment your job counter, 15:16.560 --> 15:20.080 so every threat, when it's done, it asks for the next job, 15:20.080 --> 15:24.720 that they can only do this. 15:24.720 --> 15:27.520 They have to do this in a critical section. 15:27.520 --> 15:29.520 And this can be very nicely implemented. 15:29.520 --> 15:32.400 So I think this is classic. 15:32.400 --> 15:36.760 Ada, OK, now comes the hard part. 15:36.760 --> 15:47.400 So with GPUs, you can actually do the multiple double arithmetic, 15:47.400 --> 15:51.640 that's fine, but then there are the tenser course. 15:51.640 --> 15:55.760 So the computer that I mentioned actually 15:55.760 --> 15:59.760 it has a high-end on pair graphics coordinates. 15:59.760 --> 16:05.880 And what this experiment shows is that these convolutions, 16:05.880 --> 16:12.840 they can actually be reformulated as matrix-matrix products. 16:12.840 --> 16:17.000 So this is what we call data staging. 16:17.000 --> 16:22.760 So we have to make sure that the data is all properly 16:22.760 --> 16:27.000 arranged in vectors, and that we have very carefully 16:27.000 --> 16:30.520 we know where the beginning is, and where the end, 16:30.520 --> 16:32.720 and where are we going to write. 16:32.720 --> 16:36.080 So GPU code is typically very short. 16:36.080 --> 16:40.120 You just smash some kernel, but the preparation 16:40.120 --> 16:42.720 for the code takes a very long time. 16:42.720 --> 16:45.160 And actually what I've presented to you here 16:45.160 --> 16:50.560 is kind of a first step in making the tenser course, 16:50.560 --> 16:54.480 making them useful for higher precision, 16:54.480 --> 16:59.280 multi-matrix multiplications. 16:59.280 --> 17:00.600 Thank you for your attention. 17:00.600 --> 17:01.600 On time. 17:01.600 --> 17:10.600 Thank you. 17:10.600 --> 17:13.600 So questions and answers. 17:13.600 --> 17:22.600 I have gone, you said we were using a 96-core CPU. 17:22.600 --> 17:23.600 Yes. 17:23.600 --> 17:28.400 Then you mentioned that we had only a six-time speed-up or something 17:28.400 --> 17:29.400 like that. 17:29.400 --> 17:39.280 Yeah, so these speed-ups are actually only done on one core. 17:39.280 --> 17:40.960 This is, okay, I'm sorry. 17:40.960 --> 17:46.800 So the question is, I'm using 96 cores, and how come these are 17:46.800 --> 17:49.120 the speed-ups here? 17:49.120 --> 17:56.080 In this slide, I'm actually only using one core, and I'm using the 96 cores 17:56.080 --> 17:59.280 to reduce the 9 seconds over there. 17:59.280 --> 18:05.640 So this is kind of here the highest time, and I want to reduce this, 18:05.640 --> 18:08.480 comparing to the 318. 18:08.480 --> 18:14.640 So that's about with quadruples, about 10 times slower. 18:14.640 --> 18:21.320 What I can do here is if I have 1,000 in a product to do, 18:21.320 --> 18:27.880 and they are distributed among the 9-d6 cores in that computer. 18:27.880 --> 18:37.240 And that reduces the time to 200 to 318, I'm sorry, to 298 milliseconds. 18:37.240 --> 19:01.240 Yeah, so the question is, I'm using some of fours here, have I considered using atomic 19:01.240 --> 19:04.000 variables, and I think I do. 19:04.000 --> 19:11.400 So there are these types that I can use, yes. 19:11.400 --> 19:12.400 Yes? 19:12.400 --> 19:13.400 Yes. 19:13.400 --> 19:30.800 So the question is, are the exponents useful? 19:30.800 --> 19:39.720 So the exponents, so in some sense, so here is how you can see it. 19:39.720 --> 19:45.040 So that's also drawback actually of multiple double arithmetic. 19:45.040 --> 19:52.440 You still remain 11 bits of in your exponent. 19:52.440 --> 19:56.400 So at some point you're going to have overflow or underflow. 19:56.400 --> 19:59.920 So you have to do still numerical analysis. 19:59.920 --> 20:08.640 So the question is if these exponents are useful, here they are not, if you are just normalizing, 20:08.640 --> 20:11.040 because they all stay fixed. 20:11.040 --> 20:14.320 And that's actually what you want. 20:14.320 --> 20:20.600 So in some sense for your numbers, you're only going to use, so here I generated all constant 20:20.600 --> 20:26.480 numbers in a very nice range, but it's actually this exponent that is going to matter. 20:26.480 --> 20:33.760 And all the other numbers they are going to somehow carefully need to be rebalanced. 20:33.760 --> 20:34.760 Yes? 20:34.760 --> 20:35.760 Yes. 20:35.760 --> 20:45.160 So it comes up with that theorem, and that impacts so much of the temperature of the process, 20:45.160 --> 20:51.160 so that it has a better effect of the right, and then put it in like this range, so the 20:51.160 --> 20:56.160 load, I don't think of the fact. 20:56.160 --> 21:00.400 So your question is about the load balancing? 21:00.400 --> 21:01.400 Yes. 21:01.400 --> 21:06.680 When they are currently driving a lot, yes. 21:06.680 --> 21:14.240 When they impact over the temperature and the control by a high temperature of how the 21:14.240 --> 21:17.560 process is going to be. 21:17.560 --> 21:18.560 Yes. 21:18.560 --> 21:28.760 So the question is about the temperature, and when the, so I don't know, so that the machine 21:28.760 --> 21:35.200 starts to get a lot more noisy when I run it, and the fans are really starting to blow 21:35.200 --> 21:37.680 very hard. 21:37.680 --> 21:43.320 So that's how I see that it is working or I hear that it's working. 21:43.320 --> 21:48.360 But yeah, I have no sensors, I mean, there could be sensors there, but I don't know how 21:48.360 --> 21:49.360 the temperature. 21:49.360 --> 21:50.360 Okay. 21:50.360 --> 21:56.000 That the HLFI sort of made a close and the cancels, you know, that to reduce the temperature 21:56.000 --> 22:04.720 because the family is able to add a risk, it's max speed for the heating of the 22:04.720 --> 22:11.360 process, and all this particular process, it's great for this into the ignition of the 22:12.360 --> 22:13.360 temperature. 22:13.360 --> 22:15.360 No, I never used that. 22:15.360 --> 22:17.360 Yeah. 22:17.360 --> 22:18.360 Okay, thank you. 22:18.360 --> 22:19.360 Thank you.