WEBVTT 00:00.000 --> 00:11.240 Thank you for coming to my liking, so I am unmuted, I think there's no red light. 00:11.240 --> 00:12.240 It's not red. 00:12.240 --> 00:13.240 All right. 00:13.240 --> 00:17.080 So I work on Foothark, a data-parallel function program language. 00:17.080 --> 00:18.080 It looks like this. 00:18.080 --> 00:20.240 I'm not really here to talk about the language itself. 00:20.240 --> 00:25.240 The idea is that you write this kind of very conventional style in the compiler, choose 00:25.240 --> 00:29.920 on your program, transform it, optimize it, and the language generates code for GPU. 00:29.920 --> 00:33.400 So I'll have a parallel platform, so I will talk about GPUs today, and then hopefully 00:33.400 --> 00:34.400 it was pretty fast. 00:34.400 --> 00:37.320 That's what least they did to go. 00:37.320 --> 00:40.720 So it's by itself that there will be a GPU language, it's highly good, but we have a compiler 00:40.720 --> 00:41.720 at generative GPU code. 00:41.720 --> 00:45.960 And it has three production quality GPU backgrounds that are supposed to work on every 00:45.960 --> 00:46.960 program. 00:46.960 --> 00:52.960 One of them targets the API, OpenCL, which is an open standard for, well, GPU programming 00:52.960 --> 00:56.160 a little while, and that actually, and it's implemented to various degrees, by 00:56.160 --> 00:59.880 Nvidia, MT, Intel, and so on, so it's pretty wide as a port, it's an open standard. 00:59.880 --> 01:02.960 So in principle, we should not need anything else, but we do. 01:02.960 --> 01:09.320 We also target CUDA, which is Nvidia's proprietary API for running code on GPUs, which 01:09.320 --> 01:14.760 is pretty dominant, and it's worked with well, and VDGQs, but only in VDGQs. 01:14.760 --> 01:18.920 And then we also target CIP, which is an MP taking hip and filing of the serial numbers, 01:18.920 --> 01:23.760 or running a pro step in retaining CUDA to hip, and then that works pretty well on MPTPUs 01:23.760 --> 01:24.760 as well. 01:24.760 --> 01:29.880 So ideally, when we generate equivalent code from compiler, targeting each of these three 01:29.880 --> 01:33.200 APIs, we would expect identical performance. 01:33.200 --> 01:38.680 So we had actually the case, buvigate identical performance across these backgrounds, as 01:38.680 --> 01:40.040 far as we don't. 01:40.040 --> 01:44.840 So the idea is that we will take equivalent benchmark programs, written in this two-time 01:44.840 --> 01:48.800 language, but we just imagine it's a live VQ program, it's not really the point, 01:48.800 --> 01:54.120 it's a language, it's not really important, but then we take these 48 benchmark programs, 01:54.120 --> 01:58.200 and pile them with different baguettes, and run them on different GPUs, and see how 01:58.200 --> 02:04.880 how fast they run, whether they only change this, how which API we target. 02:04.880 --> 02:09.840 And that would do a lot of data, because we have 48 programs, two, three baguettes, and two 02:09.840 --> 02:10.840 different GPUs. 02:10.840 --> 02:17.680 But this, on readable graph shows, how fast OMCL is compared to CUDA, and hit respectively 02:17.680 --> 02:24.080 on an MPTPU, and then in VDGQ, and a number greater than one means that OMCL is 02:24.080 --> 02:26.840 faster than CUDA, or hit respectively. 02:26.840 --> 02:31.760 And you can see that in many cases, the difference is less than 20%, but there are some 02:31.760 --> 02:32.760 significant outliers. 02:32.760 --> 02:37.000 And I would like to talk about briefly the reasons for these differences. 02:37.000 --> 02:40.560 The compiler largely generated equivalent code, but there will be some exceptions, and 02:40.560 --> 02:43.480 the actual drivers will also have an impact. 02:43.480 --> 02:46.920 So one of the most tedious reasons, or less interesting ones, is that by default, OMCL 02:46.920 --> 02:51.720 allows a floating point of single position floating point numbers to have incorrectly rounded 02:51.720 --> 02:53.480 motivation and square root operations. 02:53.480 --> 02:57.320 But on other hand, they will be faster, because they are wrong, and it's usually easier 02:57.320 --> 03:00.960 to get things run faster, you don't care about the results. 03:00.960 --> 03:07.920 So this explains the fact that for example, on the GA 100, GPU, the input benchmark is 1.43 03:07.920 --> 03:13.600 faster for the OMCL and CUDA, and that's almost entirely due to the fact that if you 03:13.600 --> 03:17.600 allow a CUDA to round incorrectly, it'll go really fast. 03:17.600 --> 03:18.600 It still behaves nicely. 03:18.600 --> 03:22.360 It's actually in stop validating with this, but it's kind of an unfair comparison. 03:22.360 --> 03:27.600 So if you tell OMCL to please give me the right numbers, then many of these differences 03:27.600 --> 03:28.600 will go away. 03:28.600 --> 03:33.320 I don't actually know why that's a default on OMCL, it seems a bit weird to me. 03:33.320 --> 03:38.480 Another more interesting difference is that some advanced program requires an algorithm 03:38.480 --> 03:40.840 called a prefix sum, or a scan. 03:40.840 --> 03:45.240 The best algorithm for running for computer prefix sums on a GPU is the so-called decouple 03:45.240 --> 03:49.200 look-back scan, developed by some Nvidia researchers, and it really goes against how 03:49.200 --> 03:50.680 you're supposed to program GPUs. 03:50.680 --> 03:56.120 It depends on some very sophisticated communication between running threads at the same 03:56.120 --> 03:59.520 time, so they write to some shared memory, and they read each other's results, so you 03:59.520 --> 04:02.920 need some progress guarantee, you need about cash mechanics and so on, that are not provided 04:02.920 --> 04:07.520 by the OMCL spank, and are sort of only provided by CUDA, because these Nvidia researchers 04:07.520 --> 04:10.960 came out of the algorithm, it worked in practice, so Nvidia says if you do exactly this, 04:10.960 --> 04:11.960 then it's okay. 04:12.520 --> 04:16.480 They form us a little bit better later, but the point is, on OMCL, you can rely on you 04:16.480 --> 04:21.440 run this fairly far fancy scan algorithms, we have to pull back for a slower one, and that 04:21.440 --> 04:28.360 explains performance difference for these benchmark programs that we like heavily on 04:28.360 --> 04:33.440 scans, which unfortunately, but that's, in principle, you should be also able to do this 04:33.440 --> 04:36.440 in OMCL, but I don't know if anyone who's actually managed to get a turn work 04:36.440 --> 04:38.440 to get a reliable one. 04:38.440 --> 04:46.400 Another problem is that on MGTPUs, these OMCL implementation limits thread blocks to 256 threads, 04:46.400 --> 04:49.680 so that's the limit of how it is group your threads, that's the hardware limit, because 04:49.680 --> 04:52.960 it doesn't exist with hip on the same hardware, it's just a kind of weird driver software 04:52.960 --> 04:57.000 limit, and there's some fine print that applies to that, so that explains the performance 04:57.000 --> 05:02.200 difference for some benchmarks that are very sensitive to the block size, again, kind 05:02.200 --> 05:08.720 of a strange software limit in MGT stack, and another similar weird thing is that you 05:08.720 --> 05:13.440 cannot rely on BigQuery, they level two cache size, no problem, it's a less as kind of thing, 05:13.440 --> 05:17.640 you can recover, you cache size, but it's kind of unclear whether it's mostly level one, 05:17.640 --> 05:21.680 they will two or something else, and it doesn't give the same number on the video, and 05:21.680 --> 05:26.440 give you the little two cache size, and then do the level one, and this affects algorithms 05:26.440 --> 05:31.680 that query the cache size to make intelligent choices, because then on the way we wrote 05:31.680 --> 05:34.880 it, we assume that it's a little too cache size, and if you get the level one on MG, then 05:34.880 --> 05:41.880 you get some automatic tuning, and it doesn't really work so well. 05:41.880 --> 05:47.040 And very another instance of Ubisoft may get difficult to query hardware information, is to 05:47.040 --> 05:51.840 cannot query how many threads you need to fully set your GPU or how many will fit on it, 05:51.840 --> 05:55.720 and we do generate code that tries to query the GPU to figure out how much can we fit, 05:55.720 --> 06:00.680 and then make choices based on that, now the interesting question is that we can't find 06:00.720 --> 06:05.880 information on an open shell, make a horrific guess instead, which generally is smaller than 06:05.880 --> 06:11.880 the precise number, but it's actually not given that the right number produces faster code, 06:11.880 --> 06:16.760 if tuning how many threads you've given to your GPU program is a pretty complicated process 06:16.760 --> 06:21.960 or it needs a very hard to predict, and in some cases the heuristic actually runs faster 06:21.960 --> 06:27.520 than the correct number, so this is kind of, it explains difference, but it's not really 06:27.560 --> 06:35.520 something that's a big problem, or at least the heuristic is not necessarily worse than the right information. 06:35.520 --> 06:40.080 Another problem is that when you're talking to a GPU you're going through an API, which has 06:40.080 --> 06:45.760 some overhead on the host, and in some cases we can see that a program is slower, for 06:45.760 --> 06:50.560 a little reason that we can attribute to the code actually running on a GPU, it's only due to 06:50.560 --> 06:55.840 some host side overhead where even just touching a GPU is slower for a little reason that we can't 06:56.400 --> 07:02.720 really determine, generally we see that OMCL is slower, so if you just want to run the minimal 07:02.720 --> 07:08.080 unit program on a GPU it does nothing, then we have absolute OMCL is little bit slower than both 07:08.080 --> 07:13.840 hip and then with your. This usual does not matter because usually when you touch the GPU you 07:13.840 --> 07:19.920 tell us to do something costly, but there are some programs or benchmarks with the degenerate inputs 07:19.920 --> 07:24.000 where that really do almost no work, and then it's basically just measuring the the host side 07:24.080 --> 07:28.880 overhead line. One of the small data sets for the trace benchmark or anybody has some very 07:28.880 --> 07:34.880 large ones and some very small ones, which we obviously have a big variance here. It's difficult to 07:34.880 --> 07:41.200 find the exact reason for this long. Then there we also have bounce checking, this language also 07:41.200 --> 07:44.800 support full bounce checking on a GPU, it's done with your code transformation that generates 07:44.800 --> 07:49.840 some slightly odd code, and it's something that we suspect that the code generators for the in the 07:49.920 --> 07:54.400 OMCL code and hip implementations are sometimes a little bit bad at handling, so for no other 07:54.400 --> 07:58.720 reason that the bounce checkings, which by itself is not an expensive, we see that that code 07:58.720 --> 08:04.000 is just turned into much worse code with OMCL, then in code that we're running on the Nvidia 08:04.000 --> 08:08.400 stack or on the AMD stack, it's the opposite if the OMCL code action becomes faster than hip, 08:08.400 --> 08:11.920 so that's just because there's a black box compiler after us, it does something to the code 08:11.920 --> 08:16.720 which generates, I mean always quite sure what, in this case there's difference or we can 08:16.720 --> 08:22.080 really attribute to anything else, and that's all things we can't figure out, there are some 08:22.080 --> 08:27.120 programs where there's a difference performance between OMCL and hip, and we simply cannot see why, 08:27.120 --> 08:33.920 we see that a GPU curve runs faster or slower, and even inspecting the generate machine code, 08:33.920 --> 08:38.640 I can't figure out exactly what the reason is, I suspect that it's due with reduced allocation 08:38.640 --> 08:42.560 or something like that, it's a very similar issue, very low-level thing, because it tends to 08:42.560 --> 08:46.880 have a for kernels that are compute down, not memory-bounders, where such things might have 08:46.880 --> 08:51.440 a major impact, but I really haven't figured out why. We can solve some of these issues, 08:52.400 --> 08:56.240 because that's really too category-see, one is telling OMCL to do a property off, 08:56.240 --> 08:59.120 they are, they're still managed to provide all this hardware information that OMCL does 08:59.120 --> 09:04.560 and allows to carry otherwise, if we do that, then this becomes of the summary table, and now 09:04.560 --> 09:09.280 the vast majority of the benchmarks, there's less than 10% difference between OMCL and 09:09.280 --> 09:14.320 CUDA or OMCL and hip, and most of the significant cases where OMCL slow is because of this 09:14.320 --> 09:20.000 issue, where we cannot run the best known scan implementation, because it requires some memory-bounders 09:20.000 --> 09:24.880 guarantees that we don't have OMCL. So, my conclusion is that it's not really difficult to tie 09:24.880 --> 09:30.880 it all of these APIs with a single code generator library, but there are some performance differences, 09:31.760 --> 09:36.080 performance-bounders is tricky, and most of it is down to missing hardware information, 09:36.400 --> 09:40.640 and I would say that if you're only careful in video and MG, just probably tightening 09:40.640 --> 09:46.160 OMCL, it's not worth it, because CUDA and hip are pretty similar, and overall less trouble. 09:46.960 --> 09:48.160 All right, that's my talk. 09:55.360 --> 09:58.880 Perfect timing. One question. Who has a question? 09:59.680 --> 10:06.880 Yeah, so there are different versions of the CL and we tend to have the ability to work. 10:06.880 --> 10:09.680 What originally did you use across the board? 10:09.680 --> 10:14.720 Uh, this will allow me to explain the case at the beginning of the board anything besides 10:14.720 --> 10:19.200 true one, true one, true one, the one, right, and then MG things. 10:19.200 --> 10:22.480 Uh, you know what we just used before, we said we used some of the number data. 10:22.480 --> 10:24.080 We're going to need that library to pick. 10:24.080 --> 10:27.280 I think it's, I think it's, I think it's going to be fun soon, I think it's been, 10:28.880 --> 10:30.560 You