WEBVTT

00:00.000 --> 00:11.240
Thank you for coming to my liking, so I am unmuted, I think there's no red light.

00:11.240 --> 00:12.240
It's not red.

00:12.240 --> 00:13.240
All right.

00:13.240 --> 00:17.080
So I work on Foothark, a data-parallel function program language.

00:17.080 --> 00:18.080
It looks like this.

00:18.080 --> 00:20.240
I'm not really here to talk about the language itself.

00:20.240 --> 00:25.240
The idea is that you write this kind of very conventional style in the compiler, choose

00:25.240 --> 00:29.920
on your program, transform it, optimize it, and the language generates code for GPU.

00:29.920 --> 00:33.400
So I'll have a parallel platform, so I will talk about GPUs today, and then hopefully

00:33.400 --> 00:34.400
it was pretty fast.

00:34.400 --> 00:37.320
That's what least they did to go.

00:37.320 --> 00:40.720
So it's by itself that there will be a GPU language, it's highly good, but we have a compiler

00:40.720 --> 00:41.720
at generative GPU code.

00:41.720 --> 00:45.960
And it has three production quality GPU backgrounds that are supposed to work on every

00:45.960 --> 00:46.960
program.

00:46.960 --> 00:52.960
One of them targets the API, OpenCL, which is an open standard for, well, GPU programming

00:52.960 --> 00:56.160
a little while, and that actually, and it's implemented to various degrees, by

00:56.160 --> 00:59.880
Nvidia, MT, Intel, and so on, so it's pretty wide as a port, it's an open standard.

00:59.880 --> 01:02.960
So in principle, we should not need anything else, but we do.

01:02.960 --> 01:09.320
We also target CUDA, which is Nvidia's proprietary API for running code on GPUs, which

01:09.320 --> 01:14.760
is pretty dominant, and it's worked with well, and VDGQs, but only in VDGQs.

01:14.760 --> 01:18.920
And then we also target CIP, which is an MP taking hip and filing of the serial numbers,

01:18.920 --> 01:23.760
or running a pro step in retaining CUDA to hip, and then that works pretty well on MPTPUs

01:23.760 --> 01:24.760
as well.

01:24.760 --> 01:29.880
So ideally, when we generate equivalent code from compiler, targeting each of these three

01:29.880 --> 01:33.200
APIs, we would expect identical performance.

01:33.200 --> 01:38.680
So we had actually the case, buvigate identical performance across these backgrounds, as

01:38.680 --> 01:40.040
far as we don't.

01:40.040 --> 01:44.840
So the idea is that we will take equivalent benchmark programs, written in this two-time

01:44.840 --> 01:48.800
language, but we just imagine it's a live VQ program, it's not really the point,

01:48.800 --> 01:54.120
it's a language, it's not really important, but then we take these 48 benchmark programs,

01:54.120 --> 01:58.200
and pile them with different baguettes, and run them on different GPUs, and see how

01:58.200 --> 02:04.880
how fast they run, whether they only change this, how which API we target.

02:04.880 --> 02:09.840
And that would do a lot of data, because we have 48 programs, two, three baguettes, and two

02:09.840 --> 02:10.840
different GPUs.

02:10.840 --> 02:17.680
But this, on readable graph shows, how fast OMCL is compared to CUDA, and hit respectively

02:17.680 --> 02:24.080
on an MPTPU, and then in VDGQ, and a number greater than one means that OMCL is

02:24.080 --> 02:26.840
faster than CUDA, or hit respectively.

02:26.840 --> 02:31.760
And you can see that in many cases, the difference is less than 20%, but there are some

02:31.760 --> 02:32.760
significant outliers.

02:32.760 --> 02:37.000
And I would like to talk about briefly the reasons for these differences.

02:37.000 --> 02:40.560
The compiler largely generated equivalent code, but there will be some exceptions, and

02:40.560 --> 02:43.480
the actual drivers will also have an impact.

02:43.480 --> 02:46.920
So one of the most tedious reasons, or less interesting ones, is that by default, OMCL

02:46.920 --> 02:51.720
allows a floating point of single position floating point numbers to have incorrectly rounded

02:51.720 --> 02:53.480
motivation and square root operations.

02:53.480 --> 02:57.320
But on other hand, they will be faster, because they are wrong, and it's usually easier

02:57.320 --> 03:00.960
to get things run faster, you don't care about the results.

03:00.960 --> 03:07.920
So this explains the fact that for example, on the GA 100, GPU, the input benchmark is 1.43

03:07.920 --> 03:13.600
faster for the OMCL and CUDA, and that's almost entirely due to the fact that if you

03:13.600 --> 03:17.600
allow a CUDA to round incorrectly, it'll go really fast.

03:17.600 --> 03:18.600
It still behaves nicely.

03:18.600 --> 03:22.360
It's actually in stop validating with this, but it's kind of an unfair comparison.

03:22.360 --> 03:27.600
So if you tell OMCL to please give me the right numbers, then many of these differences

03:27.600 --> 03:28.600
will go away.

03:28.600 --> 03:33.320
I don't actually know why that's a default on OMCL, it seems a bit weird to me.

03:33.320 --> 03:38.480
Another more interesting difference is that some advanced program requires an algorithm

03:38.480 --> 03:40.840
called a prefix sum, or a scan.

03:40.840 --> 03:45.240
The best algorithm for running for computer prefix sums on a GPU is the so-called decouple

03:45.240 --> 03:49.200
look-back scan, developed by some Nvidia researchers, and it really goes against how

03:49.200 --> 03:50.680
you're supposed to program GPUs.

03:50.680 --> 03:56.120
It depends on some very sophisticated communication between running threads at the same

03:56.120 --> 03:59.520
time, so they write to some shared memory, and they read each other's results, so you

03:59.520 --> 04:02.920
need some progress guarantee, you need about cash mechanics and so on, that are not provided

04:02.920 --> 04:07.520
by the OMCL spank, and are sort of only provided by CUDA, because these Nvidia researchers

04:07.520 --> 04:10.960
came out of the algorithm, it worked in practice, so Nvidia says if you do exactly this,

04:10.960 --> 04:11.960
then it's okay.

04:12.520 --> 04:16.480
They form us a little bit better later, but the point is, on OMCL, you can rely on you

04:16.480 --> 04:21.440
run this fairly far fancy scan algorithms, we have to pull back for a slower one, and that

04:21.440 --> 04:28.360
explains performance difference for these benchmark programs that we like heavily on

04:28.360 --> 04:33.440
scans, which unfortunately, but that's, in principle, you should be also able to do this

04:33.440 --> 04:36.440
in OMCL, but I don't know if anyone who's actually managed to get a turn work

04:36.440 --> 04:38.440
to get a reliable one.

04:38.440 --> 04:46.400
Another problem is that on MGTPUs, these OMCL implementation limits thread blocks to 256 threads,

04:46.400 --> 04:49.680
so that's the limit of how it is group your threads, that's the hardware limit, because

04:49.680 --> 04:52.960
it doesn't exist with hip on the same hardware, it's just a kind of weird driver software

04:52.960 --> 04:57.000
limit, and there's some fine print that applies to that, so that explains the performance

04:57.000 --> 05:02.200
difference for some benchmarks that are very sensitive to the block size, again, kind

05:02.200 --> 05:08.720
of a strange software limit in MGT stack, and another similar weird thing is that you

05:08.720 --> 05:13.440
cannot rely on BigQuery, they level two cache size, no problem, it's a less as kind of thing,

05:13.440 --> 05:17.640
you can recover, you cache size, but it's kind of unclear whether it's mostly level one,

05:17.640 --> 05:21.680
they will two or something else, and it doesn't give the same number on the video, and

05:21.680 --> 05:26.440
give you the little two cache size, and then do the level one, and this affects algorithms

05:26.440 --> 05:31.680
that query the cache size to make intelligent choices, because then on the way we wrote

05:31.680 --> 05:34.880
it, we assume that it's a little too cache size, and if you get the level one on MG, then

05:34.880 --> 05:41.880
you get some automatic tuning, and it doesn't really work so well.

05:41.880 --> 05:47.040
And very another instance of Ubisoft may get difficult to query hardware information, is to

05:47.040 --> 05:51.840
cannot query how many threads you need to fully set your GPU or how many will fit on it,

05:51.840 --> 05:55.720
and we do generate code that tries to query the GPU to figure out how much can we fit,

05:55.720 --> 06:00.680
and then make choices based on that, now the interesting question is that we can't find

06:00.720 --> 06:05.880
information on an open shell, make a horrific guess instead, which generally is smaller than

06:05.880 --> 06:11.880
the precise number, but it's actually not given that the right number produces faster code,

06:11.880 --> 06:16.760
if tuning how many threads you've given to your GPU program is a pretty complicated process

06:16.760 --> 06:21.960
or it needs a very hard to predict, and in some cases the heuristic actually runs faster

06:21.960 --> 06:27.520
than the correct number, so this is kind of, it explains difference, but it's not really

06:27.560 --> 06:35.520
something that's a big problem, or at least the heuristic is not necessarily worse than the right information.

06:35.520 --> 06:40.080
Another problem is that when you're talking to a GPU you're going through an API, which has

06:40.080 --> 06:45.760
some overhead on the host, and in some cases we can see that a program is slower, for

06:45.760 --> 06:50.560
a little reason that we can attribute to the code actually running on a GPU, it's only due to

06:50.560 --> 06:55.840
some host side overhead where even just touching a GPU is slower for a little reason that we can't

06:56.400 --> 07:02.720
really determine, generally we see that OMCL is slower, so if you just want to run the minimal

07:02.720 --> 07:08.080
unit program on a GPU it does nothing, then we have absolute OMCL is little bit slower than both

07:08.080 --> 07:13.840
hip and then with your. This usual does not matter because usually when you touch the GPU you

07:13.840 --> 07:19.920
tell us to do something costly, but there are some programs or benchmarks with the degenerate inputs

07:19.920 --> 07:24.000
where that really do almost no work, and then it's basically just measuring the the host side

07:24.080 --> 07:28.880
overhead line. One of the small data sets for the trace benchmark or anybody has some very

07:28.880 --> 07:34.880
large ones and some very small ones, which we obviously have a big variance here. It's difficult to

07:34.880 --> 07:41.200
find the exact reason for this long. Then there we also have bounce checking, this language also

07:41.200 --> 07:44.800
support full bounce checking on a GPU, it's done with your code transformation that generates

07:44.800 --> 07:49.840
some slightly odd code, and it's something that we suspect that the code generators for the in the

07:49.920 --> 07:54.400
OMCL code and hip implementations are sometimes a little bit bad at handling, so for no other

07:54.400 --> 07:58.720
reason that the bounce checkings, which by itself is not an expensive, we see that that code

07:58.720 --> 08:04.000
is just turned into much worse code with OMCL, then in code that we're running on the Nvidia

08:04.000 --> 08:08.400
stack or on the AMD stack, it's the opposite if the OMCL code action becomes faster than hip,

08:08.400 --> 08:11.920
so that's just because there's a black box compiler after us, it does something to the code

08:11.920 --> 08:16.720
which generates, I mean always quite sure what, in this case there's difference or we can

08:16.720 --> 08:22.080
really attribute to anything else, and that's all things we can't figure out, there are some

08:22.080 --> 08:27.120
programs where there's a difference performance between OMCL and hip, and we simply cannot see why,

08:27.120 --> 08:33.920
we see that a GPU curve runs faster or slower, and even inspecting the generate machine code,

08:33.920 --> 08:38.640
I can't figure out exactly what the reason is, I suspect that it's due with reduced allocation

08:38.640 --> 08:42.560
or something like that, it's a very similar issue, very low-level thing, because it tends to

08:42.560 --> 08:46.880
have a for kernels that are compute down, not memory-bounders, where such things might have

08:46.880 --> 08:51.440
a major impact, but I really haven't figured out why. We can solve some of these issues,

08:52.400 --> 08:56.240
because that's really too category-see, one is telling OMCL to do a property off,

08:56.240 --> 08:59.120
they are, they're still managed to provide all this hardware information that OMCL does

08:59.120 --> 09:04.560
and allows to carry otherwise, if we do that, then this becomes of the summary table, and now

09:04.560 --> 09:09.280
the vast majority of the benchmarks, there's less than 10% difference between OMCL and

09:09.280 --> 09:14.320
CUDA or OMCL and hip, and most of the significant cases where OMCL slow is because of this

09:14.320 --> 09:20.000
issue, where we cannot run the best known scan implementation, because it requires some memory-bounders

09:20.000 --> 09:24.880
guarantees that we don't have OMCL. So, my conclusion is that it's not really difficult to tie

09:24.880 --> 09:30.880
it all of these APIs with a single code generator library, but there are some performance differences,

09:31.760 --> 09:36.080
performance-bounders is tricky, and most of it is down to missing hardware information,

09:36.400 --> 09:40.640
and I would say that if you're only careful in video and MG, just probably tightening

09:40.640 --> 09:46.160
OMCL, it's not worth it, because CUDA and hip are pretty similar, and overall less trouble.

09:46.960 --> 09:48.160
All right, that's my talk.

09:55.360 --> 09:58.880
Perfect timing. One question. Who has a question?

09:59.680 --> 10:06.880
Yeah, so there are different versions of the CL and we tend to have the ability to work.

10:06.880 --> 10:09.680
What originally did you use across the board?

10:09.680 --> 10:14.720
Uh, this will allow me to explain the case at the beginning of the board anything besides

10:14.720 --> 10:19.200
true one, true one, true one, the one, right, and then MG things.

10:19.200 --> 10:22.480
Uh, you know what we just used before, we said we used some of the number data.

10:22.480 --> 10:24.080
We're going to need that library to pick.

10:24.080 --> 10:27.280
I think it's, I think it's, I think it's going to be fun soon, I think it's been,

10:28.880 --> 10:30.560
You