WEBVTT 00:00.000 --> 00:09.480 Thank you very much to everybody for attending this talk at the organisers of the first 00:09.480 --> 00:12.560 and conference and this track. 00:12.560 --> 00:15.120 I'm Chen Nakasanova, she's my lacanal. 00:15.120 --> 00:21.240 We both present in this talk about getting more juice out from the rest of the PIPU. 00:21.240 --> 00:24.000 A brief introduction. 00:24.000 --> 00:27.200 We work together in the graphics thing at Igalia. 00:27.200 --> 00:32.000 We have been working in the graphic stack for the Raspberry Pi for last year. 00:32.000 --> 00:36.680 I'm mainly working on the user's per side related to Mesa and she has been working on the 00:36.680 --> 00:41.280 current side of the project, supported the different parts. 00:41.280 --> 00:45.680 So well, we are talking about and better this Raspberry Pi. 00:45.680 --> 00:53.920 We're going to focus on the Raspberry Pi 5, of course, rounds in 2020, 23 in around October, 00:53.920 --> 00:54.920 I think. 00:54.920 --> 01:01.400 And it's used a next generation of the video core from Broadcom architecture, that is the 01:01.400 --> 01:02.400 number 7. 01:02.400 --> 01:07.680 It's a next step where the Raspberry Pi 4, into the several improvements, many ready to 01:07.680 --> 01:13.400 support up to weight when the targets that allow us to get into some kind of support 01:13.400 --> 01:21.640 for OpenGL, desktop 3.1, improves several operations, there's more part of this 01:21.640 --> 01:22.640 ization. 01:22.640 --> 01:26.400 We have more availability to change the weight that we handle the registers. 01:26.400 --> 01:32.640 And well, since the moment we, the hardware was available in the market, the code BIP, there 01:32.640 --> 01:41.360 was available the source code in the kernel side and in the Mesa side, completely upstream. 01:41.360 --> 01:48.960 So I'm going to give a brief introduction about how is the graphic stack in the Raspberry 01:48.960 --> 01:51.520 Pi generations. 01:51.520 --> 01:54.600 Because sometimes it's complex to identify with server, I are reducing. 01:54.600 --> 02:00.480 If you're using a Raspberry Pi 1, 2, 3, that is based on video core for the generation. 02:00.480 --> 02:05.600 In the kernel side, we have a BC4 driver, that is the one that handles the display and the 02:05.600 --> 02:06.600 rendering. 02:06.600 --> 02:14.000 In the suggested space size, we have a BC4 driver that supports OpenGLS 2.0. 02:14.000 --> 02:18.600 When we change to the new version of the generation, we have a 4 or 5 that have the video 02:18.600 --> 02:20.520 core 6 and 7. 02:20.520 --> 02:25.960 We maintain the BC4 name, but it's only for the display, the things that you put on the 02:25.960 --> 02:28.320 screen at the end. 02:28.320 --> 02:32.240 On the render side, there is a different drive that is B3D. 02:32.240 --> 02:37.360 That has the same name that the user space driver, that is B3D that we use for OpenGL 02:37.360 --> 02:39.520 or OpenGLS. 02:39.520 --> 02:44.440 We also develop the book and drive that is B3D for book. 02:44.480 --> 02:52.000 That's the complete names of different models and drivers. 02:52.000 --> 02:57.440 Now I started in the focus in the part that we thought that it was more interesting 02:57.440 --> 03:03.600 for this presentation, we are talking about performance and I'm going to talk about the 03:03.600 --> 03:11.120 part, ready to use a space, we use the B3D driver and the B3DB, are a base of messa. 03:11.120 --> 03:16.120 So part of the common infrastructure that we have available for all the drivers in Open 03:16.120 --> 03:18.240 Storage are there. 03:18.240 --> 03:26.240 In our case, the OpenGL drivers supports enough performance version of OpenGLC.1, because 03:26.240 --> 03:32.560 the presentation of the hardware and there are some, they are emulated in some way, and 03:32.560 --> 03:39.840 we are OpenGLS, that is the version of OpenGL, adapt the foreign devils, performance 03:39.840 --> 03:44.400 in the Raspberry Pi 5 was launched, it was the same also in the Raspberry Pi 4. 03:44.400 --> 03:51.880 So for the Vulcan site, when the product was launched, we supported it at that time Vulcan 03:51.880 --> 03:59.520 1.2, and last so goes we get the conformance for the new version from Vulcan 1.3. 03:59.520 --> 04:05.920 If you are interested in about what we support in the API, we have a presentation in previous 04:05.920 --> 04:14.360 XC1.3R, we went into the data about the new extension, what you can do with the B3D APIs. 04:14.360 --> 04:22.720 Today we would like to focus on performance, last year we focused working on some scenarios 04:22.720 --> 04:31.480 that are the ones that we have a CPU, GPU limited scenario working with high resolution, 04:31.480 --> 04:39.240 that implies that the limiting factor is the GPU, and we haven't been working with the 04:39.240 --> 04:48.080 DFX bands, that is a common, based on the industry for analyzing the performance of your mobile 04:48.080 --> 05:00.720 phone, and we get a performance from last year, well, 2DF December 35 of December 2023 to the end 05:00.720 --> 05:07.720 of last year, 24 of over an average, 100% of performance increase. 05:07.720 --> 05:18.400 That is, we did this analysis on Android 15, because the LS bands is not available for 05:18.400 --> 05:27.120 arm limbs, so we needed to use this version of Android, so I'm missing work that Michael 05:27.120 --> 05:36.120 had from, so we can try the useless space driver in Android, and then get the real results 05:36.120 --> 05:42.760 from the commit from the end of last year to just one month ago, so we see that there 05:42.760 --> 05:48.080 are different demos, and we are getting in some cases of performance improvement from 05:48.160 --> 05:56.880 this, better based on a slice show to get in 13 frames per second in some cases, so 05:56.880 --> 06:02.880 well, now we're going to go into detail into the what where the TMS issues will have been 06:02.880 --> 06:12.240 doing, we need to understand a bit how the tile-basser render was, that is the kind of 06:12.320 --> 06:21.040 present that we found in the Broadcon GPUs. In our case, when you prepare a job to submit 06:21.040 --> 06:29.640 to the GPU, there are two stages, the third one is you prepare a GPU line, being 06:29.640 --> 06:35.120 a bit job that is the one that is in charge of doing the next thing, you will have multiple 06:35.280 --> 06:42.720 protocols, so it's going to analyze the geometry, identify with the parts of the frame 06:42.720 --> 06:48.320 buffer generating, it's affecting that local, so at the end it creates a list of this tile, 06:48.320 --> 06:55.680 that is the small piece of the image, it's affected by this local, with this list, we go to 06:55.680 --> 07:04.080 the render job, that we see stuff list, and start loading, it's tile, it really knows, with 07:04.160 --> 07:10.720 protocols, we're affecting that tile, it only executes that part, so with load one, once the tile, 07:11.280 --> 07:19.040 we do all the protocols operation, and then we store the result of that, so the mess way of getting 07:19.040 --> 07:25.680 more performances avoiding this load and store, because if you're going to do more laws and 07:25.680 --> 07:34.160 stores, you'll split the things in different jobs, so the first demonstration that we have with 07:34.160 --> 07:41.200 great performance increase, that it goes around 40% in average, was one that we discovered maybe 07:41.200 --> 07:46.240 by chance, it was an extension that was tested by here, that was working in the drive and a 07:46.240 --> 07:51.200 ceiling implementation, because the drive was waiting, if you were writing to a texture, 07:51.760 --> 07:57.760 and then we're going to sample it, there was a job finishing to store that and for a lot that, 07:57.760 --> 08:03.040 but if the combination of rim buffer was the same, you could reuse, you had access to the texture 08:03.040 --> 08:10.480 in the in the in the cache of the of the GPU, so that the results would be already available, 08:10.480 --> 08:16.080 remove it at that and see, get us really nice results, we can see here at a different steam, 08:17.040 --> 08:22.240 on the right side is the current version, on the left side we can see that there is, 08:23.840 --> 08:32.720 you can fill the difference from 16 to 24 frames per second, we also did a lot of compiler 08:32.720 --> 08:38.480 optimization, and maybe there are like 15 per quest, improving, reducing the styles of the 08:39.680 --> 08:44.640 improving the scheduling of this traction, reducing the number and with that work we'll remove the 08:44.720 --> 08:52.160 number of instructions to run up, not almost a 5% but we get a performance about 3.57 08:52.160 --> 08:57.520 in average, a lot of work and well, it should work on not so amazing with the previous one, 08:57.520 --> 09:04.400 it was just one, come in and identify in the issue, we also take advantage, we were together myself 09:04.400 --> 09:10.640 that, you have the ability in a title, in a title error, ketit tour, if you don't need the 09:10.640 --> 09:16.560 results of the hint at the render that it usually happens when you have death or stencil buffers, 09:16.560 --> 09:25.040 you need it to render, but in some cases we can avoid storing them a lot of them, so applying 09:25.040 --> 09:32.480 some realistic, we could improve that behavior and we had another 1% of improvement, this first 09:32.560 --> 09:41.600 sample has some use, not in this demo, but in the case of Google Chrome, you will use this 09:41.600 --> 09:47.920 operation, that improves the results, and our interesting improvement was the work we are 09:47.920 --> 09:54.080 ready to improve in the early fragment test optimization that is supported by the hardware, 09:54.080 --> 09:59.360 but there are some situations that you cannot use it, one of that is, you are using a 09:59.360 --> 10:06.240 draw call, a draw call has a discard operation, but just say, well, you do not, you cannot write anything 10:06.240 --> 10:12.560 we are happy operating to the frame buffers, in that case, you need to disable the optimization, 10:12.560 --> 10:20.160 but there was a situation, I am a scenario that allows you to do that, if death writes, where to disable 10:20.160 --> 10:27.120 what, in that case, so with that we get a 14% of performance improvement, that everything I 10:27.200 --> 10:33.200 selling is accumulating in this potential way, 40% in one case, over that, it is not linear, 10:35.360 --> 10:45.600 the last, I would like to comment this one, in some kind of jobs, usually happening in 10:45.600 --> 10:51.280 a situation that in a scenario that we have transfer feedback, that we are interesting in 10:51.280 --> 11:00.480 now in the results of the geometry, the application can disable the restoration, so you do not 11:00.480 --> 11:04.800 need to execute at the end the frame and say that, because you are not interested in the 11:04.800 --> 11:13.120 result of the end, so in the case of the Manhattan, this is happening a lot, so every transfer 11:13.120 --> 11:19.440 feedback operation, you do not enable this operation for all the protocols, none of that uses 11:19.440 --> 11:25.920 this restoration, you can disable the load and the store of the buffers, and maybe you have 11:25.920 --> 11:31.680 five calls of transfer feedback that would imply for each frame, five loads from a store that 11:31.680 --> 11:36.400 you are not going to use, so take it in that data account, improve a lot of the performance, 11:36.400 --> 11:41.360 the Manhattan name is one of the most, you can see on the left side, before all the optimizations 11:41.360 --> 11:48.640 have commented today, on the right side, the last result with in this case is at 230% of performance, 11:48.720 --> 11:57.680 this results are done with traces over the execution of DLSVNC standard, because it is easier to 11:57.680 --> 12:03.520 compare the same trace and not doing the execution on the, that we do not have the counter 12:03.520 --> 12:10.640 of having the same friends draw, and we can see here on the different comments on the right 12:10.640 --> 12:16.640 and the performance improvement over then in the in the different moments, we can see that 12:16.960 --> 12:22.320 in the case of Manhattan, the programming is huge, we can see that with the previous slide 12:22.320 --> 12:28.480 that we are almost near the 300% of performance from the original, at the end of 2023, 12:30.240 --> 12:34.320 with it also a lot of work on performance tools, and currently that might be going to 12:34.320 --> 12:42.240 explain, I can do the flow. Okay, so the thing about studying performance is that we also need 12:42.640 --> 12:49.360 tools to help us to measure the performance, the current performance, and see scenarios that we 12:49.360 --> 12:56.320 would like to work to have a better performance. So first thing that we did was CPU jobs and time 12:56.320 --> 13:05.200 stamp queries, last fours then, Chairman I talked about how we implemented CPU jobs, because there 13:05.360 --> 13:12.240 are some vocal comments that we cannot perform in the GPU alone, so we had CPU jobs, 13:12.240 --> 13:18.640 and we moved the CPU jobs from the user space to the kernel space in order to avoid GPU 13:18.640 --> 13:27.600 flushes and CPU stores. And in 2023, we landed time stamp queries and the CPU jobs in the 13:27.600 --> 13:36.560 kernel, in the vukondriver only. But the last year, we were able to support time stamp queries 13:36.560 --> 13:44.080 in Mesa as well for the DL driver. And using type stamp carries is really useful for us when 13:44.080 --> 13:50.960 analyzing performance, because it helped us to identify jobs that are taking longer, and if the 13:51.040 --> 13:58.960 job is taking longer, we can analyze it and think about new ways to improve that job. That is 13:58.960 --> 14:05.840 probably a scenario that is also happening in other applications. And with time stamp queries, 14:05.840 --> 14:13.280 we can have very accurately synchronized time stamps that are perfectly synchronized in the 14:13.280 --> 14:21.120 graphics pipeline, so this is really helpful to help evaluate the time of the jobs. 14:21.840 --> 14:29.600 And we also implemented Fefator Support, Fefator is an open stores stack for performance 14:29.600 --> 14:38.560 instruments, it means that we can access system level information and also Apple level 14:38.560 --> 14:48.160 traces to help us analyze basically data from all your system. And we also have Mesa data sources 14:48.160 --> 14:56.800 in Fefator, which means that now we can also add producers to GPU information, such as frequency, 14:57.440 --> 15:06.640 visualization, performance counters, and this help us to have a unified timeline to work on performance 15:06.720 --> 15:13.920 debugging, performance tuning, debugging. So you can see in this slide that you can have a very 15:13.920 --> 15:20.640 system-wide view in a timeline, which is very useful because in the top you have like CPU information 15:20.640 --> 15:27.680 to CPU frequency and more that I mean it couldn't fit here. But I only opened the CPU 15:27.680 --> 15:34.960 information and the DRM fans because we use fans to synchronize stamps in the kernel. And you 15:35.040 --> 15:40.800 can see all the fences that are being used in the jobs. And right here we have information 15:40.800 --> 15:47.920 from Mesa that is coming from GL Mark II, which is the famous Mesa application. And we can see 15:47.920 --> 15:53.200 you know when the job was submitted and if you see the fences you can understand when the 15:53.200 --> 16:00.320 fence was signally by the job end, you can see like operations that when we are waiting for the 16:00.400 --> 16:06.240 fence in the user space. So this is amazing because it can have a very system-wide view and 16:06.240 --> 16:14.720 understand places that we can start thinking about the forms. And now jumping to the kernel work. 16:16.480 --> 16:26.640 Last year we started enabling a feature in V3D that was historically used. V3D the V3D GPU 16:26.720 --> 16:34.640 has support for four kilobyte pages, 64 kilobyte pages that are called big pages and one 16:34.640 --> 16:41.840 megabyte pages that are called super pages. To enable them it looks very simple. You just need to have 16:41.840 --> 16:49.360 a continuous block memory block with you know one megabyte for example and add the page table entries. 16:50.160 --> 16:57.040 And the Linux driver didn't have to support for it. And I mean you can think why it's beneficial 16:57.040 --> 17:03.840 because it's just like the CPU you know. We can improve the performance using a huge pages 17:03.840 --> 17:13.040 by reducing the MIMU infectious, especially when we have memory intensive applications. Nowadays 17:14.000 --> 17:21.920 shaders have like large buffer objects. So this is important. And but we had a very important 17:21.920 --> 17:29.440 issue. We couldn't have a continuous block of memory using shaman like by default in the DRM, 17:29.440 --> 17:35.920 which is the GPU subsystem in the kernel. So we had to think about a solution for it. 17:36.320 --> 17:44.160 By default, tempFS and shaman allocate memory in page size chunks. This means that if your 17:44.160 --> 17:50.640 page size is 4 kilobytes, we are going to allocate 4 kilobytes chunks. But we needed a continuous 17:50.640 --> 17:58.400 block of memory bigger than one page, right? So we decided to create a tempFS mount point with the 17:58.400 --> 18:04.640 huge equal within size option. What this means it means that we enable transparent 18:04.640 --> 18:10.960 huge pages support in that mount point. Transparent huge pages is something that exists in 18:10.960 --> 18:18.240 the kernel for a while. And it's basically an abstraction that help us to utilize huge pages 18:18.240 --> 18:24.080 without really using huge pages in the virtual memory. It's an abstraction that let 18:25.040 --> 18:31.040 basically the applications understand that page as a continuous block of memory. And we just 18:31.040 --> 18:36.160 don't need to understand what's going down. It's a bit different than a huge page for example. 18:37.680 --> 18:44.640 With that continuous block of memory, it's just a matter of placing the page table entries 18:44.640 --> 18:52.160 in the right places, setting the bits, and then it's done. We also work on reducing the virtual 18:52.160 --> 18:58.720 alignment, virtual address alignment to four kilobytes. This helps us to reduce the memory pressure 18:59.760 --> 19:04.800 in the Raspberry Pi because you know, memory is very limited in embedded devices. And we were using 19:05.360 --> 19:18.560 128 virtual address alignment. And this was basically utilizing some addresses in our virtual address space. 19:18.640 --> 19:23.680 So we did reduce it and reducing memory pressure was really useful. We had an average 19:23.680 --> 19:31.680 improvement of 1.33%, which is not that impressive. But we had a significant performance boost 19:31.680 --> 19:39.280 in some emulation cases. Just remember that when we are using an embedded device, it's important to 19:39.280 --> 19:47.680 set a met device when using transparent huge pages otherwise it can be used a lot of memory 19:47.680 --> 19:57.920 that we don't really want to avoid. This is a demo running in PS2 emulation emulator with burnout 19:57.920 --> 20:05.120 tree. And you can see the difference, you know, it's just this small feature in the kernel and you 20:05.120 --> 20:16.160 can see a huge difference in these applications that utilize big buffers. Apart from that, 20:16.160 --> 20:21.680 I mean, this is really important to use a huge pages, but we had an issue with huge pages that 20:24.080 --> 20:32.400 huge pages by default, by default, use a THP by default, use huge pages of the size of 20:33.200 --> 20:42.080 PMD. In ARM 64 this means two megabytes. And as you can see, our interest is just 4 kilobytes, 20:42.160 --> 20:50.000 64 kilobytes, and 1 megabyte. So we didn't really need to use two megabyte pages. This was 20:50.000 --> 21:00.000 leading to some unnecessary fragmentation in our system. So we decided to use Moot size THP, which 21:00.000 --> 21:11.840 allows us to use huge pages only for 64 kilobytes to 1 megabyte. So we are just selecting this range 21:11.920 --> 21:17.920 of pages that we want to have support. And MTHP is something that exists in the kernel and 21:17.920 --> 21:24.560 help us to have the ability to allocate memory in blocks that are bigger than one page than a 21:24.560 --> 21:33.760 page size, but is smaller than the traditional PMD size. And we created the true kind of parameters 21:33.760 --> 21:40.080 to help us set the policies that we wanted for the pages. Because transparent, which pages 21:40.720 --> 21:53.120 multi THP, in Schmann, had an issue that we also had to configure the THP with CSFS. And as 21:53.120 --> 21:58.800 you know CSFS, every time that we reboot, it will go back to the default. And this is not great. 21:58.800 --> 22:04.880 When we have this, and we want to create a product for client, I mean, we need to have the 22:04.880 --> 22:09.920 configuration set for the client with the best performance. So we created this kernel command 22:09.920 --> 22:16.720 lines where you can just set the policy for the transparent huge pages, just as you do for 22:16.720 --> 22:23.200 tempFest, for example, but this is for Schmann. And you can use different policies for different 22:23.200 --> 22:30.480 pages sizes. And then you can just configure like 16 kilobytes to 64 kilobytes is going to have 22:30.480 --> 22:36.640 the policy always THP. And this is really useful, even if you have an application of Schmann. 22:37.280 --> 22:43.920 Our case is that we use Schmann to back our buffer objects. So it's really useful for 22:43.920 --> 22:46.400 each piece, but it can have other applications. 22:48.400 --> 22:50.400 So that's all. Questions? 22:51.360 --> 22:57.360 APPLAUSE 23:04.480 --> 23:08.640 Hi, did you have to make any trade-offs while writing the compiler optimizations say, 23:09.440 --> 23:12.640 like, longer compilation times or higher register pressure or something of the sort? 23:13.600 --> 23:21.440 Well, there are a lot of different evistics there. It depends if you can't even optimize the 23:21.440 --> 23:27.600 compiler to be faster compiling and not do different strategies, or you get the best performance 23:27.600 --> 23:33.600 in some cases. But the default is, you try to get the maximum number of threads working, and 23:33.600 --> 23:39.040 the system already has a cache. So once they say that it is built, you have to write the cache 23:39.040 --> 23:41.520 version, so to kind of take them and do in that. 23:44.800 --> 23:52.400 So the question for the habit write and the love it by writing is the 60 kilobytes and about 23:52.400 --> 24:00.240 the GP benchmark. And I mean, you know, even the CPU changes, CPU division changes, and from eight 24:00.240 --> 24:07.040 and all, and the only eight gigabyte type, so changes the default type, and more, and you know, 24:07.120 --> 24:16.000 the omiti and use of a box. Have you tried and register 60 GB? 60 GB of advice? 24:16.000 --> 24:18.880 Yeah, we haven't tried that, the product. 24:18.880 --> 24:31.360 I think it is a difference. I think it is a difference, and all eight gigabytes, because of the memory 24:31.440 --> 24:35.760 activity in private, and omiti are you with broke? 24:35.760 --> 24:41.200 Yes, maybe, might I do not know about that, because we know that there is a difference in the memory 24:41.200 --> 24:46.080 hand in there, and there is a work on that, I don't know with this already available. 24:54.720 --> 25:00.400 Hi, how much support, if anything, you get from Broadcom, or the Raspberry 5 Foundation, 25:00.480 --> 25:05.680 to implement this? We are working for the Raspberry 5 at the end, so. 25:08.080 --> 25:12.560 Broadcom provides the documentation. We can read the specs. 25:21.200 --> 25:26.480 Hello, thank you, nice presentation. If you compare the drivers that Broadcom give you 25:27.120 --> 25:34.160 on their customers, their proprietary customers, compared to our their proprietary drivers, 25:34.960 --> 25:40.800 compared to the open source, the drivers and the messa that you use here, do you see 25:41.840 --> 25:45.440 are they the same, and do you see a big difference in performance? 25:45.440 --> 25:52.320 We cannot make the check the difference, because the drivers are not for the same kernel are different, 25:52.320 --> 25:57.680 things, in that case. We don't have the comparatives. Are they on par, or are they, 25:57.680 --> 26:03.760 you don't know at all? We don't know, we know that there is room for improvement from their numbers, 26:03.760 --> 26:09.680 but we cannot do the same rang in both of the accommodation, because we don't have access to the other 26:09.680 --> 26:17.120 one, and we need to work on the same platform. One more question, do you have a way to 26:18.080 --> 26:27.600 to see, to ask for the GPU driver, what processes, per process ID? 26:29.600 --> 26:36.800 What their resources are that they use in the GPU. I tell you the use cases, you have an application 26:36.800 --> 26:43.280 manager on the system. We landed, I think, one two years ago, the GPU stops, so you can get 26:43.280 --> 26:52.000 information from FD info. 6.8. 6.8 is available upstream, and downstream in passory OS is already available. 26:52.000 --> 26:57.360 You can use GPU top, I think. Yeah, it shows the information. 26:57.360 --> 27:00.240 That two are starting. Thank you.