WEBVTT 00:00.000 --> 00:28.200 In video with its GPUs and its CUDA software ecosystem has between 70 and 95% of the world AI 00:28.200 --> 00:31.600 chip market. 00:31.600 --> 00:41.760 If AI is going to thrive, we need a wider ecosystem of both hardware and software. 00:41.760 --> 00:46.760 And the question I give to you today is how? 00:46.760 --> 00:50.200 I'm Jeremy Bennett. 00:50.200 --> 00:57.200 Today we're going to give you a step on the answer using open source. 00:57.200 --> 01:02.560 I hope that you've come away from this with an understanding of how you can have a new chip 01:02.560 --> 01:09.680 design for AI and you can bring up the software ecosystem you need so that you can run 01:09.680 --> 01:12.920 all your favorite AI systems. 01:12.920 --> 01:17.720 I've joined by my colleague William Jones, who will take you through the practical real world 01:17.720 --> 01:18.720 of this. 01:18.720 --> 01:22.360 I'm going to give you an overview to start. 01:22.360 --> 01:26.520 We're focusing on neural networks and I am fully aware that neural networks is not 01:26.520 --> 01:30.640 the whole of machine learning and AI, but it's the big one. 01:30.640 --> 01:37.040 The way all systems work is the neural networks represented as a graph and the software, 01:37.040 --> 01:43.920 whether it's PyTorch, TensorFlow or whatever you're using, may do a bit of graph level 01:43.920 --> 01:47.920 transformation to make the graph a bit more efficient, but fundamentally sits there 01:47.920 --> 01:52.840 walking over that graph, looking at the nodes and the nodes tell it what the arguments 01:52.840 --> 01:57.560 are, the sensors, the glorified matrices, which are the data, and what the operation 01:57.560 --> 02:02.920 to perform is, whether it's an ad or a matrix multiplication or a convolution. 02:02.920 --> 02:07.560 They sit in a world that is a host and accelerator-based system. 02:07.560 --> 02:13.120 The host may be an x86 today, the accelerator almost certainly, unfortunately, is Nvidia, 02:13.120 --> 02:18.760 not because Nvidia is bad because they've got 90% of the market. 02:18.760 --> 02:23.800 The dispatcher sits in there and works out which of those to run on. 02:23.800 --> 02:24.920 Do I run on the host? 02:24.920 --> 02:26.560 Do I run on the accelerator? 02:26.560 --> 02:30.960 Or do I run across both of them? 02:30.960 --> 02:35.920 And we're focused, particularly, today, on the dispatcher and how it decides and makes 02:35.920 --> 02:40.680 that decision, pushes software onto the accelerator. 02:40.680 --> 02:47.440 And that works for microcontrollers and standard software, PyTorch, TensorFlow, you've 02:47.440 --> 02:53.720 got executor, you've got LightRT, and it's probably handling single nodes off to be accelerated 02:53.720 --> 02:58.760 by a special accelerator, and it's probably only doing that with some operations. 02:58.760 --> 03:03.280 But it goes right up to huge co-processors. 03:03.280 --> 03:05.080 And we've worked on both these scenarios. 03:05.080 --> 03:06.080 We work on the executors. 03:06.080 --> 03:12.120 We work on one of these, which is a risk-five chip, which has more than 1,000 cores on 03:12.120 --> 03:13.120 it. 03:13.120 --> 03:17.040 And in this case, the dispatcher's got to work out how to get across all those cores. 03:17.040 --> 03:21.400 It's probably trying to handle multiple operations at one time, postponing delivery of 03:21.400 --> 03:25.480 results, and so forth. 03:25.480 --> 03:31.080 So Williams, now, going to talk you through a real open source example, William is my head 03:31.080 --> 03:32.080 of AI. 03:32.080 --> 03:37.320 He is also responsible for the UK's national guidelines on best practice in AI for the 03:37.320 --> 03:40.080 electronic systems industry. 03:40.080 --> 03:46.680 William, we're going to now be a brief pause while we change over the microphone. 03:46.680 --> 03:57.960 I'm going to find which point I have, but I think you can. 03:57.960 --> 03:58.960 Cool. 03:58.960 --> 04:00.960 It's the earlier, OK. 04:00.960 --> 04:04.520 Yes, those the project I want to talk you through is a student project that we've 04:04.520 --> 04:08.280 been doing for the last four or five months, I think. 04:08.280 --> 04:13.520 So every year, we do a student project with Southampton University in the UK, and we host 04:13.520 --> 04:19.880 a set of them as the students to do some sort of relevant industrial project to help 04:19.880 --> 04:21.800 them along their degrees. 04:21.800 --> 04:26.800 And this year, we had six students for ten weeks, and we asked them to go away and integrate 04:26.800 --> 04:32.160 well, go through the process and create a demonstrated project of integrating a new accelerated 04:32.160 --> 04:34.520 into the PyTorch framework. 04:34.520 --> 04:39.320 And in particular, by the way, we were hoping had pictures of the students up here, but we 04:39.320 --> 04:41.280 didn't get that sorted in time, so we just have their names. 04:41.280 --> 04:45.080 And there's a lot of credit because this was an incredibly good project. 04:45.080 --> 04:51.600 So yeah, so particularly what we wanted to do with this project is we asked the students 04:51.600 --> 04:56.560 to bring up a risk-vive core as the accelerator, MNFBGA of their choice. 04:56.560 --> 05:01.560 We asked them to go into PyTorch and create a new device in PyTorch and modify the dispatched 05:01.560 --> 05:03.360 to dispatch to this device. 05:03.360 --> 05:06.880 But we also, most importantly, asked them to go away and create a tool chain between 05:06.880 --> 05:10.040 the two that would let this dispatched to any hardware. 05:10.040 --> 05:12.520 And sort of work in a hardware-agnostic way. 05:12.520 --> 05:14.400 And this is in many ways the tricky bits. 05:14.400 --> 05:18.000 And in terms of the slides that Jeremy puts up earlier, we're sort of looking at this 05:18.000 --> 05:19.000 bottom bit here. 05:19.000 --> 05:20.000 The students have to go in. 05:20.000 --> 05:23.080 They have to modify the dispatched to in PyTorch and create a tool chain to connect it 05:23.080 --> 05:27.320 to a, well, it's an accelerator, but it's obviously not much of an accelerator, because 05:27.320 --> 05:33.640 it's just a risk-vive core, because the ultimate goal of this is just a demonstrated. 05:33.640 --> 05:37.680 And the sort of end goal of this for the students or their stretch goal at the end was 05:37.680 --> 05:43.040 to try and get resonates in working on this risk-vive core as an accelerator. 05:43.040 --> 05:48.760 Now, we asked the students to do this with the one ABI ecosystem and the one ABI construction 05:48.760 --> 05:49.760 kit. 05:49.760 --> 05:54.200 So if you haven't met this, this is a construction kit that's largely based on the efforts 05:54.200 --> 05:59.680 of sickle and opens the L, that influence a model of heterogeneous computing. 05:59.680 --> 06:03.040 And the way that this would work and the way that we were expecting was units to solve 06:03.040 --> 06:04.040 this project. 06:04.040 --> 06:07.840 It did something a little bit different, is that they would go to PyTorch. 06:07.840 --> 06:13.960 They would modify the dispatched to this new piece of hardware for the operations that 06:13.960 --> 06:14.960 they cared about. 06:14.960 --> 06:18.360 The reason they're 18 ones, they would provide sickle implementations. 06:18.360 --> 06:24.000 I think this was going to be add, match norm, and the convolution 2D. 06:24.000 --> 06:30.000 And then with the sickle compiler, which is the DC++ compiler in the one ABI construction 06:30.000 --> 06:34.000 kit, they would produce a multi-architecture. 06:34.000 --> 06:39.840 The binary, they'd have the code on the host that is driving what is happening and the code 06:39.840 --> 06:44.520 on the accelerator, which is obviously, like, actually what is doing things. 06:44.520 --> 06:50.000 And this multi-architecture binary would call out to the OpenCL API, which would implement 06:50.000 --> 06:52.280 this heterogeneous computing. 06:52.280 --> 06:57.160 And in the one ABI construction kit, there is an extremely generic and simple implementation 06:57.160 --> 07:03.960 of this OpenCL API, which calls out to a really basic low-level hardware abstraction layer. 07:03.960 --> 07:08.440 Which, just sort of, defines, like, I think, six functions, which sort of defined writing 07:08.440 --> 07:12.680 to the device, reading from the device, and things like this. 07:12.680 --> 07:15.160 And the scope of the student project was essentially, they'd have to go in, they'd have 07:15.160 --> 07:21.320 to do a little bit of work on the producing the multi-architecture binary, and probably 07:21.320 --> 07:23.840 a bit more work on the hardware abstraction layer. 07:23.840 --> 07:26.800 And then they'd be able to go through this, demonstrate it could all work, stitch everything 07:26.800 --> 07:30.400 together, be a nice project. 07:30.400 --> 07:35.080 And ultimately, the goal of this, if the students had time, or if we were doing this 07:35.080 --> 07:39.160 for a real project, because this is how, you know, this is in the family way for us to 07:39.160 --> 07:40.160 approach a project like this. 07:40.160 --> 07:44.560 Sickle is a very mature tool chain for doing this type of thing. 07:44.560 --> 07:47.880 You start with this generic, simple implementation of OpenCL, and you develop something 07:47.880 --> 07:53.520 more target specific and rich over time. 07:53.520 --> 07:56.680 Now our students actually ended up doing something a little bit different to this, about 07:56.680 --> 07:59.320 four weeks into the project, so the students came to us. 07:59.840 --> 08:02.400 And I think they felt they were running out of time a little bit. 08:02.400 --> 08:06.760 It's a tough thing to get done in 60-engineer weeks when you're, when you're still 08:06.760 --> 08:10.240 a young student, so you haven't done this type of thing before. 08:10.240 --> 08:14.600 And they basically said, look, we think we can do this in an even quicker and simpler way, 08:14.600 --> 08:18.600 which gives us even better chance of success, and we were very proud of them for doing this. 08:18.600 --> 08:22.200 It's not easy to have conversations like this with your sort of friendly industrial customers 08:22.200 --> 08:24.040 at the best of times. 08:24.040 --> 08:31.120 And they did a really good job of not just saying, not just explaining what they needed, 08:31.120 --> 08:33.960 but what they're supposed to do to the problem. 08:33.960 --> 08:39.040 And what the students ended up doing is they ended up sort of writing a micro-harder 08:39.040 --> 08:42.640 abstraction layer that sort of sidesteped quite a lot of what we'd originally expected 08:42.640 --> 08:44.560 them to have to do. 08:44.560 --> 08:50.760 So the flow that I described before, how we were expecting them to solve the problem, 08:50.760 --> 08:55.800 was that there would be implementations of stuff to happen in sickle. 08:55.800 --> 09:01.280 The DC++ compiler would take the OpenCL library and compile this now into this multi-architecture 09:01.280 --> 09:05.440 binary, where things would happen. 09:05.440 --> 09:08.680 And this multi-architecture binary can take into these calls that the OpenCL API would 09:08.680 --> 09:11.840 drive how things happened. 09:11.840 --> 09:15.320 But what the students did is they just wrote a sort of, instead of going through this whole process 09:15.320 --> 09:19.960 and implementing this whole hardware abstraction layer and way of interfacing with the hardware 09:20.040 --> 09:24.160 through the ecosystem, they just wrote a little interposet in the OpenCL library that 09:24.160 --> 09:29.080 captured the sort of two calls that they were actually interested in dealing with, which 09:29.080 --> 09:34.480 are the set arguments for an operation and doing operation calls. 09:34.480 --> 09:38.400 And they just captured those and then went away, dispatched to the hardware on their own 09:38.400 --> 09:40.560 and got the results back. 09:40.560 --> 09:44.480 And this was like, it was a really good solution to what we'd asked them to do. 09:44.480 --> 09:48.960 And in many ways it's sort of, in many ways it was better than what we'd asked them to do 09:48.960 --> 09:52.560 or it was a better demonstrated because it's even more minimal than what we'd originally 09:52.560 --> 09:56.720 expected and this, and that is what this was supposed to be, a sort of minimum viable 09:56.720 --> 10:02.080 administrator. 10:02.080 --> 10:05.560 In a bit more detail, because their solution was interesting sort of, in more detail 10:05.560 --> 10:12.520 than the students actually ended up using TCP to implement the communication with their 10:12.520 --> 10:13.520 FPGA accelerator. 10:13.520 --> 10:15.960 And it was quite a nice little system. 10:15.960 --> 10:23.600 And the students ended up using this Zilings Zink Board as their FPGA host of the RIS5 10:23.600 --> 10:24.600 Core. 10:24.600 --> 10:28.120 And this actually is one of these FPGA boards that comes with a little processing system 10:28.120 --> 10:31.400 on it, had a few arm cores and it had a few peripherals and it meant that the students were 10:31.400 --> 10:36.760 able to sort of just, like, almost overnight set up a TCP communication with it and use 10:36.760 --> 10:39.400 that to offload things to the core. 10:39.400 --> 10:42.960 Now, obviously, this isn't a realistic thing that you do with sort of a real trip. 10:42.960 --> 10:48.240 You could probably try and use PCIe or something, but it was a good demonstrated. 10:48.240 --> 10:53.000 It meant that the flow for the students' work was that they have these sickle implementations 10:53.000 --> 10:59.720 of, as net operations, their OpenCL interposer would capture the arguments and the data 10:59.720 --> 11:00.720 from these. 11:00.720 --> 11:06.640 It would send through TCP to the FPGA board, FPGA boards processing system. 11:06.640 --> 11:08.200 We'll put this into shared memory. 11:08.200 --> 11:11.440 The shared memory would be operated on by the S5 Core. 11:11.800 --> 11:15.680 There was output back into shared memory and everything sent all the way back. 11:16.680 --> 11:21.880 And yeah, it's a really nice elegant solution that the students came up with for sort of in this problem. 11:21.880 --> 11:27.440 And it actually, yeah, it sidesteps one of the issues with sort of doing things the way we've 11:27.440 --> 11:30.560 originally proposed a few slides back. 11:30.560 --> 11:34.960 In this way of doing things, the building up things via a sort of simple, 11:34.960 --> 11:38.240 very basic hardware abstraction layer that's provided in the construction kit, 11:38.360 --> 11:44.240 you actually end up compiling your whole program for your GUSC Live Core down, 11:44.240 --> 11:47.880 sending that to the ExcelO extra executing it there with the solution of the students, 11:47.880 --> 11:49.080 but both of you don't have to do that. 11:49.080 --> 11:51.920 You can just send the data, which is at my improvement. 11:54.120 --> 11:59.360 So in terms of the overall success of the project, to sort of sum up, 11:59.360 --> 12:04.120 we set the students with the task of creating a minimum viable product of basically integrating 12:04.160 --> 12:10.320 a new piece of hardware and a tool chain with an accelerator framework, which was quite 12:10.320 --> 12:14.920 arch, and we asked them to see if they could get resident 18 working. 12:14.920 --> 12:18.760 So obviously, they achieved what we'd asked them to do in getting a demo to work. 12:18.760 --> 12:19.960 That was fantastic. 12:19.960 --> 12:22.320 All of this is available for you at what? 12:22.320 --> 12:28.160 If not, now we'll be shortly available for your open source after their work has been marked by their examiner. 12:28.160 --> 12:29.640 So they've achieved that. 12:29.720 --> 12:32.320 They very nearly got resident 18 working as well. 12:32.320 --> 12:36.360 We initially wanted to get the sort of three dominant operations in resident 18 working, 12:36.360 --> 12:39.000 which were at Anne Bach's norm and two de-convolution. 12:40.720 --> 12:48.440 The students got Anne and Bach's norm done very well and didn't quite finish off convolution to the by 12:48.440 --> 12:50.920 basically by the time they had to stop working right up. 12:50.920 --> 12:55.760 They got a very long way, a very long way, a very long way along finishing that. 12:55.760 --> 12:57.120 So that was very good. 12:57.240 --> 13:00.520 And I spoke earlier about making a hardware agnostic solution as well, 13:00.520 --> 13:06.360 so that they'd be able to basically substitute out any piece of hardware for any other piece of hardware 13:06.360 --> 13:09.800 that was supported by the single content that we designed. 13:09.800 --> 13:10.720 And that worked as well. 13:10.720 --> 13:16.880 The students initially started working with, I think, the Zylinks micro-place 5 core for their FPGA, 13:16.880 --> 13:21.960 a risk 5 core that Zylinks provide that does actually work particularly well with FPGA. 13:21.960 --> 13:26.720 And they were able to demonstrate that by sort of just substituting out for a new piece of hardware. 13:26.720 --> 13:29.560 And swapping out for even something with a completely different instruction set. 13:29.560 --> 13:35.320 I think there's still basically worked out the box, and yeah, it was very good. 13:35.320 --> 13:37.600 And the students deserve a lot of praise for the work they've done. 13:37.600 --> 13:40.280 Anyway, I think I'm now handing back to Jeremy to finish off. 13:52.200 --> 13:56.120 So Williams just described to you that we're not talking about theory. 13:56.120 --> 13:56.960 This is practical. 13:56.960 --> 13:59.320 We're able to talk about this one, it was a student project. 13:59.320 --> 14:02.760 We do this commercially for real customers as well. 14:02.760 --> 14:07.840 And that work is all freely available for students to use as a starting project. 14:07.840 --> 14:11.720 And I think it says something that six students in ten weeks, 14:11.720 --> 14:19.560 who had no previous exposure to AI software and infrastructure were able to bring that up successfully. 14:19.560 --> 14:22.080 So how do you do it? 14:22.160 --> 14:25.920 And let's look at what we mean by AI. 14:25.920 --> 14:27.640 And there is a pyramid problem here. 14:27.640 --> 14:33.400 We've all played with chatGPT millions of people having the world or these days deep seek. 14:34.400 --> 14:36.040 There's a lot of professional engineers. 14:36.040 --> 14:38.160 This is actually the big revolution in AI. 14:38.160 --> 14:42.600 It's people using standard models to make businesses run better. 14:42.600 --> 14:44.000 Get rid of the drudgery. 14:44.000 --> 14:48.280 Helping the lawyers find all their cases automatically. 14:48.280 --> 14:53.440 Helping people buying homes find all the legal documents they need automatically. 14:53.440 --> 14:57.960 Automating all the paperwork you need if you're in England and want to take a 14:57.960 --> 15:00.520 Laurie load of goods into urine. 15:00.520 --> 15:03.560 That's where the big revolution is. 15:03.560 --> 15:07.880 Actually, the number of people developing models much smaller people developing 15:07.880 --> 15:11.120 resin that 18 and all the new models, that's a smaller. 15:11.120 --> 15:15.960 And sitting on the top are people like us who actually develop the AI tools. 15:15.960 --> 15:18.040 And it's not just people. 15:18.040 --> 15:21.760 Some of that development is itself done by AI's. 15:21.760 --> 15:25.520 So how do you get involved? 15:25.520 --> 15:28.280 Executive torch is part of pie torch. 15:28.280 --> 15:29.240 Light RT. 15:29.240 --> 15:32.840 What used to be called TensorFlow Light for Micros is part of TensorFlow. 15:32.840 --> 15:34.320 And they've got their official tutorials. 15:34.320 --> 15:37.720 These slides will all be on the file stem site. 15:37.720 --> 15:39.960 So you'll get the links. 15:39.960 --> 15:43.320 Cycle and OpenCL is already in pie torch 2.4. 15:43.320 --> 15:45.160 So you don't have to write the implementations. 15:45.160 --> 15:48.600 You've got implementations already available. 15:48.600 --> 15:54.600 The one API construction kit, it's freely available on code plays GitHub. 15:54.600 --> 15:57.640 That work has done in Southampton and other work we're doing elsewhere. 15:57.640 --> 15:59.920 We're turning into some more of our how-to. 15:59.920 --> 16:02.440 And I know many of you will have used our how-to before. 16:02.440 --> 16:04.720 They're coming this year. 16:04.720 --> 16:06.960 And ultimately it's what we do for our day job. 16:06.960 --> 16:10.560 So if you'd like to do more and you want some help, 16:10.560 --> 16:11.680 come and ask us. 16:11.680 --> 16:14.120 We'll be here, William and I here all day. 16:14.120 --> 16:19.400 And we're at the AI Plummers conference tomorrow as well. 16:19.400 --> 16:20.560 So thank you all very much. 16:20.560 --> 16:22.560 I asked a question that beginning. 16:22.560 --> 16:27.560 How do we get AI working on new hardware? 16:27.560 --> 16:29.840 And I hope we've given you a bit of insight 16:29.840 --> 16:31.400 into how that can be done. 16:31.400 --> 16:34.000 And I think we have a minute or two for a few questions. 16:34.000 --> 16:35.200 Thank you.