WEBVTT 00:00.000 --> 00:08.320 All right, okay, I don't want to swallow it, all right. 00:08.320 --> 00:13.520 So we are a young startup, like five months out of stealth, I am Vinay, and this is my 00:13.520 --> 00:21.320 colleague, Geon, and yeah, we want to show you what we've been cooking. 00:21.320 --> 00:29.920 So before we get started, let's have a look at the AI game from a special point of view. 00:29.920 --> 00:33.680 We have the world divided in two sections, basically. 00:33.680 --> 00:39.320 We have training, we have the training guy on one hand, and we have the inference guy 00:39.320 --> 00:44.040 on the other hand, and introducing our players. 00:44.040 --> 00:50.200 We have training, training is typically done as a research endeavor. 00:50.200 --> 00:56.080 You have one of something, which means basically you train your next best greatest model 00:56.080 --> 00:58.640 of all time. 00:58.640 --> 01:04.760 More is better, you want bigger models, you want more modalities, and everything, and yeah, 01:04.760 --> 01:12.080 obviously you want to iterate fast, and yeah, you love Python. 01:12.080 --> 01:16.720 On the other hand, we have inference. 01:16.720 --> 01:23.640 Inference is run in production, so this guy operates in production mode, running thousands 01:23.640 --> 01:27.280 and millions of models doing billions of requests. 01:27.360 --> 01:33.600 In this scenario, less is actually better, you want to consume less resources per model, 01:33.600 --> 01:41.000 and smaller containers, and everything, you want predictable latency, and Python there 01:41.000 --> 01:46.680 is, yeah, this is some story, yeah. 01:46.680 --> 01:54.640 And so when you write the framework, any AI framework, you should obviously prioritize 01:54.720 --> 02:01.280 training, right, because when you're training framework, you get inference for free, 02:01.280 --> 02:08.720 and this is what most of the frameworks have been doing, but the experience is often like 02:08.720 --> 02:10.080 this. 02:10.080 --> 02:18.760 So the Python ecosystem is sometimes not the most friendly one to get started running 02:18.760 --> 02:19.760 inference. 02:19.760 --> 02:27.080 So when we talk to potential customers, they are typically AI-flavored backend engineers, 02:27.080 --> 02:34.080 and they have very strong demands, like accelerator, agnosticity. 02:34.080 --> 02:40.160 They don't want to run only on, or it'd be first to only run on Nvidia. 02:40.160 --> 02:48.240 They want compiled models with static typing, cross compiling, runtime cent boxing, 02:48.240 --> 02:49.720 and extra less. 02:49.720 --> 02:54.440 Parallel IO, Async IO, Kubernetes, and all the good stuff. 02:54.440 --> 02:59.440 And this is why we created Cedemail. 02:59.440 --> 03:03.840 So how can we best describe Cedemail? 03:03.840 --> 03:12.800 It's resting on four pillars, ZIG is a programming language, producing MLIR, using open 03:12.800 --> 03:20.280 XLA to compile the models to the accelerators, and yes, we use Basel to orchestrate all of 03:20.280 --> 03:22.920 this. 03:22.920 --> 03:31.160 So we use ZIG as a front end, so you write a model source code, and ZIG will show some 03:31.160 --> 03:37.660 examples of that later, and your ZIG program then produces MLIR, which passes it to open 03:37.700 --> 03:46.700 XLA, which then targets your target platform, and you have a handy Basel interface to do 03:46.700 --> 03:48.900 all that. 03:48.900 --> 03:54.500 Speaking of ZIG, our framework is for inference only. 03:54.500 --> 03:59.580 At the moment, we don't even bother with training because the inference demand is so high, 03:59.580 --> 04:03.780 so it's actually okay to specialize on that. 04:03.780 --> 04:11.620 So as I said, you write your models in ZIG, and what we achieved was zero Python in our 04:11.620 --> 04:12.620 stack. 04:12.620 --> 04:19.060 We can still load PyTorce models, we have a ZIG implementation of that, so that's good. 04:19.060 --> 04:26.380 And the focus is on producing readable, maintainable, modular code, that's statically compiled, 04:26.380 --> 04:30.580 and where it feels more like, if you are a systems engineer, it feels more like you're 04:30.580 --> 04:33.340 doing proper programming, right? 04:33.340 --> 04:37.460 Anything against Python though, so. 04:37.460 --> 04:39.460 Is that a hand over to? 04:39.460 --> 04:40.460 Okay. 04:40.460 --> 04:41.460 Okay. 04:41.460 --> 04:52.500 So this is like the main dot ZIG file, I'm not sure it's pretty readable anyway, so the main 04:52.500 --> 04:57.060 point of this slide is to show where it's looked like regular ZIG code, so we have a fine 04:57.060 --> 05:03.020 going controller allocation, but we also swing a few modern features that we use as 05:03.060 --> 05:09.460 synchronism, so typically what we do here is that we first open the PyTorce files, we 05:09.460 --> 05:15.060 extract all the shapes, but we don't load the white sets, then we take off the compilation 05:15.060 --> 05:23.020 or the model, paste on the shapes, and yeah, as synchronously we also load the weights on 05:23.020 --> 05:26.900 the device. 05:26.900 --> 05:32.140 But now let's look a bit more about the model code looks like, so the idea is that to make 05:32.140 --> 05:38.140 something which is familiar, if you're coming from PyTorce or other high-level frameworks, 05:38.140 --> 05:42.420 and even though it's still ZIG, statically compiled and so on, we made it so that we 05:42.420 --> 05:49.740 you don't need to under a location, so it feels more very high-level, but we still try 05:49.740 --> 05:57.220 to add a few goodies that's to make you want to write this code because, and what 05:57.220 --> 06:01.300 does that mean is access tagging, what basically what we do is we give names to the different 06:01.300 --> 06:08.260 axes of tensor, and we can propagate it to the different operation, so a matrix multiplication 06:08.260 --> 06:13.780 and so on, and it means in practice it simplifies a bit a lot of the model code because 06:13.780 --> 06:20.660 you need less transposition, especially I want a matrix multiplication, so to be concrete, 06:20.660 --> 06:24.740 like if you have an image tensor, so it's a tensor with three axes, you have the width, 06:24.740 --> 06:33.700 the height, and the channels, so usually you would refer to the width by its offset, so offset 06:33.700 --> 06:37.780 zero, and you have to remember in your code that offset zero is the width, and sometimes it becomes 06:37.780 --> 06:43.780 offset one or offset two, depending on what you do, but if you give it names, then eventually 06:43.780 --> 06:49.940 transpose, you can just use the name to refer to the width and always be consistent 06:49.940 --> 06:55.460 with the other program, also means you don't really want to write transpose one zero two, 06:55.460 --> 07:01.540 just say I want to transpose that the heads and the width, then the channels, and it gives you 07:01.540 --> 07:08.980 that, and for watch multiplication, here we have A and B, which are two matrices, but they are 07:08.980 --> 07:14.580 not in the text book, they are out from watch multiplication, so we cannot choose like the 07:14.580 --> 07:22.100 math rule operator, which usually find it frameworks, so what we do, we give them names, 07:22.100 --> 07:29.140 and this A and B have both axes, namely K, we can say I want to multiply A with B and contract 07:29.140 --> 07:36.340 over the K axis, and you just do what you want, and it scales particularly well to even more 07:36.420 --> 07:45.380 complicated operation like the self-attention, here we just say we want to multiply the queries 07:45.380 --> 07:52.820 with the keys, over the axis name, head dimension, then we want to compute the self-max, over 07:52.820 --> 07:58.340 the other keys, and then we aggregate all the values, weighted by the attention of the weights, 07:59.380 --> 08:05.860 and you can open up this code in the compiler, and you have a lot of transpose everywhere to make 08:05.860 --> 08:11.780 sure you can do the math rules on this one, here it just goes away, if you want to learn more, 08:12.500 --> 08:19.860 you can check out the docs, we have tutorials, builds with the open source tool, 08:19.860 --> 08:25.780 ZIN, and now I'm going to give the mic back to Rene for the rest of the paper. 08:26.260 --> 08:40.100 Thank you, so that was Zig, let's come to OpenXLA, well OpenXLA is a huge ecosystem, 08:40.100 --> 08:46.180 and it's also backed by the who is who in AI, what else can you say? 08:46.180 --> 09:02.020 Yeah, so, in combination with the produced MLR, you can use OpenXLA, or set ML uses, 09:02.020 --> 09:11.060 I'm openXLA to produce highly optimized code for your target, which could be an Nvidia GPU, 09:11.060 --> 09:18.900 it could be a GPU, it could be an AMD GPU, yeah, so it supports the important things, 09:18.900 --> 09:25.140 like kernel fusion, collect memory allocations, it's all highly optimized, auto tuning, 09:27.700 --> 09:36.340 and it produces very mature and stable MLR, so this is not experimental code, this is 09:36.980 --> 09:44.100 like industry grade, MLR, you see a picture of it here, actually you don't see it, 09:44.100 --> 09:50.740 but it will be in the slides, you can download the PDF, it's very colorful and very nice, 09:50.740 --> 09:59.620 and very professional, yeah, and we also add baseline to the mix, it's like the user interface 09:59.700 --> 10:07.300 for you on the command line, and you can do some amazing things, for example cross compiling, 10:07.300 --> 10:13.140 this is something that Zik does out of the box, but Basel also supports it, and we pull in so many 10:13.140 --> 10:23.540 third-party libraries, so the whole cross compilation story is covered by this, and you can do it 10:23.620 --> 10:29.460 from Basel, so that means if for example, here you're on a MacBook, you can compile your 10:29.460 --> 10:39.700 model, cross compile it for a Linux AMD 64 server, and then just copy to the server and it will run. 10:41.860 --> 10:51.380 We do an awful lot of runtime, trimming, and sandboxing, so some of those frameworks like 10:51.380 --> 10:59.060 CUDA, or the runtime, you need for rock M, they can get pretty large, and especially for the AMD 10:59.060 --> 11:08.340 ecosystem, we managed to reduce that by roughly 90%, so we just take out only the required shared 11:08.340 --> 11:15.940 object files and whatever is needed, and bundle it together with your executable, so you can 11:15.940 --> 11:20.820 create self-deployable archives, like I just said, even with cross compiling, you can for example 11:20.820 --> 11:26.500 create a tar archive that you then just from any machine, that you then just copy to your server, 11:26.500 --> 11:36.020 and run the executable, and it will start doing what you want. Obviously, we can produce OCI images 11:36.020 --> 11:50.020 with Basel, and yeah, and ready for Kubernetes deployments, and speaking about CUDA, yeah, all of the 11:50.100 --> 12:00.260 things I just said is probably well illustrated by this mean here. There are many such cases where 12:01.300 --> 12:07.620 you get version incompatibilities, and you have your server or your machine, you installed the 12:07.940 --> 12:18.580 CUDA driver, and then the user's base libraries, your tens of low, your PyTorch state, they 12:18.580 --> 12:24.180 have a conflict, and it's sometimes really, really painful, and this is what we basically 12:25.620 --> 12:34.660 get rid of with the mail by doing our sandboxing, so actually you don't need to do any provisioning 12:34.820 --> 12:40.340 for the mail models, you just need to copy them, so it's just a deploy stage, there's no special 12:40.340 --> 12:48.180 provisioning, no special setup on the server's required, and this is what it looks like, okay, 12:48.980 --> 12:55.620 I'm not sure if you can read it, but it's in the slides anyway, basically we show three 12:56.420 --> 13:00.180 basel command line examples here. The first one just says basel run, 13:01.140 --> 13:11.780 optimize MNIST, and produce an executable that works with the CUDA runtime. The second one is a 13:11.780 --> 13:20.420 bit more involved, we pass to the basel build command that we want to create an archive, which means 13:20.500 --> 13:32.820 a tar archive that supports the rock M platform for AMD, and also the Linux AMD 64 13:32.820 --> 13:40.260 host architecture, and you can run this from any machine, you can run it from an arm or whatever 13:40.260 --> 13:46.900 M3 MacBook, and it will still cross compile everything, and go from your development machine, 13:46.900 --> 13:54.340 just copy the stuff over to your server, and it will work. Coping stuff over, why would you want 13:54.340 --> 13:59.140 to do that if you can just push an image, so that's the third example, you just say basel run, 13:59.140 --> 14:07.140 MNIST push for CUDA, and for TPU, so what you get is a container, you can pull on a machine that has 14:08.420 --> 14:14.740 that has an Nvidia GPU, but you can also pull it on a machine that has a TPU inside, and the 14:14.740 --> 14:21.460 container will auto detect and start up and run just fine, all with just one basel command. 14:24.100 --> 14:29.300 Yeah, so speaking of open source, we are on GitHub, very easy to find, say the mail, 14:29.300 --> 14:37.860 say the mail, if you want to check out the code, please do so. 14:38.260 --> 14:47.380 And something we are quite proud of is, with you, Jan Lecker, who is also often referred to as 14:47.380 --> 14:58.660 the Godfather of AI, thinks that the mail is what it is, impressive, and this man is a genius, 14:58.740 --> 15:11.380 so I won't argue with him. So, what's next? Obviously, we are young, so we want to support 15:11.380 --> 15:23.380 more chips, more models, we just want more, more modalities, more integrations, and we are also 15:23.380 --> 15:30.980 working on an LLM server, we are pretty far with it, we call it LLMD, and it's super fast, super small, 15:30.980 --> 15:37.140 and super cool, and we hope that soon we will also give you a chance to play with it. 15:38.980 --> 15:49.860 And speaking of playing with it, yeah, this was a short intro to the said mail, and this is basically 15:49.860 --> 15:58.580 the least, how you can run models on your machine, and basically that's it, and now we have 15:58.580 --> 16:12.100 a lot of time for questions. So, I would suggest if you have a question and don't mind, 16:12.740 --> 16:18.100 it can close, so I can hear you, and then I can repeat the question or you. 16:20.260 --> 16:23.380 And if you don't have any questions, we'll have a longer break, okay. 16:42.100 --> 16:56.020 Yeah, so I try to repeat a very long question, and it went along the lines of how do we find, 16:56.020 --> 16:59.940 so for example, we showed the amnesty example, and the question was how do we find 16:59.940 --> 17:07.460 this symbolic dimensions, and I'll try to answer the question, maybe I misunderstood it, 17:07.460 --> 17:13.780 but we have enough time. So basically, what we do is, so one thing that's very important, 17:13.780 --> 17:27.060 Siggy's a compiled language, and in the compilation step, you already need to know when you run it, 17:28.100 --> 17:33.220 you need to know the shapes to produce the MLR, right? And so you need to get those shape dimensions 17:33.300 --> 17:37.620 from somewhere. And in this case, like in the emnest case, or in the llama case, we get them 17:37.620 --> 17:42.420 by loading the weights from disk. So if there are safe tensors, they have a meta information, 17:42.420 --> 17:48.100 we just grab the meta information first, compile the code to MLR, then keep loading 17:48.100 --> 17:54.660 asynchronously the rest of the weights, and then it's done. Do you want to add to that, or 17:54.980 --> 18:06.500 typically the batch size is an input to the compilation, so yeah. So we get the actual dimensions 18:06.500 --> 18:19.460 by loading the weights. So we use symbolic. So the follow-up question is, I hand over to 18:19.460 --> 18:24.340 the film, he wants, for seconds lens, typically what you do is you have a max seconds lens, 18:24.340 --> 18:29.780 and then you compile for that. And if you really want more dynamism, you can compile several 18:29.780 --> 18:36.340 batch versions of the kernel. Yeah, that's some approaches, like with project extensions, where 18:36.340 --> 18:40.900 the seconds lens is not that much of a problem, so you can just compile for very long seconds lens, 18:40.900 --> 18:46.500 and you can trade off at one time between batch size and seconds lens, but yeah. 18:50.340 --> 18:53.620 Okay. Yes. 19:09.300 --> 19:17.460 Okay, so the question is, since we are using ZIG, are we not using the build, yeah, the ZIG build system, 19:18.420 --> 19:24.820 the reason for that is that basal is more mature, and some of the dependencies are also using 19:24.820 --> 19:33.140 basal, so typically open XLA, and we compile the part of LVM, so with basal it's easier to do this. 19:34.900 --> 19:41.460 We also generate like the current images and build the content to that for now, so 19:41.860 --> 19:50.820 we need to find a, we are working on a way to make it easier to call or stuff from build.zIG, 19:50.820 --> 19:57.460 but it's not ready yet, so yeah, but it's a common frequent from people using ZIG, but yeah, we have a 19:57.460 --> 20:14.580 working on it. So I'm not sure on the, I think the question is, how do we import 20:14.580 --> 20:23.060 PyTorch models? Yeah, so for PyTorch models, you have two things, you have the weights, 20:23.060 --> 20:28.660 we can load the weights typically used for PyTorch models, so either that, I mean, the PyTorch official 20:28.660 --> 20:35.940 formats are also safe tensor, but then you also need to pause the code, we don't, I mean, yeah, you need to 20:36.020 --> 20:44.980 rewrite the inference code. We tried to make it this easy, it's a bit of work, but usually, 20:44.980 --> 20:51.620 I mean, you probably already have seen like Lama implementation in one file, so it's not that much 20:51.620 --> 20:57.780 work, if you know what you're doing, and also sometimes the inference code is different from the 20:57.860 --> 21:06.660 training code anyway, because you want to optimize for varying set lengths or pre-filling versus 21:06.660 --> 21:14.660 generation, so yeah, so this rewrite is often a bit needed anyway when you go to production. 21:22.980 --> 21:25.860 Thank you very much. We start in a three, 21:27.780 --> 21:33.140 three.