WEBVTT 00:00.000 --> 00:10.800 So this one is going to be done by somebody who needs no introduction, because we 00:10.800 --> 00:17.160 weird for me to introduce myself. My lightning talk is tiny grad on microcontrollers. 00:17.160 --> 00:22.760 So we had one talk about, you know, people trying to run AI on MPs. I figured out, you 00:22.760 --> 00:27.240 know, I need AI on an MPU, because I happen to be part of the community that is about 00:27.240 --> 00:33.440 to tape out. It's a pretty interesting microcontroller level MPU. It's one of the first 00:33.440 --> 00:39.080 fully open source tape outs at 16 nanometers. If you're interested, go at the foundry 00:39.080 --> 00:46.120 GitHub, you can find all about it. It's a pretty capable MPU at about one top, 16 megabytes 00:46.120 --> 00:50.480 of AMRAM, so yeah, it's not super tiny, so you can actually do a lot of interesting things 00:50.480 --> 00:55.640 with it, like depth models, y'all models, you know, stuff like that. 00:55.640 --> 01:00.560 It actually came from a much bigger architecture. That again, we're now dealing with 01:00.560 --> 01:05.880 an AI foundry called ET. It used to be done by this company called Esperanto Technology, 01:05.880 --> 01:11.520 but now it's sort of like we're dealing with it. And back then, it was scaled out to the 01:11.520 --> 01:15.960 1000 cores. And, you know, if you want to play with those types of CPUs, I actually have some 01:15.960 --> 01:20.440 in my pocket and in my lab. But again, this microcontroller that we're taping out is just 01:20.440 --> 01:27.680 much more scaled down version of this 1000 core CPU. So, if I want to run something, you 01:27.680 --> 01:33.400 know, anywhere, what am I usual suspects for a small kind of constraint device? Well, again, 01:33.400 --> 01:37.840 I mean, there is M-Learn, which again, last year at Fauston was a great talk, so if you're 01:37.840 --> 01:42.280 curious, you know, go check it out. It's kind of very zephyr OS friendly, you know, you 01:42.280 --> 01:46.560 have a lot of fun using it, but it's really constrained in types of models that it can 01:46.560 --> 01:54.160 run not available, you know, to sort of the kind of models that I was interested in. 01:54.160 --> 01:58.080 Then there's obviously a light attitude for microcontrollers. Again, the supports for zephyr OS 01:58.080 --> 02:03.080 is kind of like, I mean, nobody knows. Basel is not my favorite thing. And then, of course, 02:03.080 --> 02:08.640 there's exicotarch. Exicotarch, we actually had a talk about it. You know, here is this weird 02:08.640 --> 02:13.680 kind of combination of small things and big things. So, I'm not quite sure, you know, what 02:13.680 --> 02:18.320 to think about it yet, is pretty young, so maybe it will develop into something that I would 02:18.320 --> 02:25.000 actually find the joy to use. But with Fauston, right? So, like, why should we constrain ourselves 02:25.000 --> 02:30.040 to the things that I kind of like pre-cammed and given to us by big vendors, you know, like 02:30.040 --> 02:34.880 PyTorch community? So, what I considered them was, you know, micro-TVM, but apparently 02:34.880 --> 02:40.440 that thing died. So, like, I didn't know. And then I'm like, okay, fine, I know G-G-ML, I know 02:40.440 --> 02:44.240 tiny grad. And there was actually this other project that the clips from the foundation 02:44.240 --> 02:49.720 called age that kind of tries to bridge the gap between big accelerators and small ones. 02:49.720 --> 02:53.360 And the way I look at them is, like, all of them are basically looking at a compute graph 02:53.360 --> 02:58.080 and trying to lower it into, like, big devices or small devices. So, daisy tuner, again, there 02:58.080 --> 03:02.640 was a talk about that. There's one of the cool ones that I really would like to play with, 03:02.640 --> 03:08.800 and even, like, push it to the micro side of the devices. A-I-H-H graph is something that they 03:08.800 --> 03:14.120 did lower it to a lot of MPUs and actually lower it to the A6 even. So, that actually 03:14.120 --> 03:18.480 definitely has a backend, but, again, I'm unfamiliar with it. I-I-E, and I'm L-I-R, I just 03:18.480 --> 03:24.240 like, I don't talk to me about that. So, tiny grad. So, tiny grad is this framework, not 03:24.240 --> 03:28.440 a lot of people, for some reason, know about it. But it's basically kind of this idea that 03:28.440 --> 03:34.000 if we have a really optimized internal representation of a compute graph, it's very compact. 03:34.000 --> 03:38.760 It contains only small amount of operators. We can do a lot of good things with it. 03:38.760 --> 03:43.520 And, give you kind of like a toolbox of things that can be applied to it, right? So, that's 03:43.520 --> 03:49.160 what I chose. My additional constraint was to basically make sure that it can be done 03:49.160 --> 03:52.560 by cloud code, because I'm lazy these days, and it just won't cloud to do everything for 03:52.560 --> 03:57.080 me. And, by the way, tiny grad is brought to you by the same person who hacked PlayStation 03:57.080 --> 04:01.320 way back when, and, you know, that blog post back in 2010 was such a breath of fresh air 04:01.320 --> 04:07.400 for me, it's the same guy. As my friends, I'm Luca, likes to say tiny grad is small enough 04:07.400 --> 04:12.320 to actually fit into cloud codes, context window. So, that is literally my kind of, you know, 04:12.320 --> 04:16.760 it's not quiet, but like, that's my cloud.md, right? You know, that's what I wanted cloud 04:16.760 --> 04:21.600 to do. And, amazingly, that actually went pretty well. So, like, there's a really good series 04:21.600 --> 04:25.000 of blog posts, and it's not just for cloud to read them. You're welcome to read them as 04:25.000 --> 04:29.640 well. They were reading for humans. They've kind of introduced you to the tiny grad. 04:29.640 --> 04:33.680 What I decided to do is to look at how they implemented web GPU backhand, because, you know, 04:33.680 --> 04:37.480 tiny grad has it. And, I kind of went from there and just, you know, it's actually 04:37.480 --> 04:42.880 amazingly well worked, you know, with cloud code. So, takeaways. Takeaways from, like, 04:42.880 --> 04:46.720 experimenting for basically a week with cloud code. I actually ended up generating, you know, 04:46.720 --> 04:54.800 some semblance of an L file. Kind of tiny grad is all predicated on, you know, taking, basically 04:54.800 --> 05:00.400 a graph expressed somehow and producing a bunch of things that you can then push into either 05:00.400 --> 05:05.440 a device itself. Like, you can literally produce an L file, and all of that is done through 05:05.440 --> 05:10.960 the renders. So, with the renderer, you basically have, again, either a graph that's given to 05:10.960 --> 05:16.080 you, or you can construct a graph, like, I'm doing here with the u-ops. You can kind of like, 05:16.080 --> 05:21.600 you chain them and build that graph in memory. And then you can ask it to render, and it would 05:21.600 --> 05:29.040 render into something like, you know, a cuda actual kernel. So, that's what I did. One of the things 05:29.120 --> 05:34.000 that didn't work out, I didn't actually look quite managed to generate, like, c and c++ code, 05:34.000 --> 05:38.800 that would be compact. So, I need to kind of like, look into tiny grads, pattern matching, 05:38.800 --> 05:43.200 and optimizations like that. But other than that, I'm actually on the way of using tiny grad, 05:44.080 --> 05:49.120 to produce code for microcontroller imputes. Again, my models are yellow and depth perception. 05:50.240 --> 05:54.800 So, I recommend you all try that, because, again, it just happens to be one of the toolboxes. 05:54.880 --> 05:59.120 There is not a product per se, but it's so easy to combine and work with, that you can 05:59.120 --> 06:04.240 run a lot of experiments, especially if you use cloud in really short amount of time. So, that's it. 06:04.240 --> 06:07.600 Thank you so much.