WEBVTT 00:00.000 --> 00:10.500 It's always very nice to see what you have been working on, and this year it's parallelism 00:10.500 --> 00:14.600 which you understand is very difficult in the microcontroller space. 00:14.600 --> 00:18.000 Do your earrings also run tiny go. 00:18.000 --> 00:20.000 And you can turn on the mic for me. 00:20.000 --> 00:26.000 The microphone is still omitted. 00:26.000 --> 00:28.000 Let me... 00:28.000 --> 00:30.000 It was a hardware issue. 00:30.000 --> 00:36.000 That's all you want to give you a ride of applause with a person with a tiny go earrings. 00:36.000 --> 00:46.000 Well, hello, everyone. 00:46.000 --> 00:54.000 So today I'm going to talk about the implementing parallelism. 00:54.000 --> 00:58.000 As you all know, concurrency is not the same as parallelism. 00:58.000 --> 01:03.000 concurrency is like running multiple quarantines and having them run independently of each other 01:03.000 --> 01:06.000 and communicating a search. 01:06.000 --> 01:14.000 And this is all built into the go language itself, and most popular languages. 01:14.000 --> 01:17.000 But parallelism is actually something different. 01:17.000 --> 01:23.000 That's actually running multiple things at the same time, like from different CPU course. 01:24.000 --> 01:29.000 And well, go is well known for doing parallelism very well. 01:29.000 --> 01:34.000 In the beginning, it was only a single threaded system. 01:34.000 --> 01:42.000 Like starting with go 1.5 parallelism was enabled by default. 01:42.000 --> 01:52.000 In tiny go so far we haven't had any parallelism because basically most market controllers are single core, like almost all of them. 01:52.000 --> 01:58.000 But this is slowly changing. 01:58.000 --> 02:02.000 So why are we adding parallelism now? 02:02.000 --> 02:09.000 First of all, because tiny go is not only used for market controllers, but also for Linux. 02:09.000 --> 02:18.000 For example, the euro project, which is something like busy box, but threatening go. 02:19.000 --> 02:25.000 Well, they would like to string the boundaries, and so they try to use tiny go. 02:25.000 --> 02:32.000 But I'm sure there are many other projects who we use tiny go for a similar purpose on the next. 02:32.000 --> 02:35.000 For what, for example, 02:35.000 --> 02:40.000 So basically we do have concurrency on what, for example, 02:40.000 --> 02:45.000 but the way it works is it unwinds the stack and rewinds the stack on every 02:45.000 --> 02:50.000 Garting switch, which is very, it's not efficient at all. 02:50.000 --> 02:53.000 It's like pretty slow and also some issues. 02:53.000 --> 02:58.000 And like, I would like to try when I're using web work for instance, 02:58.000 --> 03:00.000 that works better. 03:00.000 --> 03:06.000 Also, it's probably going to improve performance, but that's on a side benefit. 03:06.000 --> 03:10.000 And of course, for a bare metal sport. 03:10.000 --> 03:18.000 So, there are very few market controllers that have more than one coral, 03:18.000 --> 03:20.000 basically all of them are single-core. 03:20.000 --> 03:22.000 But there are a few exceptions. 03:22.000 --> 03:29.000 One is the R2040 that's actually running on this go for batch. 03:29.000 --> 03:35.000 But this chip didn't exist back when I saw the tiny go. 03:35.000 --> 03:40.000 But that's why we are adding training, 03:40.000 --> 03:45.000 multicore support right now because it's time to get useful. 03:45.000 --> 03:48.000 The rest of the talk will focus on Linux mostly, 03:48.000 --> 03:57.000 but most of the things I'll be talking about are applicable to all the other things as well. 03:58.000 --> 04:02.000 On Linux, I decided to use one-to-one training. 04:02.000 --> 04:05.000 That means, or one-to-one scheduling. 04:05.000 --> 04:10.000 That means that every Garting runs in a separate stress. 04:10.000 --> 04:15.000 Actually, that's how KCCC also works if I remember correctly. 04:15.000 --> 04:18.000 This has a few downsides. 04:18.000 --> 04:21.000 Like, for example, it will use a little bit more RAM, 04:21.000 --> 04:24.000 but not that much more, most of it is virtual. 04:24.000 --> 04:28.000 On it, like, eight kilobytes is physical RAM at the start of Garting. 04:28.000 --> 04:31.000 And it will be a little bit slower. 04:31.000 --> 04:35.000 Switches between Garting will be slower. 04:35.000 --> 04:41.000 So there are a few downsides, but benefits are that it's way easier to implement. 04:41.000 --> 04:44.000 I just use P-trats, basically. 04:44.000 --> 04:47.000 C-gore calls are very easy, just call them. 04:47.000 --> 04:50.000 Instead of, like to make standard go, 04:50.000 --> 04:53.000 you need to switch your different seats stack to run C-gore, 04:53.000 --> 04:54.000 and then switch back. 04:54.000 --> 04:58.000 That's one of the major reasons why it's slow. 04:58.000 --> 05:01.000 System calls can just be made directly. 05:01.000 --> 05:07.000 Like, there's no reason to use E-pull on things like that. 05:07.000 --> 05:14.000 And also, normally, when a Garting does some numeric computation, for example, 05:14.000 --> 05:18.000 that doesn't block, it can just lock up the OS threats, 05:19.000 --> 05:22.000 without the OS knowing, like, 05:22.000 --> 05:25.000 the schedule has to preempt it at some point. 05:25.000 --> 05:30.000 But I think I put in, like, check set at some points. 05:30.000 --> 05:33.000 So one Garting doesn't hold the system. 05:33.000 --> 05:35.000 But when you're using Tering, 05:35.000 --> 05:38.000 the operating system is already doing this by default. 05:38.000 --> 05:41.000 So it's pretty, basically. 05:41.000 --> 05:45.000 So there are a few, and here on the slide, 05:45.000 --> 05:51.000 you can see, for comparison, the different performance overheads. 05:51.000 --> 05:56.000 And, like, yeah, treadlons takes, like, five microseconds in it. 05:56.000 --> 05:59.000 Oh, well, but it's not that much time. 06:02.000 --> 06:06.000 To implement all this, a lot of things were needed. 06:06.000 --> 06:10.000 First of all, of course, the schedule are, like, 06:11.000 --> 06:13.000 in self-using a single tread. 06:13.000 --> 06:17.000 It uses multiple tread, so, obviously, that needs a change. 06:17.000 --> 06:20.000 The garbage collected needed to change. 06:20.000 --> 06:24.000 Before we could just allocate memory, 06:24.000 --> 06:26.000 and as soon as the heap ran out, 06:26.000 --> 06:29.000 we ran a garbage collection cycle. 06:29.000 --> 06:32.000 And by definition, there was nothing else running, 06:32.000 --> 06:36.000 because that was the only thing running at the time. 06:36.000 --> 06:38.000 But now, we need to stop the whole world, 06:38.000 --> 06:41.000 every tread that's running. 06:41.000 --> 06:44.000 And we do this, basically, by sending each guarantee 06:44.000 --> 06:49.000 or each tread actually, a post-exignal. 06:49.000 --> 06:55.000 Basically, to pause, scan all the stacks, scan the globals, 06:55.000 --> 06:58.000 and then everything can start running again, 06:58.000 --> 07:03.000 and the heap can be all the garbage can free, basically. 07:07.000 --> 07:12.000 I'll come back to a chat and select later, 07:12.000 --> 07:14.000 if there's enough time. 07:14.000 --> 07:17.000 The sync package, I'll also come back to that, 07:17.000 --> 07:21.000 but I can already say that it's using few taxes. 07:21.000 --> 07:25.000 I think atomic actually didn't need any changes at all. 07:25.000 --> 07:34.000 We were already using LVM in forensics for atomic instructions, 07:34.000 --> 07:37.000 and so that just worked, no change needed. 07:37.000 --> 07:40.000 And if you're other things like print a land, 07:40.000 --> 07:43.000 you don't want multiple guaranteeing, 07:43.000 --> 07:44.000 printing at the same time, 07:44.000 --> 07:46.000 and like interleaving on the same line, 07:46.000 --> 07:48.000 it's just going to be a mess, 07:48.000 --> 07:49.000 so that looks the outputs, 07:49.000 --> 07:52.000 then prints the whole line, then unlocks the output. 07:55.000 --> 07:57.000 To explain everything, 07:57.000 --> 08:01.000 I need to explain few taxes first. 08:01.000 --> 08:05.000 So what's a few tax? 08:05.000 --> 08:09.000 It stands for fast user space mutics, 08:09.000 --> 08:14.000 but it's a system call, so it's not that fast, 08:14.000 --> 08:16.000 and not a user space, 08:16.000 --> 08:20.000 and also it's not really a new tax. 08:20.000 --> 08:24.000 So I'm not sure where the name came from, 08:24.000 --> 08:27.000 but you can build new taxes out of them. 08:27.000 --> 08:31.000 Like there are basically the building block of 08:31.000 --> 08:34.000 all the concurrency primitives, 08:34.000 --> 08:36.000 like new taxes, 08:36.000 --> 08:38.000 some of us, the way group, 08:38.000 --> 08:40.000 and in go or channels, 08:40.000 --> 08:42.000 basically everything you want to synchronize, 08:42.000 --> 08:45.000 you can use a few taxes for. 08:45.000 --> 08:50.000 So the way I would, 08:50.000 --> 08:53.000 so the way it works, 08:53.000 --> 08:55.000 basically, 08:55.000 --> 08:58.000 if a lock for example is incontended, 08:58.000 --> 09:02.000 like there's no other traps using it at the same time, 09:02.000 --> 09:04.000 it's just an atomic instruction, 09:04.000 --> 09:06.000 so it's really, really fast. 09:06.000 --> 09:09.000 And only when it's content, 09:09.000 --> 09:11.000 when they're multiple traps, 09:11.000 --> 09:13.000 trying to use the same resource, 09:13.000 --> 09:16.000 only then there's a system call being made. 09:16.000 --> 09:20.000 So this is what a big reason why locks nowadays 09:20.000 --> 09:27.000 are like super fast as long as you're not contented. 09:27.000 --> 09:30.000 The API looks something like this. 09:30.000 --> 09:33.000 It varies depending on the operating system, 09:33.000 --> 09:34.000 I'll get to this later, 09:34.000 --> 09:37.000 but this is like basically the lowest common denominates 09:37.000 --> 09:41.000 across all operating systems. 09:41.000 --> 09:44.000 The way you can imagine this is that the operating system 09:44.000 --> 09:49.000 has an internal hash map of addresses or pointers 09:49.000 --> 09:55.000 to a list of traps that are waiting. 09:55.000 --> 09:57.000 And there are two important goals 09:57.000 --> 10:00.000 or three depending on how you count. 10:00.000 --> 10:03.000 Wait, in the kernel, 10:03.000 --> 10:08.000 this is the pseudo code you can see here. 10:08.000 --> 10:11.000 In the kernel, this something like this happens. 10:11.000 --> 10:14.000 It checks whether the address that is passed 10:14.000 --> 10:19.000 should point to a full atomic variable. 10:19.000 --> 10:22.000 It checks whether it is the value that it expects. 10:22.000 --> 10:27.000 And if so, it adds the calling threats 10:27.000 --> 10:30.000 to the waiting threats on this specific address. 10:30.000 --> 10:32.000 And the hash map I mentioned, 10:32.000 --> 10:35.000 and if not, it just returns. 10:35.000 --> 10:39.000 But if it matches, it also just pauses the threat 10:39.000 --> 10:42.000 because that's all point of it. 10:43.000 --> 10:45.000 And to unpause it, 10:45.000 --> 10:50.000 another threat can call wake or wake one in this case 10:50.000 --> 10:52.000 and say to the kernel, 10:52.000 --> 10:57.000 please wake one threats that's waiting on this address. 10:57.000 --> 11:01.000 Note that the operating system doesn't really, 11:01.000 --> 11:05.000 like if you just change the value at the address, 11:05.000 --> 11:07.000 nothing changes. 11:07.000 --> 11:10.000 Remember, it's basically the address, 11:10.000 --> 11:12.000 but the points are basically just a key 11:12.000 --> 11:17.000 into the hash map inside the kernel. 11:17.000 --> 11:23.000 So yeah, when you call wake, 11:23.000 --> 11:25.000 it'll just wake one threats, 11:25.000 --> 11:27.000 or when you call wake all, 11:27.000 --> 11:29.000 it will wake all waiting threats. 11:29.000 --> 11:32.000 And that's basically all the parameters you need 11:32.000 --> 11:36.000 to implement basically all comprehensive parameters. 11:36.000 --> 11:43.000 And this is how it looks on different operating systems. 11:43.000 --> 11:48.000 On Linux, it's a single system call overloaded system call. 11:48.000 --> 11:50.000 It does a lot of different things. 11:50.000 --> 11:52.000 I'm not going to go into this, 11:52.000 --> 11:55.000 check them in page. 11:55.000 --> 12:00.000 On macOS there are two internal functions 12:00.000 --> 12:02.000 that you can call. 12:02.000 --> 12:05.000 They're not really documented anywhere. 12:05.000 --> 12:11.000 However, LVM standard C++ library uses it internally, 12:11.000 --> 12:15.000 so I guess it's not going to change ever. 12:15.000 --> 12:17.000 Hopefully. 12:17.000 --> 12:21.000 Let's apple this something stupid. 12:21.000 --> 12:25.000 On Windows, there are three functions that are very diverse, 12:25.000 --> 12:28.000 this part is. 12:28.000 --> 12:32.000 And the WebAssembly there are also in the WebAssembly 12:32.000 --> 12:45.000 and the WebAssembly is basically a subset of all the other operating systems, 12:45.000 --> 12:51.000 which makes sense because WebAssembly does it. 12:51.000 --> 12:58.000 In time ago, I've wrapped all these different APIs in this simple API. 12:58.000 --> 13:05.000 If your text is basically just a superset of a atomic integer, 13:05.000 --> 13:08.000 it's basically the same API as before. 13:08.000 --> 13:13.000 Just wait, wake, and wake all just in a nice API. 13:13.000 --> 13:18.000 And I found that like on bare metal, 13:18.000 --> 13:24.000 I'm using almost the same API. 13:24.000 --> 13:27.000 Well, the external API is all internal, 13:27.000 --> 13:33.000 but like the exported functions are exactly the same. 13:33.000 --> 13:37.000 So it still has an atomic value, 13:37.000 --> 13:42.000 but because the timing goes basically the operating system. 13:42.000 --> 13:47.000 It's running bar metal, so that's the only thing that's running there. 13:47.000 --> 13:50.000 It can implement things however one. 13:50.000 --> 13:56.000 So the few text type itself just has a link list of way to encourage things. 13:56.000 --> 14:02.000 You can see here that for example, the weight function is basically the same code as before, 14:02.000 --> 14:11.000 but now it uses pin lock for locking. 14:11.000 --> 14:17.000 For example, the RP-2040 has a hardware spin lock for some reason. 14:17.000 --> 14:21.000 It does have, yeah. 14:21.000 --> 14:25.000 And this is how you actually use a few text. 14:25.000 --> 14:30.000 To be clear, this is not the same implementation as standard code. 14:30.000 --> 14:33.000 It's similar, but it's not the same. 14:33.000 --> 14:36.000 And also, I skipped all the overflow checking here. 14:36.000 --> 14:40.000 So this is not super safe, but it should work. 14:40.000 --> 14:43.000 I didn't test for the cheaper. 14:44.000 --> 14:50.000 This is the same methods as you would normally see on a few text. 14:50.000 --> 14:53.000 Sorry, on a white group. 14:53.000 --> 14:55.000 Add, wait, and done. 14:55.000 --> 14:58.000 Done, and basically just adding one negative number. 14:58.000 --> 15:01.000 So pretty simple. 15:04.000 --> 15:10.000 Adding is actually because a few text is also a atomic variable. 15:10.000 --> 15:14.000 It can just automatically add the delta, 15:14.000 --> 15:19.000 and check whether the realt of the addition is zero. 15:19.000 --> 15:22.000 For example, if the delta is negative. 15:22.000 --> 15:25.000 And if it's zero, that means it's done, 15:25.000 --> 15:28.000 and it can wake up all the weight in gravity. 15:28.000 --> 15:35.000 So it signals every order of gravity into finish. 15:36.000 --> 15:43.000 The weight function loads the current value in the atomic value. 15:43.000 --> 15:45.000 Check what it's zero. 15:45.000 --> 15:47.000 If it's zero, it's done. 15:47.000 --> 15:48.000 It's ready. 15:48.000 --> 15:49.000 It can return. 15:49.000 --> 15:52.000 If it's not, it will wait. 15:52.000 --> 15:57.000 But as you can see, it will pass the current counter value. 15:57.000 --> 16:02.000 If it didn't, it would immediately return. 16:03.000 --> 16:07.000 However, one note is that you can see that it's a for loop. 16:07.000 --> 16:12.000 This is because the weight function can spiritually return. 16:12.000 --> 16:16.000 This is an operating system limitation. 16:16.000 --> 16:20.000 I think it can really spiritually return on like, 16:20.000 --> 16:22.000 interrupt and stuff like that. 16:22.000 --> 16:24.000 It's going to stop. 16:24.000 --> 16:26.000 So it can sometimes happen. 16:26.000 --> 16:31.000 So it needs to be in a loop until it finishes, but it's very low overhead. 16:33.000 --> 16:35.000 Now channels. 16:35.000 --> 16:39.000 Channels may seem a lot more difficult, 16:39.000 --> 16:46.000 but actually channels and go are mostly implemented by just using a lot basically. 16:46.000 --> 16:51.000 For example, to send the value on my channel, 16:51.000 --> 16:56.000 happens is the sender first locks the channel. 16:56.000 --> 16:59.000 It checks whether there's a receiver ready. 16:59.000 --> 17:02.000 If so, it can precedes the channel operation. 17:02.000 --> 17:08.000 If there's, if not there's, it checks whether there's a space in the buffer. 17:08.000 --> 17:13.000 If it's a buffer channel and if so, it can serve the value and return. 17:13.000 --> 17:16.000 And if not, it will need to wait. 17:16.000 --> 17:22.000 So it adds itself to the linked list of weighting receivers. 17:22.000 --> 17:28.000 It will unlock the channel and pause itself and wait for. 17:28.000 --> 17:31.000 Yeah, the next sender. 17:31.000 --> 17:35.000 Next receiver to use this channel. 17:35.000 --> 17:39.000 So that's relatively straightforward. 17:39.000 --> 17:44.000 It's still low-level synchronization, so not super straightforward, 17:44.000 --> 17:47.000 but relatively straightforward. 17:47.000 --> 17:51.000 Selectable forever is a lot more complicated. 17:51.000 --> 17:59.000 So what you can see here is just an example of what can go wrong. 17:59.000 --> 18:02.000 So say you have two guardians, full and bar, 18:02.000 --> 18:07.000 and they both have a select statement. 18:07.000 --> 18:20.000 So if you know the way select works is basically by locking every channel that is in the select statements. 18:21.000 --> 18:24.000 But it has to do this, and when everything is locked, 18:24.000 --> 18:29.000 it can check whether any of them can precede or add itself to the sender. 18:29.000 --> 18:35.000 Severs list, et cetera, et cetera. 18:35.000 --> 18:39.000 But if you do this to easily, you can see here the problem. 18:39.000 --> 18:45.000 For example, guarding food, locks channel 1, 18:45.000 --> 18:49.000 and then guarding bar, locks channel 2 at the same time. 18:49.000 --> 18:53.000 And then guarding food, tries to lock channel 2, but it can, 18:53.000 --> 18:57.000 because it's already locked, and guarding bar tries to lock channel 1, 18:57.000 --> 19:01.000 but it's also blocked, so that's not going to work. 19:01.000 --> 19:08.000 Main core implementation solves this by basically sorting all the channels. 19:08.000 --> 19:11.000 Those channels are a reference type, I guess. 19:11.000 --> 19:14.000 It's internally, it's a pointer. 19:14.000 --> 19:17.000 So it can just sort by address. 19:18.000 --> 19:23.000 And that's one solution, and if you have like many course and many select statements, 19:23.000 --> 19:26.000 running at the same time, that's probably the fastest. 19:26.000 --> 19:29.000 But for time you go, we have different trade-offs, 19:29.000 --> 19:33.000 and I basically use a global new text for this. 19:33.000 --> 19:40.000 Whenever a select is going on, it'll just lock it for a short while processing. 19:40.000 --> 19:49.000 So that's mostly all the, I can go into new text, 19:49.000 --> 19:53.000 but new text are surprising, complicated. 19:53.000 --> 19:55.000 It's more complicated. 19:55.000 --> 19:58.000 They are way more complicated than waiting for some reason. 19:58.000 --> 20:03.000 It's like you want to optimize in case there's no content, 20:03.000 --> 20:06.000 it's separate, not going to go into that. 20:06.000 --> 20:13.000 So that's so far all the Linux stuff, and then. 20:13.000 --> 20:16.000 But I also ported this to this go-for-bex. 20:16.000 --> 20:21.000 This is actually running very bad code right now. 20:21.000 --> 20:24.000 You can see here what's happening. 20:24.000 --> 20:30.000 So I optimize this is a Mandelbrot fractal, 20:31.000 --> 20:36.000 and I managed to run two different cover teams doing. 20:36.000 --> 20:40.000 So the work is spread across two different course. 20:40.000 --> 20:44.000 And so that the calculations coming down on Pearl, 20:44.000 --> 20:48.000 because Mandelbrot's calculations are really complex to get, 20:48.000 --> 20:50.000 get right. 20:50.000 --> 20:53.000 And yeah, this tip is not that fast. 20:53.000 --> 20:57.000 And by running them in parallel, you can see the difference. 20:58.000 --> 21:02.000 I measured it to be about 30% faster. 21:02.000 --> 21:06.000 There's probably some overhead, I don't know. 21:06.000 --> 21:08.000 So that's right. 21:08.000 --> 21:12.000 It's still all for quality, but I should be making a pull-up press somewhere 21:12.000 --> 21:15.000 in the next week or so. 21:15.000 --> 21:19.000 The way I implemented this is basically by running the scheduler, 21:19.000 --> 21:25.000 the old scheduler that's used for single operations on both course 21:25.000 --> 21:26.000 at the same time. 21:26.000 --> 21:32.000 So both course are a little bit waiting for the next cover team 21:32.000 --> 21:34.000 to arrive to be ready. 21:34.000 --> 21:42.000 There's some stuff around sleeping, but that's fixed. 21:42.000 --> 21:44.000 So that's where we are. 21:44.000 --> 21:49.000 That's so far. 21:49.000 --> 21:53.000 Next up, we need to clean up everything and merge. 21:54.000 --> 21:58.000 Linux is almost ready, like some comments, 21:58.000 --> 22:01.000 but it's basically ready to be merged. 22:01.000 --> 22:07.000 Mac OS, I added support for, but there's some CIO issues. 22:07.000 --> 22:13.000 It's also used as speedtrad, so it's very similar to Linux. 22:13.000 --> 22:16.000 Bare metal needs a little bit more clean up, 22:16.000 --> 22:21.000 but it goes quite terrible. 22:21.000 --> 22:24.000 And then maybe I'll look into weapon assembly supports, 22:24.000 --> 22:27.000 not sure where how many people are interested in that. 22:27.000 --> 22:29.000 Maybe use speedtrad to supports. 22:29.000 --> 22:34.000 The ease speedtrad to is not a chip that has two different course. 22:34.000 --> 22:38.000 And of course, performance improvements. 22:38.000 --> 22:46.000 So, but yeah, that can happen after the initial implementation is done. 22:46.000 --> 22:48.000 So that's it. 22:48.000 --> 22:50.000 applause. 22:56.000 --> 22:58.000 Sadly, I have no time for questions, I'm really sorry, 22:58.000 --> 23:01.000 but I need to catch up because of some things with something 23:01.000 --> 23:04.000 of lightning talks, and one is usually famous for running out. 23:04.000 --> 23:08.000 So, if you have questions for her, there is an amazing hallway track, 23:08.000 --> 23:11.000 just give her the free drink of lunch choice. 23:11.000 --> 23:14.000 Thank you for this amazing work in tiny go. 23:14.000 --> 23:15.000 Thank you. 23:18.000 --> 23:19.000 Thank you.