WEBVTT

00:00.000 --> 00:10.500
It's always very nice to see what you have been working on, and this year it's parallelism

00:10.500 --> 00:14.600
which you understand is very difficult in the microcontroller space.

00:14.600 --> 00:18.000
Do your earrings also run tiny go.

00:18.000 --> 00:20.000
And you can turn on the mic for me.

00:20.000 --> 00:26.000
The microphone is still omitted.

00:26.000 --> 00:28.000
Let me...

00:28.000 --> 00:30.000
It was a hardware issue.

00:30.000 --> 00:36.000
That's all you want to give you a ride of applause with a person with a tiny go earrings.

00:36.000 --> 00:46.000
Well, hello, everyone.

00:46.000 --> 00:54.000
So today I'm going to talk about the implementing parallelism.

00:54.000 --> 00:58.000
As you all know, concurrency is not the same as parallelism.

00:58.000 --> 01:03.000
concurrency is like running multiple quarantines and having them run independently of each other

01:03.000 --> 01:06.000
and communicating a search.

01:06.000 --> 01:14.000
And this is all built into the go language itself, and most popular languages.

01:14.000 --> 01:17.000
But parallelism is actually something different.

01:17.000 --> 01:23.000
That's actually running multiple things at the same time, like from different CPU course.

01:24.000 --> 01:29.000
And well, go is well known for doing parallelism very well.

01:29.000 --> 01:34.000
In the beginning, it was only a single threaded system.

01:34.000 --> 01:42.000
Like starting with go 1.5 parallelism was enabled by default.

01:42.000 --> 01:52.000
In tiny go so far we haven't had any parallelism because basically most market controllers are single core, like almost all of them.

01:52.000 --> 01:58.000
But this is slowly changing.

01:58.000 --> 02:02.000
So why are we adding parallelism now?

02:02.000 --> 02:09.000
First of all, because tiny go is not only used for market controllers, but also for Linux.

02:09.000 --> 02:18.000
For example, the euro project, which is something like busy box, but threatening go.

02:19.000 --> 02:25.000
Well, they would like to string the boundaries, and so they try to use tiny go.

02:25.000 --> 02:32.000
But I'm sure there are many other projects who we use tiny go for a similar purpose on the next.

02:32.000 --> 02:35.000
For what, for example,

02:35.000 --> 02:40.000
So basically we do have concurrency on what, for example,

02:40.000 --> 02:45.000
but the way it works is it unwinds the stack and rewinds the stack on every

02:45.000 --> 02:50.000
Garting switch, which is very, it's not efficient at all.

02:50.000 --> 02:53.000
It's like pretty slow and also some issues.

02:53.000 --> 02:58.000
And like, I would like to try when I're using web work for instance,

02:58.000 --> 03:00.000
that works better.

03:00.000 --> 03:06.000
Also, it's probably going to improve performance, but that's on a side benefit.

03:06.000 --> 03:10.000
And of course, for a bare metal sport.

03:10.000 --> 03:18.000
So, there are very few market controllers that have more than one coral,

03:18.000 --> 03:20.000
basically all of them are single-core.

03:20.000 --> 03:22.000
But there are a few exceptions.

03:22.000 --> 03:29.000
One is the R2040 that's actually running on this go for batch.

03:29.000 --> 03:35.000
But this chip didn't exist back when I saw the tiny go.

03:35.000 --> 03:40.000
But that's why we are adding training,

03:40.000 --> 03:45.000
multicore support right now because it's time to get useful.

03:45.000 --> 03:48.000
The rest of the talk will focus on Linux mostly,

03:48.000 --> 03:57.000
but most of the things I'll be talking about are applicable to all the other things as well.

03:58.000 --> 04:02.000
On Linux, I decided to use one-to-one training.

04:02.000 --> 04:05.000
That means, or one-to-one scheduling.

04:05.000 --> 04:10.000
That means that every Garting runs in a separate stress.

04:10.000 --> 04:15.000
Actually, that's how KCCC also works if I remember correctly.

04:15.000 --> 04:18.000
This has a few downsides.

04:18.000 --> 04:21.000
Like, for example, it will use a little bit more RAM,

04:21.000 --> 04:24.000
but not that much more, most of it is virtual.

04:24.000 --> 04:28.000
On it, like, eight kilobytes is physical RAM at the start of Garting.

04:28.000 --> 04:31.000
And it will be a little bit slower.

04:31.000 --> 04:35.000
Switches between Garting will be slower.

04:35.000 --> 04:41.000
So there are a few downsides, but benefits are that it's way easier to implement.

04:41.000 --> 04:44.000
I just use P-trats, basically.

04:44.000 --> 04:47.000
C-gore calls are very easy, just call them.

04:47.000 --> 04:50.000
Instead of, like to make standard go,

04:50.000 --> 04:53.000
you need to switch your different seats stack to run C-gore,

04:53.000 --> 04:54.000
and then switch back.

04:54.000 --> 04:58.000
That's one of the major reasons why it's slow.

04:58.000 --> 05:01.000
System calls can just be made directly.

05:01.000 --> 05:07.000
Like, there's no reason to use E-pull on things like that.

05:07.000 --> 05:14.000
And also, normally, when a Garting does some numeric computation, for example,

05:14.000 --> 05:18.000
that doesn't block, it can just lock up the OS threats,

05:19.000 --> 05:22.000
without the OS knowing, like,

05:22.000 --> 05:25.000
the schedule has to preempt it at some point.

05:25.000 --> 05:30.000
But I think I put in, like, check set at some points.

05:30.000 --> 05:33.000
So one Garting doesn't hold the system.

05:33.000 --> 05:35.000
But when you're using Tering,

05:35.000 --> 05:38.000
the operating system is already doing this by default.

05:38.000 --> 05:41.000
So it's pretty, basically.

05:41.000 --> 05:45.000
So there are a few, and here on the slide,

05:45.000 --> 05:51.000
you can see, for comparison, the different performance overheads.

05:51.000 --> 05:56.000
And, like, yeah, treadlons takes, like, five microseconds in it.

05:56.000 --> 05:59.000
Oh, well, but it's not that much time.

06:02.000 --> 06:06.000
To implement all this, a lot of things were needed.

06:06.000 --> 06:10.000
First of all, of course, the schedule are, like,

06:11.000 --> 06:13.000
in self-using a single tread.

06:13.000 --> 06:17.000
It uses multiple tread, so, obviously, that needs a change.

06:17.000 --> 06:20.000
The garbage collected needed to change.

06:20.000 --> 06:24.000
Before we could just allocate memory,

06:24.000 --> 06:26.000
and as soon as the heap ran out,

06:26.000 --> 06:29.000
we ran a garbage collection cycle.

06:29.000 --> 06:32.000
And by definition, there was nothing else running,

06:32.000 --> 06:36.000
because that was the only thing running at the time.

06:36.000 --> 06:38.000
But now, we need to stop the whole world,

06:38.000 --> 06:41.000
every tread that's running.

06:41.000 --> 06:44.000
And we do this, basically, by sending each guarantee

06:44.000 --> 06:49.000
or each tread actually, a post-exignal.

06:49.000 --> 06:55.000
Basically, to pause, scan all the stacks, scan the globals,

06:55.000 --> 06:58.000
and then everything can start running again,

06:58.000 --> 07:03.000
and the heap can be all the garbage can free, basically.

07:07.000 --> 07:12.000
I'll come back to a chat and select later,

07:12.000 --> 07:14.000
if there's enough time.

07:14.000 --> 07:17.000
The sync package, I'll also come back to that,

07:17.000 --> 07:21.000
but I can already say that it's using few taxes.

07:21.000 --> 07:25.000
I think atomic actually didn't need any changes at all.

07:25.000 --> 07:34.000
We were already using LVM in forensics for atomic instructions,

07:34.000 --> 07:37.000
and so that just worked, no change needed.

07:37.000 --> 07:40.000
And if you're other things like print a land,

07:40.000 --> 07:43.000
you don't want multiple guaranteeing,

07:43.000 --> 07:44.000
printing at the same time,

07:44.000 --> 07:46.000
and like interleaving on the same line,

07:46.000 --> 07:48.000
it's just going to be a mess,

07:48.000 --> 07:49.000
so that looks the outputs,

07:49.000 --> 07:52.000
then prints the whole line, then unlocks the output.

07:55.000 --> 07:57.000
To explain everything,

07:57.000 --> 08:01.000
I need to explain few taxes first.

08:01.000 --> 08:05.000
So what's a few tax?

08:05.000 --> 08:09.000
It stands for fast user space mutics,

08:09.000 --> 08:14.000
but it's a system call, so it's not that fast,

08:14.000 --> 08:16.000
and not a user space,

08:16.000 --> 08:20.000
and also it's not really a new tax.

08:20.000 --> 08:24.000
So I'm not sure where the name came from,

08:24.000 --> 08:27.000
but you can build new taxes out of them.

08:27.000 --> 08:31.000
Like there are basically the building block of

08:31.000 --> 08:34.000
all the concurrency primitives,

08:34.000 --> 08:36.000
like new taxes,

08:36.000 --> 08:38.000
some of us, the way group,

08:38.000 --> 08:40.000
and in go or channels,

08:40.000 --> 08:42.000
basically everything you want to synchronize,

08:42.000 --> 08:45.000
you can use a few taxes for.

08:45.000 --> 08:50.000
So the way I would,

08:50.000 --> 08:53.000
so the way it works,

08:53.000 --> 08:55.000
basically,

08:55.000 --> 08:58.000
if a lock for example is incontended,

08:58.000 --> 09:02.000
like there's no other traps using it at the same time,

09:02.000 --> 09:04.000
it's just an atomic instruction,

09:04.000 --> 09:06.000
so it's really, really fast.

09:06.000 --> 09:09.000
And only when it's content,

09:09.000 --> 09:11.000
when they're multiple traps,

09:11.000 --> 09:13.000
trying to use the same resource,

09:13.000 --> 09:16.000
only then there's a system call being made.

09:16.000 --> 09:20.000
So this is what a big reason why locks nowadays

09:20.000 --> 09:27.000
are like super fast as long as you're not contented.

09:27.000 --> 09:30.000
The API looks something like this.

09:30.000 --> 09:33.000
It varies depending on the operating system,

09:33.000 --> 09:34.000
I'll get to this later,

09:34.000 --> 09:37.000
but this is like basically the lowest common denominates

09:37.000 --> 09:41.000
across all operating systems.

09:41.000 --> 09:44.000
The way you can imagine this is that the operating system

09:44.000 --> 09:49.000
has an internal hash map of addresses or pointers

09:49.000 --> 09:55.000
to a list of traps that are waiting.

09:55.000 --> 09:57.000
And there are two important goals

09:57.000 --> 10:00.000
or three depending on how you count.

10:00.000 --> 10:03.000
Wait, in the kernel,

10:03.000 --> 10:08.000
this is the pseudo code you can see here.

10:08.000 --> 10:11.000
In the kernel, this something like this happens.

10:11.000 --> 10:14.000
It checks whether the address that is passed

10:14.000 --> 10:19.000
should point to a full atomic variable.

10:19.000 --> 10:22.000
It checks whether it is the value that it expects.

10:22.000 --> 10:27.000
And if so, it adds the calling threats

10:27.000 --> 10:30.000
to the waiting threats on this specific address.

10:30.000 --> 10:32.000
And the hash map I mentioned,

10:32.000 --> 10:35.000
and if not, it just returns.

10:35.000 --> 10:39.000
But if it matches, it also just pauses the threat

10:39.000 --> 10:42.000
because that's all point of it.

10:43.000 --> 10:45.000
And to unpause it,

10:45.000 --> 10:50.000
another threat can call wake or wake one in this case

10:50.000 --> 10:52.000
and say to the kernel,

10:52.000 --> 10:57.000
please wake one threats that's waiting on this address.

10:57.000 --> 11:01.000
Note that the operating system doesn't really,

11:01.000 --> 11:05.000
like if you just change the value at the address,

11:05.000 --> 11:07.000
nothing changes.

11:07.000 --> 11:10.000
Remember, it's basically the address,

11:10.000 --> 11:12.000
but the points are basically just a key

11:12.000 --> 11:17.000
into the hash map inside the kernel.

11:17.000 --> 11:23.000
So yeah, when you call wake,

11:23.000 --> 11:25.000
it'll just wake one threats,

11:25.000 --> 11:27.000
or when you call wake all,

11:27.000 --> 11:29.000
it will wake all waiting threats.

11:29.000 --> 11:32.000
And that's basically all the parameters you need

11:32.000 --> 11:36.000
to implement basically all comprehensive parameters.

11:36.000 --> 11:43.000
And this is how it looks on different operating systems.

11:43.000 --> 11:48.000
On Linux, it's a single system call overloaded system call.

11:48.000 --> 11:50.000
It does a lot of different things.

11:50.000 --> 11:52.000
I'm not going to go into this,

11:52.000 --> 11:55.000
check them in page.

11:55.000 --> 12:00.000
On macOS there are two internal functions

12:00.000 --> 12:02.000
that you can call.

12:02.000 --> 12:05.000
They're not really documented anywhere.

12:05.000 --> 12:11.000
However, LVM standard C++ library uses it internally,

12:11.000 --> 12:15.000
so I guess it's not going to change ever.

12:15.000 --> 12:17.000
Hopefully.

12:17.000 --> 12:21.000
Let's apple this something stupid.

12:21.000 --> 12:25.000
On Windows, there are three functions that are very diverse,

12:25.000 --> 12:28.000
this part is.

12:28.000 --> 12:32.000
And the WebAssembly there are also in the WebAssembly

12:32.000 --> 12:45.000
and the WebAssembly is basically a subset of all the other operating systems,

12:45.000 --> 12:51.000
which makes sense because WebAssembly does it.

12:51.000 --> 12:58.000
In time ago, I've wrapped all these different APIs in this simple API.

12:58.000 --> 13:05.000
If your text is basically just a superset of a atomic integer,

13:05.000 --> 13:08.000
it's basically the same API as before.

13:08.000 --> 13:13.000
Just wait, wake, and wake all just in a nice API.

13:13.000 --> 13:18.000
And I found that like on bare metal,

13:18.000 --> 13:24.000
I'm using almost the same API.

13:24.000 --> 13:27.000
Well, the external API is all internal,

13:27.000 --> 13:33.000
but like the exported functions are exactly the same.

13:33.000 --> 13:37.000
So it still has an atomic value,

13:37.000 --> 13:42.000
but because the timing goes basically the operating system.

13:42.000 --> 13:47.000
It's running bar metal, so that's the only thing that's running there.

13:47.000 --> 13:50.000
It can implement things however one.

13:50.000 --> 13:56.000
So the few text type itself just has a link list of way to encourage things.

13:56.000 --> 14:02.000
You can see here that for example, the weight function is basically the same code as before,

14:02.000 --> 14:11.000
but now it uses pin lock for locking.

14:11.000 --> 14:17.000
For example, the RP-2040 has a hardware spin lock for some reason.

14:17.000 --> 14:21.000
It does have, yeah.

14:21.000 --> 14:25.000
And this is how you actually use a few text.

14:25.000 --> 14:30.000
To be clear, this is not the same implementation as standard code.

14:30.000 --> 14:33.000
It's similar, but it's not the same.

14:33.000 --> 14:36.000
And also, I skipped all the overflow checking here.

14:36.000 --> 14:40.000
So this is not super safe, but it should work.

14:40.000 --> 14:43.000
I didn't test for the cheaper.

14:44.000 --> 14:50.000
This is the same methods as you would normally see on a few text.

14:50.000 --> 14:53.000
Sorry, on a white group.

14:53.000 --> 14:55.000
Add, wait, and done.

14:55.000 --> 14:58.000
Done, and basically just adding one negative number.

14:58.000 --> 15:01.000
So pretty simple.

15:04.000 --> 15:10.000
Adding is actually because a few text is also a atomic variable.

15:10.000 --> 15:14.000
It can just automatically add the delta,

15:14.000 --> 15:19.000
and check whether the realt of the addition is zero.

15:19.000 --> 15:22.000
For example, if the delta is negative.

15:22.000 --> 15:25.000
And if it's zero, that means it's done,

15:25.000 --> 15:28.000
and it can wake up all the weight in gravity.

15:28.000 --> 15:35.000
So it signals every order of gravity into finish.

15:36.000 --> 15:43.000
The weight function loads the current value in the atomic value.

15:43.000 --> 15:45.000
Check what it's zero.

15:45.000 --> 15:47.000
If it's zero, it's done.

15:47.000 --> 15:48.000
It's ready.

15:48.000 --> 15:49.000
It can return.

15:49.000 --> 15:52.000
If it's not, it will wait.

15:52.000 --> 15:57.000
But as you can see, it will pass the current counter value.

15:57.000 --> 16:02.000
If it didn't, it would immediately return.

16:03.000 --> 16:07.000
However, one note is that you can see that it's a for loop.

16:07.000 --> 16:12.000
This is because the weight function can spiritually return.

16:12.000 --> 16:16.000
This is an operating system limitation.

16:16.000 --> 16:20.000
I think it can really spiritually return on like,

16:20.000 --> 16:22.000
interrupt and stuff like that.

16:22.000 --> 16:24.000
It's going to stop.

16:24.000 --> 16:26.000
So it can sometimes happen.

16:26.000 --> 16:31.000
So it needs to be in a loop until it finishes, but it's very low overhead.

16:33.000 --> 16:35.000
Now channels.

16:35.000 --> 16:39.000
Channels may seem a lot more difficult,

16:39.000 --> 16:46.000
but actually channels and go are mostly implemented by just using a lot basically.

16:46.000 --> 16:51.000
For example, to send the value on my channel,

16:51.000 --> 16:56.000
happens is the sender first locks the channel.

16:56.000 --> 16:59.000
It checks whether there's a receiver ready.

16:59.000 --> 17:02.000
If so, it can precedes the channel operation.

17:02.000 --> 17:08.000
If there's, if not there's, it checks whether there's a space in the buffer.

17:08.000 --> 17:13.000
If it's a buffer channel and if so, it can serve the value and return.

17:13.000 --> 17:16.000
And if not, it will need to wait.

17:16.000 --> 17:22.000
So it adds itself to the linked list of weighting receivers.

17:22.000 --> 17:28.000
It will unlock the channel and pause itself and wait for.

17:28.000 --> 17:31.000
Yeah, the next sender.

17:31.000 --> 17:35.000
Next receiver to use this channel.

17:35.000 --> 17:39.000
So that's relatively straightforward.

17:39.000 --> 17:44.000
It's still low-level synchronization, so not super straightforward,

17:44.000 --> 17:47.000
but relatively straightforward.

17:47.000 --> 17:51.000
Selectable forever is a lot more complicated.

17:51.000 --> 17:59.000
So what you can see here is just an example of what can go wrong.

17:59.000 --> 18:02.000
So say you have two guardians, full and bar,

18:02.000 --> 18:07.000
and they both have a select statement.

18:07.000 --> 18:20.000
So if you know the way select works is basically by locking every channel that is in the select statements.

18:21.000 --> 18:24.000
But it has to do this, and when everything is locked,

18:24.000 --> 18:29.000
it can check whether any of them can precede or add itself to the sender.

18:29.000 --> 18:35.000
Severs list, et cetera, et cetera.

18:35.000 --> 18:39.000
But if you do this to easily, you can see here the problem.

18:39.000 --> 18:45.000
For example, guarding food, locks channel 1,

18:45.000 --> 18:49.000
and then guarding bar, locks channel 2 at the same time.

18:49.000 --> 18:53.000
And then guarding food, tries to lock channel 2, but it can,

18:53.000 --> 18:57.000
because it's already locked, and guarding bar tries to lock channel 1,

18:57.000 --> 19:01.000
but it's also blocked, so that's not going to work.

19:01.000 --> 19:08.000
Main core implementation solves this by basically sorting all the channels.

19:08.000 --> 19:11.000
Those channels are a reference type, I guess.

19:11.000 --> 19:14.000
It's internally, it's a pointer.

19:14.000 --> 19:17.000
So it can just sort by address.

19:18.000 --> 19:23.000
And that's one solution, and if you have like many course and many select statements,

19:23.000 --> 19:26.000
running at the same time, that's probably the fastest.

19:26.000 --> 19:29.000
But for time you go, we have different trade-offs,

19:29.000 --> 19:33.000
and I basically use a global new text for this.

19:33.000 --> 19:40.000
Whenever a select is going on, it'll just lock it for a short while processing.

19:40.000 --> 19:49.000
So that's mostly all the, I can go into new text,

19:49.000 --> 19:53.000
but new text are surprising, complicated.

19:53.000 --> 19:55.000
It's more complicated.

19:55.000 --> 19:58.000
They are way more complicated than waiting for some reason.

19:58.000 --> 20:03.000
It's like you want to optimize in case there's no content,

20:03.000 --> 20:06.000
it's separate, not going to go into that.

20:06.000 --> 20:13.000
So that's so far all the Linux stuff, and then.

20:13.000 --> 20:16.000
But I also ported this to this go-for-bex.

20:16.000 --> 20:21.000
This is actually running very bad code right now.

20:21.000 --> 20:24.000
You can see here what's happening.

20:24.000 --> 20:30.000
So I optimize this is a Mandelbrot fractal,

20:31.000 --> 20:36.000
and I managed to run two different cover teams doing.

20:36.000 --> 20:40.000
So the work is spread across two different course.

20:40.000 --> 20:44.000
And so that the calculations coming down on Pearl,

20:44.000 --> 20:48.000
because Mandelbrot's calculations are really complex to get,

20:48.000 --> 20:50.000
get right.

20:50.000 --> 20:53.000
And yeah, this tip is not that fast.

20:53.000 --> 20:57.000
And by running them in parallel, you can see the difference.

20:58.000 --> 21:02.000
I measured it to be about 30% faster.

21:02.000 --> 21:06.000
There's probably some overhead, I don't know.

21:06.000 --> 21:08.000
So that's right.

21:08.000 --> 21:12.000
It's still all for quality, but I should be making a pull-up press somewhere

21:12.000 --> 21:15.000
in the next week or so.

21:15.000 --> 21:19.000
The way I implemented this is basically by running the scheduler,

21:19.000 --> 21:25.000
the old scheduler that's used for single operations on both course

21:25.000 --> 21:26.000
at the same time.

21:26.000 --> 21:32.000
So both course are a little bit waiting for the next cover team

21:32.000 --> 21:34.000
to arrive to be ready.

21:34.000 --> 21:42.000
There's some stuff around sleeping, but that's fixed.

21:42.000 --> 21:44.000
So that's where we are.

21:44.000 --> 21:49.000
That's so far.

21:49.000 --> 21:53.000
Next up, we need to clean up everything and merge.

21:54.000 --> 21:58.000
Linux is almost ready, like some comments,

21:58.000 --> 22:01.000
but it's basically ready to be merged.

22:01.000 --> 22:07.000
Mac OS, I added support for, but there's some CIO issues.

22:07.000 --> 22:13.000
It's also used as speedtrad, so it's very similar to Linux.

22:13.000 --> 22:16.000
Bare metal needs a little bit more clean up,

22:16.000 --> 22:21.000
but it goes quite terrible.

22:21.000 --> 22:24.000
And then maybe I'll look into weapon assembly supports,

22:24.000 --> 22:27.000
not sure where how many people are interested in that.

22:27.000 --> 22:29.000
Maybe use speedtrad to supports.

22:29.000 --> 22:34.000
The ease speedtrad to is not a chip that has two different course.

22:34.000 --> 22:38.000
And of course, performance improvements.

22:38.000 --> 22:46.000
So, but yeah, that can happen after the initial implementation is done.

22:46.000 --> 22:48.000
So that's it.

22:48.000 --> 22:50.000
applause.

22:56.000 --> 22:58.000
Sadly, I have no time for questions, I'm really sorry,

22:58.000 --> 23:01.000
but I need to catch up because of some things with something

23:01.000 --> 23:04.000
of lightning talks, and one is usually famous for running out.

23:04.000 --> 23:08.000
So, if you have questions for her, there is an amazing hallway track,

23:08.000 --> 23:11.000
just give her the free drink of lunch choice.

23:11.000 --> 23:14.000
Thank you for this amazing work in tiny go.

23:14.000 --> 23:15.000
Thank you.

23:18.000 --> 23:19.000
Thank you.