WEBVTT 01:00.000 --> 01:30.000 I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't 01:30.000 --> 02:00.000 know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know 02:00.000 --> 02:15.200 if I can hear you, but I don't know if I can hear you, but I don't know if I can hear you, but I don't know. 02:15.200 --> 02:37.200 up and do. I'm just going to ask you to be quiet and positive and we'll just see how we get along. 02:38.200 --> 02:41.200 It's not like the podium. 02:41.200 --> 02:46.200 It's like the stadiums, the lights are going to be loose. 02:46.200 --> 02:50.200 We both are coming. 02:59.200 --> 03:02.200 Okay. 03:02.200 --> 03:10.200 Let's think about the idea of the mechanical scene. 03:10.200 --> 03:20.200 It's when we use a tool with an understanding of how it operates. 03:20.200 --> 03:27.200 So the best of my knowledge is with action is which you don't support 03:27.200 --> 03:30.200 and you're going to have to be quiet and positive. 03:30.200 --> 03:33.200 You're going to have to be quiet and positive. 03:33.200 --> 03:36.200 You're going to have to be quiet and positive. 03:36.200 --> 03:39.200 You're going to have to be quiet and positive. 03:39.200 --> 03:42.200 You're going to have to be quiet and positive. 03:42.200 --> 03:45.200 You're going to have to be quiet and positive. 03:45.200 --> 03:48.200 You're going to have to be quiet and positive. 03:48.200 --> 03:50.200 You're going to have to be quiet and positive. 03:50.200 --> 03:53.200 You're going to have to be quiet and positive. 03:53.200 --> 03:56.200 You're going to have to be quiet and positive. 03:56.200 --> 04:03.200 The focus is that when you can't worry about the canvas in, 04:03.200 --> 04:07.200 and that's what this talks about amazing clarity of machine that will 04:07.200 --> 04:12.200 flow down upon hitting action and running up. 04:12.200 --> 04:17.200 I understand that the machine underneath 04:17.200 --> 04:19.200 is really capable. 04:19.200 --> 04:25.200 It sounds to look like a sequential attributes of maximum abstraction, 04:25.200 --> 04:35.760 And knowledge of where they are structured and that can help you understand why your program 04:35.760 --> 04:45.480 performance was possible and what you might do, like the web, this is a slight oversimplic 04:45.480 --> 04:50.480 touch. 04:50.480 --> 05:00.880 What we are actually looking at here is a bass guitar in the series with a viscous bass, and I'm 05:00.880 --> 05:07.760 going to make a tune with a rosary symphony function, of course what it can be 05:07.760 --> 05:16.760 actually is just a bottle of sweat, but it has acidity and acidity. 05:16.760 --> 05:27.760 Now if you do a few calculations and I hope there's no electric engine as much as we can. 05:27.760 --> 05:35.760 You assume that the reactions we get will, you assume that the reactions or the 05:35.760 --> 05:42.760 acidity are substantially greater than the reactions of the musicians in the series. 05:42.760 --> 05:48.760 Let me do a few calculations based on the impedance. 05:48.760 --> 05:54.760 You know that there is a specific power, you do a few calculations. 05:54.760 --> 06:09.760 You realise very quickly that the power increases what we are actively with the frequency 06:09.760 --> 06:19.760 of the power. 06:19.760 --> 06:27.760 It's double the frequency of the power, it's a quarter. 06:27.760 --> 06:34.760 And this unit, for every single calculation you do, it's quite such a factor. 06:34.760 --> 06:41.760 When you're going, so it takes twice as much power, it takes twice as much energy 06:41.760 --> 06:48.760 when you're going at high speed, then it does more and less. 06:48.760 --> 06:54.760 So, every operation is possible twice as much when you're going at high frequency. 06:54.760 --> 07:02.760 This is why high performance computers of selects from a five gigahertz or so or maybe 10 years 07:02.760 --> 07:10.760 and they've gone down because we're limited by power. 07:10.760 --> 07:14.760 This first sentence is again slightly untrue. 07:14.760 --> 07:17.760 But it's sort of similar. 07:17.760 --> 07:25.760 If you look at the heat transfer coefficient, that's a super-uncooler can do. 07:26.760 --> 07:33.760 It's not madly different from the heat exchanger in the primary cooling circuit of the nuclear air. 07:33.760 --> 07:35.760 The stony shing got true. 07:35.760 --> 07:41.760 Of course, nuclear reactors are shifting more because they're hot, why, usually. 07:41.760 --> 07:51.760 And a modern super-uncooler is a true main space energy. 07:52.760 --> 07:57.760 We can increase power in the structure in the air. 07:57.760 --> 08:03.760 This is particularly true in data centers where, I think every day to center, I'd have a visitor. 08:03.760 --> 08:09.760 There's over extra-regional design budget and power by the percentage. 08:09.760 --> 08:14.760 And we're making sure all of this physical limits. 08:14.760 --> 08:19.760 So, we can't go anywhere else. 08:19.760 --> 08:24.760 And there are very good physical reasons for improving that. 08:24.760 --> 08:32.760 We're doing the other thing to be able to go any faster, least with the spikes of computers that they have in our boxes, 08:32.760 --> 08:36.760 all of them, right there, just something to do. 08:36.760 --> 08:42.760 So, we're going to have to go and buy what we're going to have to do. 08:42.760 --> 08:48.760 Instead of doing things twice as fast, we're going to have to do a lot more functions in power. 08:48.760 --> 08:54.760 And this is something that's been going on for a very long time. 08:54.760 --> 09:05.760 In fact, it's very wide parallel computers that pretend to be sequential machines. 09:05.760 --> 09:09.760 We've got back to the end of the 1960s. 09:09.760 --> 09:14.760 When TJ Watson famously asked an engineer, 09:14.760 --> 09:20.760 why you've got hundreds of people working on the system 360, 09:20.760 --> 09:25.760 but it's still slower than this computer over here, 09:25.760 --> 09:28.760 designed by 12 people working in a shed. 09:28.760 --> 09:34.760 One of the answers was, well, it's because you've got people working obviously. 09:34.760 --> 09:39.760 The guy with 12 people in a shed was Seymour Craig. 09:39.760 --> 09:42.760 And that's what perhaps it is. 09:42.760 --> 09:50.760 But the response of the end was to invent something like the modern superstandard of very wide issues. 09:50.760 --> 09:59.760 This is what the computer of today actually looks like. 09:59.760 --> 10:05.760 And if this one thing I'd like to use to take away through this such thing, 10:05.760 --> 10:11.760 just thinking about what this actually is. 10:11.760 --> 10:18.760 I'm going to talk here. 10:18.760 --> 10:24.760 This is your construction. 10:24.760 --> 10:29.760 On every single block cycle, 10:29.760 --> 10:38.760 10 instructions are deployed in parallel. 10:38.760 --> 10:42.760 Now this is an answer to four computers. 10:42.760 --> 10:48.760 There are similar diagrams for Intel X86 computers. 10:48.760 --> 10:52.760 They're decoders are much, much more complicated, 10:52.760 --> 10:56.760 because Intel has variable length instructions. 10:56.760 --> 11:00.760 On Intel, an instruction is just the stream of bytes. 11:00.760 --> 11:04.760 So, you don't know where it's going to start, you don't know where it's going to end. 11:04.760 --> 11:10.760 Something sometimes you don't know where it's going to end until you actually get what I do. 11:10.760 --> 11:15.760 So, instruction you've got to really tell is very, very much harder. 11:15.760 --> 11:18.760 And the reason Intel is looking for is the life of this, 11:18.760 --> 11:24.760 is because it was my time to do this. 11:24.760 --> 11:29.760 When your architecture was inventing 10 modern 10 years ago, 11:29.760 --> 11:34.760 then you can get it. 11:34.760 --> 11:41.760 So, all of these instructions are decoded. 11:41.760 --> 11:43.760 This is interesting. 11:43.760 --> 11:45.760 Looking at this. 11:45.760 --> 11:51.760 What it's telling you is that although the architecture of the 64-bit architecture 11:51.760 --> 11:56.760 only has 32 or 64-bit, 11:56.760 --> 12:01.760 only has 32 or 64 registers. 12:01.760 --> 12:05.760 The critical, what? 12:05.760 --> 12:08.760 You didn't expect to do that. 12:08.760 --> 12:10.760 Probably it. 12:10.760 --> 12:17.760 The physical machine has nine hundred physical registers. 12:17.760 --> 12:20.760 So, what happens in the front here, 12:20.760 --> 12:23.760 when you start decoding instruction, 12:23.760 --> 12:26.760 there's what's called a rename table, 12:26.760 --> 12:28.760 which basically says, okay, 12:28.760 --> 12:32.760 when the next instruction acts for register 16, 12:32.760 --> 12:35.760 you actually want to register 571. 12:35.760 --> 12:38.760 So, it's a massive look for table. 12:38.760 --> 12:41.760 But, of course, if you're going to execute all of these instructions 12:41.760 --> 12:44.760 in a different order from the art of program, 12:44.760 --> 12:47.760 you've got to do a tremendous amount of work 12:47.760 --> 12:51.760 keeping track of what all the things they're going to be in the cell. 12:51.760 --> 12:55.760 And that's the real one of them here, which keeps... 12:55.760 --> 12:59.760 So, this is what's actually going on. 12:59.760 --> 13:06.760 And you can see here, and you can see here, 13:06.760 --> 13:09.760 there are different instruction blocks 13:09.760 --> 13:13.760 where things that you need to do. 13:13.760 --> 13:17.760 So, this is what the machine actually looks like. 13:17.760 --> 13:23.760 Now, most people, even people do work on compilers, 13:23.760 --> 13:31.760 who are living, tend not to think in terms of this amazing machine 13:31.760 --> 13:32.760 that it really is. 13:32.760 --> 13:37.760 Most compilers, even sophisticated, optimised in five. 13:37.760 --> 13:40.760 I don't know about any of them. 13:40.760 --> 13:43.760 They're actually based around the model 13:43.760 --> 13:47.760 or a straightforward, in-line issue, 13:47.760 --> 13:50.760 register and choose. 13:50.760 --> 13:53.760 But, conditional branches are very, very common. 13:53.760 --> 13:55.760 Some people reckon that there's a branch 13:55.760 --> 13:59.760 every five or six instructions, something like that. 13:59.760 --> 14:03.760 So, in order to use all of that machinery there, 14:03.760 --> 14:10.760 you have to predict where the program is going to go. 14:10.760 --> 14:14.760 This is for speculation, this is for branch production. 14:14.760 --> 14:17.760 But, memory loads are predicted as well, not just branch, 14:17.760 --> 14:21.760 so sometimes a machine will simply say, 14:21.760 --> 14:23.760 okay, I've been reading memory sequentially. 14:23.760 --> 14:27.760 Let's just predict it in the absence of any other language. 14:27.760 --> 14:32.760 What we're actually, we're just going to keep reading things exponentially. 14:32.760 --> 14:34.760 And this works in a lot of the time. 14:34.760 --> 14:37.760 Almost all memory reads a sequentially. 14:37.760 --> 14:43.760 So, it's just thinking about what a clock cycle actually is 14:43.760 --> 14:48.760 from, it's a simply socorration you've got. 14:48.760 --> 14:52.760 It's logical operation, an an and an or an add or something like that, 14:52.760 --> 14:56.760 which is typically limited by some number of gate fields, 14:56.760 --> 15:01.760 which is like 16 or 20 typically. 15:01.760 --> 15:04.760 Memory comes from 11, one catch is a lot more than that. 15:04.760 --> 15:06.760 That's a whole nanosecond. 15:06.760 --> 15:09.760 The multiplication is multiplied, et cetera, et cetera. 15:09.760 --> 15:13.760 So, memory reading data, it takes forever and ever and ever. 15:13.760 --> 15:19.760 And it's a huge, huge latency, et cetera. 15:19.760 --> 15:31.760 So, what you actually want to do is try to arrange your program, 15:31.760 --> 15:37.760 so that it works well on that sort of thing. 15:37.760 --> 15:41.760 But I imagine that most people program me, I don't know, 15:41.760 --> 15:46.760 really be aware of it, but it's not just because the CPU design is trying to find 15:46.760 --> 15:47.760 what I'm trying to do. 15:47.760 --> 15:49.760 They really don't want me to know. 15:49.760 --> 15:54.760 They're trying to put people affect the notion. 15:54.760 --> 15:59.760 But occasionally, as we saw with spectra meltdown bugs, 15:59.760 --> 16:02.760 this very carefully hidden abstraction. 16:02.760 --> 16:07.760 But it also affects how past your program is going to run. 16:07.760 --> 16:13.760 So, these are three things that I've done, which affects it by this. 16:13.760 --> 16:18.760 So, it's about years, not how I've mentioned the earlier, 16:18.760 --> 16:23.760 and no job feature, but reading a step about the conceptual 16:23.760 --> 16:27.760 walks up the job of a stack to find a caller. 16:28.760 --> 16:30.760 And then your search can be stopped. 16:30.760 --> 16:38.760 So, we're looking at various things with the proposal. 16:38.760 --> 16:44.760 And then the end, I'm not really a 16-wide software catch, 16:44.760 --> 16:51.760 entirely software managed, which is local to each track. 16:51.760 --> 16:57.760 So, whenever you look for the value of a stack of value, 16:57.760 --> 17:01.760 rather than walking down the stack, you just look it up in this touch. 17:01.760 --> 17:02.760 Right? 17:02.760 --> 17:07.760 Now, I think you can predict everywhere, 17:07.760 --> 17:13.760 so the fact sure is that, and there's no really anything like, 17:13.760 --> 17:17.760 so I end them into the software catch. 17:17.760 --> 17:24.760 And it turns from being a slow look to the stack, 17:24.760 --> 17:30.760 into something that was literal and believable. 17:30.760 --> 17:35.760 Like, in many cases, there's no hope at all. 17:35.760 --> 17:39.760 Some of this was the thing we have at all, and it's like, 17:39.760 --> 17:41.760 the job is certainly important. 17:41.760 --> 17:45.760 And all from one of them is the combination, 17:45.760 --> 17:49.760 the process of just standing forward, 17:49.760 --> 17:53.760 and walking through the stack, I don't know what that is. 17:53.760 --> 17:57.760 That's just there. 17:57.760 --> 18:03.760 So, it was essentially, it's all the same. 18:03.760 --> 18:10.760 So, this idea, if you've got any look up, 18:10.760 --> 18:11.760 you're going to do it. 18:11.760 --> 18:14.760 So, what you have to do is, 18:14.760 --> 18:20.760 the fields are going back to China. 18:20.760 --> 18:23.760 Just a tiny little thing, which I mean, 18:23.760 --> 18:29.760 the one that's actually in the stock value source type, 18:29.760 --> 18:32.760 is something that's not very simple, 18:32.760 --> 18:35.760 and it's actually in China, because it works in Europe. 18:36.760 --> 18:38.760 But it works in quite a bit faster. 18:38.760 --> 18:42.760 It's particularly good because I spent a lot of time 18:42.760 --> 18:46.760 looking at the actual assembly code output, 18:46.760 --> 18:52.760 that the C2 compiler was producing for access 18:52.760 --> 18:58.760 to the stock value range. 18:58.760 --> 19:04.760 So, it's more cash is highly effective in many other ways. 19:04.760 --> 19:08.760 Okay, so last year, I think to us, 19:08.760 --> 19:10.760 that much of the year, doing this, 19:10.760 --> 19:13.760 there was a problem with the old way 19:13.760 --> 19:16.760 that's interface, 19:16.760 --> 19:19.760 membership access are done. 19:19.760 --> 19:23.760 The old algorithm is very fast, 19:23.760 --> 19:27.760 but it really started parallel in the multi-fledged line. 19:27.760 --> 19:29.760 So, when it was first designed, 19:29.760 --> 19:31.760 this was 20 year old code. 19:32.760 --> 19:34.760 It was perfect. 19:34.760 --> 19:37.760 But now, as soon as the difference 19:37.760 --> 19:40.760 red started asking the same question, 19:40.760 --> 19:42.760 the performance just died, 19:42.760 --> 19:47.760 because you've got a tremendous amount of cash 19:47.760 --> 19:51.760 to hear and strap it going between the various processes. 19:51.760 --> 19:54.760 And it was extremely slow. 19:54.760 --> 19:57.760 So, the algorithm, benchmark, 19:57.760 --> 20:00.760 is at 1.4 meters. 20:04.760 --> 20:06.760 So, what I've said earlier, 20:06.760 --> 20:09.760 you can see that there's no way you can possibly check anything 20:09.760 --> 20:11.760 at 1.4 meters. 20:11.760 --> 20:13.760 That's essentially nothing. 20:13.760 --> 20:15.760 I mean, if you're looking at the speed of light, 20:15.760 --> 20:17.760 it's about that. 20:17.760 --> 20:21.760 So, it was never going to be like that. 20:21.760 --> 20:26.760 So, I thought, at this point, I got interested. 20:26.760 --> 20:27.760 I don't believe that. 20:27.760 --> 20:29.760 What is actually, I don't know. 20:29.760 --> 20:33.760 And then it is this, 20:33.760 --> 20:37.760 to simply, you will always inspect it 20:37.760 --> 20:40.760 with the chance to ask for a speed. 20:40.760 --> 20:44.760 Because, simply, you notice that it always does, 20:44.760 --> 20:45.760 thank you. 20:45.760 --> 20:48.760 So, unless the guarded code uses all of the simply, 20:48.760 --> 20:50.760 you will, which is very likely, 20:50.760 --> 20:51.760 especially in that monster, 20:51.760 --> 20:54.760 so check that's for so much nothing. 20:55.760 --> 20:56.760 And here it is. 20:56.760 --> 20:58.760 Now, you probably cannot see that back, 20:58.760 --> 21:02.760 and there probably can't see it from either. 21:02.760 --> 21:06.760 But this, this first part, 21:06.760 --> 21:08.760 is the check-out. 21:08.760 --> 21:10.760 It's that. 21:10.760 --> 21:14.760 It's, I think, nine instructions. 21:14.760 --> 21:18.760 And you see, these, this is a, this isn't the arm process 21:18.760 --> 21:19.760 that I showed you, 21:19.760 --> 21:22.760 because I can't get a pipeline model for it. 21:22.760 --> 21:25.760 So, you can see that the actual code 21:25.760 --> 21:27.760 that's guarded by the check code 21:27.760 --> 21:31.760 actually begins to cycle 21:31.760 --> 21:35.760 after the check-out code is first-ish. 21:35.760 --> 21:38.760 Now, going to the check-out code 21:38.760 --> 21:39.760 takes until there. 21:39.760 --> 21:42.760 So, it actually takes 30 seconds. 21:42.760 --> 21:44.760 What it doesn't matter, 21:44.760 --> 21:47.760 because the process of executing it's always. 21:47.760 --> 21:49.760 And if the check-out file was, 21:49.760 --> 21:54.760 it will throw away all of this. 21:54.760 --> 21:56.760 And rewind every three. 21:56.760 --> 22:01.760 Throw away all that work at the start again. 22:01.760 --> 22:03.760 So, this is the failure. 22:03.760 --> 22:06.760 I wanted to do more of the same by improving 22:06.760 --> 22:08.760 the other of the use of the invoke interface. 22:08.760 --> 22:11.760 I thought we could do something similar. 22:11.760 --> 22:15.760 I thought we could do a nice hash table look up. 22:16.760 --> 22:20.760 This is a benchmark in open JDK. 22:20.760 --> 22:25.760 And when it runs on the machine, 22:25.760 --> 22:26.760 I was testing it on it. 22:26.760 --> 22:29.760 It runs in less than four nanoseconds. 22:29.760 --> 22:32.760 That is completely impossible. 22:32.760 --> 22:34.760 Clearly, it's doing something. 22:34.760 --> 22:36.760 And what it's actually doing here, 22:36.760 --> 22:39.760 it's predicting memory loads. 22:39.760 --> 22:42.760 So, I changed the benchmark into scrambled 22:43.760 --> 22:46.760 in order to be tested into a different order. 22:46.760 --> 22:47.760 And all of a sudden, 22:47.760 --> 22:52.760 or who this is what happened. 22:52.760 --> 22:54.760 Yeah. 22:54.760 --> 22:58.760 This is actually executing 75 instructions. 22:58.760 --> 23:00.760 So, I changed it. 23:00.760 --> 23:02.760 And all of a sudden, 23:02.760 --> 23:06.760 the performance went right, right down. 23:06.760 --> 23:10.760 So, it was all about production. 23:11.760 --> 23:14.760 So, I'm running out of time, 23:14.760 --> 23:16.760 and this is perhaps a long at all. 23:16.760 --> 23:22.760 But the point is that invoke interface 23:22.760 --> 23:25.760 is legendarily in Java, 23:25.760 --> 23:28.760 supposed to be a slow operation. 23:28.760 --> 23:31.760 It really isn't a slow operation. 23:31.760 --> 23:34.760 If it's well predicted, 23:34.760 --> 23:37.760 if it's poorly predicted, 23:37.760 --> 23:40.760 then not only is the code that does to knock up 23:40.760 --> 23:42.760 like it to be predicted badly, 23:42.760 --> 23:44.760 but all of the target method is going 23:44.760 --> 23:46.760 to be predicted badly too. 23:46.760 --> 23:49.760 And simply scrambling that, 23:49.760 --> 23:52.760 it used the performance so badly 23:52.760 --> 23:54.760 that it was doing six instructions 23:54.760 --> 23:56.760 like down to less than one. 23:56.760 --> 23:58.760 Because it is always executing 23:58.760 --> 23:59.760 the whole net in twice. 23:59.760 --> 24:01.760 Once with the right in operation, 24:01.760 --> 24:03.760 once with the wrong information, 24:03.760 --> 24:04.760 spectacular. 24:04.760 --> 24:05.760 Once again. 24:06.760 --> 24:09.760 So, if you're wanting to take away from this, 24:09.760 --> 24:13.760 invoke interface is nothing to the scale. 24:13.760 --> 24:17.760 Interpaces are a very hard and solid general. 24:17.760 --> 24:21.760 But if you want them to be executed quickly, 24:21.760 --> 24:25.760 then that will only happen if prediction works accurately. 24:25.760 --> 24:28.760 Now, we don't know exactly how prediction really does work. 24:28.760 --> 24:29.760 But it tells you, 24:29.760 --> 24:31.760 keep it simple. 24:31.760 --> 24:32.760 Keep all of your stuff simple. 24:33.760 --> 24:34.760 If you try and do your lab work, 24:34.760 --> 24:36.760 long-term tests and so on, 24:36.760 --> 24:38.760 you're probably going to lose the prediction, 24:38.760 --> 24:39.760 which you need. 24:39.760 --> 24:41.760 And if it's quite what happens, 24:41.760 --> 24:43.760 I like my advantage here, 24:43.760 --> 24:45.760 everything I try was worse 24:45.760 --> 24:47.760 and simple as a pencil search. 24:47.760 --> 24:49.760 Keep it simple, Steven. 24:50.760 --> 24:51.760 Thank you. 24:51.760 --> 24:52.760 Thank you. 24:52.760 --> 24:53.760 Thank you. 24:53.760 --> 24:54.760 Thank you. 24:54.760 --> 24:55.760 Thank you. 24:55.760 --> 24:56.760 Thank you. 24:56.760 --> 24:57.760 Thank you. 24:57.760 --> 24:58.760 Thank you. 24:58.760 --> 24:59.760 Thank you. 24:59.760 --> 25:00.760 Thank you. 25:00.760 --> 25:01.760 Thank you. 25:01.760 --> 25:02.760 Thank you. 25:02.760 --> 25:03.760 Thank you. 25:03.760 --> 25:04.760 Thank you. 25:04.760 --> 25:05.760 Thank you. 25:05.760 --> 25:06.760 Thank you.