WEBVTT 00:00.000 --> 00:16.560 Hello everyone, hey, I'm Eric, I'm an engineer and the android by performance team and I'd 00:16.560 --> 00:20.360 like to talk to you a little bit about things we did for chromium and android to make 00:20.360 --> 00:28.320 it nice and fast and to give you a little sneak peek, this is what the graph looks like 00:28.320 --> 00:33.640 for speedometer performance on android and chromium over the last two years roughly. 00:33.640 --> 00:38.760 It's an graphic quite proud of, it's not just me that accomplished this of course, it's 00:38.760 --> 00:46.960 a large group of people that helped us get here, but yeah, we basically doubled web performance 00:46.960 --> 00:55.000 measured as speedometer score on android through a lot of work. 00:55.000 --> 01:01.440 To maybe briefly take a step back and ask how many of you do know what speedometer 01:01.440 --> 01:02.440 is. 01:02.440 --> 01:10.560 That's a good number, it's really a web interaction benchmark, it tries to measure how fast 01:10.560 --> 01:15.520 as a browser respond when users interact with a website. 01:15.520 --> 01:21.320 It does that using synthetic workloads, here what you can see a lot is a to-do application 01:21.320 --> 01:28.600 where we add items to to-do list and then remove them, we measure how long does that take. 01:28.600 --> 01:30.240 Why does this matter? 01:30.240 --> 01:35.000 Speedometer improvements, they mean interactions on the web get faster, but they also 01:35.000 --> 01:37.080 mean that page loads get faster. 01:37.080 --> 01:44.560 And this is an example of loading a Google Doc in a Chrome on Android back two years ago 01:44.560 --> 01:51.480 and now you can kind of see that two years ago that was almost 50% slower to do and it's 01:51.480 --> 01:55.680 all due to these kinds of improvements. 01:55.680 --> 02:02.240 So with all that bait out of the way, let me say thank you to a lot of people that helped 02:02.240 --> 02:07.840 us get here and tell you what I want to talk about here today. 02:07.840 --> 02:12.720 First I'd like to break down this timeline of what do each of these improvements, where 02:12.720 --> 02:18.120 do each of these improvements come from, how do we get to 2x? 02:18.120 --> 02:24.640 And then I'd like to dive deep into a rather geeky area and nerdy geeky area of the built-up 02:24.640 --> 02:32.240 optimization that we added. 02:32.240 --> 02:37.920 And then talk a little bit about the tooling that I enables us to understand this kind 02:37.920 --> 02:42.480 of browser performance on the lower level close to the hardware and how we can identify 02:42.480 --> 02:45.120 where there are optimizations. 02:45.120 --> 02:51.760 So it's very much a view from a browser developer close to the barebones hardware here, 02:51.760 --> 02:57.600 which might not be quite the same to a web developer, but some of the tooling, some of the 02:57.600 --> 03:04.200 insights, some of the processes might actually also be quite relevant for a web developer. 03:04.200 --> 03:09.320 So yeah, speedometer speedometer, lastly I also want to talk a little bit about what we are 03:09.320 --> 03:15.320 doing to get workloads in the lab closer to real page loads, real interactions and 03:15.320 --> 03:21.240 page loads on real websites. 03:21.240 --> 03:28.480 So yeah, this timeline, a lot of the improvements here came out of these three areas. 03:28.480 --> 03:36.320 First, improvements to Chrome's build to make it faster to execute on modern Android hardware. 03:36.320 --> 03:41.760 Second, they were a bunch of improvements to the rendering and the JavaScript engines in Chrome 03:41.760 --> 03:48.440 that contributed kind of the other half to mostly the other half of these improvements. 03:48.440 --> 03:56.720 And lastly, we had to work quite closely with other OEMs and Android partners to make 03:56.720 --> 04:04.320 sure that browsing is actually scheduled correctly on the hardware. 04:04.320 --> 04:11.120 So digging into the build first, the main thing that enabled us to make a change here was 04:11.120 --> 04:17.000 the fact that we split Chrome's build into two halves for Android. 04:17.000 --> 04:21.160 Previously, we were shipping the same APK, the same binary to all Android devices. 04:21.160 --> 04:33.960 And you might know Android devices, they range from $100 phone to $2,000 flip phone and 04:33.960 --> 04:36.840 these devices perform very differently. 04:36.840 --> 04:42.640 For the low end, it's really important that we ship an APK a build that's really small 04:42.640 --> 04:48.800 and size and has low memory overhead because these devices ship with poor flash, very 04:48.800 --> 04:56.320 little amount of flash and a lot less memory than the high end phones. 04:56.320 --> 05:03.800 So for poor, these poorer, more economy-style phones, we can't really ship a very well-optimized 05:03.800 --> 05:08.720 Chrome binary because a lot of these optimizations that we can land into Chrome, they increase 05:08.720 --> 05:13.040 our binary size, they increase the memory footprint. 05:13.040 --> 05:20.040 So splitting Chrome's build into these two halves and shipping one APK to lower end phones 05:20.040 --> 05:26.760 and another APK to a more premium end phones made a lot of these improvements possible. 05:26.760 --> 05:29.200 What are we doing differently for this high end build, I guess? 05:29.200 --> 05:32.960 The first thing we're doing differently is that we don't optimize for size in the compiler, 05:32.960 --> 05:35.440 we optimize for speed instead. 05:35.440 --> 05:40.640 We also don't have to target 32 bit, are many more, we can say, well, this build, we are 05:40.640 --> 05:45.440 only going to ship it to 64 bit devices. 05:45.440 --> 05:49.840 That's actually also something that increases the memory footprint slightly. 05:49.840 --> 05:53.480 64 bit means every pointer becomes double the size. 05:53.480 --> 05:59.680 We can make some tweaks to that, we can use pointer compression, for example, to reduce 05:59.680 --> 06:05.520 pointers in V8 and the garbage collection heaps and the JavaScript heaps and the bling heaps 06:05.520 --> 06:11.760 potentially, but overall there's still going to be a memory impact. 06:11.760 --> 06:16.400 Even on low end phones where we have 64 bit capability, we wouldn't ship a 64 bit 06:16.400 --> 06:23.200 built, we would ship only a 32 bit built because of the memory impact. 06:23.200 --> 06:29.040 And finally, what we can also now start to do is profile guided optimization, PGO, this 06:29.040 --> 06:36.160 is basically a mechanism that runs workloads in the lab on Chrome and figures out what 06:36.160 --> 06:43.680 of the code is hot, what of the code is, which code is less hot, which code is cold and 06:43.680 --> 06:49.120 applies different optimizations to a hot and cold code, and we'll get into more details 06:49.120 --> 06:52.800 of what that actually means later. 06:52.800 --> 06:56.160 So initially, we just enabled all of these things. 06:56.160 --> 07:02.480 Second, we then dug deep into how can we improve on that. 07:02.480 --> 07:09.600 We switched the generation of these provided optimizations to use different profiling data. 07:09.600 --> 07:17.800 Previously, we were kind of reusing profiles that were already present for Mac 64 bit arm devices. 07:17.800 --> 07:23.400 Now we are switching to 64 bit profiles collected on actual Android phones. 07:23.400 --> 07:27.160 And then it improves performance. 07:27.160 --> 07:34.440 We also made sure that the PGO profiles that were used in the binary built later are profiles 07:34.440 --> 07:38.320 that were very recently computed. 07:38.320 --> 07:44.920 If you let a lot of time pass between generating the profile and doing your build, these profiles 07:44.920 --> 07:52.040 become stale and might not actually apply the correct optimizations to the correct code. 07:52.040 --> 07:58.200 Another thing that we then discovered later is that we can increase inlining in the compiler 07:58.200 --> 08:09.320 so that the compiler kind of prefers to pull in more code into inline functions. 08:09.320 --> 08:16.040 In this rigorous binary size, but it's actually beneficial on a modern hardware. 08:16.040 --> 08:22.440 And lastly, we also added improvements to the order file. 08:22.440 --> 08:29.320 That's a Chrome specific compiler, more like a linker feature that tries to arrange functions 08:29.320 --> 08:33.760 across the binary in a sensible order. 08:33.760 --> 08:41.440 That's now based on speedometer and that helps a lot too. 08:41.440 --> 08:44.320 So that's all built. 08:44.320 --> 08:49.200 Beyond built, I mentioned there a lot of improvements in the Chrome-UM engine itself, 08:49.200 --> 08:56.080 like the blink rendering engine, for example, there were a bunch of small improvements 08:56.080 --> 09:02.000 that were landed across the engine and a little bit there, a little bit there, here it 09:02.000 --> 09:03.000 adds up. 09:03.240 --> 09:14.440 With even 11, 13% plus improvements just by adding up lots of tiny little things over a year, 09:14.440 --> 09:17.000 two years that adds up. 09:17.000 --> 09:21.560 But there were also a couple of bigger changes that were landed. 09:21.560 --> 09:30.200 The first one called out here is an improved parser that makes it faster to parse HTML when 09:30.280 --> 09:37.160 it is inserted dynamically via the NIHDML attribute. 09:37.160 --> 09:41.960 Again, something that we didn't ship on Android before, because binary size. 09:41.960 --> 09:48.200 Adding this extra parser means we regress low-end devices. 09:48.200 --> 09:53.400 The eight also added a new baseline compiler here that's basically a tier that sits in between 09:53.400 --> 10:01.160 the really quick to generate code, ignition interpreter and the eight, 10:01.160 --> 10:08.600 and the next level up compiler here that crunches out really well optimized code using a 10:08.600 --> 10:10.360 Jet compiler. 10:10.360 --> 10:16.360 Spark plug is a baseline compiler that is really quick to spit out somewhat better code. 10:17.320 --> 10:24.360 And adding that into V8 improves speedometer but also improves spatial significantly. 10:26.840 --> 10:29.240 The last thing to call out here is garbage collection. 10:29.240 --> 10:31.240 I think there's more opportunities in that space. 10:31.240 --> 10:36.840 But in the last few releases last year, we landed a few improvements to make sure that 10:36.840 --> 10:39.000 garbage collection happens in a better moment in time. 10:39.800 --> 10:45.800 Rather than triggering in moments when it will affect negatively speedometer scores or 10:45.800 --> 10:46.840 interactions on pages. 10:50.920 --> 10:53.160 Last area is scheduling an operating system. 10:53.160 --> 11:01.560 So it turns out that if you don't tweak your kernel to prioritize the right threats, 11:02.520 --> 11:04.840 your performance suffers. Who would have guessed? 11:05.480 --> 11:11.880 This is an active area for us. I think we landed a couple of initial winds here with some of the 11:11.880 --> 11:17.400 OEMs and Android, but Android is really fragmented in this area. A lot of OEMs use very 11:17.400 --> 11:26.440 different scheduling heuristics policies in their platforms and making sure that web browsing 11:26.440 --> 11:30.600 is effectively prioritized there is a very tricky topic. 11:30.760 --> 11:39.400 All right, so to dig a little bit deeper into this built thing, I wanted to show you a little bit 11:39.400 --> 11:46.440 of data. This is a data that we collect during speedometer on an Android device. 11:46.440 --> 11:53.000 This is in this case a pixel 8 device. This was before we landed all these PGO improvements. 11:53.880 --> 12:04.360 And what you can see here is that in speedometer, the execution in the CPU is often 12:04.360 --> 12:10.120 stalled in the front end of the CPU. The front end of the CPU is kind of the piece in the CPU 12:10.120 --> 12:15.160 that tries to fetch instructions from the memory to then pass them on into the execution units 12:15.160 --> 12:19.880 in the back end to execute. So what this means really is that the front end is having trouble 12:20.760 --> 12:29.800 identifying where to fetch the instructions from. Stalling here means we have to wait for 12:29.800 --> 12:34.440 the data to come in so that we can actually take these instructions and put them into the back end. 12:35.960 --> 12:41.480 And typically what that again means is that you are waiting for memory, you're waiting for 12:41.480 --> 12:44.360 these instructions to come out of your cache hierarchy or out of your DRAM. 12:45.320 --> 12:53.960 And what we discovered is that in Chrome, before all of these PMU, sorry, before all of these 12:53.960 --> 13:00.760 provided optimizations, we were seeing a lot of stalls were due to branches in some form or 13:00.760 --> 13:05.640 another cache misses that happened because we were mis-predicting branches. 13:07.480 --> 13:13.560 And speedometer is a workload that is very different to many other workloads that CPU engineers 13:13.560 --> 13:21.240 typically utilize to benchmark their CPUs. That's a very quick statistic here is on branches. 13:22.520 --> 13:27.880 We don't matter, it's about 20% of its instructions being branches. That's every 5th instruction 13:27.880 --> 13:35.080 is a conditional branch that wants to go somewhere. In workloads like peak bench, 13:35.640 --> 13:42.760 something that compiler engineers or even CPU engineers would be more familiar with, 13:43.000 --> 13:51.960 that number is about half. Given these many branches, it's really important that 13:53.720 --> 14:01.800 these branches are predicted correctly in the CPU. When you mis-predict a branch on an ARM CPU, 14:02.680 --> 14:06.840 you end up paying not only for the mis-predict itself, right? Like you mis-predict the branch, 14:06.840 --> 14:11.400 that means you have to roll back all your execution to the beginning of this branch 14:12.280 --> 14:18.040 throughout all the instructions that you executed in a predicted fashion and then restart instructions 14:18.040 --> 14:24.520 instruction execution there. But when you did this mis-predict what you also did is you mis-predict 14:24.520 --> 14:29.800 that what memory to fetch into your cache hierarchy to load your instructions from. 14:30.600 --> 14:35.000 So you end up polluting your cache hierarchy with the wrong instructions and you end up not having 14:35.080 --> 14:41.240 the correct instructions in the cache hierarchy, which means that again you increase the time 14:41.240 --> 14:51.640 needed to fetch data from memory in the front end. The way that you solve this is by making 14:51.640 --> 15:01.880 sure that in the code of your application, you align your code in such a way, your branches, 15:01.960 --> 15:07.160 on the assembly level, and such a way that the fall-through branches, like the fall-through 15:07.160 --> 15:14.280 path through a branch, is the one that is most often the taken one. That's what provided optimization 15:14.280 --> 15:21.800 attempts to do. It tracks which path in each branch is taken, while executing a workload. 15:22.680 --> 15:26.840 It then sees that maybe 80% of the time you have to go that way at 20% of the time you have to 15:26.840 --> 15:32.760 get that way. So it then takes that information and during compilation, it will make sure that 15:32.760 --> 15:39.160 the 80% branch is actually the path that falls through the branch. So in 80% of cases this branch 15:39.160 --> 15:48.600 does not have to be taken, you just fall through. This helps CPUs enormously. What also helps is to have a 15:48.600 --> 15:59.880 CPU that has higher branch predictors, larger branch predictors. When you, and another reason 15:59.880 --> 16:05.560 why making sure that the fall-through branch is the one that is the most prominent one, the 16:05.560 --> 16:10.520 one that is always taken, is because fall-through branches, when the CPU predicts that a branch 16:10.520 --> 16:17.160 will be not taken, it doesn't have to take up any space in the branch predictors memory, 16:17.240 --> 16:24.760 in the branch predictors' caches. Branch predictor caches only store the taken branches, 16:24.760 --> 16:40.760 the branches that you have to follow. So jumping forward in both software and in hardware 16:40.760 --> 16:48.440 one generation to a CPU that is a little bit better in Pixel 9 now, and to Chrome that uses PGO 16:48.440 --> 16:54.440 optimizations. You can see that this bottleneck in the CPU moves, they no longer as much front and 16:54.440 --> 16:59.480 bound, but instead we are now back and found. That means the instruction bottleneck that fetching 16:59.480 --> 17:06.840 the instructions, doing correct branch prediction is a lot more optimized now. But what you can 17:06.840 --> 17:13.720 also still see is that the instructions per cycle, so the efficiency of executing this workload 17:13.720 --> 17:22.120 in the CPU is still lower when the front and stalls are higher. It's still very important for us 17:22.120 --> 17:38.840 to continue optimizing in this area for the front end. So I mentioned the order file earlier, 17:38.840 --> 17:45.640 and the quick call out here is that the the order file improves on this in a separate way. 17:46.840 --> 17:51.240 The order file tries to make sure that we are across functions, across different 17:51.240 --> 17:57.720 functions, we are pulling them together in memory, and then more continuous, and more and 17:57.720 --> 18:06.680 more continuous space when these functions are often executed in temporal proximity. So often 18:06.680 --> 18:10.280 function B is executed after function A, you would want to make sure that function B is close to 18:10.280 --> 18:21.160 function A in the binary. This helps reduce pressure on the TLB. Again, improving a different 18:21.160 --> 18:30.600 bottleneck, but also still in the front end. And of course, we also now need to look into the 18:30.600 --> 18:36.040 back end stall a little bit more, what's causing all these back end stalls. I think our early 18:36.040 --> 18:43.880 insights here are also that these back end stalls are bound on cache hierarchy lookups in these CPUs. 18:44.760 --> 18:49.400 Most of the time when we are stalling in the back end, it's actually because we have to go beyond 18:50.280 --> 18:58.040 the L3 cache in the CPU, so beyond the cache in the CPU into the caches or the DRAM itself. 18:59.880 --> 19:04.120 This means we need to understand why the cache hierarchy doesn't work for the back end 19:05.240 --> 19:12.920 accesses either. Guess likely something there that again scatters memory accesses in a way 19:12.920 --> 19:25.640 that the CPU finds hard to predict. So I wanted to take a step at explaining a little bit 19:26.920 --> 19:32.440 how we might go about doing that, understanding why data accesses are scattered as well. 19:32.440 --> 19:38.760 And give you a little bit of an insight into what tooling we used to get these insights. I just 19:38.840 --> 19:48.520 show it to you. Most of these insights come from profiling and I know many of you are probably 19:48.520 --> 19:55.720 familiar with DevTools and Chrome and it's profiling options. As Chrome engineers we use another tool 19:56.600 --> 20:01.560 that gives us a little bit more insights into the browser's inner workings and that's 20:01.560 --> 20:08.760 performance tracing based on Pefetto. Pefetto gives us an attribution of all the execution 20:08.760 --> 20:17.000 in Chrome to browser tasks and also allows us to combine that information with system profiling 20:17.000 --> 20:26.200 data like scheduling information or other system counters. You might be familiar with the 20:26.200 --> 20:31.160 performance.mark and the performance.measure APIs in the web as well. These allow us to 20:31.160 --> 20:38.040 annotate these workloads with user journey information. For speedometer what we did is we added 20:38.040 --> 20:44.200 instrumentation to annotate different sub-tests in speedometer to be able to then break down 20:45.320 --> 20:50.040 are these all behaving the same way are the specific tests in speedometer that have different 20:50.040 --> 21:02.120 bottlenecks than others and we can then also bring in additional data into the traces on 21:02.120 --> 21:10.280 top of that. For example we can bring in CPUPU new counters so performance counters that the CPU 21:10.280 --> 21:15.080 tells us how many the old cycles are there at this moment in time, how many instructions 21:15.080 --> 21:24.440 that I execute, how many cycles that that take me and we can also bring in calls to examples 21:25.560 --> 21:33.960 from Chrome and from JavaScript. Together all of this should allow us to effectively find 21:33.960 --> 21:41.240 functions in JavaScript or functions in the browser that have a very poor instruction throughput. 21:41.240 --> 21:47.560 For example because they often miss encasches. So here this is really tool tooling that allows 21:47.560 --> 21:53.000 us to go down from the big picture of I'm executing speedometer it takes me 30 seconds to 21:54.200 --> 22:01.640 this particular function this particular sub-test is executing instructions really really 22:01.720 --> 22:14.040 slowly and you should better look into why. So if you want to go even further you can also 22:15.320 --> 22:20.360 let me skip this slide actually. If you want to go even further you can go down a level further 22:20.360 --> 22:27.000 and try to understand within these functions which instructions in these functions cause bottlenecks. 22:27.640 --> 22:32.760 So we can add in data from low-level sources on armships that are called ETM and SPE. 22:33.480 --> 22:40.360 ETM is a tracing mechanism on arm CPUs that allows you to get the whole instruction stream 22:40.360 --> 22:45.960 basically and identify ranges in the instruction stream where instructions were taking very 22:47.000 --> 22:55.640 very long time or where they were hiccups and SPE is a statistical 22:56.280 --> 23:02.920 statistical tool to do something very similar. It samples loads or branches and it can 23:02.920 --> 23:07.800 try to identify branches that often miss in the cache or branches that were often mispredicted. 23:09.800 --> 23:18.680 Or loads that were often missing in caches. So for example in this screenshot there's a load here 23:19.400 --> 23:27.080 that took thousands of cycles to fulfill and that's because you had to go down to DRAM to 23:27.080 --> 23:33.320 fulfill it. Now with all the symbolization data we can actually go and understand which load is 23:33.320 --> 23:39.080 that which instruction is that where in the JavaScript or in the page resources is this happen happening. 23:39.320 --> 23:56.280 And yeah. Speedometer is a good workload for us to do all of this with. But speedometer is 23:56.280 --> 24:01.560 quite a synthetic workload and it really only helps us look at a small piece of whatever 24:01.560 --> 24:09.240 browser has to do. And for that I've got a nice diagram that shows you which areas of the browser 24:09.240 --> 24:16.680 is our exercise by speedometer. Speedometer is this benchmark here and I've contrasted this with 24:16.680 --> 24:22.200 page load and with scrolling here for now. In page load we see that there are a bunch of other 24:22.200 --> 24:29.240 browser components that are exercised. In page load also affects pieces in the browser. For example 24:29.320 --> 24:34.840 we have to utilize the network. We have to prepare requests sent them out and get responses back. 24:35.560 --> 24:43.000 We have to parse a lot more of that content. We also have to do a lot more rendering during 24:43.000 --> 24:47.560 page load. There's a lot more restoration of new resources than there's in speedometer. 24:48.440 --> 24:53.240 And some pieces in speedometer while they are exercised they don't actually affect the score 24:53.240 --> 24:56.920 that was ultimately computed in speedometer. So they are less relevant there either. 24:57.320 --> 25:04.040 So overall for us that means that when we are talking about this kind of low-level data 25:04.680 --> 25:10.280 the low-level optimizations that we can attack together with partners and OEMs 25:11.080 --> 25:18.440 there's a lot a lot of the browser that we need to have a better coverage for in the lab. 25:20.200 --> 25:26.280 Chromium in the past has mainly focused on using field data to optimize for these use cases. 25:26.280 --> 25:34.520 That's why you see all these webbital metrics like INP and FCP and LCP have so much prominence 25:34.520 --> 25:43.160 in the browser world. But when you want to look at instruction level bottlenecks you cannot use 25:44.280 --> 25:51.160 you can't get this data from the field. So we need a good workload to approximate page load 25:51.160 --> 25:58.200 and scrolling in the lab. And for that we set out to create one because really we didn't have 25:58.200 --> 26:04.680 one that was up to date in Chromium. We call it load line and Gemini helped us generate a 26:04.680 --> 26:15.240 logo for it. To talk a little bit about how we approach this problem we wanted to make sure 26:15.240 --> 26:22.520 that this was a maintainable workload. A benchmark like this in the past when Chromium has attempted 26:22.520 --> 26:30.280 this it often became irrelevant because it was very hard to update the pages that were part of 26:30.280 --> 26:39.480 the benchmark or the metrics used. So we made the choice to focus on a small workload that we could 26:39.480 --> 26:45.800 maintain and that we are happy to update in the future. So we limited ourselves to only five sites 26:46.600 --> 26:53.000 and chose those based on product needs but also based on performance characteristics of those sites. 26:53.000 --> 26:58.680 We wanted a little bit of coverage of both fast websites and slow websites and websites that 26:58.680 --> 27:04.840 exercise the JavaScript engine very heavily and websites that exercise the layout engine a lot more 27:04.840 --> 27:12.280 heavily and so on. We did this by analyzing a bunch of more popular websites and then selecting 27:12.280 --> 27:17.720 once that had different characteristics in a way to maximize coverage across a bunch of 27:17.720 --> 27:26.440 dimensions. We also noticed that many of these metrics that we use in the field like FCT or LCT, 27:26.840 --> 27:31.080 they don't work that well when you try to apply them only to five websites in the lab. 27:31.160 --> 27:43.880 So instead we utilized the fact that we've chosen only five sites by building site specific 27:43.880 --> 27:49.320 metrics that utilize some knowledge of well in this case it actually matters when this element is 27:49.320 --> 28:00.760 shown or when this element becomes intractable and another aspect of utilizing this custom 28:00.760 --> 28:05.000 instrumentation custom metrics is that we can build metrics that actually behave well 28:05.960 --> 28:11.880 in terms of statistical properties in the lab. If you have metrics that are by model for example 28:11.880 --> 28:20.360 that's really problematic for a lab benchmark introduced a lot of noise. We also have to make 28:20.360 --> 28:25.240 sure that when we are measuring today we measure tomorrow that we get somewhat consistent results. 28:26.120 --> 28:30.920 So while we were using real websites we had to make sure that they kind of stay fixed in one 28:30.920 --> 28:37.720 point in time and for that we are using a tool called web page replay or WPR that takes a recording 28:37.720 --> 28:44.680 of a website and then later replace that. It's not perfect right but what we really are looking for 28:44.680 --> 28:48.520 here is a workload that we can utilize that is reasonably relevant. 28:49.080 --> 29:01.160 To give you an example this is a page load in one of our pages where we are looking at LCP 29:02.120 --> 29:10.600 and right before LCP begins or right before LCP is finished we have a very long script execution 29:12.120 --> 29:16.680 and the browser is in the deterministic sometimes that script execution happens before that paint 29:16.760 --> 29:21.000 sometimes it happens after the paint. This creates it by modality in the metric. 29:21.720 --> 29:30.520 So you see a little bit of a bump in an earlier around an earlier page load time and a little 29:30.520 --> 29:36.680 bit of a bump in a later page load time. That's something we can't have so instead we had to choose 29:36.680 --> 29:46.200 it better moment in time for for the metric for this page. Another example LCP doesn't always 29:46.280 --> 29:52.600 track a moment that's actually really relevant. This is a page load of a CNN page. It turns out 29:52.600 --> 29:59.160 that LCP happens when this image is shown but from an end user perspective that's not really 29:59.160 --> 30:03.320 relevant that's only the image there's no text. I can't really look at this page yet. 30:04.360 --> 30:08.520 I also don't I can't really scroll it yet. I can't interact with any items on the on the page yet. 30:08.520 --> 30:12.520 So instead we built a metric for this particular page that waits for 30:12.520 --> 30:19.800 these main pieces of the content to be loaded and for that content to start to be interactable. 30:21.240 --> 30:26.680 That's kind of how we approach these things and then you end up with a set of pages. 30:27.720 --> 30:34.280 For us this is the initial set, the initial set of the first version of this benchmark that we're 30:34.360 --> 30:44.360 utilizing. We have one configuration for phones. That's mainly striving to create good coverage 30:44.360 --> 30:53.400 over a void variety of websites. But websites that are a website types that are somewhat popular 30:53.400 --> 31:01.800 on the wider grind scale of things. So you have fast pages like Wikipedia. You have pages that 31:01.880 --> 31:09.000 are really slow like a news article. You have pages that are more on the average site like a product 31:09.000 --> 31:15.480 page on Amazon. And for each of those pages we develop metrics that either wait for a piece of 31:15.480 --> 31:20.440 the content to be ready or maybe in some cases actually we can use LCP because it does the right 31:20.440 --> 31:28.280 thing for this page. We also looked at tablets and our tablets are up and coming. 31:28.840 --> 31:34.680 On tablets the use cases that are relevant for browser performance are a little bit different. 31:34.680 --> 31:42.680 There's a lot more focus on productivity and challenging things. Also from a competitive standpoint. 31:44.440 --> 31:49.880 And so we chose this slightly different set of websites here skewed towards larger content 31:50.520 --> 31:52.120 more challenging content for the browser. 31:52.120 --> 32:02.280 This at the moment is an internal benchmark for Chromey engineers really. It's built also primarily 32:02.280 --> 32:09.560 for Android not so much for desktop and it only covers the fundamental browser performance. It doesn't 32:09.560 --> 32:14.680 cover absolutely every browser feature. It doesn't cover the networking part very well in the browser 32:14.680 --> 32:21.160 given that we are replaying network responses as opposed to using a live server that behaves 32:21.240 --> 32:28.040 in quite different ways. So there's a lot that you have to take into account when 32:28.040 --> 32:36.600 when utilizing this but it is available and it's quite easy to run. So give it a try if you're 32:36.600 --> 32:42.600 interested. With that let me open it up for questions. 32:52.120 --> 33:00.280 Hi, I was just curious if you could explain why the CNN page didn't show the text initially. 33:00.280 --> 33:06.920 Is that they lazy to text itself or if they have some font problem or why the metric doesn't 33:06.920 --> 33:11.640 work for CNN? I don't understand what the metric doesn't work but I'll come the image appears 33:11.640 --> 33:19.400 before the text. I don't know. It's something that I see at an engineer might want to look at. 33:19.880 --> 33:23.960 Okay, I want to look at one of the quick questions. Do you have any recommendations for 33:23.960 --> 33:29.880 framework developers to increase the chances of branch predictions being correct? 33:29.880 --> 33:34.760 Like there's sort of mechanism since C++ where you can kind of annotate the guide to CPU but 33:34.760 --> 33:39.400 it's not going to look like that in JavaScript. There's sort of so kind of presidents or certain 33:39.400 --> 33:44.040 patterns. It's a very good question and the way that I would answer it is that that's really a 33:44.120 --> 33:50.680 problem for the JavaScript engine to deal with probably. I mean, yes, you can try to reduce branches 33:50.680 --> 33:56.040 even in JavaScript code, right? You might want to avoid conditions on the hot path, but if you can. 33:56.840 --> 34:03.480 But primarily, the JavaScript engine should take care of all of this for you because 34:04.120 --> 34:09.800 the JavaScript engine, it doesn't have profile guided optimization as we use that for a native code, 34:10.520 --> 34:15.480 but it does have all the runtime jit information. So it effectively builds up the same 34:16.440 --> 34:21.000 the same data, which branches are often taken, which branches are not often taken, 34:21.000 --> 34:25.320 which functions are hot, which functions are cold. It has all of this data so that it's able to 34:25.320 --> 34:33.160 optimize very hot code and higher compiler tiers. So it should already be doing these optimizations 34:33.160 --> 34:39.320 to a degree to make sure that the hot path in a branch is the one that is the fall through. 34:39.480 --> 34:45.560 For example, we have to remove branches from the hot path, but that doesn't really include 34:45.560 --> 34:51.880 functions that doesn't really include the time spent executing a function before it is optimized 34:51.880 --> 35:00.920 by the jit. I think I know why the image erase first because they're optimized for ACP. 35:02.520 --> 35:08.440 Yes, it's a good idea to take away all the content for LCD, right? So yeah, it's another aspect 35:08.520 --> 35:11.560 of these metrics is that developers start to gain them. 35:15.080 --> 35:19.320 I do have a question. I've seen you mentioned both speedometer 2 and 3. 35:19.960 --> 35:25.720 Is that because you waited for the 3 to be released before doing the work on that or how did you 35:25.720 --> 35:30.600 work with speedometer while it was developed? And I worked on that that's why it's so interesting. 35:31.240 --> 35:36.520 Yes, I have some data that is from speedometer 3. I have some data that is from speedometer 2. 35:36.600 --> 35:42.440 That's mainly because over this time period, at the beginning of this graph, we only had speedometer 35:42.440 --> 35:47.400 2 available to us. At the end towards the end of this graph, we had speedometer 3 available. 35:47.400 --> 35:53.880 So we switched eventually to tracking the newer benchmark. Speedometer 3 is not significantly 35:53.880 --> 35:58.840 different from speedometer 2 from the low level perspective of what works well in the CPU versus 35:58.840 --> 36:04.200 what doesn't. So a lot of the work loads in speedometer 3 are the same ones as they are in speedometer 2, 36:04.280 --> 36:08.920 just slightly updated in the framework from the framework perspective. 36:10.120 --> 36:16.120 And the newer workloads are probably a little bit more stressful for a device overall. 36:17.000 --> 36:24.360 It's slightly larger workloads. Maybe a little bit more GPU work. So there's a little bit more 36:24.360 --> 36:31.240 there for us to look at and look at now. But yeah, they overall they don't look too different. 36:34.200 --> 36:46.840 For load line you talk about these custom metrics, can you say a little bit more about 36:46.840 --> 36:55.480 what they are and if or how that might scale for more than the five sites. 36:55.480 --> 37:02.040 Yeah, it doesn't scale to more sites. That is very clear to us. At the moment, they have really 37:02.040 --> 37:06.120 weight for specific elements on the page. For the pages that we need the custom metric, 37:06.120 --> 37:10.760 it's like for CNN. We make sure that the headline element is there too. On some other pages, 37:10.760 --> 37:15.160 we might interact with an element via JavaScript and then measure the time taken up to that point. 37:15.160 --> 37:19.960 For example, we wait until the menu icon appears and we'll click on the menu icon and we'll 37:19.960 --> 37:25.960 wait until the menu appears. That's a proxy for us to be able to say actually they are the 37:25.960 --> 37:31.640 contentist all there and you can interact with it. It's not like you've shown the content, 37:31.640 --> 37:36.120 but the big JavaScript that has to make the button interactive hasn't run yet. So you can't 37:36.120 --> 37:45.640 actually do anything on the page. To ask that's not something we can at least not with a lot of 37:45.640 --> 38:08.520 work generalized to any website. How many brands do you do of tests on a single website in order 38:08.520 --> 38:20.760 to get stable results? About a hundred. But there's work in progress to try and reduce that. 38:22.680 --> 38:27.960 The goal that we set out for initially was to be able to detect like a 1% difference in score 38:27.960 --> 38:35.240 and time taken to render our website in one hour. Running this benchmark a hundred times on each 38:35.240 --> 38:42.120 of these pages takes about an hour at the moment. We are roughly there for this goal, 38:42.120 --> 38:49.240 I think there's opportunity for us to optimize this. Most of the reason why it takes too long 38:49.240 --> 38:56.440 to run it is even though the page load may be takes about 500 milliseconds for a medium side. 38:57.480 --> 39:04.360 We have to tear down the browser and bring it all back up in between each iteration to make 39:04.360 --> 39:09.560 sure that things like caching and process creation, etc., is all taken into account correctly. 39:11.080 --> 39:19.240 But there's work underway to try and identify how much of that can be emulate rather than having 39:19.320 --> 39:22.440 to re-initialize everything from scratch. 39:38.040 --> 39:44.920 Do you have test related to the late CSS loading because sometimes the CSS comes in 39:45.400 --> 39:50.280 and after that the whole website has to be re-styled. 39:51.960 --> 39:55.960 I think that's an equal mouse website that can get quite some issues. 39:56.520 --> 40:02.680 It's a good point. I don't think we've covered for that particular case. At least not from what I've seen. 40:02.680 --> 40:08.760 It's possible that one of these sites does have that characteristics, but I'm not sure. 40:15.720 --> 40:21.720 All right. Thank you very much. 40:23.240 --> 40:23.720 Thank you.