WEBVTT 00:00.000 --> 00:18.720 Hello, my name is Marco Dacos. I'm a software engineer at Google. I was going to give 00:18.720 --> 00:26.680 this presentation with Brandon Lum. Unfortunately, he fell sick. We had originally planned 00:26.680 --> 00:32.360 for, well, Brandon loves S-bomb so much that he really wanted to give the talk. So, he did 00:32.360 --> 00:39.000 a recording of the first part, but we didn't check, we didn't realize that that's 00:39.000 --> 00:45.360 not supported here at Fossum. So, I'm just going to give the whole thing. So, bear with 00:45.360 --> 00:54.040 me for the first part. Let's begin. Today, we're going to talk about Google's journey 00:54.040 --> 01:04.280 with S-bomb. We'll start out with how we implemented various aspects of this and then some 01:04.280 --> 01:13.880 lessons learned through this process. We will go through the S-bomb lifecycle of a generation 01:13.880 --> 01:28.880 storage retrieval and then various applications. So, before we start off with this lifecycle, 01:28.880 --> 01:39.880 I guess first, to give some context, our initial motivation to push dive deep into S-bomb 01:39.880 --> 01:47.960 was in mix of responding to EO 14028 and security and license use cases as well. Before we 01:47.960 --> 01:54.040 get into that lifecycle, let's talk about some design principles that we developed. The first 01:54.040 --> 02:01.280 question we asked is, what are we looking for in S-bomb? We rallied around two properties, 02:01.280 --> 02:08.880 first accuracy and trustworthiness, is the dependency information in the S-bomb correct. 02:09.880 --> 02:15.120 And then trustworthiness. Can we trust this S-bomb and use it for important security and 02:15.120 --> 02:23.040 compliance decisions? Based on these properties, we developed a series of best practices that 02:23.040 --> 02:33.440 will go through in this talk. There's a link here as well to the document that goes through these 02:33.440 --> 02:41.600 design principles that we recently made public. So, feel free to check that out. Another 02:41.600 --> 02:48.960 question, clearly, that we are in into, is S-p-d-x or cyclone-d-x or both. But more generically, 02:48.960 --> 02:57.800 the question is, how opinionated do we want to be throughout this process? The scope for us 02:57.880 --> 03:02.920 was very large. It's scanned, it's spanned, organizations, different products, different 03:02.920 --> 03:12.680 text acts. There were a lot of moving pieces as a result. We decided that less is more. That 03:12.680 --> 03:20.440 for all of the moving pieces to come together in a better way, being more opinionated would 03:20.440 --> 03:27.000 be helpful. So, we decided to only use one S-bomb standard based on the experience we had 03:27.080 --> 03:43.560 at the time. We went with S-p-d-x. So, let's get into S-bomb generation and how we approached 03:43.560 --> 03:51.320 to this. A question that we faced here is, when and where to generate S-bombs, we can do 03:51.320 --> 03:57.480 it at the source phase during the build or analysis. After the fact, after the artifact has been 03:57.480 --> 04:06.680 generated. On one hand, if we look at the source, if we generate S-bombs at the source, we 04:06.680 --> 04:14.440 found that things like tests and plug-in dependencies would end up in the S-bomb. We found 04:14.440 --> 04:21.160 that there is perhaps ambiguous dependency resolution. And on the other side, for analysis S-bombs, 04:21.240 --> 04:31.880 builds our lossy. There could be missing context in 2022. We did some work and showed that 04:33.720 --> 04:39.480 if we, for example, import a binary into a container image. As we mentioned, builds our lossy. 04:39.480 --> 04:44.920 So, that information won't. The dependencies of that binary won't show up in the S-bomb. 04:45.560 --> 04:58.840 This is a Goldilocks situation at the source S-bombs are inaccurate and in the analysis of 04:58.840 --> 05:07.960 the artifact stage, they can be incomplete. We found that the best approach is to meet in the 05:07.960 --> 05:14.360 middle at the build time where there is a good trade-off between accuracy and completeness. We 05:14.360 --> 05:23.560 found that to know when or to really know what goes into the artifact that's being produced, 05:23.560 --> 05:32.920 the builder and the build process knows best. So, we recommend that only the builders can 05:32.920 --> 05:40.600 generate S-bombs. But this is again also a nuanced question because during the build time, 05:41.240 --> 05:48.680 there can be an S-bomb generated by the build tooling itself or an S-bomb generated by an SCA tool. 05:50.760 --> 05:57.560 We recommend that wherever possible, the build tool should generate the S-bomb. 05:59.400 --> 06:09.080 And in some cases, this lists some cases where we do that. For Android, we created a 06:10.040 --> 06:18.040 Gradle plugin that generates an S-pdx S-bomb during the build. We have a similar approach for 06:18.040 --> 06:28.680 Google 3 which is our monar repo. But outside of that, we use SCA tools such as SIFT and OSV 06:28.760 --> 06:46.760 to generate the S-bomb during the build process. So, now we are faced with S-bomb storage. We 06:46.760 --> 06:51.800 have the S-bombs. So, let's just put them into a database and be done. Well, it's a little bit 06:51.880 --> 06:59.800 more complicated than that. And the main, one of the important questions here was to be able 06:59.800 --> 07:05.320 to use S-bombs for important security, compliance decision, to be able to get them to the 07:05.320 --> 07:10.440 government, for example, how can we create an S-bomb database that we can trust? 07:10.760 --> 07:22.600 So, for this, we developed a system called SILO, the Supply Chain Integrity Log, 07:23.800 --> 07:30.920 contrary to its name. It's supposed to break up metadata silos. For those familiar with Gwak, 07:31.960 --> 07:38.280 it serves a similar purpose. Gwak in the open as a project in the open SSF. It serves a similar 07:38.360 --> 07:44.200 purpose, and it's under the same team. Or it's worked on by members of SILO as well. 07:47.800 --> 07:57.160 So, SILO, for some context, collects metadata from software supply chain events that occur 07:57.160 --> 08:05.960 in Google. When builders produce artifacts, they produce build provenance, they sign it, 08:06.600 --> 08:14.360 and then they send it over to SILO. And then SILO can verify the signature, 08:15.480 --> 08:21.800 can verify that it was produced by a trusted builder, and then can ingest it, store it, 08:21.800 --> 08:31.880 and use it, and make that information usable. For S-bombs, we took a similar approach. 08:32.760 --> 08:42.040 We used something called, we used in Toto, and in Toto, a predicate type called the reference 08:42.040 --> 08:50.200 at a station, which we have now upstreamed to the at a station repository that's a SILO-C8s 08:50.200 --> 08:59.880 and S-bomb. Or in this case, S-bombs with a software artifact. And so here, when builders produce 08:59.960 --> 09:09.400 S-bombs, they generate and sign an Intoto-autostation and send it over to SILO and SILO 09:10.200 --> 09:18.760 verifies this to ensure the integrity of S-bomb. And this also supports only accepting S-bombs 09:18.760 --> 09:37.320 from trusted builders. So now, in theory, we have a table mapping artifacts to S-bombs and 09:37.320 --> 09:51.880 retrieval should be easy, but famous last words. So what we want is to have a simple input 09:51.880 --> 10:02.520 output of looking up an S-bomb by an artifact, such as a URI or a digest, but what we got was 10:02.520 --> 10:10.600 nothing when we did this operation. Why is that? So the context is that we're working in a 10:10.600 --> 10:18.440 supply chain. It's a graph and it has to be reasoned about as such. When we search for the S-bomb 10:18.440 --> 10:27.320 of GCR.io slash ABCD, we didn't get anything because no S-bomb was produced for that artifact, 10:27.400 --> 10:33.640 because that last stage when that artifact was produced was, say, a promotion process. 10:35.080 --> 10:43.160 And that's not a build. So there is no S-bomb generated there. So what we can do is go back 10:43.160 --> 10:52.520 in the graph. We can look at one lower lower. And here, in this case, we did not find the S-bomb. 10:52.760 --> 11:02.200 And then we go even further back and turns out that this artifact in the staging is a multi-ark 11:02.200 --> 11:14.440 image that was assembled from different container images itself. And here, we found S-bombs. 11:14.440 --> 11:21.160 Great. But we can go even further back. All the way back through the chain to collect all the 11:21.240 --> 11:29.560 S-bombs that were in some way associated in the production of this final artifact. 11:30.920 --> 11:36.440 So now, when we look up for the S-bombs of an artifact, we get a collection of S-bombs. 11:37.880 --> 11:45.960 Awesome. Here are some edge cases that we ran into throughout this process. 11:45.960 --> 11:55.720 Well, going back a second, what we did was generate a, do this, implement a transitive 11:55.720 --> 12:02.280 search through the software supply chain graph whenever looking for S-bombs to collect all of these. 12:03.800 --> 12:09.560 And here's some edge cases that we ran into through this process. I won't go into all of them here. 12:09.560 --> 12:15.240 But in general, the practice here, the principle here is that to get good quality S-bombs, 12:16.440 --> 12:23.400 that are accurate and complete, we try to compose them to arrive at completeness. 12:29.400 --> 12:35.400 So we've left out a little bit of detail here. Where does the scoff come from? 12:36.120 --> 12:43.320 This is an idealized state or is it real? So let's go back to the build, what we mentioned earlier. 12:43.960 --> 12:50.680 That when the build there produces an artifact, it creates a build problem, it's not only describing 12:50.680 --> 12:54.680 how the build was conducted, but also the dependencies and the materials that went into the build. 12:57.400 --> 13:02.520 So this problem is going to be used to construct the graph. 13:04.120 --> 13:10.440 And it can be used to glue together these lost pieces of S-bombs, creating more complete S-bombs 13:10.440 --> 13:16.760 by composing them together. And this is a blog post, partly by Brandon, that goes into this in 13:16.760 --> 13:21.240 more detail about how salsa and S-bombs work together. 13:27.640 --> 13:35.640 Another challenge we ran into here is that often a request for S-bombs are the artifact granularity. 13:36.600 --> 13:42.680 Somebody will ask our S-bombs for a product say pixel OS. And translating this to 13:45.480 --> 13:48.840 to something that can be interpreted with the mechanisms that we developed 13:49.960 --> 13:56.120 is challenging, maintaining self-to-inventories hard. And in these cases we relied on the expertise of 13:56.120 --> 14:03.800 product teams. This is an example of why, and part of this is challenging because different 14:03.800 --> 14:08.840 products can be composed of artifacts in different, in varied ways. 14:15.320 --> 14:22.680 Okay, so we've responded to the EO, but compliance isn't fun, so let's talk about some other 14:22.680 --> 14:32.280 things that we've done with S-bombs. S-bomb blobs are cool, but what they contain is even better. 14:32.360 --> 14:38.040 We used them to develop dependency inventory across the organization. 14:38.920 --> 14:45.640 This involved parsing and storing S-bomb data at scale. For now, we only store 14:46.680 --> 14:51.640 flat lists of dependencies that are in the S-bombs. We found that the graph structure that can be 14:51.640 --> 15:02.040 encoded in an S-bomb relationships was useful, but that it wasn't mature enough to really be 15:02.120 --> 15:10.120 dependent on. And developing such a dependency inventory with S-bombs in a large, 15:10.120 --> 15:15.960 softer producing machine like Google is difficult, where it's spanned different products, 15:15.960 --> 15:21.880 tech stacks, different CSED flows, different S-bomb generators. But there are some properties of S-bombs 15:21.880 --> 15:30.280 that really helped to succeed. This included, the fact that S-bombs are common format across 15:30.280 --> 15:37.880 ecosystems, that's enabled us to reason about otherwise siloed systems through a single pane of glass. 15:39.480 --> 15:45.960 And another thing that really helped here was that S-bombs are flexible both in the generation 15:45.960 --> 15:55.000 and in the concepts of them. We did, as we talked about, put forth recommendations on how they 15:55.000 --> 16:01.080 should be producing Google, but within that each ecosystem could choose to generate S-bombs 16:01.080 --> 16:07.640 in a way that produced the best S-bombs for them. But this flexibility also led to some challenges 16:07.640 --> 16:15.240 that we'll talk about in a bit. So we have this dependency inventory. We can combine it with 16:15.240 --> 16:21.160 things like threat intelligence and organizational metadata on using systems like silo and guac 16:22.120 --> 16:30.360 and use that for various use cases such as internet response, measuring risk from upstream, 16:30.360 --> 16:35.800 open source dependencies, and also for general business and housekeeping purposes. 16:39.640 --> 16:46.520 This inventory enables fast identification of where packages are used. This is critical for 16:46.520 --> 16:57.560 internet response. I think this is like a log for a shell or XE type capability, a quote from 16:57.560 --> 17:01.000 a product team is that they were able to figure out that they weren't affected by an incident 17:01.000 --> 17:09.560 within 10 minutes using S-bombs. What's shown on the screen here is a sample dashboard that 17:09.560 --> 17:17.960 can search across this inventory. As you can see, the searching for a single dependency 17:20.360 --> 17:28.600 searches through various siloed systems. That's part of the value that the S-bombs provide. 17:30.520 --> 17:34.760 This slide also calls out the importance of point of contact information or 17:34.760 --> 17:43.320 attribution of artifacts to artifacts and artifacts to products and to teams. This organizational 17:43.320 --> 17:51.000 metadata is important because otherwise the actionability of this inventory is limited and 17:51.000 --> 18:03.960 it creates toil when responding to incidents. Five minutes. It's not until 1030. 18:05.400 --> 18:13.720 Yeah, we'll have the questions. Okay, so another thing, another use case that we 18:14.920 --> 18:22.360 applied it to is to measure the risk from our dependency usage. This graph shows 18:24.280 --> 18:29.160 the subset of the fleet, the dependencies of subset of the fleet, 18:30.040 --> 18:37.400 mapped to the open SSF score card scores and this identifies a danger zone here. 18:40.840 --> 18:45.800 But this really highlights the need for additional metadata such as criticality and 18:45.800 --> 18:53.800 attribution to make this more actionable. Now, run through some lessons that we learned through 18:53.880 --> 19:00.680 this process of operationalizing a dependency inventory using S-bombs. As you might have noticed, 19:00.680 --> 19:07.160 we decided to use perils for the identification scheme for this inventory. We decided to use perils 19:07.160 --> 19:16.120 because they're common in S-bombs. They are other identification schemes and cut it and 19:17.640 --> 19:21.640 we didn't want to develop a new one. However, we quickly noticed that many S-bombs or 19:21.640 --> 19:28.200 packages in S-bombs didn't have perils. In these cases, we generated fake perils with the 19:28.200 --> 19:32.120 information that was available in S-bomb, but this still had limited utility. 19:34.440 --> 19:40.920 And often this happened due to things like build-time generation of S-bombs from artifacts 19:40.920 --> 19:47.880 that contain third-party vendard code, where the proper identification metadata hadn't been 19:47.960 --> 19:58.360 propagated through. Other shortcomings that we ran into related to S-ca tools that we used. 19:58.360 --> 20:03.480 And these are really a fundamental result of how they operate. But S-ca tools attempt to 20:03.480 --> 20:08.600 read files on disk and match them to packages in a registry based only on what the file 20:08.600 --> 20:14.440 on disk says and not and without access to the registry. Understandably, this can lead to some 20:14.440 --> 20:23.560 issues. The problem is that the same package can have, sorry, the problem is that different 20:23.560 --> 20:28.760 packages can have the same names or a very similar manifest files within or outside of the 20:28.760 --> 20:35.240 same ecosystem. And the scanner has to choose one. For example, Unity, which is a game engine, 20:35.240 --> 20:45.960 has the same package that JSON file as NPM. And in this case NPM perils were generated for 20:46.600 --> 20:55.000 something that wasn't NPM at all. Other cases include private packages. This is an example. 20:55.000 --> 21:00.840 This is shown on the screen. This is a meta-host file of a private package in a repository that 21:00.920 --> 21:09.800 was not published to NPM. But an NPM pro was generated for it. And for the by the pro spec, 21:09.800 --> 21:14.920 this pro indicates that this package is on the registry. But in reality, it's not. 21:17.240 --> 21:22.440 So why is this important? Why do I flag this? It's because some other people not with the best 21:22.440 --> 21:29.880 intentions also realize this. For a lot of these packages, such as the private packages, for example, 21:31.720 --> 21:40.600 malware was submitted to the registry under that package name. The name was squatted. And then 21:42.040 --> 21:49.080 this creates a lot of toil because whenever somebody uses a project that contains one of these 21:49.080 --> 21:56.760 say private dependent, private packages and runs a scanner on it, it will flag it as malware 21:56.760 --> 22:02.920 and reality. This is an example of a thread of a project that contains such a package and it 22:02.920 --> 22:08.360 receives a lot of complaints. Understandably, the maintainers didn't sympathize. 22:12.920 --> 22:20.120 So we also ran into some challenges with identifiers going to go through this section quickly 22:20.360 --> 22:30.600 because I don't have much time left. But the SCA problems that I mentioned are 22:30.600 --> 22:36.280 archipounded by lack of expressivity of the identifiers that we used. General identification is hard. 22:37.560 --> 22:44.040 Even if we could identify that a package is private, for example, pearlism support, creating an 22:44.040 --> 22:50.680 identifier for it. SCA tools could extract more information such as hashes, but how to do this 22:50.680 --> 22:57.960 isn't standardized. And how to compare packages with that say supplemental information isn't 22:57.960 --> 23:07.240 standardized either. So I'd like to also just call out a reference that I think a project that 23:07.240 --> 23:11.720 I think is making a lot of progress in the space guac, which is guac that we mentioned earlier. 23:12.680 --> 23:19.880 guac is looking at a strategy of disambulating and correlating identifiers to 23:21.800 --> 23:28.120 to help solve this issue. So these issues are all fall under 23:28.760 --> 23:36.760 S-bomb quality. Other issues have also highlighted this and the importance of that application. 23:36.840 --> 23:41.560 The importance of S-bombs have meant that this year we're really going to focus on S-bomb quality 23:42.280 --> 23:46.840 both on syntactic elements, but also semantics such as completeness and accuracy. 23:49.480 --> 23:55.560 This is an example of S-bomb quality that's a lot of accuracy, but I don't think I have to 23:55.560 --> 24:04.840 go that into here. So what now? We went from almost zero S-bomb to four million S-bombs a week 24:04.920 --> 24:11.960 totaling over 200 million S-bombs. Security and compliance teams are using S-bombs to triage 24:14.600 --> 24:21.400 security and compliance issues. S-bombs are part of the security posture of several organizations 24:22.440 --> 24:29.000 and we've been through a few S-bomb tools along the way. And we use S-bomb tools along the way. 24:29.240 --> 24:35.400 This is just a list of some things that we're going to, that we're looking forward to, 24:35.400 --> 24:40.680 that we're focusing on this year. And yeah, that's just on the screen. 24:41.960 --> 24:45.960 Yeah, so I'm ready to take any questions. If anybody has any. 24:47.160 --> 24:54.680 With the Wich school do you use for recognising vulnerability for example? 24:55.640 --> 24:59.640 Is that completely different? Is that from the S-bombs? 25:07.640 --> 25:17.960 Oh, yes, sorry. So the question is if we use, what do we use to detect around abilities in 25:17.960 --> 25:24.360 S-bombs and do we do that as part of, is that in scope of S-bombs or is it out of scope? Is that 25:24.360 --> 25:26.840 done separately? And in this case, it's out of scope. 25:29.880 --> 25:36.280 I also have a question. Yes. Since you mentioned the supply chain, do you reuse parts of the S-bombs 25:36.280 --> 25:43.960 or which one results in the end S-bombs that is being used at a stand-alone unit? 25:44.760 --> 25:49.000 And as it forms to, either media and S-bombs that are being. 25:49.960 --> 25:56.280 If I understand your question correctly, we use provenance to, sorry. 25:57.000 --> 26:01.640 The question is, do we, I mean, just very, 26:02.920 --> 26:12.040 explicitly do we merge S-bombs together or do we compose them? We compose them with, with a mechanism that we mentioned? 26:13.320 --> 26:16.520 Yes. You've got to change about a date set. Really good. 26:17.160 --> 26:22.680 How are you contributing the date so you're finding upstream to make the matter data better for everybody else? 26:24.280 --> 26:32.440 Well, a good question. Yes. We have a big data set. How are we contributing this back to 26:32.440 --> 26:41.960 upstream to tell the community? We, the sheer we're looking at generating error developing, 26:41.960 --> 26:45.880 and S-bomb quality library. This is something that we may be able to open source, 26:47.480 --> 26:53.480 and O-S-V-Scaliber will be a part of the S-bomb quality that we look at, and this is open source as well. 26:57.560 --> 26:59.560 So you're not, you know, thank you very much. 27:11.960 --> 27:20.600 Actually, the Copeland and Brandon was online on the Matrix, and was answering questions there.