WEBVTT 00:00.000 --> 00:09.000 All right, I'm Vlad, and I'm going to be talking about binary dependencies. 00:09.000 --> 00:10.000 Just doing it. 00:10.000 --> 00:11.000 Just doing it. 00:11.000 --> 00:12.000 All right. 00:12.000 --> 00:18.000 So we want to be able to identify the dependency graphs of all packages generally ideally, but specifically 00:18.000 --> 00:19.000 Keystone packages. 00:19.000 --> 00:21.000 What are Keystone packages? 00:21.000 --> 00:27.000 Nadia and her book in her report has this great definition where it comes from sort of this 00:28.000 --> 00:29.000 Barrelage definition. 00:29.000 --> 00:34.000 And the Keystone package we want to say is a package that has a widespread effect on its ecosystem 00:34.000 --> 00:38.000 or something that is disproportionately useful or dependent upon by other things or whatever. 00:38.000 --> 00:42.000 So especially the most important packages we really want to know there's dependency graphs. 00:42.000 --> 00:45.000 There's two reasons for this that I'm going to talk about. 00:45.000 --> 00:47.000 The first reason is open source financial sustainability. 00:47.000 --> 00:52.000 So obviously open source maintenance don't get paid enough or at all, 00:52.000 --> 00:54.000 burnout is a big issue when you have burnout. 00:54.000 --> 00:57.000 Packages don't get maintained, you know the deal. 00:57.000 --> 00:59.000 And open source supply chain security. 00:59.000 --> 01:04.000 So if we don't know any particular packages dependency graph, we can't know what security issues 01:04.000 --> 01:06.000 it might be vulnerable to. 01:06.000 --> 01:10.000 So the first thing, open source is not financially sustainable. 01:10.000 --> 01:12.000 Companies charge money for their products. 01:12.000 --> 01:14.000 They're products incorporate open source software. 01:14.000 --> 01:17.000 The money goes to the companies that doesn't flow to the open source product. 01:17.000 --> 01:18.000 The open source packages. 01:18.000 --> 01:20.000 And so maintainers can sustain them. 01:20.000 --> 01:21.000 They're packages. 01:21.000 --> 01:23.000 This is a problem. 01:23.000 --> 01:25.000 Technically, this is according to the rules. 01:25.000 --> 01:31.000 Yes, open source licenses allow this, but also this can still be bad even if the rules say that it's fine. 01:31.000 --> 01:33.000 Both things can be true. 01:33.000 --> 01:36.000 I'm not going to go into a lot of detail about that. 01:36.000 --> 01:37.000 The heat is doing the great talk tomorrow. 01:37.000 --> 01:40.000 The community track about burnout and open source. 01:40.000 --> 01:42.000 It's really powerful points about that. 01:42.000 --> 01:44.000 And then you can also reach out what occurs. 01:44.000 --> 01:47.000 Block both the open source sustainability crisis. 01:47.000 --> 01:50.000 So here's an example from my work. 01:50.000 --> 01:51.000 On the open source pledge. 01:51.000 --> 01:55.000 This is an initiative where we try to get companies to pay the maintainers who's worked 01:55.000 --> 01:56.000 They depend on. 01:56.000 --> 02:00.000 We ask companies to pay $2,000 per full-time equivalent developer to the 02:00.000 --> 02:03.000 employee per year to any open source maintainers really. 02:03.000 --> 02:07.000 But particularly to the maintainers of the software. 02:07.000 --> 02:10.000 They depend on the money goes directly to maintenance. 02:10.000 --> 02:11.000 We don't handle any funds. 02:11.000 --> 02:15.000 We had our one-year anniversary last year. 02:15.000 --> 02:18.000 And so far, our members have raised $6.1 million for maintainers. 02:18.000 --> 02:21.000 We're very happy about it. 02:21.000 --> 02:22.000 But. 02:22.000 --> 02:23.000 Thank you. 02:23.000 --> 02:27.000 Thank you. 02:27.000 --> 02:29.000 But where should the money go? 02:29.000 --> 02:34.000 So obviously companies don't understand 100% of their dependency tree. 02:34.000 --> 02:38.000 And so don't know necessarily what the best way is to allocate all of that money 02:38.000 --> 02:39.000 that they're paying. 02:39.000 --> 02:43.000 And like I said, we struggle to identify, especially the most important packages that 02:43.000 --> 02:45.000 keep our global infrastructure running. 02:45.000 --> 02:46.000 So that's one problem. 02:46.000 --> 02:50.000 Another problem is open source supply chain security. 02:50.000 --> 02:53.000 So let's say you have a thing and it depends on fully. 02:53.000 --> 02:55.000 It depends on some other live and some other thing. 02:55.000 --> 02:58.000 And there's a security problem in that last one. 02:58.000 --> 03:03.000 You might want to know because that means that there might be a security problem in your project. 03:03.000 --> 03:08.000 But if you don't know about that package, you can't do anything about it. 03:08.000 --> 03:11.000 So that's bad and we want to solve that. 03:12.000 --> 03:16.000 Yeah, so we want to be able to identify packages dependency graphs. 03:16.000 --> 03:19.000 Now you'd think look in the manifest, right? 03:19.000 --> 03:23.000 Because it says in package.json what things things depend on. 03:23.000 --> 03:26.000 The problem is some things are not in the manifest. 03:26.000 --> 03:32.000 The dreaded phantom dependencies, which are sort of described in this really cool post. 03:32.000 --> 03:38.000 Yeah, so if the dependency is not in your manifest, you're not going to know you depend on it. 03:38.000 --> 03:42.000 And then you can't fund the people you depend on if you wanted to do that. 03:42.000 --> 03:46.000 And also you can't spot security issues from the stuff that you depend on. 03:46.000 --> 03:47.000 So that's a problem. 03:47.000 --> 03:49.000 Now I'm going to talk about these different kinds of phantom dependencies. 03:49.000 --> 03:52.000 I'm going to talk about one kind, which is binary dependency. 03:52.000 --> 03:56.000 So this is when you have something and then it calls into this dynamic library, 03:56.000 --> 03:58.000 which is usually written in C. 03:58.000 --> 04:03.000 And this kind of dependency is usually not recorded. 04:03.000 --> 04:07.000 So for example, you might have some Python code and your Python code depends on numpy. 04:07.000 --> 04:09.000 But numpy depends on open blasts. 04:09.000 --> 04:11.000 But it doesn't say that anywhere. 04:11.000 --> 04:12.000 So we don't know that. 04:12.000 --> 04:13.000 So that's bad. 04:13.000 --> 04:15.000 It's not in Pyproject.tomals. 04:15.000 --> 04:22.000 So you can't support the developers of open blasts or any of the other sort of binary dependencies that you rely on. 04:22.000 --> 04:25.000 And you can't track security issues in those packages. 04:25.000 --> 04:26.000 So that's not good. 04:26.000 --> 04:28.000 Now how binary dependencies work? 04:28.000 --> 04:29.000 I wrote a post. 04:29.000 --> 04:30.000 It's on Vladau website. 04:30.000 --> 04:31.000 You can read it. 04:31.000 --> 04:32.000 I'm not going to get into it. 04:32.000 --> 04:34.000 But I will get into this specific thing. 04:34.000 --> 04:40.000 So how you usually call into C code in a lot of different ecosystem. 04:40.000 --> 04:43.000 So this is true for Python and Ruby and JavaScript. 04:43.000 --> 04:44.000 The details are a bit different. 04:44.000 --> 04:50.000 But basically, you want to be able to have some kind of code that you write that calls into the C library. 04:50.000 --> 04:56.000 But you also want to be able to manage type conversions in between the C types and you know, 04:56.000 --> 05:01.000 your Python data structures in a way that is, you know, maximally flexible. 05:01.000 --> 05:05.000 And so what a lot of ecosystems do is you have this thing called an extension module, 05:05.000 --> 05:07.000 which is a bunch of C codes. 05:07.000 --> 05:08.000 You write a little bit. 05:08.000 --> 05:10.000 You write a little bit of C code. 05:10.000 --> 05:15.000 And then that C code includes some headers as like Python.htr or whatever. 05:15.000 --> 05:18.000 And those headers tell it about the types in Python. 05:18.000 --> 05:22.000 And so that means that in your extension module, 05:22.000 --> 05:27.000 you can convert things, you know, back and forth whenever is best for you. 05:27.000 --> 05:29.000 And you know, it's a performance decision. 05:29.000 --> 05:37.000 But anyway, the point is your Python code and your extension module are going to communicate in whatever way is defined by your ecosystem. 05:37.000 --> 05:42.000 But then your extension module itself gets compiled to dynamic library. 05:42.000 --> 05:44.000 This is the important sort of point, right? 05:44.000 --> 05:48.000 And your dynamic library is going to get linked dynamically to the C thing you're calling. 05:48.000 --> 05:51.000 And so that means that your compiled extension module, 05:51.000 --> 05:56.000 if we can get a hold of that, is going to say in it what libraries it needs, right? 05:56.000 --> 06:00.000 It's cool because then we can figure out what binary dependencies you depend on. 06:00.000 --> 06:05.000 So, Python wheels, right, include vendor binary. 06:05.000 --> 06:09.000 So, you know, if you depend on open blast is going to include the dynamic library, 06:09.000 --> 06:12.000 compiled version of open blast in your wheel. 06:12.000 --> 06:13.000 So that's good. 06:13.000 --> 06:14.000 And we can detect that. 06:14.000 --> 06:15.000 So let's investigate. 06:15.000 --> 06:19.000 I looked at the 15,000 most downloaded Python packages. 06:19.000 --> 06:23.000 I could only download 13,074 of them for various reasons. 06:23.000 --> 06:29.000 And of those wheels, 1531 wheels contain dynamic libraries. 06:29.000 --> 06:31.000 So that's already interesting. 06:31.000 --> 06:35.000 And I found a total of 12,137 SO files. 06:35.000 --> 06:40.000 Now, those SO files, some of them are bundled binary dependencies. 06:40.000 --> 06:41.000 So stuff like open blasts. 06:41.000 --> 06:45.000 But some of them are just those extension modules that we wrote, you know, 06:45.000 --> 06:47.000 that we just looked at. 06:47.000 --> 06:51.000 So, you know, it's either the extension modules or the phantom dependencies. 06:51.000 --> 06:53.000 So we want to find the phantom dependencies. 06:53.000 --> 06:55.000 Okay, how do we find it? 06:55.000 --> 06:59.000 In your extension module, that's on Linux, at least it's an L file. 06:59.000 --> 07:01.000 Your L file has a section, the dynamic section. 07:01.000 --> 07:03.000 And in there, it has a bunch of information. 07:03.000 --> 07:05.000 But some of that information has a tag. 07:05.000 --> 07:06.000 And the tag is detected. 07:06.000 --> 07:10.000 So it describes a library that your extension module needs. 07:10.000 --> 07:15.000 And in the name, it's going to tell you the name of that dynamic library. 07:15.000 --> 07:20.000 And it's that name that the runtime linker uses to find that thing on the computer. 07:20.000 --> 07:23.000 And so in theory, the name is sufficient for us to find it. 07:23.000 --> 07:24.000 Right? 07:24.000 --> 07:26.000 So we can look inside the dynamic. 07:26.000 --> 07:29.000 We can look inside those dynamic libraries and find those names. 07:29.000 --> 07:36.000 And so I decided to figure out what the most common dynamic libraries are in our 13,000 odd wheels. 07:36.000 --> 07:38.000 Some of them are obvious. 07:38.000 --> 07:42.000 I think, you know, lib m whatever. 07:42.000 --> 07:44.000 So that's fine. 07:44.000 --> 07:47.000 But some of them, I don't, I mean, maybe you all know all of these things. 07:47.000 --> 07:49.000 In which case, that's super cool. 07:49.000 --> 07:50.000 I don't. 07:50.000 --> 07:53.000 I mean, I know like a few of them, but also wasn't expecting them to be this common. 07:53.000 --> 07:55.000 Other ones, I just don't know what they are at all. 07:55.000 --> 07:59.000 So anyway, I don't sort of have a concrete breakdown. 07:59.000 --> 08:03.000 But it's interesting because there's a bunch of stuff that's really dependent upon that we don't know. 08:03.000 --> 08:05.000 I don't know what it is. 08:05.000 --> 08:10.000 And so I wanted to come up with a number of how common binary dependencies are in the Python ecosystem. 08:10.000 --> 08:12.000 Generally, it's going to be a very inaccurate number. 08:12.000 --> 08:13.000 But let's give it a try. 08:13.000 --> 08:15.000 So we have 13,000 odd packages. 08:15.000 --> 08:19.000 We have 1,531 wheels that contain the SO files. 08:19.000 --> 08:24.000 We have 1,300 free other wheels that depend on those first wheels, right? 08:24.000 --> 08:27.000 So that gives us 2,834 wheels. 08:27.000 --> 08:32.000 Now, when we generalize this, we're going to keep in mind that this is obviously a small sample, 08:32.000 --> 08:34.000 even though it is the most popular packages. 08:34.000 --> 08:37.000 And also, this is only one way of doing binary dependencies, right? 08:37.000 --> 08:39.000 There's other ways. 08:39.000 --> 08:42.000 But, you know, taking that into account if we were to average that out. 08:42.000 --> 08:46.000 We can say that around 20% of Python packages have phantom binary dependencies. 08:46.000 --> 08:51.000 Which means that for those packages, which is a lot of packages, we don't have the full picture of the dependencies, 08:51.000 --> 08:54.000 especially since frequently. 08:54.000 --> 08:59.000 You know, you're not going to have a binary dependency for something that's like a left-pad type situation, I hope. 08:59.000 --> 09:02.000 So it's going to be a thing that you really rely on, right? 09:02.000 --> 09:03.000 So we want to know about those things. 09:03.000 --> 09:05.000 And this affects pretty much everyone. 09:05.000 --> 09:07.000 It affects users of the open source software. 09:07.000 --> 09:10.000 And it also harms maintainers because the maintainers can't get supported. 09:10.000 --> 09:13.000 Yeah, so I'd like to continue working on this. 09:13.000 --> 09:14.000 This was just for a Python. 09:14.000 --> 09:16.000 We can do this in more detail for Python. 09:16.000 --> 09:18.000 We can do it for all of these other ecosystems. 09:18.000 --> 09:22.000 We could look at the symbols inside the dynamic libraries. 09:22.000 --> 09:26.000 In case we have some trouble identifying what library a file belongs to. 09:26.000 --> 09:31.000 We can look at the symbols and build a database of, you know, which symbols are provided by what package. 09:31.000 --> 09:37.000 I would like to publish some kind of tool that makes it easy to explore these binary dependencies for any given package. 09:37.000 --> 09:40.000 And integrating into services like ecosystems would be great. 09:40.000 --> 09:44.000 I'd hope to get some funding to work on this if you know of anyone that might be willing to fund this. 09:44.000 --> 09:45.000 Let me know. 09:45.000 --> 09:46.000 And that's it. 09:46.000 --> 09:47.000 Thank you very much. 09:47.000 --> 09:49.000 There's more details on my website. 09:57.000 --> 09:58.000 All right, let's do some questions. 09:58.000 --> 09:59.000 I'll fund you. 09:59.000 --> 10:00.000 Amazing. 10:00.000 --> 10:01.000 Easy. 10:01.000 --> 10:02.000 Thank you very much. 10:02.000 --> 10:05.000 It's not a question, but I welcome it. 10:05.000 --> 10:07.000 I'm kind of a question. 10:07.000 --> 10:08.000 Sorry. 10:08.000 --> 10:10.000 The answer is yes to your question. 10:10.000 --> 10:11.000 Let's do a different question. 10:11.000 --> 10:13.000 We have a question over there. 10:13.000 --> 10:20.000 Do you think the future culture should just debate these kind of binary dependence? 10:20.000 --> 10:24.000 I don't. 10:24.000 --> 10:25.000 Yes. 10:25.000 --> 10:30.000 So do I think that package manager or sort of package managers or package managers? 10:30.000 --> 10:31.000 Yeah. 10:31.000 --> 10:32.000 Yeah. 10:32.000 --> 10:35.000 Registry should not allow these kinds of binary dependencies. 10:35.000 --> 10:36.000 No. 10:36.000 --> 10:37.000 I don't think so. 10:37.000 --> 10:42.000 I think if anything, we need to get better handling them because there's, you know, a lot of people 10:42.000 --> 10:43.000 use them for very good reason. 10:43.000 --> 10:47.000 If anything, we don't have the infrastructure right now to do that in a reliable way. 10:47.000 --> 10:48.000 Yeah. 10:48.000 --> 10:49.000 Just one word. 10:49.000 --> 10:51.000 There's a button on this thing. 10:51.000 --> 10:53.000 We'll close it. 10:53.000 --> 10:54.000 Yeah. 10:54.000 --> 10:55.000 Two. 10:55.000 --> 10:56.000 Four. 10:56.000 --> 10:57.000 Four. 10:57.000 --> 10:58.000 Girls. 10:58.000 --> 10:59.000 And we have the offer right there. 10:59.000 --> 11:01.000 Raise your hand. 11:01.000 --> 11:02.000 Hello. 11:02.000 --> 11:03.000 The truck. 11:03.000 --> 11:04.000 Bineries. 11:04.000 --> 11:06.000 That's our vendor. 11:06.000 --> 11:07.000 There's another thing. 11:07.000 --> 11:09.000 Not my son, please. 11:09.000 --> 11:10.000 Raise your hand. 11:10.000 --> 11:12.000 So main things of package. 11:13.000 --> 11:14.000 Test truck. 11:14.000 --> 11:16.000 S-S-T-R-E. 11:16.000 --> 11:18.000 Which instruments. 11:18.000 --> 11:21.000 G-C-C and L-D-A-B-E-L. 11:21.000 --> 11:25.000 To inject all of the dependencies in an L-F-section. 11:25.000 --> 11:27.000 So there are people who believe that you said there's two. 11:27.000 --> 11:28.000 So there's pebb. 11:28.000 --> 11:29.000 Haven't you haven't happened then? 11:29.000 --> 11:34.000 You know that the dependencies because they were bound up at the up time. 11:34.000 --> 11:35.000 So. 11:35.000 --> 11:36.000 We isn't as bound. 11:36.000 --> 11:37.000 For the time we spread. 11:37.000 --> 11:38.000 Yes. 11:38.000 --> 11:39.000 Yes. 11:39.000 --> 11:40.000 So just to repeat this. 11:41.000 --> 11:42.000 Yeah. 11:42.000 --> 11:43.000 You said there's pebb. 11:43.000 --> 11:44.000 Seven. 11:44.000 --> 11:45.000 Seven. 11:45.000 --> 11:46.000 Twenty-five. 11:46.000 --> 11:47.000 Okay. 11:47.000 --> 11:48.000 So that's I will definitely look into that. 11:48.000 --> 11:49.000 Hello. 11:49.000 --> 11:50.000 And then the other thing was called. 11:50.000 --> 11:52.000 S-T-R-E. 11:52.000 --> 11:53.000 S-T-R-E. 11:53.000 --> 11:54.000 Okay. 11:54.000 --> 11:55.000 Cool. 11:55.000 --> 11:56.000 So I'll look at those things. 11:56.000 --> 11:57.000 Thank you very much. 11:57.000 --> 11:59.000 You can go to getup.com slash sunny. 11:59.000 --> 12:00.000 Yeah. 12:00.000 --> 12:01.000 Yes. 12:01.000 --> 12:02.000 Amazing. 12:02.000 --> 12:03.000 Thank you. 12:03.000 --> 12:04.000 Any. 12:04.000 --> 12:05.000 Raise your hand. 12:05.000 --> 12:06.000 I may sound. 12:06.000 --> 12:08.000 You can go to this guy there. 12:08.000 --> 12:09.000 Amazing. 12:10.000 --> 12:11.000 We have one question over there. 12:11.000 --> 12:12.000 If I still have time for one question. 12:12.000 --> 12:13.000 Yes. 12:13.000 --> 12:14.000 I'll get a curse. 12:14.000 --> 12:15.000 It's not so much a question. 12:15.000 --> 12:19.000 It occurs to me that if you said this is called phrase or run. 12:19.000 --> 12:20.000 You will discover every day. 12:20.000 --> 12:21.000 Yeah. 12:21.000 --> 12:22.000 Yeah. 12:22.000 --> 12:23.000 I've built a blue. 12:23.000 --> 12:24.000 Exactly. 12:24.000 --> 12:25.000 For that. 12:25.000 --> 12:26.000 He's a trace. 12:26.000 --> 12:27.000 It's painful. 12:27.000 --> 12:28.000 Because. 12:28.000 --> 12:29.000 Yeah. 12:29.000 --> 12:30.000 Interesting. 12:30.000 --> 12:31.000 And doesn't know. 12:31.000 --> 12:33.000 Because you don't find exactly this. 12:33.000 --> 12:34.000 Okay. 12:34.000 --> 12:35.000 Then you don't know. 12:35.000 --> 12:36.000 All right. 12:36.000 --> 12:37.000 All right. 12:37.000 --> 12:38.000 Thank you very much. 12:39.000 --> 12:40.000 Good job. 12:40.000 --> 12:41.000 Thank you. 12:41.000 --> 12:42.000 Thank you. 12:42.000 --> 12:43.000 Good job. 12:43.000 --> 12:44.000 Good job.