WEBVTT 00:00.000 --> 00:14.400 So, this is a short version of a talk that I usually give about my journey from old school 00:14.400 --> 00:15.400 cloud developer. 00:15.400 --> 00:20.080 I contributed a lot of code to open stack, went to Kubernetes, yeah, yeah, yeah, okay. 00:20.080 --> 00:25.920 Recently I got bitten by the molecular biology bug and then that started in bioinformatics 00:25.920 --> 00:30.200 and doing a master of research in London recently, right? 00:30.200 --> 00:35.320 Of course, I don't pretend or expect to become a common expert in biology, but I like 00:35.320 --> 00:41.600 to bring old expertise I have in building clouds and stuff like that into this domain, right? 00:41.600 --> 00:48.120 So the project I did as part of this is what I'm going to highlight very quickly here. 00:48.120 --> 00:53.240 To begin with, we have two problems, which are, let's say, slightly relevant for this room, 00:53.240 --> 00:56.160 but it just gives an idea about what we're trying to achieve. 00:56.160 --> 01:01.680 So one of them is antigen specificity, meaning given an antibody sequence, will it bind 01:01.680 --> 01:03.800 to a given antigen? 01:03.800 --> 01:08.200 Just for simplicity, for example, the SARS-CoV-2 spike protein, that all of us are for 01:08.200 --> 01:13.920 25 million with from the pandemic, and it's a binary classification problem in machine learning 01:13.920 --> 01:19.120 terms, so meaning like, is it going to bind, yes, no, okay, in biological terms, it's not binary, 01:19.120 --> 01:20.120 but we simplify here. 01:20.520 --> 01:25.360 The other one is the school part of predictions, so given the whole antibody, which positions 01:25.360 --> 01:31.600 in that sequence of amino acids, right, are going to bind with the antigen, meaning 01:31.600 --> 01:37.240 the form body that we are trying to prevent to cause a trouble, right? 01:37.240 --> 01:42.160 And this one is a token classification problem, meaning that given a sequence of tokens, 01:42.240 --> 01:48.560 in this case, one for each amino acid, which one of them has that positive level or negative 01:48.560 --> 01:52.960 label, okay? 01:52.960 --> 01:59.360 It's, in doing this, we are comparing a large number of, pretty large number of existing 01:59.360 --> 02:05.480 models that we're going to fine tune for this particular task, okay? 02:05.480 --> 02:10.920 Well, you know, that resulted in around 600 tasks, okay? 02:10.920 --> 02:16.040 For both sequence and token classification, most of them requiring GPUs, so it's not something 02:16.040 --> 02:21.000 that you'll just run on your laptop, it's not that it requires a good amount of processing 02:21.000 --> 02:27.160 power, and that it requires also a good amount of orchestration, because all those tasks 02:27.160 --> 02:34.560 are changed, right, some of them, parallelized, of course, and we require a lot of attention 02:34.560 --> 02:38.400 in the ordering which random, so if you do it manually, besides taking forever, there is 02:38.400 --> 02:42.400 a very, very high chance that you're introducing human error. 02:42.400 --> 02:47.000 So both in terms of repeatability, but also reproducibility, about the researchers, it becomes 02:47.000 --> 02:48.520 a nightmare. 02:48.520 --> 02:55.560 So I went out and looked, hey, what can I use to automate all these things? 02:55.560 --> 03:00.160 And I looked, of course, for directed, acically graph solutions, which are very common, right, 03:00.160 --> 03:04.840 two very common ones out there are, for example, Apache workflow, that they used here, 03:04.840 --> 03:09.760 and another one, very common in the scientific world, is next law. 03:09.760 --> 03:13.400 Who heard about Apache workflow, to now? 03:13.400 --> 03:15.160 How many of you heard about next law? 03:15.160 --> 03:18.920 Okay, so both of them are very popular. 03:18.920 --> 03:21.800 And then, of course, that thing is just the orchestration layer. 03:21.800 --> 03:26.480 You need to choose an underlying platform to do that, right? 03:26.480 --> 03:31.960 Just learn, actually, at something which are very common, right, in the scientific world, 03:31.960 --> 03:34.080 in the HPC world. 03:34.080 --> 03:38.800 But coming from open stack Kubernetes and so on, from of course, Kubernetes in the significantly 03:38.800 --> 03:43.480 more natural way to do this, and I wanted to see if we can work very well also for this 03:43.480 --> 03:47.440 type of HPC use cases. 03:47.440 --> 03:53.160 So here it's a quick overview of how the antigen affinity pipeline looks like. 03:53.160 --> 03:54.960 There are two of them, of them, right? 03:55.040 --> 03:59.400 Again, this is a very short version of the talk, so I'm just focusing on one of the two. 03:59.400 --> 04:05.040 But you can see we have to begin with a lot of tasks that are needed to prepare the data set. 04:05.040 --> 04:11.440 So we start with a row file, a row archive file, basically with all the sequences. 04:11.440 --> 04:17.840 We start clustering in them, very important because we don't want to have duplicates, 04:17.840 --> 04:23.280 which in terms of sequences gets also quite complicated, so I'm not going to do the details. 04:23.280 --> 04:27.480 We need to split training of all the additional sets, then we have a completely separated 04:27.480 --> 04:32.640 test data set, which has to be independent from the other for data leakage considerations, 04:32.640 --> 04:37.120 which has to be clustered as well, potentially under sample if needed. 04:37.120 --> 04:41.760 And then we want to have a control in which we simply shuffle all the labels, so we want 04:41.760 --> 04:45.000 to see which are also the prediction we get on that. 04:45.000 --> 04:49.680 And then for each of the models and we have a lot of them, for those of you familiar with 04:49.680 --> 04:55.120 the domain, Antibiot, Antibiot, 2, yes, and 2, and so on, yes, and 2, maybe familiar with you guys 04:55.120 --> 04:57.360 to come from the metagrop. 04:57.360 --> 05:03.760 Some of them are very large models, the biggest one I'm used here has 15 billion parameters. 05:03.760 --> 05:08.360 And for all of them, you repeat the stuff which is in the blue box, the yellow, smaller 05:08.360 --> 05:10.920 boxes are the task which required GPUs. 05:10.920 --> 05:16.520 So the core of that involves finding tuning models, getting for intunes potential ways and 05:16.520 --> 05:20.040 stuff like that, okay. 05:20.040 --> 05:24.920 At the end of it, we have completely separate tasks which are reports for intuninar and 05:24.920 --> 05:27.680 if everything goes well, you get an email with the results. 05:27.680 --> 05:31.560 So the idea is that you start this thing in the evening, takes around six hours on a server 05:31.560 --> 05:36.120 that I used with two, eight hundreds on that, two and a village hundreds. 05:36.120 --> 05:39.200 And by the morning, you get an email, right? 05:39.200 --> 05:43.960 It's also made in a smart way, so that if a task doesn't need to be run, because I already 05:43.960 --> 05:47.160 have a process, for example, that particular part of the pipeline, it doesn't repeat 05:47.160 --> 05:50.000 the whole thing, right? 05:50.000 --> 05:53.200 And the pipeline is made to be run automatically when every have changed in the data 05:53.200 --> 05:57.520 sets, changing the code, and what not. 05:57.520 --> 06:06.000 Now I'm trying to do a live demo, which is something relatively rare in a line in talks, 06:06.000 --> 06:10.520 so this is the Apache Airflow interface, okay. 06:10.560 --> 06:18.120 I take, for example, one of the prediction tasks, and I clear it, okay. 06:18.120 --> 06:25.720 This will tell, if the network works, it will tell Airflow to start riskadowing that 06:25.720 --> 06:29.320 particular tasks, so not all of them, just that one, because it's, we don't have the 06:29.320 --> 06:30.320 time. 06:30.320 --> 06:35.520 That will spin that, of course it will contact Kubernetes, to the operators behind and 06:35.520 --> 06:41.160 execute or have, and Kubernetes will start scheduling a container, right? 06:41.160 --> 06:45.320 The big difference here is that, compared to SLR, for example, Kubernetes doesn't really 06:45.320 --> 06:50.160 have a scheduling solution, which works very well for this type of task, right? 06:50.160 --> 06:54.080 That part of the work is, of course, entirely of loaded in this case to Airflow, which 06:54.080 --> 06:58.720 does a pretty good job with that, keeps on retrying, basically, and scheduling a maximum number 06:58.720 --> 07:02.360 of tasks based on what your configurations are. 07:03.280 --> 07:08.040 It's already finished, and I can look here, so this is a big advantage, for example, compared 07:08.040 --> 07:12.720 to things like next flow, because all the output from my containers are coming out here, 07:12.720 --> 07:13.720 right? 07:13.720 --> 07:22.080 When that is finished, it goes, it goes here, and starts, it sees that it already collected 07:22.080 --> 07:28.160 everything else, so it's going to go through the, through, the next task, which in 07:28.160 --> 07:33.040 both, for example, running all the reports and everything, here, this is the R code, 07:33.040 --> 07:38.960 generating an RMD, and at the end of it, it's going to send an email, okay? 07:38.960 --> 07:46.680 So if all went well, I'm going to have an email here, which just arrived with all my results. 07:46.680 --> 07:52.480 I also simplified it for simplicity, running only two models here instead of all of them. 07:52.520 --> 07:58.880 I click on it, and I get all these nice metrics, you know, which show me comparisons 07:58.880 --> 08:00.960 between all the models, okay? 08:00.960 --> 08:11.680 In this case, against only two models, because simplicity, but this is all the whole thing, 08:11.680 --> 08:16.800 looks like comparing, comparing all of them, you see, all the models on the X-axis, 08:16.840 --> 08:23.040 and the rest, so for example, recall, FPR, F1, average precision, and so on, so it's very 08:23.040 --> 08:27.760 useful enough for running this stuff. 08:27.760 --> 08:38.220 Moving on, in terms of architecture, you have three main components, you have an F-Lome 08:38.220 --> 08:42.000 rest of the PI, your DAX, which are running, you have a schedule and a queue, which runs 08:42.000 --> 08:44.360 on the work, which is a third component. 08:44.360 --> 08:48.760 If you compare it with next flow, next flow will basically be the scalar class worker 08:48.760 --> 08:51.400 part, right? 08:51.400 --> 08:52.400 We saw this thing. 08:52.400 --> 08:55.320 An important consideration is relatively to storage. 08:55.320 --> 09:02.840 We have here every single container, which gets bone, running each individual task, needs 09:02.840 --> 09:05.120 to have access to single share storage. 09:05.120 --> 09:06.120 How do you do that? 09:06.120 --> 09:11.160 Well, in Kubernetes, you have so-called CSI's, you know, the driver's specific for storage, 09:11.160 --> 09:16.760 we need a pick one that has read right many, meaning that you can share the same PVC, 09:16.760 --> 09:17.760 right? 09:17.760 --> 09:22.120 The persistent volume claim, two, all of those containers, which are running in parallel, 09:22.120 --> 09:27.160 so that they can share both the code, the storage, and so on. 09:27.160 --> 09:35.480 Two great examples in that case are a set for production, or NFS for POCs, right? 09:35.560 --> 09:42.800 So, both of them work pretty well, and they solve the problem, of course, set is more. 09:42.800 --> 09:46.640 It's better for production sinkers, use case. 09:46.640 --> 09:48.720 Optimizations, you have a cool them. 09:48.720 --> 09:52.720 Again, here I'm not entering in the details, but I'm using timeslizing for this particular 09:52.720 --> 09:53.720 case. 09:53.720 --> 09:56.640 I'm sharing my two GPUs between all those models. 09:56.640 --> 10:00.600 But you can do meek if you want more security, more multi-tenancy, or you can use 10:00.600 --> 10:08.240 NPS, but it's currently in a experimental stage in the Kubernetes device plugin, right? 10:08.240 --> 10:12.880 And then for scaling, since you want to share the model across multiple GPUs, you want 10:12.880 --> 10:17.400 to have, of course, having phase accelerates, which I'm using from Mmodos, and deep speed, 10:17.400 --> 10:23.000 which basically allows you to split your models across multiple GPUs in across multiple 10:23.000 --> 10:24.000 nodes. 10:24.000 --> 10:30.920 The three degrees there, one, one, two, three, one allows you to do less sharing, but 10:30.920 --> 10:35.480 it's less problematic to configure, and the third one allows you to do more distribution 10:35.480 --> 10:39.280 of the data, but it's more complicated to configure. 10:39.280 --> 10:43.560 But you saw here, uses the level three. 10:43.560 --> 10:48.840 Last but not least, I think that I will explore more in, for the talk in the future, because 10:48.840 --> 10:52.320 I'm currently porting this pipeline, or so to next floor. 10:52.320 --> 10:55.080 You often people ask me, which is better, if lower next floor. 10:55.080 --> 10:58.760 In reality, there is no better option. 10:58.760 --> 11:05.880 It's just that very shortly, one is Python, the one is Ruby, one uses a, let's say, very 11:05.880 --> 11:10.000 opinionated DSL, the other one is just Python code. 11:10.000 --> 11:15.040 One has an easy learning curve, next floor, it's a bit steeper. 11:15.040 --> 11:20.640 Next floor, it's more HP-serient, if you want, but most important, next floor has an extremely 11:20.720 --> 11:25.560 strong, and if community, which is an F core, on the biology side, which is not present 11:25.560 --> 11:30.720 of course, on air flow, and in my opinion, for example, coming also from the open stack 11:30.720 --> 11:35.200 experience, community is where more important than code, if there is a community, you 11:35.200 --> 11:36.720 can fix the code, right? 11:36.720 --> 11:41.400 So that's why I think it's very interesting to port now this pipeline to next floor, 11:41.400 --> 11:43.600 and see what's coming next. 11:43.600 --> 11:44.600 Thank you. 11:44.600 --> 11:46.600 I'm done with this. 11:47.520 --> 11:52.520 We have time for one question. 11:52.520 --> 12:08.520 Yeah, while the next speaker can line up, one question, yes, that's the next step. 12:08.520 --> 12:14.920 Okay, that was a quick question, I should repeat it, am I going to port this after 12:14.960 --> 12:18.120 the next floor, so when F core, yes, that is the plan, okay? 12:18.120 --> 12:19.400 All right, thank you so much, guys.