WEBVTT 00:00.000 --> 00:18.960 Hello everybody, thanks for being here. I'm Simone Tirobozki. I'm working in a redata on the 00:18.960 --> 00:21.360 CubeVirt project. 00:22.240 --> 00:25.840 Okay, can you hear me now better? Thank you. 00:27.600 --> 00:33.280 So here we are going to talk today about scheduling. Scheduleing is the process of matching 00:33.280 --> 00:42.800 workloads to nodes. By default, CubeVirt is using the standard Kubernetes scheduler, the cube scheduler. 00:44.720 --> 00:50.720 As you know, CubeVirt is about virtual machines, but a virtual machine at the end is 00:50.800 --> 00:58.560 executed in a pod and the pod has to be scheduled. But pod is scheduled by the standard CubeVirt 00:58.560 --> 01:11.680 Kubernetes scheduler. Let's talk a bit, let's do a diversion about how we define our 01:12.640 --> 01:20.480 source needs for our workload. On Kubernetes on a pod and source on virtual machines, 01:21.360 --> 01:28.640 we have requests and limits. Requests are the amount of resources 01:28.640 --> 01:39.280 are lowered to be used with a strong guarantee of availability. You can request a CPU and you can 01:39.280 --> 01:48.000 request a memory. Schedule is not going to overcome meet on a request. You ask for a certain amount 01:48.000 --> 01:54.560 of resources and you are going to get that. Limit is the max amount of resources that can be 01:54.560 --> 02:02.800 used regardless of any guarantee. The scheduler is completely ignoring limits. What's 02:02.880 --> 02:15.040 the implication of this, requests should be typically less than a user. It's going to be less 02:15.040 --> 02:22.800 than limit. This means that you are overcoming a bit and the resources can be available on an order. 02:23.760 --> 02:31.440 If the user is over the limit, you are going to be totally in case of CPU. You are not going to 02:31.520 --> 02:39.520 go over the limit or in case of memory. You can eventually be killed by the out of memory killer. 02:41.760 --> 02:48.560 In CubeVirt we manage virtual machines. You define what you need on your virtual machine. You define 02:48.560 --> 02:54.400 what you need a certain amount of CPU cores. You define that you need a certain amount of memory 02:54.400 --> 03:03.040 for your virtual machine. The CubeVirt controller is translating it into a source of some parts. 03:03.040 --> 03:15.200 We are not in general, we are not setting limits. We are not setting limits because we don't want 03:15.200 --> 03:21.680 to have our virtual machines being killed by the out of memory killer. We are not setting limits 03:21.680 --> 03:27.680 on CPU because we want to take the full advantage of the service available on that order. 03:28.560 --> 03:36.640 Normally, CubeVirt is not over committing in terms of memory. If you require two gigs of 03:36.640 --> 03:43.520 memory for your virtual machine, CubeVirt is going to render a pod with two gigs of memory. As 03:43.600 --> 03:51.840 a request plus something more as a safety threshold for ancillary services within the virtual 03:51.840 --> 04:00.720 home chat pod. On CPU by default, CubeVirt is over committing by a factor of 10. It means that 04:00.720 --> 04:06.960 if you are requiring a 4 CPU cores, the CubeVirt controller is going to scale the pod, configure 04:07.040 --> 04:16.720 to acquire 0.4 or 400 ml cores. This means that from the scheduler point of view, 04:16.720 --> 04:23.520 a virtual machine with a 4 cores is just a pod with 0.4 to be scheduled. 04:25.600 --> 04:33.200 The scheduling is the process of binding a virtual machine and then a pod to an order. 04:33.520 --> 04:43.040 The scheduler should find an order that is capable of executing vector work load 04:43.040 --> 04:51.520 ensuring that on the order you have a fire mount of a source to match at a list the source 04:51.520 --> 04:59.920 that we require. The process of scheduling is a bit more complex. We have a predicate. 04:59.920 --> 05:09.040 Whatever is a predicate for the scheduler is something that should happen on the order to be able 05:09.040 --> 05:16.640 to handle the work load that it is going to be scheduled. The scheduler takes the definition of 05:16.640 --> 05:24.000 the pod. It starts looking for all the nodes in the system. It scans all of them. The first step 05:24.000 --> 05:31.040 is filtering. There are a set of filters, nodes that are not going to match the predicates. 05:31.040 --> 05:36.480 Set on the nodes are going to be filtered out. They are not good candidates. Then there is a 05:36.480 --> 05:42.000 calling mechanism. What is not mandatory? It is just a priority. It is just a weight. 05:42.000 --> 05:46.000 nodes are weighted and the scheduler is going to choose the best fitting node. 05:46.000 --> 05:55.440 Now, we have a virtual machines and we have pods. But virtual machine and pods are different. 05:55.440 --> 06:01.600 We know, I mean, now, on a virtual machine, you have a different operative system. It usually requires more 06:01.600 --> 06:08.880 source than a pod. You probably need to locate a few cores. Probably you are going to 06:08.880 --> 06:16.880 locate a few geeks of RAM. The boot will start up time. It is as low because you need to 06:16.880 --> 06:24.560 to boot a full operative system. Usually, a virtual machine is state full. We have data on a volume. 06:24.560 --> 06:31.040 It is going to be safe. We don't have as persistent volumes. The virtual machines can be 06:31.040 --> 06:38.080 live migrated between nodes without time. You cannot do that for pods. On the other side, 06:38.240 --> 06:44.560 you can easily start a pod on a different node because of a start time. The starting time is short. 06:46.640 --> 06:51.440 In order to scale out a virtual machine, probably you need to reconfigure it. 06:53.440 --> 07:01.120 The user expenses are different because the user is not supposed to see his virtual machine 07:01.120 --> 07:09.680 continuously booting. When you define a virtual machine, you will be surprised to be 07:09.680 --> 07:16.960 as close as possible to Kubernetes. So, basically, you can use the same semantics using 07:16.960 --> 07:28.320 node selectors or the finity pod. We tied it to as close as possible. Basically, one-to-one. 07:31.200 --> 07:40.560 But what use really wants? I work on a virtual machine before working on Kubernetes. We know that 07:40.560 --> 07:49.280 when cluster means migrating a virtual machine, they look to be able to select the node 07:49.280 --> 07:54.800 where they want to have a virtual machine landing. You can find it in a virtual machine. 07:54.960 --> 08:02.640 The semantics, the semantics. You expect something like that. On the other side, 08:02.640 --> 08:10.000 on Kubernetes, a virtual in a live migration is just an instance of the virtual machine 08:10.000 --> 08:17.680 instance migration object. It's a name-space-ed object, so an inspection can create it. 08:18.640 --> 08:25.040 On the spec of this object, you can only specify the name of a virtual machine that you want to 08:25.040 --> 08:31.360 migrate. This means that you have no control at all. It's up to the scheduler to select 08:31.360 --> 08:39.120 a node, but it's going to fit this virtual machine. It's an out-defend from 08:39.120 --> 08:46.080 what the users of the traditional virtualization systems or the school systems are supposed to do. 08:48.640 --> 08:55.920 This has been debated a lot over the community. We had a few proposals in the past. We had 08:55.920 --> 09:03.120 a lot of discussions here and there. It's not a new idea. We know that users are expecting 09:03.120 --> 09:11.280 something different. Up to now we are still not able to converge on this. It's controversial. 09:11.280 --> 09:21.680 Let me try to explain it. We know that experience that means users are used to control 09:21.680 --> 09:26.560 where the traditional workloads are going to be moved to. Just because they are a bit 09:26.560 --> 09:33.200 ready to do that, because maybe they are lying on existing patterns or automation, 09:33.200 --> 09:39.520 because they are planning maintenance with a set of nodes after the other. The water, I mean, 09:40.480 --> 09:45.920 it's very fast. They know what they want to do. On the other side, as a virtual machine 09:45.920 --> 09:52.640 your owner, you don't want to see that your object should be updated or amended just because 09:52.640 --> 09:59.760 a cluster admin needs to schedule that virtual machine on a different node. The goal of what we are 09:59.760 --> 10:06.720 talking about is to allow a cluster admin to trigger a live migration of a virtual machine, 10:06.880 --> 10:17.040 limiting the set of candidates of a validates node. The target node that is explicit 10:17.040 --> 10:23.920 required for the actual live migration should not stay there. It should not influence the future 10:23.920 --> 10:35.120 of a virtual machine. It's just for one off migration attempt and it should not bypass constraints 10:35.120 --> 10:42.880 that are already set on the virtual machine. You should not be able to bypass what it's there 10:42.880 --> 10:52.080 of a writing it. We are proposing now a really simple design. The idea is that directly on the 10:52.080 --> 11:01.360 virtual machine instance migration object, we can add an additional node selector that is going to 11:01.360 --> 11:09.280 be measured preceding all the nodes selectors and all the affinity that are already set on the 11:09.280 --> 11:17.200 virtual machine. From a CLI point of view, it's just about passing an additional parameter. 11:17.200 --> 11:28.640 At that point, you can inject on the fly, a set of additional constraints that should be 11:29.600 --> 11:38.160 a specter for better migration attempts. The proposal is quite simple. It got a lot of criticism. 11:40.880 --> 11:47.440 The first one, we are on Kubernetes, it's a cloud native solution. Kubernetes is the scheduler, 11:47.440 --> 11:52.880 they use a should not interact, should not take over. Okay, 11:53.760 --> 12:02.240 but there are the user, I mean. Then on Kubernetes, we cannot live in a greater 12:02.240 --> 12:07.200 part when we have the node, it's not there, but of course we cannot live in a total 12:07.200 --> 12:12.480 part in Kubernetes. And without Kubernetes, we don't have a virtual machine at all, 12:12.480 --> 12:18.080 but now we have Kubernetes, we have a virtual machine, so we can somehow handle this. 12:18.800 --> 12:26.080 Then of course, Kubernetes, we have a native paradigm to individually address some, 12:26.080 --> 12:33.440 if not all the use cases that I presented before, doing adding attain central relations, 12:33.440 --> 12:42.160 coordinating an accord on it nodes, there is an active way of doing that. Probably is not as intuitive 12:42.160 --> 12:47.600 as it should be for experience at the cluster of the means that simply want to live 12:47.600 --> 12:56.720 a greater virtual machine from this node to that one for whatever reason. Another crazy system 12:56.720 --> 13:02.080 that got a easy in the community is that live migrations are a source of expensive 13:02.080 --> 13:10.800 operations. We know that they are consuming a lot of bandwidth within the cluster, so they 13:10.800 --> 13:19.280 cap at a certain number. The idea is that if we allow users to freely manage 13:21.520 --> 13:27.840 live migrations, they can maybe start abusing on that introducing too much load on the cluster. 13:29.200 --> 13:39.840 We solve that this, let me try to quickly explain it on Kubernetes, we have two different 13:40.480 --> 13:48.720 admin roles, the cluster admin role, which is all the way to do whatever on any resource on the cluster, 13:48.720 --> 13:56.960 and the admin cluster role that is available, that is all, that the normal it's supposed to be bound 13:56.960 --> 14:06.240 inside a single name space to add user. The cube via the admin role, it's aggregated to the 14:06.400 --> 14:15.120 admin role. It means that normally I think it's a common practice to grant the admin role to 14:15.120 --> 14:21.520 select the owners inside an space where they name space owner. It's your kind of tenant, your own 14:21.520 --> 14:29.200 resource there, and right now, by default, you are allowed to create and delete virtual machines, 14:29.200 --> 14:38.240 and the virtual machine instance migration instances, and the Kubernetes RBAC model is purely 14:38.240 --> 14:44.080 addictive. You cannot deny anything, you can only add, so since this is granted by different 14:44.080 --> 14:50.800 installations, it means that all of your name space owners can trigger live migrations for the 14:50.800 --> 14:56.400 virtual machines in the name space, of course. What's the issue? We have all your single migration 14:56.800 --> 15:05.760 queue. It means that you can affect cluster critical operations like no that rains or upgrades, 15:05.760 --> 15:15.280 because you can cancel a live migration request on the queue. So, in the next version of cube 15:15.280 --> 15:23.840 of a cube width, we decided that the cube width admin is not going by default anymore to be 15:23.840 --> 15:30.640 all over the to create and delete the virtual machines instance object, but this is going to be granted 15:30.640 --> 15:37.920 only with an additional role, named the cube width migrate. As a cluster admin, you will be able to 15:37.920 --> 15:46.720 grant it to individual to individual grant it to selected users, or eventually label it to be aggregated 15:46.720 --> 15:53.280 as in the past to the admin cluster role to get back to the previous behavior. It's not an 15:53.280 --> 16:06.880 API change, it's simply adding. Then, back to our initial problem, the month that guided us over 16:06.880 --> 16:14.560 the years is the cube width eraser. If something is useful for pods, we should not implement it 16:14.640 --> 16:22.080 only for the machines. The point is that the year we are talking about live migrations and 16:22.080 --> 16:28.080 live migrations is something that is not relevant for pods. So, this is a weird machine specific 16:28.080 --> 16:38.720 topic. We should address this in cube width. In the proposal, we had also fewer alternatives. 16:39.600 --> 16:48.640 One of them is something that you can already do today without any, without any change in the 16:48.640 --> 16:57.280 cube width code. You can set a temporary, not select or not affinity on the virtual machine, wait for 16:57.280 --> 17:04.160 it to be propagated down to the virtual machine instance. If you configure the cube width instance 17:04.160 --> 17:11.760 with life update or output strategy, which is very difficult, only at that point, you can take 17:11.760 --> 17:18.720 a live migration with existing API, nothing special here, wait for the migration to complete, 17:19.360 --> 17:25.200 and now, only now, you can remove the additional constraints that you set on the virtual machine object. 17:25.920 --> 17:30.640 Why we don't like it, or at least, why I don't like it, it's an imperative flow, it's a 17:30.880 --> 17:35.920 platform. It's still as to be somehow orchestrated, it will be completely up to the user. 17:36.960 --> 17:44.000 It can mess up with DevOps and infrastructure as a code tools that are managing the 17:44.000 --> 17:52.640 virtual machines on your craft. Another possible option, in Kubernetes, you can configure more 17:52.640 --> 18:03.600 than one scheduler. You can add a seconder scheduler. We know that there are also loader-ware scheduling 18:03.600 --> 18:09.200 plugins, so you can configure a seconder scheduler that is loader-ware. It's going to take 18:11.040 --> 18:19.360 into consideration the actual source consumption of virtual machines, but still, this has to be 18:19.920 --> 18:25.040 you have to still, you have to configure the seconder scheduler, and then each individual 18:25.040 --> 18:29.200 virtual machine should be configured to be scheduled by that seconder scheduler. 18:31.840 --> 18:39.040 And still, the scheduler is loaded-ware, it knows the actual CPU consumption on the nodes, 18:39.840 --> 18:47.200 but still, it's going to schedule or according to the static observation that we set on 18:47.200 --> 18:56.160 our virtual machine. And if you remember, we are setting by default one-tenths of the allocated 18:56.160 --> 19:02.240 code. It means that if you need the four cores, the scheduler is not aware vector, and it's 19:02.240 --> 19:11.520 going still to schedule only for 0.4. And this is also going only to affect the scheduling. It's 19:11.600 --> 19:18.880 not going to watch over the time the actual consumption on your cluster and react to the balancing 19:18.880 --> 19:27.520 the cluster. It's still up to you to monitor the cluster and eventually inject migration objects 19:28.320 --> 19:37.200 just to get the scheduler doing something. Another option is to use the cube-desk scheduler 19:37.280 --> 19:42.640 for automatic work or the balancing, eventually combining it with a lower-the-ware scheduler. 19:45.680 --> 19:56.000 Since two months ago, the scheduler is, the opposite of the scheduler is at all, that it's monitoring 19:57.120 --> 20:02.960 your nodes and it can decide to disquede or something. The scheduler has a ton of 20:02.960 --> 20:10.560 machine, it means that the cube-desk is going to react and live a mega-it-it automatically. 20:13.200 --> 20:21.040 Since November, the scheduler is a lot ofware. It means that it can, it's a new feature. It can 20:21.120 --> 20:32.320 then, the scheduler will turn machines according to various CPU consumption. It's a really good option. 20:32.320 --> 20:39.600 It's a really good idea to continue to balance your cluster. On the other side, this is just about 20:39.600 --> 20:48.400 the scheduling. This is not affecting any out. Our schedule is going to complete the migration 20:48.400 --> 20:54.080 because when we are triggering a live migration, the scheduler will trigger it, but then it will be 20:54.080 --> 21:01.200 up to the scheduler to select the node and he is still going to count for what it knows. 21:05.440 --> 21:14.080 In, but this is an interesting proposal. We are continuously working on it. We have one more 21:14.160 --> 21:23.440 thing. We are trying to enhance it with precious tool information. PSI is a metric that is 21:23.440 --> 21:31.360 supported by the Linux kernel since version of 4.20, so it's even not so new. It's supported at 21:31.360 --> 21:41.440 a node and a Cgroup's license level. It's not a metric about the CPU utilization, but it's 21:41.440 --> 21:48.640 exactly measuring the actual productivity loss caused by the scarcity of resources. And we have it 21:48.640 --> 21:56.480 for memory CPU and IO. The kernel is measuring the amount of time where your Cgroup's license 21:56.480 --> 22:04.000 is stuck because it's waiting for a CPU that it's not available at the moment. And the kernel is 22:04.080 --> 22:10.240 supporting it. We did some experiments and the results are really convincing. 22:11.280 --> 22:17.680 Unfortunately, PSI is still not a PSI matrix, still not a portrait by a C-advisor, 22:17.680 --> 22:23.760 which is the tool that supports the matrix to the Q-blet. There is an open PR. This is pretty short 22:23.760 --> 22:29.760 if it's from a one week ago. We are going to have it in the future. This is going to be a really 22:29.760 --> 22:41.280 interesting way to automatically balance the cluster. So now we presented a few options. We have 22:41.280 --> 22:49.360 a design proposal. The design proposal is still not accepted. We are a community. We have users. 22:50.560 --> 22:57.200 So please make your voice heard. If you think that you need this feature even another feature. 22:57.200 --> 23:05.200 If you think that you need something, please talk. I think that as developers, we have a vision 23:05.200 --> 23:11.360 of a cluster. We have a vision of user needs. Maybe we are right. Maybe we are wrong. We want also 23:11.360 --> 23:17.200 to get your feedbacks. Thank you. 23:17.200 --> 23:27.360 Okay. First one. There was a problem in what I said because if I just sent a year ago to 23:27.360 --> 23:33.760 I have my great and then from not A to not B and then my year will be good. And it goes back to 23:33.760 --> 23:41.440 not C. That's completely unexpected. So somewhere must be stored, stay at not C or B or whatever. 23:42.400 --> 23:51.120 Okay. So he said that if we simply add the addition of the constraints to the one of 23:51.920 --> 23:57.280 the objects for the one of my patients, they are not going to stay on the virtual machine. Yes, 23:57.280 --> 24:03.200 it's true. And it's absolutely expected. If you want to be a persistent change, please set it on 24:03.200 --> 24:09.280 the virtual machine object. But that's a problem again. Because if I don't have the permission 24:09.360 --> 24:16.800 to migrate, I have the permission to code migrate as well, which is completely either should be 24:16.800 --> 24:29.920 allowed to do both on either. So the constraints are set of an object and if you are the 24:29.920 --> 24:36.320 owner of the virtual machine object, you are all aware that we are at the object. It's up to your 24:36.400 --> 24:45.520 cluster admin to decide if you are entitled to trigger a virtual machine's instant migration now. 24:45.520 --> 24:53.760 If not, you can only specify where you want to have your virtual machine and sooner or later 24:53.760 --> 25:02.000 it will happen, but it's not up to you to force it. And we can talk later. 25:06.800 --> 25:14.800 So he's asking about the user of a pod disruption budget in Cuba, 25:14.800 --> 25:19.840 yes, we are using them. We are using them to protect the virtual machine to be sure that it's 25:19.840 --> 25:29.120 not going to be killed. And we are using a second PDB also to protect the target pod of the 25:29.200 --> 25:33.760 live migration. So yes, we are using them. Next question. 25:47.040 --> 25:53.840 So the disk of the virtual machine is a storage on a system storage on external storage, 25:54.480 --> 25:58.960 depending on how you configure the virtual machine. It could be eventually automatically 25:58.960 --> 26:05.280 a static. If you say that the virtual machine should be automatically a static, you can configure it. 26:06.160 --> 26:12.320 We also have additional operators that are going to continue to monitor the node 26:13.440 --> 26:17.040 to speed up the covariate process. If you need the eigenveller ability, 26:17.040 --> 26:23.520 if I do a virtual machine. What if the node moves connected to the cluster, 26:23.520 --> 26:27.360 that not with the persistent storage? I have two machines talking the same. 26:27.360 --> 26:31.120 That's why we have a better additional operators. Normally, we wait. 26:34.640 --> 26:40.720 So he's asking what is going to happen if we have two nodes that the node that was 26:40.720 --> 26:46.560 hosting the virtual machine lost the network connectivity, but it's still able to 26:46.560 --> 26:51.760 write on the disk. Potentially it could corrupt the virtual machine. We have locking mechanism 26:51.760 --> 27:00.160 and we have additional operators that are going to use mechanism to be sure that the node is 27:00.160 --> 27:06.400 a really bad if you need them. Normally, we have a long time out to be on the safe side. 27:09.360 --> 27:10.160 Thank you very much. 27:16.560 --> 27:18.560 Thank you so much.