WEBVTT 00:00.000 --> 00:13.400 Hello, hello, everyone. We are starting the next talk and now Tanya and Ivan. Ivan, 00:13.400 --> 00:20.000 Ivan, Ivan will tell about the quantization and this is the talk you want to listen to 00:20.000 --> 00:27.000 if you want to learn about quantization because Ivan is the one who is the like wild 00:27.000 --> 00:32.280 bore of quantization or like father of quantization of which are of quantization like looks 00:32.280 --> 00:38.280 like a witcher a little bit, yeah. So like I give it to your hands and just talk about quantization 00:38.280 --> 00:45.320 now. Thank you. So as we go into the quantization portion and we actually don't have to talk 00:45.320 --> 00:51.400 so right now. So this is a Q&A talk so I'll be asking questions but feel free to write your questions 00:51.400 --> 00:56.600 if you go on this car code for Ivan and Father speakers as well. So by the end of the talk 00:56.600 --> 01:02.360 we'll actually look who submitted what and ask them as well. But to start actually it's for us 01:03.080 --> 01:09.000 amazing opportunity to put a real face with a gift-up profile picture. So who have seen this gift-up 01:09.000 --> 01:14.520 profile picture? Yeah everybody who's basically looked into any kind of quantization right? 01:14.520 --> 01:19.240 So now this is the real face but it's really hard to find your pictures on the internet 01:20.200 --> 01:28.280 Well I'm kind of a really cool past. No, I'm this kind of guy who doesn't participate in social networks 01:29.000 --> 01:34.360 you'll not find me anywhere except this GitHub profile. I'm not on Twitter. I'm not on Reddit. 01:34.360 --> 01:40.920 I'm not nowhere. But before working on a CPD, so you actually come from a completely different 01:40.920 --> 01:48.040 industry. You get physicists right? Yes, I'm a physicist. Graduated in high energy physics many, 01:48.120 --> 01:57.720 many years ago. Then the way my life evolved I ended up working in medical physics, 01:59.720 --> 02:06.840 specifically doing research on radiation therapy of cancer, 02:06.840 --> 02:14.600 basically physics, tools related to that. So how is it similar in the current quantization 02:14.760 --> 02:22.200 works and what's different? Well, when you do this sort of work, you basically it's 02:22.200 --> 02:31.160 numerical, numerical methods of various kinds. In my previous work I was most famous for the 02:31.160 --> 02:40.840 for the Monte Carlo work that I have done. But I have done lots of other things. One of the 02:41.080 --> 02:49.080 problems that you have to solve there in radiation therapy is to optimize the treatment. So in 02:49.080 --> 02:56.200 radiation therapy, when you're radiating the patient, you come from many different angles and you can also 02:56.200 --> 03:04.840 modulate the fluids or the amount of radiation you're imparting from each angle from every position 03:04.840 --> 03:11.880 of your beam. And so there there is this large scale optimization problem where you want to 03:11.880 --> 03:19.880 basically optimize the fluids or the amount of radiation from the different angles in the different 03:19.880 --> 03:28.120 positions at this angle such that to achieve an optimum dose distribution in the patient. So this 03:28.440 --> 03:39.160 problem is associated with the so-called system matrix. I don't know anyone knows about the system 03:39.160 --> 03:45.480 matrix. Many optimization problems have a system matrix. Basically the system matrix here is how 03:45.480 --> 03:56.200 much dose a small beamlet J in parts into a small voxyl I in the patient. It's a giant thing. 03:56.200 --> 04:06.680 It's billions of elements. So then while working on that at some point I realized the best way to 04:06.680 --> 04:15.480 the basically the calculation is 99 percent dominated by the matrix multiplication of the system 04:15.480 --> 04:25.560 matrix with your gradients relative to the dose or the same matrix with your changes of the fluids. 04:26.600 --> 04:33.640 And so the the best way to speed up this calculation is to actually quantize it. Now it's a bit 04:33.640 --> 04:41.480 different than LLMs. There you're happy if you can make it work without numerical instabilities 04:41.480 --> 04:52.120 so it's 32 bit floats. So going lower than 16 bit in it would be pushing it too far so it was 04:52.120 --> 05:02.120 a 16 32 float to 16 bit quantization. But basically what happens then is that when you do this you 05:02.120 --> 05:12.120 realize that you have a big butt of in this matrix which is you can represent a side bit and you 05:12.120 --> 05:21.080 have an extremely sparse matrix which is really requires the full 16 bit. So this was the thing 05:21.080 --> 05:29.720 that my one of my encounters in quantizations. So having gone through this one evening I was 05:30.360 --> 05:38.520 having beers with Georgie who also I live in Sofia partially and he also live in Sofia. Georgie was 05:40.120 --> 05:47.880 the first member of the science team that I built in Sofia that I hired I don't know maybe 12 years ago 05:47.880 --> 05:54.600 15 years ago. I don't remember. We're going out regularly for beers and then he started talking 05:54.600 --> 06:04.040 about language models and llama and whisper and quantization and that got my that's how I 06:04.040 --> 06:12.840 actually got into this business of quantizing language models. And then since you got in there 06:12.840 --> 06:16.760 have been implemented a lot of different methods. It's actually if you try to track and llama 06:16.760 --> 06:24.840 CPP some of the older methods it's sometimes they no longer supported even. So what was the 06:24.840 --> 06:31.080 history of the evolution of the quantization method there? So did you just go lower in the number of bits 06:31.080 --> 06:39.160 or what was you think in actually in the evolution of the methods? Well so when I started doing 06:39.240 --> 06:45.800 the only quantizations working on that the only quantizations in llama CPP were Q40 in 06:48.120 --> 06:55.880 then the time there was no GPU support so your best option to actually run a 06:56.440 --> 07:01.720 perplexity calculation to see how your quantization work was to use 07:01.880 --> 07:14.360 the blast back end and doing that we realized that the result from blast was very different 07:14.360 --> 07:22.120 and much better than the result without blast and the reason for that was that when you did 07:22.120 --> 07:29.320 blast you did quantize and did floats while in the back then this is now two years ago 07:30.280 --> 07:37.800 llama CPP was quantizing the activations also to four bits and that led to a significant 07:37.800 --> 07:46.280 precision loss. So that initiated the addition of higher bit quant in llama CPP, a 07:46.280 --> 07:55.080 georgated Q80 and Q50 and then I did the kick once I mean I don't know the kick once 07:55.240 --> 08:00.920 haven't changed that much maybe I have done a little bit of tweaking here and there on the 08:02.120 --> 08:08.680 quantization algorithm but not too much. Of course lots of this is driven by the hike waves 08:08.680 --> 08:16.120 of the internet people users start coming asking ah I saw this can I have a two bit 08:16.200 --> 08:25.480 quantization can I have a one bit quantization so these are then you respond to these people you do 08:25.480 --> 08:34.280 to bit the initial kick once to bit but it wasn't a very good to bit quantization so then later 08:34.280 --> 08:42.520 the hike once came along which are much better at lower bits per wave. So with actually like since 08:42.600 --> 08:47.080 we are talking about going below to bit quantization there is a bunch of research this 08:47.080 --> 08:52.840 days and everybody talks about of course like combination of which layers to quantize to which 08:52.840 --> 09:00.200 bits how to do the script quantization but also there is some new research about how to do 09:00.200 --> 09:05.720 quantization of our training so the don't lose quality that much when you quantize to like below 09:05.880 --> 09:12.760 to bits. So what's your take on it? Are you interested? Are you trying to implement methods 09:12.760 --> 09:22.280 like that or everything that's like on the training side is separated? No I think so I have never 09:22.280 --> 09:30.360 taught what seriously taught about getting involved in training because I'm just a lonely guy sitting 09:30.440 --> 09:37.560 in a corner and hacking. I don't have the resources to do training of serious models but my 09:37.560 --> 09:44.600 personal opinion is that people designing and training models should go to integer models 09:46.440 --> 09:52.840 and train directly in integers Microsoft has demonstrated that you can train 09:52.840 --> 10:02.600 partner in models so that for sure to me indicates that it will be possible to do for bit 10:03.800 --> 10:09.960 a bit integers. So actually coming to this question about doing it floating point that we just 10:09.960 --> 10:14.120 do integers that is a lot of debate especially on the hardware side like do actually need 10:14.120 --> 10:19.160 floating point on your chip especially on the acceleratorships. So what's your take on it? If you 10:19.160 --> 10:25.720 just train on integers and you're just inference on integers, do you still think like that 10:25.720 --> 10:32.520 it's not the case? I think but I may be biased I'm old school I think integers are much nicer than 10:32.520 --> 10:39.240 floats. Back in the day some of the computers I have worked on didn't even have a floating point 10:39.240 --> 10:46.760 unit so you have to do software emulation and so any calculation that you could convert to integer 10:46.760 --> 10:55.240 math you did that just for the speed. Now it's different today but even today integer math has 10:56.040 --> 11:03.720 some very nice properties your calculation is 100% reproducibility doesn't matter if you run it 11:04.280 --> 11:11.720 on this CPU or on that CPU or on that GPU or on this CPU you always get the same answer this is not 11:11.800 --> 11:21.400 true for floats which makes it kind of hard to compare your results from this architecture to 11:21.400 --> 11:29.000 this other architecture. So I think yes going fully going full-mongued to integers will be really nice. 11:30.760 --> 11:37.160 So one of our engineers who looked a lot at your work said that it's a really interesting 11:37.160 --> 11:43.240 combination of engineering optimization tricks and numerical designs that you use so can you 11:43.240 --> 11:48.280 give a little bit of comment on your approach to quantization is it trial or error is it more 11:48.280 --> 11:53.720 common from some mathematical paper and mathematical work how do you approach it in your opinion 11:53.720 --> 11:59.240 what's the combination between engineering tricks and numerical design? So it's I had a 11:59.240 --> 12:08.360 friend from a friend at the University of Sofia he was calling this numerology it is numerology 12:09.960 --> 12:16.200 basically when you do quantization you're facing a mixed integer optimization problem which is 12:16.200 --> 12:23.480 notoriously hard to solve. Now it's not impossible to solve especially if you do block-wise 12:23.480 --> 12:29.880 quantization as it is typically done in llama CPP so it's doable and in fact in the early 12:30.680 --> 12:37.880 in the very early work I had done an exact solution of the mix of this mixed integer 12:37.880 --> 12:46.840 quantization problem to determine the quant. The issue there is that you don't really know what 12:47.320 --> 12:56.200 you're wanting to minimize. If when you're doing optimization you always need a cost function 12:56.200 --> 13:01.720 that's how we call it back in the day now people call it a loss okay so you need a loss 13:03.320 --> 13:08.760 and it isn't immediately clear what is the loss that you want to minimize when you're converting 13:08.760 --> 13:16.680 floating points to low bit integers typically you use root mean square error because I don't know 13:16.680 --> 13:24.040 like of any other insight this is something that is nice to work with so back then when I started 13:24.040 --> 13:32.200 working on it turned out that finding the exact solution of the of this minimization problem leads 13:32.200 --> 13:44.600 to a worse result than doing an empirical minimization that stops how the way. Now maybe today 13:45.880 --> 13:55.480 with the iMutrix which gives a little bit of indication on which weights are more important than the others 13:56.440 --> 14:03.480 it will probably be a little bit more stable than it was back then without the iMutrix 14:04.920 --> 14:12.600 but this empirical way of you know trying a number of steps and trying 14:14.280 --> 14:24.040 modifications to the iMutrix weights to improve your quantization results kind of works pretty well. 14:24.120 --> 14:29.080 Some of the some of the quantization methods actually do solve the minimization problem 14:29.080 --> 14:37.000 exactly for instance if you look at the one bit iQ1s that does in fact use an exact solution of 14:37.000 --> 14:43.000 the minimization it's very easy when you only have three possible states for your funds. 14:44.360 --> 14:49.400 Yep and that is actually like I think one of the reasons why people interested especially on the hardware 14:49.400 --> 14:54.760 side because hardware for that is just way easier to build for for just like. 14:54.760 --> 15:01.800 They're not a model yes yes. So what is the latest quantization technique or what 15:01.800 --> 15:07.480 what's recently I work on and where can we find it is at your own lmcpp for that you can 15:07.480 --> 15:14.120 preview to recently or like if somebody wants to join or to track your work what's the reason 15:14.680 --> 15:22.600 Yeah I have I'm having by my lmcpp fork when I'm quietly hacking around in the corner 15:23.800 --> 15:32.920 the best kept secret on the internet. There are quite a few things that I have done in that repository 15:32.920 --> 15:42.760 and some of some of the work is new quantization types. I call them iQK to which is the combination 15:42.760 --> 15:53.080 of iQ and k and those are actually quite especially the four five and six bit are quite 15:53.080 --> 16:00.040 quite a bit better than what is in lmcpp and then the other the lower bits are not really 16:00.040 --> 16:08.520 better than the iQ1s but they're more efficient on the CPU and also I did those out of curiosity 16:08.600 --> 16:17.160 iQ1s use this kind of group approach where instead of converting individual model weights into 16:17.160 --> 16:26.200 fantastic values you take group groups of weights and you try to put them on to a given grid type 16:27.560 --> 16:34.440 so for me this was a curiosity to see if I cannot achieve the same quantization quality without 16:34.520 --> 16:41.800 actually doing the grid thing and yes it turns out it is possible and then the most recent 16:41.800 --> 16:50.120 thing related to quantization is sometimes ago there was the there were the it was the Q tip high 16:50.120 --> 16:59.320 I don't know how many of you have heard it they use trailices to basically generate a series of 16:59.400 --> 17:09.800 quantized values I did this out of curiosity in my repo it sits on a branch it's not merged into 17:09.800 --> 17:19.400 my main development branch because it only has Q2 inference doesn't have inference for the other 17:19.400 --> 17:26.840 platforms that are supported there so speaking about different pecans and Q2 for are you interested 17:26.840 --> 17:34.120 in writing specialized compute kernels to have better levels of optimization or that's the 17:34.120 --> 17:41.240 level you don't go to specialized kernels for different hardware backends and again like 17:41.240 --> 17:48.840 even for a video you can load kernels without Q2 there is some nice hack that tiny grid came up 17:48.840 --> 17:53.080 or somebody else but I saw it in tiny grid the first time came up is so that you can actually 17:53.080 --> 18:03.640 load kernels without even booting to the well I'm not a Q2 I'm not a Q2 person in fact the 18:03.640 --> 18:11.640 kernels that I wrote from Lama CPP was the very first Q2 working in my life so I'm not an expert there 18:13.480 --> 18:21.080 cannot really comment on that on the CPU it's quite interesting to optimize it turns out 18:21.880 --> 18:33.240 you can speed up on the CPU the Lama CPP quantization implementation by factors up to a factor of 7 18:34.040 --> 18:41.640 for some quantization type types some of this went into Lama file but in the meantime I have done 18:41.640 --> 18:50.040 more progress on the on the CPU site I find the CPU kind of more cool I'm older than almost 18:50.120 --> 18:57.320 everybody in this room I remember living through various hardware types there was mentioning 18:57.320 --> 19:05.640 of transputers just the other talk yes I went through a porting high energy physics code to 19:05.640 --> 19:18.600 transputers in 90s and then those transputers died away CPU will stay around so it is and it is 19:18.600 --> 19:26.840 much nicer to actually work for so I'm looking forward to hardware vendors it's CPU manufacturers 19:26.840 --> 19:33.000 to finally give their CPUs more memory bandwidth and then they will be not so bad to work with 19:33.000 --> 19:40.440 especially for very large models which you cannot do on a consumer level kind of a GPU 19:41.240 --> 19:46.360 and this what's currently available of the shelf what's your choice of hardware 19:47.000 --> 19:57.640 what's my choice I have an updated my hardware for a while so I'm working on a Ryzen 79050s Zen4 16 core 19:58.440 --> 20:10.440 I'm also having a little bit older the previous Zen generation 5975 I have an RTX 48 to test 20:10.440 --> 20:19.800 the CUDA code and then I have M2 max to test the RNA on implementation and the metal 20:19.800 --> 20:27.480 implementation so a lot of interest now goes into the direction of different architecture 20:27.480 --> 20:30.840 like mixture of experts architecture and I mean they're all still transformable based on 20:30.840 --> 20:37.160 can be run on LMSCPP but like if you talk about quantization techniques is there any advice 20:37.160 --> 20:44.040 like what is the go-to method these days or like you should try different ones and compare 20:44.040 --> 20:47.720 so what's your recommendation for people who just need to start choosing 20:52.440 --> 21:02.920 so I thought we wanted to so the tensors different tensors in on LM have different 21:02.920 --> 21:10.360 importance or different impact the quantization of different tensors have has a different impact on 21:10.360 --> 21:18.600 the accuracy laws that you get through the quantization because of the quantization some tensors 21:18.600 --> 21:26.200 are more important than others in the standard transformer attention mechanism the V tensor is 21:26.200 --> 21:35.320 the most important one then comes the output of the attention layer and then the comes the 21:37.800 --> 21:44.040 FFN down tensor in the feed forward part of the layer these are the most important ones now 21:44.920 --> 21:50.360 more recent models the the importance of FFN down appears to have the gun down 21:51.240 --> 22:01.240 then we have the R1 deep seek model the way all in the room there and their tension mechanism 22:01.240 --> 22:07.480 is a little bit different and I have to admit that I haven't actually studied in detail how 22:07.480 --> 22:12.600 this new tensors influence the effect the quantization error 22:13.480 --> 22:18.440 I think they still use basically your quantization methods but just differently for different parts 22:19.400 --> 22:27.000 yeah it's interesting the onslaught thing making this huge hype around the internet 22:27.000 --> 22:37.720 by quantizing deep seek error R1 to IQ1s so basically I actually yesterday went to check my original 22:37.720 --> 22:50.120 PR in Lama CPP and yes the quantizing the attention tensors to four bits is in that PR including 22:50.840 --> 22:57.400 especially when you are using a mixture of experts why it doesn't work in Lama CPP and 22:57.400 --> 23:03.080 onslaught needed to come along is because you need to the tensor names are slightly different 23:03.080 --> 23:12.760 so the heuristic to detect which tensor to assign four bits just didn't work but it's basically 23:12.760 --> 23:21.560 a three line code change to do what onslaught did that's funny so do you are interested in any 23:21.560 --> 23:26.920 other machine learning frameworks besides like forks of Lama CPP I know you contributed to Lama 23:27.000 --> 23:34.040 file a bit by the way just in says hi she was not able to come but any others that you at least 23:34.040 --> 23:41.800 tracking may be not contributed to but like some ideas and them or again just Lama CPP and C++ 23:41.800 --> 23:51.640 model I work in my little fork I occasionally look around but there is not nothing yeah I always 23:51.720 --> 24:01.800 check what the use issues people are entering in Lama file out of curiosity but I don't track 24:01.800 --> 24:09.000 anything specifically very closely what about different papers again any collectives that you like 24:09.000 --> 24:17.480 that research that you I interested in again maybe some I saw some papers that came about just 24:17.480 --> 24:23.240 the importance of different layers in in the models or more on the mathematical size I'm like we 24:23.240 --> 24:28.760 were having some quantization of a training research so any collective that you track that you 24:28.760 --> 24:33.320 like that you can recommend people to keep an eye on like the HQQ collection if I think we'll 24:33.320 --> 24:42.680 also discuss at some point yeah so I'll tell you a little anecdot of my life that will hopefully tell 24:42.760 --> 24:51.320 you my approach to papers so many years ago I was much younger back then I was invited to be a consultant 24:51.320 --> 25:01.000 to the international atomic energy agency and everybody else in the in the group were luminaries in 25:01.000 --> 25:08.040 the field much older much more accomplished in me and at some point we were discussing some topic 25:08.120 --> 25:18.120 and discussion became quite heated and then somebody said to me but didn't you reach such 25:18.120 --> 25:26.760 and such paper and I answered I don't read papers there was a silence in the room and then 25:26.760 --> 25:33.640 everybody started laughing so this has always been my approach to papers back in the back 25:33.640 --> 25:41.320 when I was a researcher I would go to a conference and we'll learn in one day or two days what 25:41.320 --> 25:47.400 people are doing instead of sitting for hours and looking at papers that have been published in 25:47.400 --> 25:56.600 the last month in the journal now being semi retired I don't actually need to go and study papers 25:56.600 --> 26:03.240 mostly when I go and look at something is because somebody I know told me about it 26:04.200 --> 26:11.000 and you also don't write papers I also don't write papers yes prefer prefer to not prepare 26:11.000 --> 26:17.160 those that's why we're having this session by the way I actually go into open for the questions 26:17.160 --> 26:22.520 so I'm going to check the website but if somebody wants to send questions that's your last chance 26:22.520 --> 26:40.760 or you can use metrics as well okay some questions do all execution exploration models in 26:41.000 --> 26:48.360 Lama CPP GML example through the CPU model support all conversations and is performance similar 26:54.600 --> 27:04.360 I'm not tracking very closely but my memory is that if I remember correctly all of the 27:04.360 --> 27:17.000 quantization types were supported I wrote kernels for them in metal and in cuda it is possible that 27:19.880 --> 27:25.160 you know cuda support was 100% so I don't really know if that's fair 27:26.760 --> 27:32.120 we saw from Lama to you to Lama 3 that the ability to quantize the model was impacted due to 27:32.120 --> 27:39.960 massive overtrainings also seen bitscale and laws from Christopher Raston do you think quantization 27:39.960 --> 27:47.000 will still be used as models become more overtraint yeah I think yes this is a real effect 27:49.400 --> 27:57.480 I don't quite follow the term overtraint the model have been trained on more tokens and it seems to 27:58.440 --> 28:05.080 me the model weights could actually contain more information than the model weights of per previous 28:05.880 --> 28:15.560 generation models and that's why they're more difficult to quantize or they're not more difficult 28:15.560 --> 28:22.680 to quantize but the organization error is higher as you quantize them to the same with using the same 28:22.680 --> 28:29.160 amount of bits this was the main motivation for developing these four five six bits I 28:29.160 --> 28:37.160 QK ones because they do quantize Lama 3 much better than what is in in main line Lama CPP 28:38.360 --> 28:43.960 so the other question asks if you think that dynamic quantization approach will be integrated in 28:43.960 --> 28:52.040 Lama CPP since Lama CPP has more key to the genus strategy and what challenges they might face 28:53.640 --> 29:01.080 in doing so because again yeah so the person who asked this question need to because as I mentioned 29:01.080 --> 29:08.840 earlier I don't hang around on Citer and read it so I don't know what people understand on the dynamic 29:08.840 --> 29:16.840 quantization what is dynamic quantization assign different bits to different layers basically 29:17.800 --> 29:26.040 this has been always like that in Lama CPP since the beginning in fact if you go in Lama 29:26.680 --> 29:34.840 CPP and search for issues entered by cover cough you'll see an issue that I have entered I don't know 29:35.400 --> 29:44.760 somewhere around March April of 2023 talking about variable bit like quantizations 29:45.720 --> 29:51.880 there were of course professors coming in teaching me that this LLM quantization has 29:51.880 --> 29:59.480 not to do with video compression and therefore what I'm talking about is nonsense but yeah this 29:59.480 --> 30:07.560 has been in Lama CPP forever yeah thank you so much actually so we're out of time so if somebody 30:07.560 --> 30:13.800 and I haven't been able to read all the questions but feel free to ask them to one in the hallway track 30:14.760 --> 30:23.560 at the dinner or tomorrow at the event hopefully so thank you very much I have on time yeah