WEBVTT

00:00.000 --> 00:13.400
Hello, hello, everyone. We are starting the next talk and now Tanya and Ivan. Ivan,

00:13.400 --> 00:20.000
Ivan, Ivan will tell about the quantization and this is the talk you want to listen to

00:20.000 --> 00:27.000
if you want to learn about quantization because Ivan is the one who is the like wild

00:27.000 --> 00:32.280
bore of quantization or like father of quantization of which are of quantization like looks

00:32.280 --> 00:38.280
like a witcher a little bit, yeah. So like I give it to your hands and just talk about quantization

00:38.280 --> 00:45.320
now. Thank you. So as we go into the quantization portion and we actually don't have to talk

00:45.320 --> 00:51.400
so right now. So this is a Q&A talk so I'll be asking questions but feel free to write your questions

00:51.400 --> 00:56.600
if you go on this car code for Ivan and Father speakers as well. So by the end of the talk

00:56.600 --> 01:02.360
we'll actually look who submitted what and ask them as well. But to start actually it's for us

01:03.080 --> 01:09.000
amazing opportunity to put a real face with a gift-up profile picture. So who have seen this gift-up

01:09.000 --> 01:14.520
profile picture? Yeah everybody who's basically looked into any kind of quantization right?

01:14.520 --> 01:19.240
So now this is the real face but it's really hard to find your pictures on the internet

01:20.200 --> 01:28.280
Well I'm kind of a really cool past. No, I'm this kind of guy who doesn't participate in social networks

01:29.000 --> 01:34.360
you'll not find me anywhere except this GitHub profile. I'm not on Twitter. I'm not on Reddit.

01:34.360 --> 01:40.920
I'm not nowhere. But before working on a CPD, so you actually come from a completely different

01:40.920 --> 01:48.040
industry. You get physicists right? Yes, I'm a physicist. Graduated in high energy physics many,

01:48.120 --> 01:57.720
many years ago. Then the way my life evolved I ended up working in medical physics,

01:59.720 --> 02:06.840
specifically doing research on radiation therapy of cancer,

02:06.840 --> 02:14.600
basically physics, tools related to that. So how is it similar in the current quantization

02:14.760 --> 02:22.200
works and what's different? Well, when you do this sort of work, you basically it's

02:22.200 --> 02:31.160
numerical, numerical methods of various kinds. In my previous work I was most famous for the

02:31.160 --> 02:40.840
for the Monte Carlo work that I have done. But I have done lots of other things. One of the

02:41.080 --> 02:49.080
problems that you have to solve there in radiation therapy is to optimize the treatment. So in

02:49.080 --> 02:56.200
radiation therapy, when you're radiating the patient, you come from many different angles and you can also

02:56.200 --> 03:04.840
modulate the fluids or the amount of radiation you're imparting from each angle from every position

03:04.840 --> 03:11.880
of your beam. And so there there is this large scale optimization problem where you want to

03:11.880 --> 03:19.880
basically optimize the fluids or the amount of radiation from the different angles in the different

03:19.880 --> 03:28.120
positions at this angle such that to achieve an optimum dose distribution in the patient. So this

03:28.440 --> 03:39.160
problem is associated with the so-called system matrix. I don't know anyone knows about the system

03:39.160 --> 03:45.480
matrix. Many optimization problems have a system matrix. Basically the system matrix here is how

03:45.480 --> 03:56.200
much dose a small beamlet J in parts into a small voxyl I in the patient. It's a giant thing.

03:56.200 --> 04:06.680
It's billions of elements. So then while working on that at some point I realized the best way to

04:06.680 --> 04:15.480
the basically the calculation is 99 percent dominated by the matrix multiplication of the system

04:15.480 --> 04:25.560
matrix with your gradients relative to the dose or the same matrix with your changes of the fluids.

04:26.600 --> 04:33.640
And so the the best way to speed up this calculation is to actually quantize it. Now it's a bit

04:33.640 --> 04:41.480
different than LLMs. There you're happy if you can make it work without numerical instabilities

04:41.480 --> 04:52.120
so it's 32 bit floats. So going lower than 16 bit in it would be pushing it too far so it was

04:52.120 --> 05:02.120
a 16 32 float to 16 bit quantization. But basically what happens then is that when you do this you

05:02.120 --> 05:12.120
realize that you have a big butt of in this matrix which is you can represent a side bit and you

05:12.120 --> 05:21.080
have an extremely sparse matrix which is really requires the full 16 bit. So this was the thing

05:21.080 --> 05:29.720
that my one of my encounters in quantizations. So having gone through this one evening I was

05:30.360 --> 05:38.520
having beers with Georgie who also I live in Sofia partially and he also live in Sofia. Georgie was

05:40.120 --> 05:47.880
the first member of the science team that I built in Sofia that I hired I don't know maybe 12 years ago

05:47.880 --> 05:54.600
15 years ago. I don't remember. We're going out regularly for beers and then he started talking

05:54.600 --> 06:04.040
about language models and llama and whisper and quantization and that got my that's how I

06:04.040 --> 06:12.840
actually got into this business of quantizing language models. And then since you got in there

06:12.840 --> 06:16.760
have been implemented a lot of different methods. It's actually if you try to track and llama

06:16.760 --> 06:24.840
CPP some of the older methods it's sometimes they no longer supported even. So what was the

06:24.840 --> 06:31.080
history of the evolution of the quantization method there? So did you just go lower in the number of bits

06:31.080 --> 06:39.160
or what was you think in actually in the evolution of the methods? Well so when I started doing

06:39.240 --> 06:45.800
the only quantizations working on that the only quantizations in llama CPP were Q40 in

06:48.120 --> 06:55.880
then the time there was no GPU support so your best option to actually run a

06:56.440 --> 07:01.720
perplexity calculation to see how your quantization work was to use

07:01.880 --> 07:14.360
the blast back end and doing that we realized that the result from blast was very different

07:14.360 --> 07:22.120
and much better than the result without blast and the reason for that was that when you did

07:22.120 --> 07:29.320
blast you did quantize and did floats while in the back then this is now two years ago

07:30.280 --> 07:37.800
llama CPP was quantizing the activations also to four bits and that led to a significant

07:37.800 --> 07:46.280
precision loss. So that initiated the addition of higher bit quant in llama CPP, a

07:46.280 --> 07:55.080
georgated Q80 and Q50 and then I did the kick once I mean I don't know the kick once

07:55.240 --> 08:00.920
haven't changed that much maybe I have done a little bit of tweaking here and there on the

08:02.120 --> 08:08.680
quantization algorithm but not too much. Of course lots of this is driven by the hike waves

08:08.680 --> 08:16.120
of the internet people users start coming asking ah I saw this can I have a two bit

08:16.200 --> 08:25.480
quantization can I have a one bit quantization so these are then you respond to these people you do

08:25.480 --> 08:34.280
to bit the initial kick once to bit but it wasn't a very good to bit quantization so then later

08:34.280 --> 08:42.520
the hike once came along which are much better at lower bits per wave. So with actually like since

08:42.600 --> 08:47.080
we are talking about going below to bit quantization there is a bunch of research this

08:47.080 --> 08:52.840
days and everybody talks about of course like combination of which layers to quantize to which

08:52.840 --> 09:00.200
bits how to do the script quantization but also there is some new research about how to do

09:00.200 --> 09:05.720
quantization of our training so the don't lose quality that much when you quantize to like below

09:05.880 --> 09:12.760
to bits. So what's your take on it? Are you interested? Are you trying to implement methods

09:12.760 --> 09:22.280
like that or everything that's like on the training side is separated? No I think so I have never

09:22.280 --> 09:30.360
taught what seriously taught about getting involved in training because I'm just a lonely guy sitting

09:30.440 --> 09:37.560
in a corner and hacking. I don't have the resources to do training of serious models but my

09:37.560 --> 09:44.600
personal opinion is that people designing and training models should go to integer models

09:46.440 --> 09:52.840
and train directly in integers Microsoft has demonstrated that you can train

09:52.840 --> 10:02.600
partner in models so that for sure to me indicates that it will be possible to do for bit

10:03.800 --> 10:09.960
a bit integers. So actually coming to this question about doing it floating point that we just

10:09.960 --> 10:14.120
do integers that is a lot of debate especially on the hardware side like do actually need

10:14.120 --> 10:19.160
floating point on your chip especially on the acceleratorships. So what's your take on it? If you

10:19.160 --> 10:25.720
just train on integers and you're just inference on integers, do you still think like that

10:25.720 --> 10:32.520
it's not the case? I think but I may be biased I'm old school I think integers are much nicer than

10:32.520 --> 10:39.240
floats. Back in the day some of the computers I have worked on didn't even have a floating point

10:39.240 --> 10:46.760
unit so you have to do software emulation and so any calculation that you could convert to integer

10:46.760 --> 10:55.240
math you did that just for the speed. Now it's different today but even today integer math has

10:56.040 --> 11:03.720
some very nice properties your calculation is 100% reproducibility doesn't matter if you run it

11:04.280 --> 11:11.720
on this CPU or on that CPU or on that GPU or on this CPU you always get the same answer this is not

11:11.800 --> 11:21.400
true for floats which makes it kind of hard to compare your results from this architecture to

11:21.400 --> 11:29.000
this other architecture. So I think yes going fully going full-mongued to integers will be really nice.

11:30.760 --> 11:37.160
So one of our engineers who looked a lot at your work said that it's a really interesting

11:37.160 --> 11:43.240
combination of engineering optimization tricks and numerical designs that you use so can you

11:43.240 --> 11:48.280
give a little bit of comment on your approach to quantization is it trial or error is it more

11:48.280 --> 11:53.720
common from some mathematical paper and mathematical work how do you approach it in your opinion

11:53.720 --> 11:59.240
what's the combination between engineering tricks and numerical design? So it's I had a

11:59.240 --> 12:08.360
friend from a friend at the University of Sofia he was calling this numerology it is numerology

12:09.960 --> 12:16.200
basically when you do quantization you're facing a mixed integer optimization problem which is

12:16.200 --> 12:23.480
notoriously hard to solve. Now it's not impossible to solve especially if you do block-wise

12:23.480 --> 12:29.880
quantization as it is typically done in llama CPP so it's doable and in fact in the early

12:30.680 --> 12:37.880
in the very early work I had done an exact solution of the mix of this mixed integer

12:37.880 --> 12:46.840
quantization problem to determine the quant. The issue there is that you don't really know what

12:47.320 --> 12:56.200
you're wanting to minimize. If when you're doing optimization you always need a cost function

12:56.200 --> 13:01.720
that's how we call it back in the day now people call it a loss okay so you need a loss

13:03.320 --> 13:08.760
and it isn't immediately clear what is the loss that you want to minimize when you're converting

13:08.760 --> 13:16.680
floating points to low bit integers typically you use root mean square error because I don't know

13:16.680 --> 13:24.040
like of any other insight this is something that is nice to work with so back then when I started

13:24.040 --> 13:32.200
working on it turned out that finding the exact solution of the of this minimization problem leads

13:32.200 --> 13:44.600
to a worse result than doing an empirical minimization that stops how the way. Now maybe today

13:45.880 --> 13:55.480
with the iMutrix which gives a little bit of indication on which weights are more important than the others

13:56.440 --> 14:03.480
it will probably be a little bit more stable than it was back then without the iMutrix

14:04.920 --> 14:12.600
but this empirical way of you know trying a number of steps and trying

14:14.280 --> 14:24.040
modifications to the iMutrix weights to improve your quantization results kind of works pretty well.

14:24.120 --> 14:29.080
Some of the some of the quantization methods actually do solve the minimization problem

14:29.080 --> 14:37.000
exactly for instance if you look at the one bit iQ1s that does in fact use an exact solution of

14:37.000 --> 14:43.000
the minimization it's very easy when you only have three possible states for your funds.

14:44.360 --> 14:49.400
Yep and that is actually like I think one of the reasons why people interested especially on the hardware

14:49.400 --> 14:54.760
side because hardware for that is just way easier to build for for just like.

14:54.760 --> 15:01.800
They're not a model yes yes. So what is the latest quantization technique or what

15:01.800 --> 15:07.480
what's recently I work on and where can we find it is at your own lmcpp for that you can

15:07.480 --> 15:14.120
preview to recently or like if somebody wants to join or to track your work what's the reason

15:14.680 --> 15:22.600
Yeah I have I'm having by my lmcpp fork when I'm quietly hacking around in the corner

15:23.800 --> 15:32.920
the best kept secret on the internet. There are quite a few things that I have done in that repository

15:32.920 --> 15:42.760
and some of some of the work is new quantization types. I call them iQK to which is the combination

15:42.760 --> 15:53.080
of iQ and k and those are actually quite especially the four five and six bit are quite

15:53.080 --> 16:00.040
quite a bit better than what is in lmcpp and then the other the lower bits are not really

16:00.040 --> 16:08.520
better than the iQ1s but they're more efficient on the CPU and also I did those out of curiosity

16:08.600 --> 16:17.160
iQ1s use this kind of group approach where instead of converting individual model weights into

16:17.160 --> 16:26.200
fantastic values you take group groups of weights and you try to put them on to a given grid type

16:27.560 --> 16:34.440
so for me this was a curiosity to see if I cannot achieve the same quantization quality without

16:34.520 --> 16:41.800
actually doing the grid thing and yes it turns out it is possible and then the most recent

16:41.800 --> 16:50.120
thing related to quantization is sometimes ago there was the there were the it was the Q tip high

16:50.120 --> 16:59.320
I don't know how many of you have heard it they use trailices to basically generate a series of

16:59.400 --> 17:09.800
quantized values I did this out of curiosity in my repo it sits on a branch it's not merged into

17:09.800 --> 17:19.400
my main development branch because it only has Q2 inference doesn't have inference for the other

17:19.400 --> 17:26.840
platforms that are supported there so speaking about different pecans and Q2 for are you interested

17:26.840 --> 17:34.120
in writing specialized compute kernels to have better levels of optimization or that's the

17:34.120 --> 17:41.240
level you don't go to specialized kernels for different hardware backends and again like

17:41.240 --> 17:48.840
even for a video you can load kernels without Q2 there is some nice hack that tiny grid came up

17:48.840 --> 17:53.080
or somebody else but I saw it in tiny grid the first time came up is so that you can actually

17:53.080 --> 18:03.640
load kernels without even booting to the well I'm not a Q2 I'm not a Q2 person in fact the

18:03.640 --> 18:11.640
kernels that I wrote from Lama CPP was the very first Q2 working in my life so I'm not an expert there

18:13.480 --> 18:21.080
cannot really comment on that on the CPU it's quite interesting to optimize it turns out

18:21.880 --> 18:33.240
you can speed up on the CPU the Lama CPP quantization implementation by factors up to a factor of 7

18:34.040 --> 18:41.640
for some quantization type types some of this went into Lama file but in the meantime I have done

18:41.640 --> 18:50.040
more progress on the on the CPU site I find the CPU kind of more cool I'm older than almost

18:50.120 --> 18:57.320
everybody in this room I remember living through various hardware types there was mentioning

18:57.320 --> 19:05.640
of transputers just the other talk yes I went through a porting high energy physics code to

19:05.640 --> 19:18.600
transputers in 90s and then those transputers died away CPU will stay around so it is and it is

19:18.600 --> 19:26.840
much nicer to actually work for so I'm looking forward to hardware vendors it's CPU manufacturers

19:26.840 --> 19:33.000
to finally give their CPUs more memory bandwidth and then they will be not so bad to work with

19:33.000 --> 19:40.440
especially for very large models which you cannot do on a consumer level kind of a GPU

19:41.240 --> 19:46.360
and this what's currently available of the shelf what's your choice of hardware

19:47.000 --> 19:57.640
what's my choice I have an updated my hardware for a while so I'm working on a Ryzen 79050s Zen4 16 core

19:58.440 --> 20:10.440
I'm also having a little bit older the previous Zen generation 5975 I have an RTX 48 to test

20:10.440 --> 20:19.800
the CUDA code and then I have M2 max to test the RNA on implementation and the metal

20:19.800 --> 20:27.480
implementation so a lot of interest now goes into the direction of different architecture

20:27.480 --> 20:30.840
like mixture of experts architecture and I mean they're all still transformable based on

20:30.840 --> 20:37.160
can be run on LMSCPP but like if you talk about quantization techniques is there any advice

20:37.160 --> 20:44.040
like what is the go-to method these days or like you should try different ones and compare

20:44.040 --> 20:47.720
so what's your recommendation for people who just need to start choosing

20:52.440 --> 21:02.920
so I thought we wanted to so the tensors different tensors in on LM have different

21:02.920 --> 21:10.360
importance or different impact the quantization of different tensors have has a different impact on

21:10.360 --> 21:18.600
the accuracy laws that you get through the quantization because of the quantization some tensors

21:18.600 --> 21:26.200
are more important than others in the standard transformer attention mechanism the V tensor is

21:26.200 --> 21:35.320
the most important one then comes the output of the attention layer and then the comes the

21:37.800 --> 21:44.040
FFN down tensor in the feed forward part of the layer these are the most important ones now

21:44.920 --> 21:50.360
more recent models the the importance of FFN down appears to have the gun down

21:51.240 --> 22:01.240
then we have the R1 deep seek model the way all in the room there and their tension mechanism

22:01.240 --> 22:07.480
is a little bit different and I have to admit that I haven't actually studied in detail how

22:07.480 --> 22:12.600
this new tensors influence the effect the quantization error

22:13.480 --> 22:18.440
I think they still use basically your quantization methods but just differently for different parts

22:19.400 --> 22:27.000
yeah it's interesting the onslaught thing making this huge hype around the internet

22:27.000 --> 22:37.720
by quantizing deep seek error R1 to IQ1s so basically I actually yesterday went to check my original

22:37.720 --> 22:50.120
PR in Lama CPP and yes the quantizing the attention tensors to four bits is in that PR including

22:50.840 --> 22:57.400
especially when you are using a mixture of experts why it doesn't work in Lama CPP and

22:57.400 --> 23:03.080
onslaught needed to come along is because you need to the tensor names are slightly different

23:03.080 --> 23:12.760
so the heuristic to detect which tensor to assign four bits just didn't work but it's basically

23:12.760 --> 23:21.560
a three line code change to do what onslaught did that's funny so do you are interested in any

23:21.560 --> 23:26.920
other machine learning frameworks besides like forks of Lama CPP I know you contributed to Lama

23:27.000 --> 23:34.040
file a bit by the way just in says hi she was not able to come but any others that you at least

23:34.040 --> 23:41.800
tracking may be not contributed to but like some ideas and them or again just Lama CPP and C++

23:41.800 --> 23:51.640
model I work in my little fork I occasionally look around but there is not nothing yeah I always

23:51.720 --> 24:01.800
check what the use issues people are entering in Lama file out of curiosity but I don't track

24:01.800 --> 24:09.000
anything specifically very closely what about different papers again any collectives that you like

24:09.000 --> 24:17.480
that research that you I interested in again maybe some I saw some papers that came about just

24:17.480 --> 24:23.240
the importance of different layers in in the models or more on the mathematical size I'm like we

24:23.240 --> 24:28.760
were having some quantization of a training research so any collective that you track that you

24:28.760 --> 24:33.320
like that you can recommend people to keep an eye on like the HQQ collection if I think we'll

24:33.320 --> 24:42.680
also discuss at some point yeah so I'll tell you a little anecdot of my life that will hopefully tell

24:42.760 --> 24:51.320
you my approach to papers so many years ago I was much younger back then I was invited to be a consultant

24:51.320 --> 25:01.000
to the international atomic energy agency and everybody else in the in the group were luminaries in

25:01.000 --> 25:08.040
the field much older much more accomplished in me and at some point we were discussing some topic

25:08.120 --> 25:18.120
and discussion became quite heated and then somebody said to me but didn't you reach such

25:18.120 --> 25:26.760
and such paper and I answered I don't read papers there was a silence in the room and then

25:26.760 --> 25:33.640
everybody started laughing so this has always been my approach to papers back in the back

25:33.640 --> 25:41.320
when I was a researcher I would go to a conference and we'll learn in one day or two days what

25:41.320 --> 25:47.400
people are doing instead of sitting for hours and looking at papers that have been published in

25:47.400 --> 25:56.600
the last month in the journal now being semi retired I don't actually need to go and study papers

25:56.600 --> 26:03.240
mostly when I go and look at something is because somebody I know told me about it

26:04.200 --> 26:11.000
and you also don't write papers I also don't write papers yes prefer prefer to not prepare

26:11.000 --> 26:17.160
those that's why we're having this session by the way I actually go into open for the questions

26:17.160 --> 26:22.520
so I'm going to check the website but if somebody wants to send questions that's your last chance

26:22.520 --> 26:40.760
or you can use metrics as well okay some questions do all execution exploration models in

26:41.000 --> 26:48.360
Lama CPP GML example through the CPU model support all conversations and is performance similar

26:54.600 --> 27:04.360
I'm not tracking very closely but my memory is that if I remember correctly all of the

27:04.360 --> 27:17.000
quantization types were supported I wrote kernels for them in metal and in cuda it is possible that

27:19.880 --> 27:25.160
you know cuda support was 100% so I don't really know if that's fair

27:26.760 --> 27:32.120
we saw from Lama to you to Lama 3 that the ability to quantize the model was impacted due to

27:32.120 --> 27:39.960
massive overtrainings also seen bitscale and laws from Christopher Raston do you think quantization

27:39.960 --> 27:47.000
will still be used as models become more overtraint yeah I think yes this is a real effect

27:49.400 --> 27:57.480
I don't quite follow the term overtraint the model have been trained on more tokens and it seems to

27:58.440 --> 28:05.080
me the model weights could actually contain more information than the model weights of per previous

28:05.880 --> 28:15.560
generation models and that's why they're more difficult to quantize or they're not more difficult

28:15.560 --> 28:22.680
to quantize but the organization error is higher as you quantize them to the same with using the same

28:22.680 --> 28:29.160
amount of bits this was the main motivation for developing these four five six bits I

28:29.160 --> 28:37.160
QK ones because they do quantize Lama 3 much better than what is in in main line Lama CPP

28:38.360 --> 28:43.960
so the other question asks if you think that dynamic quantization approach will be integrated in

28:43.960 --> 28:52.040
Lama CPP since Lama CPP has more key to the genus strategy and what challenges they might face

28:53.640 --> 29:01.080
in doing so because again yeah so the person who asked this question need to because as I mentioned

29:01.080 --> 29:08.840
earlier I don't hang around on Citer and read it so I don't know what people understand on the dynamic

29:08.840 --> 29:16.840
quantization what is dynamic quantization assign different bits to different layers basically

29:17.800 --> 29:26.040
this has been always like that in Lama CPP since the beginning in fact if you go in Lama

29:26.680 --> 29:34.840
CPP and search for issues entered by cover cough you'll see an issue that I have entered I don't know

29:35.400 --> 29:44.760
somewhere around March April of 2023 talking about variable bit like quantizations

29:45.720 --> 29:51.880
there were of course professors coming in teaching me that this LLM quantization has

29:51.880 --> 29:59.480
not to do with video compression and therefore what I'm talking about is nonsense but yeah this

29:59.480 --> 30:07.560
has been in Lama CPP forever yeah thank you so much actually so we're out of time so if somebody

30:07.560 --> 30:13.800
and I haven't been able to read all the questions but feel free to ask them to one in the hallway track

30:14.760 --> 30:23.560
at the dinner or tomorrow at the event hopefully so thank you very much I have on time yeah