WEBVTT 00:00.000 --> 00:12.000 We'll speak with and prompt high fidelity text to speech with minimal supervision. 00:12.000 --> 00:18.000 And when we read the paper, we kind of found out that the eye texture that they presented 00:18.000 --> 00:19.000 is simple. 00:19.000 --> 00:25.000 It's basically just like true transform models stitched together and to end. 00:25.000 --> 00:30.000 The samples that they released also sounded really, really good. 00:30.000 --> 00:34.000 Unfortunately, they only released the paper and the samples. 00:34.000 --> 00:39.000 And we thought, like, we give it a shot to basically replicate the paper. 00:39.000 --> 00:47.000 We had no previous experience with TTS, speech, or even with training big transform models. 00:47.000 --> 00:50.000 And SS had this is basically our journey. 00:50.000 --> 00:52.000 Let's see if that works. 00:53.000 --> 00:56.000 Hello, Fauston. This is the first demo of Wester speech. 00:56.000 --> 01:03.000 A fully open source text to speech model trained by Calabra and Lyon on the jewel's supercomputer. 01:03.000 --> 01:09.000 This is basically just like one example that Wester speech is able to to produce. 01:09.000 --> 01:19.000 And we think it's basically on par with the text to speech that you get from Amazon Microsoft or Google. 01:20.000 --> 01:27.000 Previously, I basically said that we build Wester speech on top of the S. 01:27.000 --> 01:30.000 That's only half of the story. 01:30.000 --> 01:35.000 We actually build Wester speech on top of some amazing open source projects. 01:35.000 --> 01:37.000 Mainly Wester from OpenAI. 01:37.000 --> 01:40.000 That's also like where the name Wester speech comes from. 01:40.000 --> 01:47.000 And codec from meta, which is the the neural codec and vocals from Gemilo AI. 01:47.000 --> 01:51.000 We also in the process of like implementing Wester speech. 01:51.000 --> 01:56.000 We read a bunch of papers from like different AI labs. 01:56.000 --> 01:59.000 Namely, of course, Spirit TTS from Google. 01:59.000 --> 02:07.000 Music can from meta and tensor program five from Microsoft and OpenAI. 02:07.000 --> 02:13.000 So this basically shows you that the rough overview about the architecture of Wester speech. 02:13.000 --> 02:16.000 So on the left, of course, it detects a speech model. 02:16.000 --> 02:27.000 You have the input text, which is then in fact into like the first transform model, which kind of creates a phonetic representation of you in 02:27.000 --> 02:28.000 Protect. 02:28.000 --> 02:35.000 And since it's kind of difficult to figure out like emotions or property from the text itself, 02:35.000 --> 02:42.000 we also embed this into like the phonetic representation as part of the first transform model. 02:42.000 --> 02:49.000 Then the phonetic representation is fetch into the next one, which then actually creates the actual speech. 02:49.000 --> 02:52.000 It's still like compressed speech, but it's speech. 02:52.000 --> 03:00.000 And we also, since the text doesn't give you anything about like the speakers, emotions, whatever. 03:00.000 --> 03:03.000 We also add speaker embeddings on top of it. 03:03.000 --> 03:09.000 And then we apply the vocals vocoder to actually get the audio out of it. 03:10.000 --> 03:14.000 Now, I mentioned that we kind of get like a phonetic representation. 03:14.000 --> 03:18.000 And phonetic representation actually comes from from Wester itself. 03:18.000 --> 03:22.000 And Wester is a very easy, just encoder decoder model. 03:22.000 --> 03:29.000 And if you just look at the rights, the Wester decoder just takes the Wester features and creates the text tokens, 03:29.000 --> 03:34.000 with like varying speed because you don't really have an idea from the text itself, 03:34.000 --> 03:38.000 but from the audio sample, how fast the speakers. 03:38.000 --> 03:43.000 And on the left, of course, you have the input speech, which is like an audio file, 03:43.000 --> 03:49.000 goes into the encoder, and this actually creates the phonetic representation. 03:49.000 --> 03:59.000 And what we did is we trained a quantizer in between the encoder and decoder to not only kind of reduce the number of features, 03:59.000 --> 04:06.000 but also to kind of force the encoder to focus on the phonetic representation. 04:07.000 --> 04:15.000 Now, we basically have our architecture in place, and this was basically our plan, how we can basically train Wester speech. 04:15.000 --> 04:20.000 Starting from the right here, we have the speech waveform, which is just audio files, 04:20.000 --> 04:24.000 and you can find a bunch of audio files like on the internet. 04:24.000 --> 04:30.000 The problem is, if you want to train a model, like a text speech model, you also need the transcription. 04:31.000 --> 04:38.000 Luckily, we have like the open-eye Wester model that can take audio and create the transcription files. 04:38.000 --> 04:43.000 And then we just do the reverse, we have the transcriptions, and in the training process, 04:43.000 --> 04:46.000 we just used the audio file in the transcription to train it. 04:46.000 --> 04:53.000 But when we actually started the training process, we kind of figured out that it failed us on all fronts, 04:53.000 --> 04:58.000 and yeah, like let's just go through the different parts. 04:59.000 --> 05:02.000 Starting with the speech waveform. 05:02.000 --> 05:07.000 I mentioned that we just used audio data. 05:07.000 --> 05:11.000 Luckily for us, there's like this really great status at out there. 05:11.000 --> 05:13.000 It's called liberalites. 05:13.000 --> 05:14.000 It was released by meta. 05:14.000 --> 05:19.000 It has 60,000 hours of English speech in the public domain. 05:19.000 --> 05:25.000 But then, like the first problem that we encountered was, it comes as a single 3.6 terabyte archive, 05:25.000 --> 05:28.000 which is kind of dumb. 05:28.000 --> 05:33.000 Just if you want to download this, and if you can, like, set a rate, you're 200 end, it's like connection. 05:33.000 --> 05:34.000 It takes you like 5 hours. 05:34.000 --> 05:37.000 But if you're like us, you want to try out different things. 05:37.000 --> 05:41.000 You don't really download it like once you don't like multiple times. 05:41.000 --> 05:43.000 So it's a problem. 05:43.000 --> 05:49.000 If you want to unpack the status at, it takes like over eight terabytes on your SSD. 05:49.000 --> 05:52.000 Just yeah, so you have the data available. 05:53.000 --> 05:56.000 The other problem is read amplification. 05:56.000 --> 06:01.000 So the 3.6 terabytes are actually 220,000 files. 06:01.000 --> 06:06.000 And the liberalite data set, as I mentioned, it's fairly audio books. 06:06.000 --> 06:09.000 So they're kind of like split up into two chapters. 06:09.000 --> 06:14.000 And one chapter has around like 16 terabytes on average. 06:14.000 --> 06:19.000 And in machine learning, what you want to do is you kind of want to randomly access your data, 06:19.000 --> 06:21.000 you want to shuffle it. 06:21.000 --> 06:27.000 You might want to just extract like 30 seconds out of like a single chapter. 06:27.000 --> 06:32.000 So on average, you kind of have to read like eight megabytes, and the rest is just waste. 06:32.000 --> 06:39.000 And if you do the mass, if you want to try in this, you kind of have like an SSD that can read like six gigabytes per second. 06:39.000 --> 06:43.000 You have like HEPUs and like use a batch of like 32. 06:43.000 --> 06:45.000 You arrive at like three iterations. 06:45.000 --> 06:50.000 And in practice, what we have seen is more point five iterations. 06:50.000 --> 06:55.000 Because like you get like file system overhead and just like the system overhead in general. 06:55.000 --> 07:03.000 Which is really, really bad because like you would wait like months to just finish like a single iteration. 07:03.000 --> 07:11.000 Luckily, if there's this really cool project, web datasets from TMB, that's the GitHub user handle. 07:11.000 --> 07:20.000 And what the web datasets actually allows us to split up our 3.6 terabyte file into like charts. 07:20.000 --> 07:24.000 It's just like like splitting up in multiple tar files. 07:24.000 --> 07:31.000 And we split up the whole dataset into 623 5 gigabird chart files. 07:31.000 --> 07:37.000 And the web datasets had then allowed us to read eight random charts to country and also shuffle. 07:37.000 --> 07:45.000 And if you do the mass again, like you basically arrive at 330 iterations per second on the same system. 07:45.000 --> 07:52.000 And that is something that you don't really see because like most people just maybe deal with like a gigabyte of data. 07:52.000 --> 07:53.000 And that's it. 07:53.000 --> 07:58.000 And it just fits in RAM and we don't really have a problem like you don't have copies whatsoever. 07:58.000 --> 08:06.000 But if you deal with like a lot of data, large data, the disk actually becomes a bottleneck. 08:06.000 --> 08:13.000 So now we have a way to read our audio files, our audio books. 08:13.000 --> 08:18.000 Next thing is that we need the actual transcriptions, right? 08:18.000 --> 08:21.000 We need it to try and the whole whisper model with the speech model. 08:21.000 --> 08:26.000 And luckily, as I said, we can just use OpenAI whisper. 08:26.000 --> 08:29.000 It's like a state's great of the art model. 08:29.000 --> 08:35.000 But if we actually use it just like the plain whisper model that OpenAI released, 08:35.000 --> 08:38.000 it's like 50 times faster than real time. 08:38.000 --> 08:45.000 Which if you want to process like the 60,000 hours, it takes you 1,200 GPU hours, 50 days. 08:45.000 --> 08:49.000 I'm not waiting like over a month to kind of get the transcriptions. 08:49.000 --> 08:57.000 Especially if you mess something up along the way, you probably do this like multiple times. 08:58.000 --> 09:00.000 So make it faster. 09:00.000 --> 09:02.000 Yeah, you can use batches. 09:02.000 --> 09:06.000 You can also use faster whisper implementation, which is using the same model, 09:06.000 --> 09:08.000 but it's just like a faster implementation. 09:08.000 --> 09:16.000 You can also switch to a smaller model, which then basically puts it down to like 78 GPU hours, 09:16.000 --> 09:18.000 which is still like three days. 09:18.000 --> 09:26.000 But remember, we are using web data sets, which allows us to parallelize this over across like multiple GPUs. 09:26.000 --> 09:34.000 So if you do this across like 100, for example, you can kind of process the data set in over and in another one hour. 09:34.000 --> 09:40.000 And that's not only works for the transcription, but also for voice activity detection, speaker embeddings, and so on. 09:40.000 --> 09:43.000 Now we have the transcription. 09:43.000 --> 09:47.000 But yeah, whisper is like kind of a state of the art model, right? 09:47.000 --> 09:49.000 But we encountered several problems. 09:49.000 --> 09:53.000 One of it is like you also get timestamps from whisper. 09:53.000 --> 09:57.000 But what we have seen is that they're off by several seconds. 09:57.000 --> 10:02.000 Luckily, it's kind of consistent, so we just applied a constant offset. 10:02.000 --> 10:08.000 What was even more puzzling was that in the transcription itself, there were some parts missing. 10:08.000 --> 10:11.000 And SSH had been using the liberal light data sets. 10:11.000 --> 10:16.000 And this is like an example, like if I hear, like if I'm listening to the one of the shepherds, 10:16.000 --> 10:19.000 it says like, shape the five of the things in our garden by author random, 10:19.000 --> 10:22.000 this leap of works recording is in the public domain, and so on. 10:22.000 --> 10:25.000 And what whisper heard was basically this. 10:25.000 --> 10:28.000 So it ignored the first part completely. 10:28.000 --> 10:34.000 And we kind of like wish this art because like you want to train something like on on top of it. 10:34.000 --> 10:37.000 And we have a really good idea like how this happened. 10:37.000 --> 10:42.000 So basically, open AI, it's very likely that they use the liberal light data set. 10:42.000 --> 10:45.000 It's about like 10% of their whole data set. 10:45.000 --> 10:48.000 So it's not likely that they ignore this. 10:48.000 --> 10:51.000 And but they also needed the transcriptions, right? 10:51.000 --> 10:55.000 But what they did was basically they used forced alignment. 10:55.000 --> 11:00.000 And the actual e-books from project Gutenberg. 11:00.000 --> 11:03.000 So they aligned the text with the audio. 11:03.000 --> 11:06.000 But in the e-book, there's no like chapter five. 11:06.000 --> 11:08.000 You don't really see this. 11:08.000 --> 11:10.000 Even like the same four footnotes as well. 11:10.000 --> 11:15.000 So what we do to basically fix this if you just ignore the 30 seconds. 11:16.000 --> 11:19.000 Now like to this training itself. 11:19.000 --> 11:24.000 Just to sit for me, what's the most important thing when you need to solve a difficult problem. 11:24.000 --> 11:27.000 And it's basically iteration speed. 11:27.000 --> 11:30.000 I'm just skipping this because like I'm running out of time. 11:30.000 --> 11:33.000 But yeah, what does it mean in our case? 11:33.000 --> 11:40.000 So if we want to train like a large model on like 80,000 hours on a thousand speakers on like 96 GPUs, 11:40.000 --> 11:43.000 we take like six hours to train. 11:43.000 --> 11:48.000 Which allows me to basically train two experiments to two experiments per day. 11:48.000 --> 11:51.000 One in the morning, one like after lunch. 11:51.000 --> 11:59.000 Which is not really great because like if you like us, you want to try out like a lot of different ideas and like 99% of the ideas I garbage. 11:59.000 --> 12:03.000 So what you do is just you train it on a very small data set. 12:03.000 --> 12:06.000 Single speaker, you can train it on the 40 90 and 50 minutes. 12:06.000 --> 12:10.000 And that allows you to train basically 48 experiments per day. 12:10.000 --> 12:15.000 And then like you just like train a little model like add somewhere depth and profit right. 12:15.000 --> 12:17.000 Now the last part. 12:17.000 --> 12:20.000 I mentioned like oh like you need GPUs. 12:20.000 --> 12:22.000 96 GPUs 100 GPUs. 12:22.000 --> 12:26.000 So doing our journey, we basically export like different companies. 12:26.000 --> 12:28.000 We started this like data crunch on the lips. 12:28.000 --> 12:31.000 Which kind of give you like visual pricing. 12:31.000 --> 12:35.000 But the problem is that usually you can only get like one or two GPUs. 12:35.000 --> 12:37.000 So they're all booked. 12:37.000 --> 12:40.000 Another interesting platform is vast AI. 12:40.000 --> 12:45.000 Where basically users provide their their GPUs that they have under the desk. 12:45.000 --> 12:46.000 It's a lot cheaper. 12:46.000 --> 12:48.000 But the problem is like bandwidth. 12:48.000 --> 12:50.000 And they also don't support like network drives. 12:50.000 --> 12:53.000 And again like we have to download the data type like multiple times. 12:53.000 --> 12:56.000 Like you have some machine crushes and all of this. 12:56.000 --> 13:01.000 And something that we figured out later was if you do like open source work. 13:01.000 --> 13:03.000 You can talk to the line community. 13:03.000 --> 13:06.000 And they have access to like the dual super cluster. 13:06.000 --> 13:07.000 With a lot of GPUs. 13:07.000 --> 13:09.000 And then yeah, you basically profit. 13:09.000 --> 13:10.000 There you go. 13:10.000 --> 13:12.000 Give them a round of applause everybody. 13:12.000 --> 13:13.000 Good morning.