WEBVTT 00:00.000 --> 00:11.080 I'm Jonathan Clark, a developer from the Document Foundation, working on improving language 00:11.080 --> 00:15.120 support in the labor office. 00:15.120 --> 00:20.400 Language supports very important for our project and for the foundation. 00:20.400 --> 00:23.200 Many languages are endangered, many are going extinct. 00:23.200 --> 00:27.880 I think the saddest thing is when people are forced to use a different language because their 00:27.880 --> 00:32.440 software doesn't work in the language they want to use. 00:32.440 --> 00:37.760 As developers, I think we have a tremendous opportunity and responsibility to make sure 00:37.760 --> 00:42.720 that our project supports everybody's languages. 00:42.720 --> 00:48.880 So, what we've been doing is fixing bugs. 00:48.880 --> 00:55.160 In the last year, labor office developers have fixed more than 100 bugs directly related 00:55.160 --> 00:57.160 to language support. 00:57.160 --> 01:02.840 So, 34 of them were very, very old. 01:02.840 --> 01:09.160 Speak up. 01:09.160 --> 01:13.920 If there's one takeaway from this talk, it's that if you tried using labor office while 01:13.920 --> 01:18.920 ago and it didn't work well for you because of your language, because of how we support 01:18.920 --> 01:22.000 it, please try a new version. 01:22.000 --> 01:25.440 The situation might be much better now. 01:25.440 --> 01:31.000 There are too many bug fixes to talk about, so I just wanted to focus on one bug that 01:31.000 --> 01:38.560 can help to kind of motivate and explain some of the challenges behind retrofitting good 01:38.560 --> 01:45.600 language support into a mature code base. 01:45.600 --> 01:53.840 This is a very old bug in labor office that's been around since the very beginning. 01:53.840 --> 02:01.160 When you change the color of text inside an English word, the spacing of the letters gets 02:01.160 --> 02:02.880 a little bit weird. 02:02.880 --> 02:12.720 They kind of hide to see, so I blew it up and overlaid the two over there. 02:12.720 --> 02:18.840 Now, this bug set on the back burner for a really long time, because it's kind of hard 02:18.840 --> 02:21.960 to understand why it's important. 02:21.960 --> 02:26.960 It seems like a very minor difference in text, and you always just tell a user not to do 02:26.960 --> 02:28.680 this. 02:28.680 --> 02:33.920 But it turns out that labor office actually does this automatically sometimes, for instance, 02:33.920 --> 02:37.160 when you're using a track changes feature. 02:37.200 --> 02:43.800 So we can't just tell you just not to do it, because we do it. 02:43.800 --> 02:45.800 Why does this happen? 02:45.800 --> 02:53.040 It's due to a very common assumption made by software developers, especially developers 02:53.040 --> 02:57.080 who are most used to the Latin script. 02:57.080 --> 03:02.440 We kind of tend to view computers as high-concept typewriters that just arrange letters from 03:02.480 --> 03:06.080 left to right across the screen. 03:06.080 --> 03:16.760 And so if that's what you believe about text, it leads you to an obvious abstraction. 03:16.760 --> 03:22.880 Your graphics library, your graphics code, is responsible for atomically rendering text, pieces 03:22.880 --> 03:26.840 of text, in a particular style using a particular font. 03:26.880 --> 03:33.400 The application code is responsible for chopping text up into regions of interest, into stretches 03:33.400 --> 03:39.000 of text that share the same style, whatever you have. 03:39.000 --> 03:42.120 And this is what this ends up looking like. 03:42.120 --> 03:47.240 So we saw where there are a virtual pen at a particular position on screen. 03:47.240 --> 03:54.800 We tell our graphics library to draw a string in this case T, the first letter. 03:54.840 --> 04:00.440 We advance our pen to the right by the width of the text that we drew. 04:00.440 --> 04:04.680 Set the color, draw the rest of the text. 04:04.680 --> 04:08.400 In this case another complete string, and advance the pen again. 04:08.400 --> 04:14.080 So now we're ready to draw whatever text comes after this. 04:14.080 --> 04:18.240 The problem is is that that's not the way that text works. 04:18.240 --> 04:21.920 This turns out to be a very bad assumption. 04:21.920 --> 04:27.720 In an English, we use lots of special font features to adjust the positions of characters 04:27.720 --> 04:33.600 on a screen to make the text look more attractive. 04:33.600 --> 04:38.160 So in software, this is something we call an architecture bug. 04:38.160 --> 04:44.160 It's an assumption or a design decision made very early in the project that turns out 04:44.160 --> 04:45.960 to be incorrect. 04:45.960 --> 04:52.960 When you have an architecture bug, it's very difficult, very expensive to fix. 04:52.960 --> 04:57.640 And in this case, fortunately for English text, it's only a minor irritation. 04:57.640 --> 04:59.200 You can still understand the text. 04:59.200 --> 05:03.000 It just looks a little bit funny. 05:03.000 --> 05:08.000 But in software, we have this category of languages called CTL languages. 05:08.000 --> 05:10.400 Stents for complex text layout. 05:10.960 --> 05:16.760 The dictionary definition is writing systems where the shapes of the characters, the appearance 05:16.760 --> 05:21.760 of the character, changes depending on the context in which the character appears. 05:21.760 --> 05:29.280 CTL languages are sadly very often neglected by software projects. 05:29.280 --> 05:36.480 Treated as an afterthought, but the group includes a number of heavy hitters. 05:36.480 --> 05:39.680 But your project doesn't consider the needs of CTL languages. 05:39.680 --> 05:44.880 You're giving up billions of potential users. 05:44.880 --> 05:49.520 So the canonical example of a CTL language is Arabic script. 05:49.520 --> 05:54.520 In Arabic, letters can change shape pretty dramatically depending on where they appear 05:54.520 --> 05:55.520 in a word. 05:55.520 --> 06:00.080 In this case, you can see here this is the isolated form. 06:00.080 --> 06:06.080 And that's one of the pairs of the start of a word in the middle or at the end. 06:06.080 --> 06:09.840 But this isn't an unusual phenomenon. 06:09.840 --> 06:14.880 Even if you're only familiar with European languages, see this example of English cursive. 06:14.880 --> 06:20.560 The shape of the letter E changes significantly depending on whether it's connected higher 06:20.560 --> 06:22.560 or low. 06:22.560 --> 06:31.840 If you try to render the other E in a different word, it would look pretty odd. 06:31.920 --> 06:37.760 I might even confuse it for space in the middle of a word. 06:37.760 --> 06:47.840 So the algorithm that I described previously really doesn't handle CTL languages well. 06:47.840 --> 06:56.880 I can't read to mill, but those don't look very similar to me. 06:56.880 --> 07:00.560 So the definition I gave before is kind of vague. 07:00.560 --> 07:06.880 It's languages that are heavily affected by context. 07:06.880 --> 07:13.360 I think an alternate definition might be a CTL language is one where when you try to use 07:13.360 --> 07:23.280 the algorithm that we're talking about, it fails to substantially preserve meaning. 07:23.360 --> 07:27.920 Just to make it clear, Librar office isn't alone in making this assumption. 07:27.920 --> 07:31.360 It's very, very common. 07:32.160 --> 07:34.000 May I have a time? 07:34.000 --> 07:35.360 You have two minutes. 07:35.360 --> 07:37.200 Two minutes, okay. 07:37.200 --> 07:44.880 Many other software projects make the same assumption, including some very popular operating systems 07:44.880 --> 07:46.080 and made by very big companies. 07:47.440 --> 07:51.680 Anyway, two expensive to rewrite all of our application code to suit to fix this bug. 07:52.640 --> 07:57.280 So instead, what we can do is just pass more information when we're figuring out what characters 07:57.280 --> 08:01.120 to render this case once again, start at the origin. 08:01.120 --> 08:08.160 This time, we're going to draw the entire text and just drop the characters that aren't included in this 08:08.160 --> 08:09.520 section of the text. 08:10.320 --> 08:11.600 Now we repeat the algorithm. 08:13.040 --> 08:16.640 At the end, we've successfully laid out the text correctly. 08:16.960 --> 08:20.720 Our approach isn't perfect. 08:22.560 --> 08:29.840 It can't handle cases where characters actually rearranged in words, which is something I can 08:29.840 --> 08:35.760 happen in CTL languages, but our results are much closer to what someone would expect when they're 08:35.760 --> 08:36.640 using these languages. 08:38.880 --> 08:44.720 And all this is to say that if you're writing a new software project today, it's very helpful to keep 08:44.720 --> 08:51.680 these languages in mind because it's very difficult to reverse course once you've written a lot of application code.