WEBVTT 00:00.000 --> 00:13.040 Okay, hi everyone, I'll be to be here, it will be a quick presentation, only 10 minutes, 00:13.040 --> 00:18.760 and I will try to both present a new tool and make a quick demo. 00:18.760 --> 00:23.880 So I'm a reception engineer from the end of the 70s and I'm in team in the south of Paris 00:23.960 --> 00:30.040 in Crest and we are developing from one year a tool to accelerate text invitation for social 00:30.040 --> 00:31.040 sense. 00:31.040 --> 00:39.320 So this is the basic plot, the context is following, we have a sheet load of pixelal data for social 00:39.320 --> 00:44.760 sense and you can imagine ranging from newspapers to social media to speech to text 00:44.760 --> 00:51.200 the copies and it's come in range from a hundred to millions of documents regarding the 00:51.200 --> 00:53.440 kind of research topics. 00:53.440 --> 00:59.880 And there is this hype on AI, the new tools that are just emerging. 00:59.880 --> 01:05.960 So we have a lot of new horizons with Fondesha model and LLM to embed in our searches. 01:05.960 --> 01:11.440 And on the other side such terms and a lot of fields, not all the researchers are developing 01:11.440 --> 01:12.440 using code. 01:12.440 --> 01:18.520 So there is a deal and we have also an example today of the need of tools to expand the 01:18.520 --> 01:21.840 possibilities of those new evolution for research. 01:21.840 --> 01:28.760 So this is the basic context and the tool I'm just going to showcase today, it's called 01:28.760 --> 01:34.760 active trigger, together, trigger the temptation on text and it is an open source web application 01:34.760 --> 01:36.760 for text verification. 01:36.760 --> 01:42.600 So the rationale is following first to be able to collaborate on text annotation because 01:42.680 --> 01:44.960 it seems one to work together on it. 01:44.960 --> 01:51.480 Second is to accelerate the annotation on specific elements and the tool integrate a loop 01:51.480 --> 01:56.160 of active learning, which is basically selecting specific elements to accelerate the annotation 01:56.160 --> 02:04.320 of specific patterns in the data set and the last point is to scale a few annotation 02:04.320 --> 02:10.360 under the thousands to the world data set which can be a millioner elements. 02:10.360 --> 02:16.440 And it's based on the observations that even with we are the lot of zero shot models 02:16.440 --> 02:21.480 right now, there is still something relevant that even the complex tasks, a world-trained 02:21.480 --> 02:26.120 to provide model can be called in certain events of past humanitarian and so there is 02:26.120 --> 02:32.520 a lot of space to replace human and titles with well-trained supervised models to annotate 02:32.520 --> 02:33.520 the data. 02:33.520 --> 02:36.720 So basically it's the rationale of the tool. 02:36.760 --> 02:43.560 I will get back to that while I've done the demo, but since there's a clue what it looks like. 02:43.560 --> 02:49.000 Just to give you a very quick technical review, I don't want to deeper into it. 02:49.000 --> 02:55.920 The backend is developed like Panoptic, I think, in Python and CEST API for architecture 02:55.920 --> 02:58.720 client, several organizations. 02:58.720 --> 03:04.960 The front end is a collaboration from this open room with Paul Girard, so it's something 03:05.000 --> 03:12.240 which was born here also somehow with React and also Python client to interact with 03:12.240 --> 03:17.680 the application and also we are using one specific family of models which kind of are all 03:17.680 --> 03:26.160 models for comparing to the new ones which are Bird Models and Bird Models as a one developed 03:26.160 --> 03:34.800 and trained by the end of 2020 and they are having a comeback because just a few months 03:34.880 --> 03:40.400 a few weeks before a new version just dropped, which is modern birth, which extends the 03:40.400 --> 03:46.160 contextual window, so it is still, it's all stuff that's still relevant for social science and 03:46.160 --> 03:48.000 for text analysis. 03:48.000 --> 03:56.160 The tool is based on three main values, first to specific to task, so we don't intend 03:56.160 --> 03:59.000 to be a platform to all the text analysis. 03:59.960 --> 04:05.080 The idea is to integrate it's possible, the best practices in our field, so in sociology, 04:05.800 --> 04:11.480 political science and so forth, how to undertake correctly, by that I said, to help people 04:11.480 --> 04:17.720 to follow the best practices and it's still a research tool, so at the end it needs to be simple, 04:17.720 --> 04:23.240 simple enough to be modified for each team to add elements that they need for doing the research. 04:23.320 --> 04:27.000 So this is a trade-off, but we'll try to create them. 04:29.160 --> 04:35.640 Just to give a few elements of the current mainstone, so we have 10 beta testing, so 04:35.640 --> 04:39.640 currently we have research chains that are starting using it for the research, so currently 04:39.640 --> 04:44.600 we don't need to have, so please join if you want and to give you a just an idea that 04:44.600 --> 04:54.120 is scheduled, we are starting to integrate a better interface to integrate the recall for 04:54.120 --> 04:59.720 external model, like in phase, open-eye model to use the research in size application, 05:00.280 --> 05:06.120 we are trying to better to adapt the management of rights and we aim for stable 05:06.120 --> 05:13.560 classical version by the mid-June in a Docker-wise version. Okay, so I want to jump to a quick demo 05:13.560 --> 05:20.600 for the five minutes that's left, do we need, do I need the classifier right now, maybe not, 05:20.600 --> 05:27.880 maybe yes, so we need all to choose the next talk to attend to, so why not use our preferences to 05:27.880 --> 05:33.160 train classifier on the first-dimensional test set, so we have all the abstract as a conference, 05:33.240 --> 05:39.480 it's easy to do the nerd, it's not a real social science test, of course, but you can translate it on 05:39.480 --> 05:45.320 the no-activity like trying to analyze the way journalists are talking about specific topic and 05:45.320 --> 05:52.440 then train in classifier to do so, so how to proceed and if you want to do it, please join 05:52.440 --> 05:58.040 on the interface, so we need the first-dimensional set and you can take it, but it's already online, 05:59.000 --> 06:05.000 it's for them, for them, for the access, and then what's the next step, first you create a project, 06:05.880 --> 06:11.880 then you decide how do you want to code, so you create a label, you want to use, and if you are 06:11.880 --> 06:17.400 sort of in a time a codebook to describe how do you want to code and this is the huge parcel 06:17.400 --> 06:22.440 result project, and then you start to annotate, first randomly, you just take a few elements, 06:22.440 --> 06:27.400 say, okay I want to add on to this one, not to this one, and so forth, and at some point you can decide 06:27.400 --> 06:37.160 to achieve learning, so trying to predict the next label for the element, and so to increase 06:37.160 --> 06:42.360 the number of elements from one class and the other, once you have enough, you're going to train 06:42.360 --> 06:49.720 the model, it's good enough, so if it predicts, if it predicts the well enough, what you want, 06:49.720 --> 06:54.680 so the F1, for instance, is high enough, you say, okay, I'm done with that, and you can just 06:54.680 --> 07:01.080 generalize the prediction on the rule of the set, so let's do it very quickly, so you can go to the 07:01.080 --> 07:07.080 project, if you were able to connect with the same interface, and I have already created the project, 07:07.080 --> 07:13.320 but if you want, you can create it this way, but in that a set, you have a very simple interface, 07:13.880 --> 07:20.040 you can prepare a project like adding a label, I already created them, so I would attend, 07:20.040 --> 07:27.880 I won't attend, and it's interesting, you create features like you create to transform the text 07:27.880 --> 07:34.120 in a vector that they're based to help the model for the active learning book, 07:34.120 --> 07:38.680 then you can explore and the main part is to annotate, so now it's, you know, the humanity's 07:39.000 --> 07:45.240 part, so you need to read and say, okay, is something I want to attend, now it's not very interesting, 07:45.240 --> 07:51.880 interested, do you want to attend to this one, oh yes, it's nice, okay, so at the moment I just 07:51.880 --> 08:01.000 annotate, I can train a simple model based on the 62 elements I annotated, so I can train this one, 08:01.800 --> 08:08.600 if I do that, I have a first loop of prediction based on what I've already annotated to 08:08.600 --> 08:15.080 predict the next one, so I can have an information about, is it something that I can predict, 08:15.080 --> 08:23.240 and it helps me to look for a specific elements or to loop to those, while not very easy to predict, 08:23.240 --> 08:29.240 with a high entropy, so you can select to focus on those, was the most difficult to predict 08:29.240 --> 08:34.600 as a model and then to improve your train that is, this is basically principle, okay, so once you have 08:34.600 --> 08:40.120 enough, you can just switch to train a model, and then you can find a bad model, so bad model is 08:40.120 --> 08:50.760 kind of one-and-rad parameters, so it's five-and-rad megabytes to one gigabytes, you can use the name, 08:50.840 --> 08:59.480 test, or for them, let's try to the last modern model, you can go for all the elements, 08:59.480 --> 09:05.000 and the point is you need GPU for fine tuning those models, so you need 20 gigabytes of GPU 09:05.000 --> 09:10.440 if you want to do it, so that's why we're running them on our servers, and so you can train 09:10.440 --> 09:16.920 you predict, okay, I just move on, so I can finish, and then once you've trained what, 09:17.720 --> 09:25.000 21, you can look at it, you can see the parameters, but you can also predict scores, 09:25.000 --> 09:32.280 and for this one I have a F1, about around, so that seven is not very good, I should 09:33.240 --> 09:38.360 annotate more elements, but now I can iterate and start to have a better model and then, 09:38.440 --> 09:45.320 I can just extend as a prediction on the work of us here, just computing right now, 09:45.320 --> 09:52.280 and export the prediction here, I already did that, so that's prediction, and then I'll look 09:53.640 --> 10:00.200 on the wall for them presentation to see which one I will attend to in the next session, 10:00.200 --> 10:20.040 so I'm done, thank you, thank you, I need to look for the probabilities because I'm not sure 10:20.040 --> 10:26.200 I predict always, it's good enough to pick up one, I will stay here because I'm trying to support the team. 10:30.200 --> 10:43.720 It depends on how much can we handle things like transcripts, interview data, and the answer is, 10:43.720 --> 10:48.280 what do you want to do with that? If your ID is to detect a specific pattern of 10:48.280 --> 10:53.320 in the interview and you know what you want to do and you can write a code book, it works very well, 10:53.320 --> 10:58.600 just have to prepare you better state, like maybe the whole interview is too big to enter 10:58.680 --> 11:03.880 the conceptual window of a bad model, so did I do it by paragraphs and then start to annotate it 11:03.880 --> 11:09.480 and deploy it on the wall that has it, so it will work as well, it's basically designed also to do 11:09.480 --> 11:15.960 this kind of little, you know, stuff of creating some specific annotators for finding a pattern of 11:15.960 --> 11:23.000 an instability or speaking about such a topic in the desert, but the idea is really to extend your own 11:23.000 --> 11:29.080 you know, view and way of doing the data on the specific criteria for doing what you want to do. 11:31.960 --> 11:38.200 Where is the training happening? Where is the training happening? Right now the server is 11:38.200 --> 11:46.600 in our laboratory, so we have a few GPU, at the end you will be able to install it whenever you want, 11:46.600 --> 11:50.040 so for the moment this is in the service of Paris inside laboratory. 11:53.000 --> 11:57.080 Thank you.