WEBVTT

00:00.000 --> 00:13.040
Okay, hi everyone, I'll be to be here, it will be a quick presentation, only 10 minutes,

00:13.040 --> 00:18.760
and I will try to both present a new tool and make a quick demo.

00:18.760 --> 00:23.880
So I'm a reception engineer from the end of the 70s and I'm in team in the south of Paris

00:23.960 --> 00:30.040
in Crest and we are developing from one year a tool to accelerate text invitation for social

00:30.040 --> 00:31.040
sense.

00:31.040 --> 00:39.320
So this is the basic plot, the context is following, we have a sheet load of pixelal data for social

00:39.320 --> 00:44.760
sense and you can imagine ranging from newspapers to social media to speech to text

00:44.760 --> 00:51.200
the copies and it's come in range from a hundred to millions of documents regarding the

00:51.200 --> 00:53.440
kind of research topics.

00:53.440 --> 00:59.880
And there is this hype on AI, the new tools that are just emerging.

00:59.880 --> 01:05.960
So we have a lot of new horizons with Fondesha model and LLM to embed in our searches.

01:05.960 --> 01:11.440
And on the other side such terms and a lot of fields, not all the researchers are developing

01:11.440 --> 01:12.440
using code.

01:12.440 --> 01:18.520
So there is a deal and we have also an example today of the need of tools to expand the

01:18.520 --> 01:21.840
possibilities of those new evolution for research.

01:21.840 --> 01:28.760
So this is the basic context and the tool I'm just going to showcase today, it's called

01:28.760 --> 01:34.760
active trigger, together, trigger the temptation on text and it is an open source web application

01:34.760 --> 01:36.760
for text verification.

01:36.760 --> 01:42.600
So the rationale is following first to be able to collaborate on text annotation because

01:42.680 --> 01:44.960
it seems one to work together on it.

01:44.960 --> 01:51.480
Second is to accelerate the annotation on specific elements and the tool integrate a loop

01:51.480 --> 01:56.160
of active learning, which is basically selecting specific elements to accelerate the annotation

01:56.160 --> 02:04.320
of specific patterns in the data set and the last point is to scale a few annotation

02:04.320 --> 02:10.360
under the thousands to the world data set which can be a millioner elements.

02:10.360 --> 02:16.440
And it's based on the observations that even with we are the lot of zero shot models

02:16.440 --> 02:21.480
right now, there is still something relevant that even the complex tasks, a world-trained

02:21.480 --> 02:26.120
to provide model can be called in certain events of past humanitarian and so there is

02:26.120 --> 02:32.520
a lot of space to replace human and titles with well-trained supervised models to annotate

02:32.520 --> 02:33.520
the data.

02:33.520 --> 02:36.720
So basically it's the rationale of the tool.

02:36.760 --> 02:43.560
I will get back to that while I've done the demo, but since there's a clue what it looks like.

02:43.560 --> 02:49.000
Just to give you a very quick technical review, I don't want to deeper into it.

02:49.000 --> 02:55.920
The backend is developed like Panoptic, I think, in Python and CEST API for architecture

02:55.920 --> 02:58.720
client, several organizations.

02:58.720 --> 03:04.960
The front end is a collaboration from this open room with Paul Girard, so it's something

03:05.000 --> 03:12.240
which was born here also somehow with React and also Python client to interact with

03:12.240 --> 03:17.680
the application and also we are using one specific family of models which kind of are all

03:17.680 --> 03:26.160
models for comparing to the new ones which are Bird Models and Bird Models as a one developed

03:26.160 --> 03:34.800
and trained by the end of 2020 and they are having a comeback because just a few months

03:34.880 --> 03:40.400
a few weeks before a new version just dropped, which is modern birth, which extends the

03:40.400 --> 03:46.160
contextual window, so it is still, it's all stuff that's still relevant for social science and

03:46.160 --> 03:48.000
for text analysis.

03:48.000 --> 03:56.160
The tool is based on three main values, first to specific to task, so we don't intend

03:56.160 --> 03:59.000
to be a platform to all the text analysis.

03:59.960 --> 04:05.080
The idea is to integrate it's possible, the best practices in our field, so in sociology,

04:05.800 --> 04:11.480
political science and so forth, how to undertake correctly, by that I said, to help people

04:11.480 --> 04:17.720
to follow the best practices and it's still a research tool, so at the end it needs to be simple,

04:17.720 --> 04:23.240
simple enough to be modified for each team to add elements that they need for doing the research.

04:23.320 --> 04:27.000
So this is a trade-off, but we'll try to create them.

04:29.160 --> 04:35.640
Just to give a few elements of the current mainstone, so we have 10 beta testing, so

04:35.640 --> 04:39.640
currently we have research chains that are starting using it for the research, so currently

04:39.640 --> 04:44.600
we don't need to have, so please join if you want and to give you a just an idea that

04:44.600 --> 04:54.120
is scheduled, we are starting to integrate a better interface to integrate the recall for

04:54.120 --> 04:59.720
external model, like in phase, open-eye model to use the research in size application,

05:00.280 --> 05:06.120
we are trying to better to adapt the management of rights and we aim for stable

05:06.120 --> 05:13.560
classical version by the mid-June in a Docker-wise version. Okay, so I want to jump to a quick demo

05:13.560 --> 05:20.600
for the five minutes that's left, do we need, do I need the classifier right now, maybe not,

05:20.600 --> 05:27.880
maybe yes, so we need all to choose the next talk to attend to, so why not use our preferences to

05:27.880 --> 05:33.160
train classifier on the first-dimensional test set, so we have all the abstract as a conference,

05:33.240 --> 05:39.480
it's easy to do the nerd, it's not a real social science test, of course, but you can translate it on

05:39.480 --> 05:45.320
the no-activity like trying to analyze the way journalists are talking about specific topic and

05:45.320 --> 05:52.440
then train in classifier to do so, so how to proceed and if you want to do it, please join

05:52.440 --> 05:58.040
on the interface, so we need the first-dimensional set and you can take it, but it's already online,

05:59.000 --> 06:05.000
it's for them, for them, for the access, and then what's the next step, first you create a project,

06:05.880 --> 06:11.880
then you decide how do you want to code, so you create a label, you want to use, and if you are

06:11.880 --> 06:17.400
sort of in a time a codebook to describe how do you want to code and this is the huge parcel

06:17.400 --> 06:22.440
result project, and then you start to annotate, first randomly, you just take a few elements,

06:22.440 --> 06:27.400
say, okay I want to add on to this one, not to this one, and so forth, and at some point you can decide

06:27.400 --> 06:37.160
to achieve learning, so trying to predict the next label for the element, and so to increase

06:37.160 --> 06:42.360
the number of elements from one class and the other, once you have enough, you're going to train

06:42.360 --> 06:49.720
the model, it's good enough, so if it predicts, if it predicts the well enough, what you want,

06:49.720 --> 06:54.680
so the F1, for instance, is high enough, you say, okay, I'm done with that, and you can just

06:54.680 --> 07:01.080
generalize the prediction on the rule of the set, so let's do it very quickly, so you can go to the

07:01.080 --> 07:07.080
project, if you were able to connect with the same interface, and I have already created the project,

07:07.080 --> 07:13.320
but if you want, you can create it this way, but in that a set, you have a very simple interface,

07:13.880 --> 07:20.040
you can prepare a project like adding a label, I already created them, so I would attend,

07:20.040 --> 07:27.880
I won't attend, and it's interesting, you create features like you create to transform the text

07:27.880 --> 07:34.120
in a vector that they're based to help the model for the active learning book,

07:34.120 --> 07:38.680
then you can explore and the main part is to annotate, so now it's, you know, the humanity's

07:39.000 --> 07:45.240
part, so you need to read and say, okay, is something I want to attend, now it's not very interesting,

07:45.240 --> 07:51.880
interested, do you want to attend to this one, oh yes, it's nice, okay, so at the moment I just

07:51.880 --> 08:01.000
annotate, I can train a simple model based on the 62 elements I annotated, so I can train this one,

08:01.800 --> 08:08.600
if I do that, I have a first loop of prediction based on what I've already annotated to

08:08.600 --> 08:15.080
predict the next one, so I can have an information about, is it something that I can predict,

08:15.080 --> 08:23.240
and it helps me to look for a specific elements or to loop to those, while not very easy to predict,

08:23.240 --> 08:29.240
with a high entropy, so you can select to focus on those, was the most difficult to predict

08:29.240 --> 08:34.600
as a model and then to improve your train that is, this is basically principle, okay, so once you have

08:34.600 --> 08:40.120
enough, you can just switch to train a model, and then you can find a bad model, so bad model is

08:40.120 --> 08:50.760
kind of one-and-rad parameters, so it's five-and-rad megabytes to one gigabytes, you can use the name,

08:50.840 --> 08:59.480
test, or for them, let's try to the last modern model, you can go for all the elements,

08:59.480 --> 09:05.000
and the point is you need GPU for fine tuning those models, so you need 20 gigabytes of GPU

09:05.000 --> 09:10.440
if you want to do it, so that's why we're running them on our servers, and so you can train

09:10.440 --> 09:16.920
you predict, okay, I just move on, so I can finish, and then once you've trained what,

09:17.720 --> 09:25.000
21, you can look at it, you can see the parameters, but you can also predict scores,

09:25.000 --> 09:32.280
and for this one I have a F1, about around, so that seven is not very good, I should

09:33.240 --> 09:38.360
annotate more elements, but now I can iterate and start to have a better model and then,

09:38.440 --> 09:45.320
I can just extend as a prediction on the work of us here, just computing right now,

09:45.320 --> 09:52.280
and export the prediction here, I already did that, so that's prediction, and then I'll look

09:53.640 --> 10:00.200
on the wall for them presentation to see which one I will attend to in the next session,

10:00.200 --> 10:20.040
so I'm done, thank you, thank you, I need to look for the probabilities because I'm not sure

10:20.040 --> 10:26.200
I predict always, it's good enough to pick up one, I will stay here because I'm trying to support the team.

10:30.200 --> 10:43.720
It depends on how much can we handle things like transcripts, interview data, and the answer is,

10:43.720 --> 10:48.280
what do you want to do with that? If your ID is to detect a specific pattern of

10:48.280 --> 10:53.320
in the interview and you know what you want to do and you can write a code book, it works very well,

10:53.320 --> 10:58.600
just have to prepare you better state, like maybe the whole interview is too big to enter

10:58.680 --> 11:03.880
the conceptual window of a bad model, so did I do it by paragraphs and then start to annotate it

11:03.880 --> 11:09.480
and deploy it on the wall that has it, so it will work as well, it's basically designed also to do

11:09.480 --> 11:15.960
this kind of little, you know, stuff of creating some specific annotators for finding a pattern of

11:15.960 --> 11:23.000
an instability or speaking about such a topic in the desert, but the idea is really to extend your own

11:23.000 --> 11:29.080
you know, view and way of doing the data on the specific criteria for doing what you want to do.

11:31.960 --> 11:38.200
Where is the training happening? Where is the training happening? Right now the server is

11:38.200 --> 11:46.600
in our laboratory, so we have a few GPU, at the end you will be able to install it whenever you want,

11:46.600 --> 11:50.040
so for the moment this is in the service of Paris inside laboratory.

11:53.000 --> 11:57.080
Thank you.