WEBVTT 00:00.000 --> 00:12.720 Okay, so I think we are good to start, please welcome Damien Closchard. 00:12.720 --> 00:18.400 He will talk about posgurskwell anonymizing privacy, but the privacy. 00:18.400 --> 00:22.560 This is a very actual topic, so yes, welcome, and thank you. 00:22.560 --> 00:32.560 Thanks a lot, are you here with me? 00:32.560 --> 00:35.960 It's all right, let's go. 00:35.960 --> 00:43.840 So my name is Damien Closchard, and one of the co-founders of Dalibo, which is a French posgurskwell 00:43.840 --> 00:44.840 company, basically. 00:44.840 --> 00:50.120 And I also an active member of the French posgurskwell community, if you've done some posgurskwell 00:50.120 --> 00:55.040 in France, you probably know me for that. 00:55.040 --> 00:56.880 So what brings me here? 00:56.880 --> 01:00.080 Why am I here before in front of you? 01:00.080 --> 01:04.040 Like five years, seven years ago, actually. 01:04.040 --> 01:08.760 Someone asked me to anonymize a database, and I say, well, easy, quick. 01:08.760 --> 01:10.920 I'm going to do it for next week. 01:10.920 --> 01:16.560 So I just got a bunch of PLPG scripts to do it. 01:16.560 --> 01:22.400 I just named it posgurskwell anonymizer, because it was for the topic. 01:22.400 --> 01:27.560 And somehow I'm still there, seven years later, still working on it. 01:27.560 --> 01:32.520 The funny things is, I never actually anonymize the database, in fact, but I did a tool 01:32.520 --> 01:35.440 to an anonymizer to do it. 01:35.440 --> 01:38.280 So this is my story. 01:38.280 --> 01:43.160 And I'm here to talk about the privacy and how you can protect that. 01:43.160 --> 01:51.720 So I'm going to go a little bit on the definition of what privacy is before going to a 01:51.720 --> 01:59.160 more concrete example, and yeah, the principles of data protection and how to do it in 01:59.160 --> 02:05.160 with posgurskwell anonymizer, because that's the thing I did, but those principles could 02:05.160 --> 02:09.720 be used with other tools, too, okay? 02:10.680 --> 02:18.520 So the paradox of privacy, so yeah, basically, it's a strange concept, basically, privacy, 02:18.520 --> 02:26.520 because compared to other concepts like your private property, private properties, quite 02:26.520 --> 02:34.800 well defined, I think we all know that this computer is mine, but this is not mine, okay? 02:34.800 --> 02:42.000 This is very clear, but privacy is not so clear, actually. 02:42.000 --> 02:49.880 So yeah, we could do 45 minute essay about what is privacy and pretty sure we would all 02:49.880 --> 02:55.560 write different things about what is private, what is intimate for you. 02:55.560 --> 03:01.280 And actually, it's all, it's also, so it's different between people, it's different between 03:01.360 --> 03:05.640 eras and also it's different between regions, okay? 03:05.640 --> 03:13.360 Someone in Asia may think privacy means something else, all right? 03:13.360 --> 03:21.800 So yeah, and basically, it's quite new, it's new in the history of humanity, it's a new 03:21.800 --> 03:31.240 thing, because basically, 200 years ago, everybody slept in the same bedroom, all right? 03:31.240 --> 03:36.800 So there was no intimacy in the sense that we talk about it now, you know, in the villages 03:36.800 --> 03:41.320 of everybody knew everything about everyone. 03:41.320 --> 03:49.560 If you had secret, maybe the only way to talk about it was to go to the local priest. 03:49.560 --> 03:53.960 And even that, do you remember that thing? 03:53.960 --> 04:01.160 So that was, yeah, 40, 50 years ago, trying to explain that to a teenager now, you know? 04:01.160 --> 04:08.160 We had three hours, had this book, and in this book, there were the names and the phone 04:08.160 --> 04:13.040 numbers of everyone in the area, okay? 04:13.040 --> 04:15.840 Asked two pieces of that. 04:15.840 --> 04:22.200 I mean, I could give you a piece of paper now in this room, ask you for your phone numbers 04:22.200 --> 04:24.040 in your names. 04:24.040 --> 04:29.040 We would give you, would you give me your names and your phone numbers? 04:29.040 --> 04:34.120 Yeah, maybe someone, but probably not at all. 04:34.120 --> 04:42.640 So you see, it's relatively new, it's evolving, and basically, this is just the beginning. 04:42.640 --> 04:51.480 As we, as our life is getting more and more and more digitized, in reverse, we want more 04:51.480 --> 04:52.480 and more privacy. 04:52.720 --> 05:00.880 It's never going to stop, we will, our, I would say that the grand children of our grand 05:00.880 --> 05:06.920 children will look at us just like we look at those poor guys. 05:06.920 --> 05:11.840 They will find privacy like a basic human right, actually. 05:11.840 --> 05:20.120 Okay, so, but no, no, no, everyone agrees with me about that, and then some guys, some people, 05:20.160 --> 05:25.160 like this one, Eric Schmidt, that would say something like, if you have something that you 05:25.160 --> 05:31.040 don't want anyone to know, maybe you shouldn't be doing it in the first. 05:31.040 --> 05:32.040 Right? 05:32.040 --> 05:39.560 Yes, in the comments, and now, well, this guy is a CEO of Google, was a CEO, is not a CEO at 05:39.560 --> 05:47.240 the time, but as this time, but he was, and this is the stupidest argument about privacy 05:47.320 --> 05:52.280 that you can either bring in a discussion, right? 05:52.280 --> 06:01.000 We all have the right to privacy, it's not, yeah, you can publish anything you want, but 06:01.000 --> 06:03.640 you also have the right to hide things. 06:03.640 --> 06:05.640 All right? 06:05.640 --> 06:12.840 So, yeah, of course, Big Tech companies, they want, they don't want this definition 06:12.920 --> 06:19.320 of privacy, because basically, the business model is collecting and selling new data. 06:19.320 --> 06:25.760 So, of course, they want to redefine what the meaning of privacy is, right? 06:25.760 --> 06:35.800 And so, there's a market for personal data that is blooming currently, so at this point 06:35.880 --> 06:46.040 right now, in 25, so that's a broker industry is worth 2,000 billion, yeah, and this 06:46.040 --> 06:50.600 is just ungary, okay? 06:50.600 --> 06:56.040 And you've got this for companies, do you know any of these companies, anyone here knows 06:56.040 --> 06:57.040 of what? 06:57.040 --> 06:58.040 Oh, yes. 06:58.040 --> 07:01.520 Does some of you work for them? 07:02.480 --> 07:10.960 I can ask, just asking, just asking, well, this guy doesn't know about you, you're probably 07:10.960 --> 07:18.960 in one or the five of them, they sell you data on every day basis, and they make a lot 07:18.960 --> 07:26.560 a lot of money, and you probably never give your consent for that, all right? 07:26.560 --> 07:36.480 And so, thing is, that's just the top five, they're about 200 of them, okay? 07:36.480 --> 07:41.720 And the five, the top five are bigger than redat on their own. 07:41.720 --> 07:51.400 So, yeah, in theory, legally, you can send a letter to each one of them and ask for them 07:51.400 --> 07:56.040 to remove your data, and probably, probably, they will do it. 07:56.960 --> 08:03.640 But then next month, you will have to do it again, and you will have to do it for all the 200 08:03.640 --> 08:07.640 companies that have brokers that take this, then you will have to share, to look for new ones 08:07.640 --> 08:10.440 that will appear every day. 08:10.440 --> 08:18.600 So, of course, capitalism has a response for that, so you can hire companies that will 08:18.600 --> 08:25.000 write these letters for you, and it will only cost you 10-door apartments. 08:25.320 --> 08:30.920 So, this is a great example of our capitalism, we'll sell you the disease, and then sell 08:30.920 --> 08:32.520 you the coronavirus. 08:32.520 --> 08:39.160 All right, so the black market is rising, because these guys, the needs of data, it's 08:39.160 --> 08:43.760 hard to get, and it's cheaper on the black market, basically. 08:43.760 --> 08:52.280 All right, of course, I have no way to prove that taking your data on the black market, 08:52.360 --> 08:56.200 I can sue me for saying that, but I'll take the risk. 08:58.200 --> 09:05.960 So, yeah, this market is absolutely not regulated, and GDPR does not change anything about that. 09:07.080 --> 09:11.880 And so, logically, the number of data breach is just exploding. 09:13.080 --> 09:20.520 Last year, the number of notification people received went up for three hundred percent. 09:21.080 --> 09:25.240 All right, you know those notifications, where you receive an email, when they say, 09:25.240 --> 09:33.240 oh, we value your data privacy, and we may have lost some of our data, but yeah, keep stay with us. 09:33.240 --> 09:36.280 All right, we have this email devices. 09:39.080 --> 09:46.200 So, this is a battle, basically, people want more privacy and take companies, want more data. 09:47.160 --> 09:52.520 All right, and so, if you are a possessive data on any kind of DBA, actually, 09:53.320 --> 09:56.440 you're writing the middle of this right now. 09:57.640 --> 10:08.520 So, your target for the data leaks, and you have the responsibility of the data you hold. 10:10.200 --> 10:11.720 All right, so we're going to war. 10:12.040 --> 10:20.040 Just, we're going to try to apply, you have to have a matter, if you want to win this war. 10:21.880 --> 10:27.880 So, we're going to check six basic principles for this battle. 10:28.520 --> 10:33.160 So, first one is privacy by design, which basically means 10:34.520 --> 10:40.200 your masking policy should not be written by, should be written by the application developers. 10:40.280 --> 10:43.000 It's something you should do at the start of the project. 10:43.560 --> 10:48.120 You don't do that at the end of the project, once the database is in production, it's too late. 10:48.920 --> 10:52.120 You need to think about it right now, right? 10:53.160 --> 11:00.200 When you add a colon, when you add a new feature, just ask yourself, okay, what impact does it have 11:00.200 --> 11:02.600 on the privacy of our users? 11:03.880 --> 11:04.200 Right. 11:05.160 --> 11:07.320 Then you need role separation. 11:07.320 --> 11:15.240 As a DBA, you're going to give access to a lot of people to the data. 11:15.240 --> 11:24.040 So, don't just give them just one role, just separate the role and give different rules to different roles. 11:26.200 --> 11:27.800 I take so far, it's a quick introduction. 11:27.800 --> 11:32.200 I'm going to say AASF for now, but 11:32.600 --> 11:36.120 this is a basic principle in security. 11:36.120 --> 11:37.640 It's not just for database. 11:37.640 --> 11:46.280 It's just the idea that you should reduce the places where your data is, okay? 11:47.560 --> 11:50.680 Data minimization is basically the opposite of big data. 11:51.240 --> 11:52.920 Big data is completely dead. 11:52.920 --> 11:55.560 This data is officially illegal. 11:55.560 --> 12:01.800 You should not collect data if you don't have a real usage for it, right? 12:02.440 --> 12:07.880 Which means when I don't know, maybe in your form, in your forms, 12:07.880 --> 12:11.960 you collect the birth date of your users. 12:11.960 --> 12:13.720 What do you use that for? 12:13.720 --> 12:18.680 Is it useful to have the birth date of every users you have? 12:18.680 --> 12:19.720 Probably not. 12:19.720 --> 12:21.080 Maybe, I don't. 12:21.080 --> 12:22.920 But you have to think about it. 12:22.920 --> 12:28.280 And you can't have this philosophy or let's collect everything and maybe let 12:28.280 --> 12:32.120 turn two years or four years, we will find something useful to do with this. 12:32.120 --> 12:33.080 It's not possible. 12:35.720 --> 12:40.680 Risk evaluation, like I said, the concept itself of privacy is always evolving. 12:41.720 --> 12:44.040 The technologies are evolving. 12:44.040 --> 12:47.240 You have new kinds of attacks, new ways to to to to to 12:48.120 --> 12:51.840 who only need to be constantly 12:53.320 --> 12:56.480 evaluate your policy. 12:57.720 --> 13:01.240 And last one is privacy by default, which is potentially simple. 13:01.240 --> 13:04.280 If you don't know if the colon is contained, 13:04.280 --> 13:09.720 the private data, you should treat it like if it contains private data, okay? 13:09.720 --> 13:11.240 If you don't know, you know. 13:12.360 --> 13:15.640 Okay, so again six principle, 13:15.640 --> 13:18.240 I'm going to show you how you can implement them 13:18.240 --> 13:20.360 with post-resculine animator. 13:20.360 --> 13:25.120 But again, you could do that with any of the tools. 13:25.120 --> 13:26.040 So what is this? 13:26.040 --> 13:28.840 This is an open source post-res extension. 13:31.920 --> 13:34.200 And it's been production for the last five years, 13:34.200 --> 13:37.840 I guess it's written in Rust and PGSQL. 13:37.840 --> 13:41.760 I did a talk yesterday about Rust and Rust extension 13:41.760 --> 13:42.680 in Post-res. 13:42.680 --> 13:44.880 It should be on the first-day website 13:44.880 --> 13:49.880 in a few hours, I guess. 13:49.880 --> 13:54.280 You can install it through RPM, DBR, Dr. Ansible, 13:54.280 --> 13:55.080 whatever. 13:55.080 --> 14:00.080 It's also available on most cloud platforms, 14:00.080 --> 14:04.480 such as Google Cloud as your crunching area, et cetera. 14:04.480 --> 14:07.560 And we do have a lot of experimental tutorials. 14:07.560 --> 14:11.280 So digital, digital, digital for the known French people. 14:11.280 --> 14:13.320 It's the Ministry of Finance. 14:13.320 --> 14:15.680 And I, and I, and I, and I, and I, 14:15.680 --> 14:22.040 is the National Research French Research Agency. 14:22.040 --> 14:24.680 Well, they've found their gave us money 14:24.680 --> 14:26.240 to develop this technology. 14:29.160 --> 14:29.800 So what is it? 14:29.800 --> 14:31.960 It's a masking engine. 14:31.960 --> 14:36.400 So you have different ways to mask your data. 14:36.400 --> 14:40.400 You can do static masking dynamic, dumps, et cetera. 14:40.400 --> 14:45.160 I'm just going to talk about the free first one. 14:45.160 --> 14:46.960 Because that's enough for you to understand, 14:46.960 --> 14:52.920 but you, there's a lot of way to mask that, actually. 14:52.920 --> 14:54.520 And it's also masking toolbox. 14:54.520 --> 15:00.560 So it's, at the same time, is what you want to mask 15:00.560 --> 15:03.320 and how you want to mask the data. 15:03.320 --> 15:06.800 And there's a lot of different ways to mask the data, 15:06.800 --> 15:09.000 depending on what you want to do. 15:09.000 --> 15:10.680 You've got pseudonymization. 15:10.680 --> 15:11.560 You have noise. 15:11.560 --> 15:13.880 You have fake data. 15:13.880 --> 15:17.640 You have partial destruction, generalization. 15:17.640 --> 15:21.640 You can also manipulate images. 15:21.640 --> 15:24.600 So again, I'm not going to go into each one of them, 15:24.600 --> 15:28.360 but you get the ID. 15:28.360 --> 15:29.320 OK, let's go. 15:29.320 --> 15:33.320 So I have installed the extension. 15:33.320 --> 15:34.760 It's a binary extension. 15:34.760 --> 15:36.520 I need to load it to my data. 15:36.520 --> 15:40.280 So the best way to do it would be to use session 15:40.280 --> 15:42.120 prelude library. 15:42.120 --> 15:45.160 All right, practice fairly simple. 15:45.160 --> 15:47.560 And then connect to the data as a gain 15:47.560 --> 15:50.600 and just create the extension. 15:50.600 --> 15:51.800 And that's it. 15:51.800 --> 15:54.600 Let's go. 15:54.600 --> 15:55.240 Let's go. 15:55.240 --> 15:57.720 And let's implement privacy by design. 15:57.720 --> 16:01.960 So again, privacy by design is a declarative approach 16:01.960 --> 16:04.600 of anonymization. 16:04.600 --> 16:09.480 We are going to write our masking rules inside the database 16:09.480 --> 16:10.520 model. 16:10.520 --> 16:11.640 All right? 16:11.640 --> 16:15.880 And but how can we do that? 16:15.880 --> 16:20.520 How can we add metadata inside the tables 16:20.520 --> 16:23.000 and inside the database model? 16:23.000 --> 16:25.800 Well, there is this thing called security label. 16:25.800 --> 16:28.520 That probably most of you don't know. 16:28.520 --> 16:30.280 It's a feature of post-vascular access. 16:30.280 --> 16:32.440 It's SQL. 16:32.440 --> 16:36.600 And you can, with this, you can attach label metadata 16:36.600 --> 16:39.080 on objects in your database. 16:39.080 --> 16:41.080 All right. 16:41.080 --> 16:42.680 So let's get to an example. 16:42.680 --> 16:43.960 So I've got the table. 16:43.960 --> 16:47.480 People and they have an ID, first name, last name, and phone number. 16:47.480 --> 16:49.000 All right. 16:49.000 --> 16:50.200 Let's go. 16:50.200 --> 16:56.760 So what I want to do is, in my database model, 16:56.760 --> 17:01.720 this line that is just going to say how this 17:01.800 --> 17:04.920 current, so in this case, the colon is last name. 17:04.920 --> 17:08.920 And this is how I'm going to replace it to the mask user. 17:08.920 --> 17:10.360 So I'm going to replace it. 17:10.360 --> 17:14.120 I'm going to mask it with the function dummy last name, 17:14.120 --> 17:18.120 and function dummy last name, which is a generic last name, 17:18.120 --> 17:20.040 any kind of last name. 17:20.040 --> 17:28.280 But so this is just like putting a constraint on your colon. 17:28.280 --> 17:31.080 You would say that last name for example, 17:31.080 --> 17:35.240 is not new, or maybe there's a, I don't know, 17:35.240 --> 17:39.320 some kind of check on this. 17:39.320 --> 17:41.800 And it's quite the same thing, actually. 17:41.800 --> 17:43.960 You're just saying, oh, this is colon. 17:43.960 --> 17:52.200 This is how I'm going to apply transform this colon when I need to. 17:52.200 --> 17:57.080 So you can also destroy the data. 17:57.080 --> 18:00.680 In here, I'm replacing the data with a function, 18:00.680 --> 18:03.800 but I can also just replace it with the value, 18:03.800 --> 18:05.640 with a static value. 18:05.640 --> 18:13.240 And so if you get just one thing to keep from this talk, 18:13.240 --> 18:17.480 is that destruction is the best and an amazement. 18:17.480 --> 18:21.640 Yeah, you can install you something you have destroyed. 18:21.640 --> 18:25.880 So again, with data minimization, we'll talk about it later. 18:25.880 --> 18:30.360 But yeah, if you want to be sure that some data will not leak, 18:30.440 --> 18:34.920 just replace it with a static value. 18:34.920 --> 18:38.920 But a third example is, for example, 18:38.920 --> 18:42.600 we want the phone to be partially destroyed. 18:42.600 --> 18:49.720 So we're going to just destroy the digits in the middle of the phone number. 18:49.720 --> 18:57.560 And if we, then I'm going to apply those frameworks on the table people. 18:57.560 --> 18:59.080 And here I am. 18:59.080 --> 19:02.120 So the table has been statically masked. 19:02.120 --> 19:06.440 It's now, if it's done forever. 19:06.440 --> 19:11.720 And as you see, the idea has disappeared this new. 19:11.720 --> 19:13.480 The first name is the same. 19:13.480 --> 19:15.480 I didn't mask it. 19:15.480 --> 19:24.680 The last name is a fake last name, but it's as schema model. 19:24.680 --> 19:26.760 OK, let's go with world separation. 19:26.760 --> 19:30.920 Again, you're going to have some rules that will be masked. 19:30.920 --> 19:36.520 So the masking rules will be applied automatically 19:36.520 --> 19:38.680 for these people. 19:38.680 --> 19:42.280 And by definition, so a mask rule will be a redoney. 19:42.280 --> 19:45.880 But while the mask rule will be a redoney, 19:45.880 --> 19:52.120 the other rule will be able to read the data and write the data. 19:52.120 --> 19:53.320 So let's go. 19:53.320 --> 19:56.360 We're going to create a new rule, which is sky net. 19:56.360 --> 19:57.880 And is around connect. 19:57.880 --> 20:02.440 We're going to activate transparent dynamic masking for him. 20:02.440 --> 20:05.800 And we're going to say, again, with the security label, 20:05.800 --> 20:08.200 that this rule is masked. 20:08.200 --> 20:12.520 And we're going to go, actually, I'm going to give it 20:12.520 --> 20:16.760 the residual data, redone data, a privilege 20:16.760 --> 20:20.360 to because it's easier because I could do things. 20:20.360 --> 20:26.360 Early, the more subtle, but let's go with this. 20:26.360 --> 20:29.800 So now, when sky net will connect, 20:29.800 --> 20:31.880 it will connect to the people's table. 20:31.880 --> 20:37.480 It will try to read the people's table, and it will be masked. 20:37.480 --> 20:41.000 And as you see, the last name has changed, because every time 20:41.000 --> 20:46.440 it will query the data, the function will be called again. 20:46.440 --> 20:49.800 So we will see, for the fake data generator, 20:49.800 --> 20:53.960 we will see different fake data every time. 20:53.960 --> 20:59.240 But for the phone, it will be always the same result. 20:59.240 --> 21:03.240 But now, if I connect back as a normal user, as postgres, 21:03.240 --> 21:04.840 sorry. 21:04.840 --> 21:07.000 The data is actually changed for the people, 21:07.000 --> 21:08.920 because it's a viewer. 21:08.920 --> 21:11.640 It's called masked. 21:11.640 --> 21:14.600 So we can re-changer that in the row, 21:14.600 --> 21:18.760 which is when you read it, it just changes on the fly when you read it. 21:18.760 --> 21:22.600 Yes, sorry. 21:22.600 --> 21:26.680 So the question is, is the data change on the view 21:26.680 --> 21:28.520 or change on the fly? 21:28.520 --> 21:30.520 It's changed on the fly, actually. 21:30.520 --> 21:35.800 Basically, what's happened is that the select clause 21:35.800 --> 21:38.760 will be intercepted. 21:38.760 --> 21:43.720 And I'm the extension will rewrite the query 21:43.720 --> 21:46.760 to display this. 21:46.760 --> 21:50.520 But actually, the masked rule doesn't even know 21:50.520 --> 21:51.480 is masked. 21:51.480 --> 21:53.480 All right? 21:53.480 --> 21:54.920 What if you have it in the index? 21:54.920 --> 21:57.000 Will the mask be available? 21:57.000 --> 21:58.920 No, if you... 21:58.920 --> 22:00.280 Do you repeat the question? 22:00.280 --> 22:01.960 Sorry. 22:01.960 --> 22:04.040 What about the index? 22:04.040 --> 22:06.040 Yes, the index is not masked. 22:06.040 --> 22:10.200 So it's still useful in that case. 22:10.200 --> 22:11.960 So basically, this is it. 22:11.960 --> 22:16.600 You have masking rules written by the post-res administrator. 22:16.600 --> 22:20.360 Some guy, a regular user, can read and write the data. 22:20.360 --> 22:24.200 And the mask can only read the mask data. 22:24.200 --> 22:26.200 All right, let's go. 22:26.200 --> 22:28.920 We're going to talk about attack surface reduction. 22:28.920 --> 22:34.040 OK, this one, I think it's going to be a clear example 22:34.040 --> 22:35.240 for you all. 22:35.240 --> 22:37.320 Let's say we have a prediction database 22:37.320 --> 22:40.840 with an user DBA, and you've got the developer 22:40.840 --> 22:43.960 and data scientists and they want the data. 22:43.960 --> 22:48.040 The developer wants to run some tests with realistic data. 22:48.040 --> 22:51.720 And the data scientists need to, I don't know, 22:51.720 --> 22:57.080 run a weekly reporting BigQuery about the start, 22:57.080 --> 22:59.640 about the company and everything. 22:59.640 --> 23:00.520 All right. 23:00.520 --> 23:04.920 So what happens to worst case scenario is this? 23:04.920 --> 23:08.120 You're going to send the real data to both those M 23:08.120 --> 23:13.320 that are going to have a copy of the data on the desktop. 23:13.320 --> 23:18.520 Maybe one of them has a Windows laptop, all right? 23:18.520 --> 23:21.560 Just stay. 23:21.560 --> 23:23.320 And so yeah, you're still fast. 23:23.320 --> 23:26.920 The attack surface is everyone. 23:26.920 --> 23:31.640 So if I want to steal your data, I'm not going to target this, 23:31.640 --> 23:35.960 because this is very hard, because you're a very good DBA, 23:35.960 --> 23:38.840 and you have protected this area very good. 23:38.840 --> 23:42.200 And I'm going to attack this guy, or maybe this guy. 23:42.200 --> 23:45.800 Maybe I'm just going to store this laptop on the bus. 23:45.800 --> 23:50.440 Or maybe I'm just going to put some kind of trojan 23:50.440 --> 23:55.960 off in this guy, PC, all right? 23:55.960 --> 23:57.640 So we don't want that. 23:57.640 --> 24:00.520 This is the worst scenario. 24:00.520 --> 24:03.640 So what most people would do is this, 24:03.640 --> 24:08.120 is extract the data, transform it, and then push it 24:08.120 --> 24:11.880 to this, to this two environments. 24:11.880 --> 24:12.880 This is nice. 24:12.880 --> 24:17.320 This is good, I'm not judging this. 24:17.320 --> 24:18.520 It's correct. 24:18.520 --> 24:22.120 But as you see, the attack surface is not 24:22.120 --> 24:23.520 reduced that much. 24:23.520 --> 24:28.760 So those two guys are not an attack vector now, 24:28.760 --> 24:33.440 but you've got a new guy in the loop in the pipeline. 24:33.440 --> 24:38.560 And this guy is also a vector of attack, right? 24:38.560 --> 24:41.880 So in some ways, you've reduced the surface, 24:41.880 --> 24:47.000 but the world pipeline is a bit more complicated. 24:47.000 --> 24:49.520 And of course, in the edge of AI. 24:49.520 --> 24:50.880 So you have this now. 24:50.880 --> 24:55.280 A lot of new startups saying, oh, just send us your data, 24:55.280 --> 24:59.080 and we have a new AI thing that will analyze this. 24:59.080 --> 25:02.840 So this is the worst ID in the entire story 25:02.840 --> 25:08.240 of the type privacy first, because there's someone 25:08.240 --> 25:12.040 in this cloud, which will be basically 25:12.040 --> 25:15.240 able to see the logs of this AI things. 25:15.240 --> 25:19.440 And you don't even know how the AI has been trained. 25:19.440 --> 25:22.400 You don't even know what they do about it. 25:22.400 --> 25:24.240 So yeah, this is a nightmare. 25:27.120 --> 25:32.240 So what we want to do is this. 25:32.240 --> 25:35.080 We reduce the surface of attack to this, 25:35.080 --> 25:38.240 but which is, as I said earlier, very secure, 25:38.240 --> 25:40.520 because you're very good DBA. 25:40.520 --> 25:46.080 And we're going to either push the data with an analyzer 25:46.080 --> 25:51.000 dump, which we'll pitch it up, right, to this environment. 25:51.000 --> 25:54.880 Or we're just going to give access to this guy. 25:54.880 --> 25:57.040 So it doesn't have a copy anymore. 25:57.040 --> 26:00.000 It's just allowed to connect to the prediction. 26:00.000 --> 26:01.800 But it is masked. 26:01.800 --> 26:06.240 So you won't see a personal data. 26:06.240 --> 26:08.680 All right. 26:08.680 --> 26:10.320 So let's do it. 26:10.320 --> 26:15.080 We can reduce this PG dump thing, the anonymous dump thing. 26:15.080 --> 26:18.120 Well, again, we're going to create a new user, 26:18.120 --> 26:22.200 a new hole, which will be used just for the dumps. 26:22.200 --> 26:25.880 And we will activate transparent dynamic masking. 26:25.880 --> 26:29.800 We're going to say this masked, and let's go. 26:29.800 --> 26:33.120 Now I'm just going to use PG dump with this user. 26:33.120 --> 26:35.880 And I'm going to get a masked dump. 26:35.880 --> 26:38.680 And now with this anonymous dump, 26:38.680 --> 26:42.280 I can share it everywhere on my network. 26:42.280 --> 26:46.240 I can send it by email to someone. 26:46.240 --> 26:50.800 It's completely out of the surface attack, 26:50.800 --> 26:54.880 because there's no personal data in the dump. 26:54.880 --> 26:57.920 Of course, if you use PG dump with a regular user, 26:57.920 --> 27:03.040 you will get a dump with a regular data. 27:07.040 --> 27:10.240 So we're going to use this dump to refresh environment. 27:10.240 --> 27:13.040 And this is, again, this is a regular PG dump. 27:13.040 --> 27:17.360 It's not a wrapper, and you can use any kind of option you 27:17.360 --> 27:20.720 use with a classic PG dump. 27:20.720 --> 27:26.600 And notably, the custom function format works. 27:26.600 --> 27:27.720 OK. 27:27.720 --> 27:33.520 So let's go again, next principle is data minimization. 27:33.520 --> 27:38.160 So in most case, when you analyze something for someone, 27:38.160 --> 27:41.400 they don't really need all of the data. 27:41.400 --> 27:44.280 A sample of the data is sufficient for tests, 27:44.280 --> 27:47.520 for analytics, for demo, for training data. 27:47.520 --> 27:51.640 Most of the time, just one small part of the entire data 27:51.640 --> 27:53.440 set is enough. 27:53.440 --> 27:55.920 So this is called sampling. 27:55.920 --> 27:57.960 Maybe did you know that Postgres already 27:57.960 --> 27:59.160 are this? 27:59.160 --> 28:02.760 And this close calls that are simple as any one ever 28:02.760 --> 28:04.800 used it here? 28:04.800 --> 28:06.080 Oh, two guys. 28:06.080 --> 28:07.080 Great. 28:07.080 --> 28:09.240 So yeah, Postgres are this already, 28:09.240 --> 28:11.920 since I don't know, years and years. 28:11.920 --> 28:16.280 And you're able to say, I just want this fraction 28:16.280 --> 28:19.080 of the result. 28:19.080 --> 28:23.520 So let's say we have a big HTTP log table with the date 28:23.520 --> 28:28.040 with the IP address that you are all everything. 28:28.040 --> 28:31.800 So we're going to put a security label on the IP address 28:31.800 --> 28:34.440 thing, so we're just going to destroy the IP address. 28:34.440 --> 28:37.240 We don't need it for the stats, for example. 28:37.240 --> 28:41.120 And you were going to just send 10% of the table 28:41.120 --> 28:42.520 to the mask users. 28:42.520 --> 28:46.600 So we'll see one or two percent of the table. 28:46.600 --> 28:51.040 The mask users will see only 10% of the table. 28:51.040 --> 28:53.640 All right. 28:53.640 --> 28:58.880 And we can also sample with RLS policies. 28:58.880 --> 29:01.640 Another great feature of Postgres. 29:01.640 --> 29:07.240 Again, we used our role-level security policies here. 29:07.240 --> 29:10.000 OK, yeah, good. 29:10.000 --> 29:12.760 So basically, role-level security policies 29:12.760 --> 29:14.840 is like filters, you're going to make 29:14.840 --> 29:18.080 at the role-level of each table. 29:18.080 --> 29:25.400 And so with this one, again, we're going to apply those rules. 29:25.400 --> 29:30.280 So this is, again, pure SQL, nothing fancy. 29:30.280 --> 29:33.320 So we're going to create a policy on the logs. 29:33.320 --> 29:38.800 And we're going to say that we're going to apply this rule 29:38.800 --> 29:40.400 only for the mask user. 29:40.400 --> 29:45.080 Any users that has, if the user, the current user as a mask, 29:45.080 --> 29:50.040 and we're going to only apply it for the values 29:50.040 --> 29:52.840 that are six months old. 29:52.840 --> 29:57.200 So the mask user will only see the latest data. 29:57.200 --> 30:01.000 Only the data that has six months of the last six months. 30:01.000 --> 30:03.120 All right? 30:03.120 --> 30:04.720 OK. 30:04.720 --> 30:08.880 I really like this example, actually. 30:08.880 --> 30:11.280 OK, let's go with risk evaluations. 30:11.280 --> 30:12.280 This one is hard. 30:12.280 --> 30:16.680 It's not where we still have a lot of work to do in this area. 30:16.680 --> 30:19.560 But we have two features for that. 30:19.560 --> 30:21.680 One is called K and Animiti. 30:21.680 --> 30:23.640 And the other is a detection function. 30:23.640 --> 30:24.800 So let's go. 30:24.800 --> 30:28.040 And talk about K and Animiti. 30:28.040 --> 30:29.760 It's an industry standard, basically. 30:29.760 --> 30:34.160 You will find it on other tools too, hopefully. 30:34.160 --> 30:40.920 And it says that it's a factor that computes the risk 30:40.920 --> 30:45.200 of re-identifying someone within your data set. 30:45.200 --> 30:48.160 So maybe you applied your masking rules, 30:48.160 --> 30:51.640 but just feel some guy that is on edge case 30:51.640 --> 30:55.240 and you're still able to find one unique person 30:55.240 --> 30:56.520 within your data set. 30:56.520 --> 30:57.560 Right? 30:57.560 --> 31:00.240 So it's just a function, actually, 31:00.240 --> 31:02.520 that you can run on your table. 31:02.520 --> 31:09.680 And it will try to guess how many single people you can find. 31:09.680 --> 31:14.760 No, not how many single, but how difficult it 31:14.760 --> 31:17.160 would be to identify someone. 31:17.160 --> 31:19.960 So here's the factor is free, which is good, 31:19.960 --> 31:22.720 but not good, not the best way. 31:22.720 --> 31:24.760 The higher the value is the better. 31:24.760 --> 31:25.840 All right? 31:25.840 --> 31:28.120 If you want, it means that there's one guy 31:28.120 --> 31:32.920 that is your unique guy unit in your data set. 31:32.920 --> 31:36.560 And we also have a detection function. 31:36.560 --> 31:43.080 So we're going to scan all your tables and all the columns. 31:43.080 --> 31:46.080 And with U.S. ticks, we're going to try to guess 31:46.080 --> 31:51.680 that maybe this customer first name colon 31:51.680 --> 31:53.400 doesn't have a masking rule. 31:53.400 --> 31:59.640 And maybe this one is a dialectite identifier. 31:59.640 --> 32:01.880 So it's not perfect. 32:01.880 --> 32:05.720 I think we have a lot of areas to improve this one. 32:05.720 --> 32:09.160 But this one is like a checker and a void 32:09.160 --> 32:10.760 to miss one colon. 32:10.760 --> 32:18.200 And finally, privacy by default, like I said, 32:18.200 --> 32:24.800 if you don't know, if a colon is all the personal data, 32:24.800 --> 32:28.360 just mask it by default. 32:28.360 --> 32:33.360 So again, we're going to take the HTTP log table. 32:33.360 --> 32:38.680 And so yeah, we have a date, the IP address, et cetera, 32:38.840 --> 32:47.200 and we're going to just activate the privacy by default parameter. 32:47.200 --> 32:50.600 And instead of masking the data, we're going to unmask the data. 32:50.600 --> 32:54.040 So now, we've activated privacy by default. 32:54.040 --> 32:56.840 So everything is going to be masked. 32:56.840 --> 33:00.480 And we're going to unmask the thing we want to see. 33:00.480 --> 33:08.280 So maybe, yeah, maybe as a log URL, we want to see it. 33:08.280 --> 33:10.240 So we're going to unmask it. 33:10.240 --> 33:12.520 I'm going to say it's not masked. 33:12.520 --> 33:17.320 And the date, we're going to just keep the year. 33:17.320 --> 33:20.680 And we move the month and the date. 33:20.680 --> 33:25.160 So we generalize the date. 33:25.160 --> 33:31.560 And so with this, once I animate the table, 33:31.560 --> 33:35.760 what I get is, OK, the date has been changed. 33:35.760 --> 33:40.320 I just keep the year. 33:40.320 --> 33:42.560 The IP address has disappeared. 33:42.560 --> 33:45.320 For the present, I jump, I took the default value. 33:45.320 --> 33:48.120 So I have the default value of this current. 33:48.120 --> 33:52.160 And I unmask the URL. 33:52.160 --> 33:55.000 Right. 33:55.000 --> 33:57.080 OK, I'm almost finished now. 33:57.080 --> 34:01.800 So yeah, the battle for privacy is happening right now. 34:01.800 --> 34:05.360 Whether you want it or not in the middle of it. 34:05.360 --> 34:07.000 So you have a responsibility. 34:07.000 --> 34:08.800 And you also target for our data rigs. 34:12.240 --> 34:17.920 So you need to step up and take actions. 34:17.920 --> 34:23.520 But you don't have to take all actions or at once. 34:23.520 --> 34:28.480 You can improve things on iterations. 34:28.480 --> 34:31.080 And so the first things first, the one you need 34:31.080 --> 34:33.720 to do is privacy by design. 34:33.720 --> 34:40.720 So you need to go talk to the developers about privacy 34:40.720 --> 34:45.760 and ask them which current old private data, which one 34:45.760 --> 34:49.080 is OK to publish, et cetera, et cetera. 34:49.080 --> 34:54.720 And if, because maybe if this is a software that you 34:54.720 --> 35:00.720 bought, you need to go and reach the editor of software. 35:00.720 --> 35:04.880 And say it's their responsibility, their software. 35:04.880 --> 35:09.120 They have designed a database model. 35:09.120 --> 35:14.120 It's up to them to tell you how you should apply masking 35:14.120 --> 35:15.200 upon this thing. 35:15.200 --> 35:17.920 It's not up to you to reverse engineer 35:17.920 --> 35:21.120 that that is a model to guess which current should 35:21.120 --> 35:22.400 be masked or not. 35:22.400 --> 35:23.680 It's their responsibility. 35:23.680 --> 35:27.760 It don't let them avoid that responsibility. 35:27.760 --> 35:30.000 And once again, they can do it with processing measure. 35:30.000 --> 35:31.560 It's free to open source. 35:31.560 --> 35:34.760 If you want to do it with another tool, maybe they 35:34.760 --> 35:37.440 would want to develop their own tooling. 35:37.440 --> 35:40.400 Maybe a lot of editors know right now, 35:40.400 --> 35:45.800 we would provide you with their own scripts. 35:45.800 --> 35:49.120 I would say that it's just like running your own crypto 35:49.120 --> 35:53.080 or running your own database, the backup scripts. 35:53.080 --> 35:53.720 It's nice. 35:53.720 --> 35:57.040 It's fun to do, but basically, just don't reinvent 35:57.040 --> 36:00.240 so we'll use a proper masking tools. 36:00.240 --> 36:04.360 Do not reinvent your own masking tools. 36:04.360 --> 36:05.440 I did it. 36:05.440 --> 36:06.840 I know what it cost. 36:10.840 --> 36:13.040 Then once you've done this, yeah, 36:13.040 --> 36:16.080 philosophical, you have this philosophical mindset 36:16.080 --> 36:20.080 on doing privacy by designing everything else is easier. 36:20.080 --> 36:22.320 Once you've said everything, we had something 36:22.320 --> 36:24.800 we think about privacy at the very beginning, 36:24.800 --> 36:26.800 everything else is easier. 36:26.800 --> 36:31.440 And so the next steps are wall separation. 36:31.440 --> 36:34.480 Attacks surface reduction. 36:34.480 --> 36:37.400 And finally, that I meanization, a privacy 36:37.400 --> 36:39.040 by default, extra. 36:39.040 --> 36:42.240 So of course, you can do it as there was a war one, 36:42.240 --> 36:45.800 but that's my advice. 36:45.800 --> 36:48.120 And do not fight alone. 36:48.120 --> 36:49.920 Get a team. 36:49.920 --> 36:56.040 Because again, it involves a lot of different talents 36:56.040 --> 36:56.880 to this. 36:56.880 --> 36:59.760 So you have to have the developers with you. 36:59.760 --> 37:01.840 You have to have the system in the DPO, 37:01.840 --> 37:06.440 the software editor, if that's a software that you both, 37:06.440 --> 37:08.880 etc., etc., yeah. 37:08.880 --> 37:12.640 Do not do this by yourself on your own. 37:12.640 --> 37:14.040 You need it's a teamwork. 37:16.200 --> 37:19.920 And again, different strategies for different use case. 37:19.920 --> 37:24.920 There's not one single strategies that was for everyone. 37:24.920 --> 37:27.240 You even have some database where you 37:27.240 --> 37:29.320 have multiple masking policies. 37:29.320 --> 37:33.440 So maybe one policy just for GDPR. 37:33.440 --> 37:37.520 And another policy for, I don't know, commercial secrets. 37:37.520 --> 37:39.960 And commercial secrets are not concerned 37:39.960 --> 37:42.600 by GDPR or the world's world. 37:42.600 --> 37:45.320 So different tools, different techniques 37:45.320 --> 37:48.440 for different contexts. 37:48.440 --> 37:51.040 And that's about it. 37:51.040 --> 37:56.640 So yeah, the extension is available on GitLab. 37:56.640 --> 37:58.720 I've wrote a four-hour tutorial, if so, 37:58.720 --> 38:00.720 if you want to try it as a documentation, 38:00.720 --> 38:03.280 you can try the tutorial. 38:03.280 --> 38:06.320 And you can contact me if you have any questions. 38:06.320 --> 38:09.680 And also, if some people already use the extension 38:09.680 --> 38:13.840 in the world, I should have as a beginning. 38:13.840 --> 38:16.720 Does anyone already use the extension? 38:16.720 --> 38:19.000 OK, one guy, yeah, good. 38:19.000 --> 38:20.960 So if anyone's listening on GitLab, 38:20.960 --> 38:23.360 then there's a server right now where 38:23.360 --> 38:26.760 I try to get feedback from the users to know 38:26.760 --> 38:28.280 what they do with the extension. 38:28.280 --> 38:34.240 So yeah, you can answer that if you're using it. 38:34.240 --> 38:36.240 And that's it for me. 38:36.240 --> 38:39.760 So please do not move, wait for the end of the questions 38:39.760 --> 38:42.960 to move it easier for me to answer if that's not. 38:42.960 --> 38:43.960 OK. 38:43.960 --> 38:54.640 That was very interesting, Tolk. 38:54.640 --> 38:55.840 Thank you. 38:55.840 --> 38:56.680 Any question? 38:56.680 --> 38:57.680 Ooh. 39:06.000 --> 39:07.640 All right, yeah, sure. 39:07.640 --> 39:08.800 I have a question. 39:08.800 --> 39:09.840 One more. 39:09.840 --> 39:10.640 One more. 39:10.640 --> 39:13.480 OK, so just a bit for the phone. 39:13.480 --> 39:17.080 And after this one, OK. 39:17.080 --> 39:18.440 Hey, thanks for the great talk. 39:18.440 --> 39:19.680 I have a question. 39:19.680 --> 39:24.840 Can you use logic or replication with the Mascad data? 39:24.840 --> 39:29.280 I thought because sometimes dumping a big database 39:29.280 --> 39:31.480 is very time consuming and still we 39:31.480 --> 39:35.160 want to have Mascad data somewhere 39:35.160 --> 39:38.200 on the staging environment, let's say. 39:38.200 --> 39:41.240 Yeah, that's a great gratration. 39:41.240 --> 39:44.720 We actually, so currently, we are in V2. 39:44.720 --> 39:48.040 And we're working on D3, person 3. 39:48.040 --> 39:50.480 And so we are working on something 39:50.480 --> 39:55.360 that we were going to call the masking logical decoder 39:55.360 --> 39:58.520 when you can subscribe to the prediction 39:58.520 --> 40:02.480 and have your logical replication, but with the masking 40:02.480 --> 40:05.080 rules are placed on the fly. 40:05.080 --> 40:08.880 So yeah, it's not ready at all, but yeah, 40:08.880 --> 40:12.560 we're really thinking about that a lot of people. 40:12.560 --> 40:15.000 You can, for most of some people, 40:15.000 --> 40:20.960 can not, well, as I say, as well as this. 40:20.960 --> 40:23.320 Yeah, a lot of people cannot do this, 40:23.320 --> 40:24.840 because there's too much data. 40:24.840 --> 40:28.040 It takes too long, OK? 40:28.040 --> 40:32.080 So yeah, again, because maybe the regular dump 40:32.080 --> 40:33.760 will took one hour. 40:33.760 --> 40:38.400 But depending of how much rules you have on your data set, 40:38.400 --> 40:41.320 the anonymized dump will take maybe three hours. 40:41.320 --> 40:42.080 All right? 40:42.080 --> 40:43.960 It does a cost. 40:43.960 --> 40:44.320 OK. 40:47.960 --> 40:49.320 Yeah, thanks. 40:49.320 --> 40:51.640 I was wondering about determinism in the masking. 40:51.640 --> 40:55.400 So is it deterministic also across different databases? 40:55.400 --> 40:58.760 If I have the same input values in one of my databases, 40:58.760 --> 41:01.480 anonymized according to the same rule, it will end up 41:01.480 --> 41:03.200 fully consistent across my data? 41:03.200 --> 41:05.960 Yeah, actually, it depends. 41:05.960 --> 41:06.840 You choose. 41:06.840 --> 41:11.400 So you have some, where is it? 41:11.400 --> 41:12.200 All right. 41:12.200 --> 41:13.040 Yeah. 41:13.040 --> 41:17.520 So among these types of masking rules, 41:17.520 --> 41:19.480 some of them are deterministic. 41:19.480 --> 41:22.000 Meaning that we always return the same result 41:22.000 --> 41:24.520 for the same value, all right? 41:24.520 --> 41:27.800 But some are completely random. 41:27.800 --> 41:33.320 So basically, random and faking are always different. 41:33.320 --> 41:38.520 But if you want to have the same fake value across 41:38.520 --> 41:46.000 different database, this is called pseudo-sudo-sudo-neemization. 41:46.000 --> 41:50.640 And so pseudo-neemization is not anonymization. 41:50.640 --> 41:52.240 It's not the same thing. 41:52.240 --> 41:57.760 Because there's a way to go back, because there's a deterministic, 41:57.760 --> 42:00.120 once there's a deterministic transformation, 42:00.120 --> 42:01.760 you can always go back. 42:01.760 --> 42:04.240 So you need to be very careful about that. 42:04.240 --> 42:10.360 And the GDPR regulation is very clear that if you have 42:10.360 --> 42:14.760 done pseudo-neemization, your data is still 42:14.760 --> 42:18.640 a concern about a still personal data, actually. 42:22.440 --> 42:24.960 My question is a bit of a follow-up. 42:24.960 --> 42:29.240 The table's simple functionality. 42:29.480 --> 42:33.000 Well, that always return the same end elements, 42:33.000 --> 42:37.400 or will it's a completely random sample on each and every call? 42:37.400 --> 42:39.720 Now, it's a random sample. 42:39.720 --> 42:40.880 It's random deadbossed. 42:40.880 --> 42:43.840 You can't know which one we want. 42:43.840 --> 42:45.400 If you want something deterministic, 42:45.400 --> 42:47.320 you need something like that. 42:47.320 --> 42:51.560 So if you want to use that function to limit the accessibility 42:51.560 --> 42:55.600 of some analyzer, if he calls the function off 42:55.600 --> 42:57.680 enough, he can still get your full data set. 43:00.960 --> 43:03.160 Yeah, that's a good question. 43:03.160 --> 43:07.200 So this is linked to another topic 43:07.200 --> 43:10.240 called differential privacy, where at some point 43:10.240 --> 43:13.080 with some of this, actually this, you have also 43:13.080 --> 43:14.200 when you have noise. 43:14.200 --> 43:22.200 If you put noise upon, for example, a date of birth, 43:22.200 --> 43:28.320 the masking roll can ask for the date of birth multiple times. 43:28.320 --> 43:32.520 And the noise will be applied with the same 43:32.520 --> 43:34.360 a random function upon it. 43:34.360 --> 43:38.800 So it's going to find out the truth at some point, 43:38.800 --> 43:42.440 just by doing the mean of all the results he got. 43:42.440 --> 43:46.120 So this is an area we're working on, which is called 43:46.120 --> 43:51.360 differential privacy, but for now, yes, 43:51.360 --> 43:56.040 you need to somehow track how many times your mask roll 43:56.040 --> 44:00.640 will access to the data and how close they are getting 44:00.640 --> 44:03.760 to the truth. 44:03.760 --> 44:06.760 Just a little addiction to the table sample. 44:06.760 --> 44:08.560 Is support solving? 44:08.560 --> 44:13.080 So it's possible, maybe it's possible to repeat 44:13.080 --> 44:14.600 at least the same that set. 44:14.600 --> 44:15.280 Yeah, sure. 44:15.280 --> 44:19.240 Yeah, you can define a sort, and so the salt will be 44:19.240 --> 44:23.200 used in a good point. 44:23.200 --> 44:25.640 Great talk. 44:25.640 --> 44:29.840 I wanted to ask if someone has an already established database 44:29.840 --> 44:32.080 that is not anonymized at all. 44:32.080 --> 44:35.640 Was the best approach to go on about that? 44:35.640 --> 44:39.960 Yeah, so the question is what to do when the database 44:39.960 --> 44:41.160 is already in production? 44:41.160 --> 44:45.440 Yeah, it's a worse scenario, because you need to look 44:45.440 --> 44:48.440 at the database model and try to guess what 44:48.440 --> 44:50.360 the developers try to do. 44:50.360 --> 44:56.560 So if you can just try to find the people that made it, 44:56.560 --> 44:59.280 otherwise, yeah, just privacy by default, 44:59.280 --> 45:02.960 just add on those and everything is private. 45:02.960 --> 45:06.360 Can I ask another question? 45:06.360 --> 45:11.680 The sampling of the database, how does it work with foreign keys 45:11.680 --> 45:12.360 and relationships? 45:12.360 --> 45:14.680 Yeah, that's a very, very good question. 45:14.680 --> 45:18.320 So basically, the exception does not guarantee 45:18.320 --> 45:20.840 use that foreign keys will be respected. 45:20.840 --> 45:22.640 So you can't break. 45:22.640 --> 45:23.640 Yeah, it's up to you. 45:23.640 --> 45:25.920 You can check it or you can destroy it. 45:25.920 --> 45:29.000 But yeah, it's up to you to decide whether or not 45:29.000 --> 45:33.240 you want to keep a referential integrity, right? 45:33.240 --> 45:35.400 Because in some cases, it's important to break 45:35.400 --> 45:36.880 referential integrity. 45:36.880 --> 45:40.080 But for some user use case, you want to keep it. 45:40.080 --> 45:42.160 It really depends, actually. 45:42.160 --> 45:44.160 So there's no doubt rise about that. 45:44.160 --> 45:49.760 If you do, it's like a not new, yeah, in this example, 45:49.760 --> 45:54.480 for example, there's a not real concern from the ID. 45:54.480 --> 45:58.600 But I'm putting a new value, all right? 45:58.600 --> 46:02.400 And so the user will get a result that there's 46:02.400 --> 46:04.800 not respect to the database schema. 46:04.800 --> 46:07.000 But I do it in purpose. 46:07.000 --> 46:09.240 Of course, yes. 46:09.240 --> 46:11.080 OK, thank you for the talk over here. 46:12.160 --> 46:16.600 Did you measure the overhead and performance impacts 46:16.600 --> 46:18.040 of the extension? 46:18.040 --> 46:21.120 Because if you do it on the fly, it has to. 46:21.120 --> 46:22.320 Yeah. 46:22.320 --> 46:26.720 So over it, it really depends on different factors. 46:26.720 --> 46:29.600 So there's a number of lines, the number of rules you have, 46:29.600 --> 46:31.560 and the complexity of rules. 46:31.560 --> 46:35.440 Of course, this kind of rule is very fast. 46:35.440 --> 46:36.880 Destroying is fast. 46:36.880 --> 46:40.040 But generating a random value is very slow. 46:42.520 --> 46:47.920 But the thing's to understand this with different strategies, 46:47.920 --> 46:50.560 the cost is not paid by the same users. 46:50.560 --> 46:54.280 With dynamic masking, the cost is paid by the mass troll. 46:54.280 --> 46:56.280 So you can say, OK, your mass troll, you're 46:56.280 --> 46:58.120 going to pay for this. 46:58.120 --> 47:00.960 So you're going to have a very, very slow result 47:00.960 --> 47:04.480 whereas the normal user with regular performances. 47:04.480 --> 47:06.800 Whereas, if you do anonymized dumps, 47:06.800 --> 47:10.720 the price is paid by everyone, right? 47:10.720 --> 47:14.800 So it depends on who pays the cost. 47:14.800 --> 47:18.960 In the use case of a mask troll, why would I 47:18.960 --> 47:23.120 when do I need to use a mask troll instead of simply 47:23.120 --> 47:26.840 restrict and access to some columns? 47:26.840 --> 47:30.880 So I could grant a data scientist from your example 47:30.880 --> 47:34.280 with access to the required columns only. 47:34.280 --> 47:37.760 And just restrict access to the columns, which you actually 47:37.760 --> 47:39.440 mask. 47:39.440 --> 47:43.120 Maybe I don't see the whole picture here. 47:43.120 --> 47:47.520 When do I need to mask a data instead of simply 47:47.520 --> 47:50.480 restrict and access to this data? 47:50.480 --> 47:55.720 Oh, but maybe because this is data scientist, for example, 47:55.720 --> 48:02.600 he needs to run some queries upon, sorry, I get fast. 48:02.600 --> 48:04.320 Now, this one, yeah. 48:04.320 --> 48:09.000 So maybe he wants to run stats over the years. 48:09.000 --> 48:13.160 And he wants to make some stats about the ASK test logs. 48:13.160 --> 48:15.560 So he needs the year. 48:15.560 --> 48:18.160 He's going to group by year. 48:18.160 --> 48:24.160 So you can just erase the day and the months, 48:24.160 --> 48:25.840 but you keep the year, right? 48:25.840 --> 48:26.680 I see. 48:26.680 --> 48:28.920 So that's called generalization. 48:28.920 --> 48:30.840 And it's enough for him. 48:30.840 --> 48:35.000 But so he has access, but he does a different access. 48:35.000 --> 48:38.480 So when even more granular control is required, 48:38.480 --> 48:40.680 yes, and sizes, data, for example, one part 48:40.680 --> 48:42.560 is visible, but another is not. 48:42.560 --> 48:43.280 Yeah. 48:43.280 --> 48:44.120 OK, thank you. 48:44.120 --> 48:45.920 And you can write your own, of course. 48:45.920 --> 48:51.360 So the main use case for this is noise QL and JSON data. 48:51.360 --> 48:53.520 So because if you ask JSON data, 48:53.520 --> 48:55.640 it's going to be a nightmare to mask, 48:55.640 --> 48:59.160 because there's no schema. 48:59.160 --> 49:05.400 And so probably you're going to have to write your own functions 49:05.400 --> 49:11.680 to go deep inside the JSON values to modify them. 49:11.680 --> 49:14.440 You mentioned that indexes are not masked. 49:14.440 --> 49:18.320 But if I'm a mask user, and I do a query, 49:18.320 --> 49:21.840 it says that select from user swear or last name 49:21.840 --> 49:26.520 is corner, would I get Sara Spellman or would 49:26.520 --> 49:28.280 query it's actually a little unnomized data? 49:28.280 --> 49:32.080 OK, the result, if you have an undistimistic value, 49:32.080 --> 49:33.600 the query will change every time. 49:33.600 --> 49:35.080 Yes. 49:35.080 --> 49:37.320 The result will change every time. 49:37.320 --> 49:38.520 Again and again. 49:38.520 --> 49:42.480 No, I mean, the queries are actually going to match 49:42.480 --> 49:45.520 against the masked values, not against the original value. 49:45.520 --> 49:47.720 So right. 49:47.720 --> 49:50.800 Yeah, no, you're going to mask it against the masked value, 49:50.800 --> 49:51.760 not to really value it. 49:51.760 --> 49:52.280 OK. 49:52.280 --> 49:53.200 So it won't work. 49:53.240 --> 49:57.960 You cannot if where name is gone, now it won't work. 49:57.960 --> 49:59.840 But I need to go actually. 49:59.840 --> 50:01.000 Just take a picture. 50:01.000 --> 50:02.000 Yes. 50:02.000 --> 50:03.360 Thank you very much. 50:08.360 --> 50:12.440 I'll be outside if you have questions. 50:12.440 --> 50:13.040 Sorry. 50:13.040 --> 50:13.840 Privacy. 50:13.840 --> 50:15.080 It's amazing.