WEBVTT 00:00.000 --> 00:12.320 All righty. Good morning. I'm going to start by saying I agree with the previous speakers. 00:12.320 --> 00:17.760 We shouldn't be talking about digital wallets. It's fundamentally the wrong abstraction 00:17.760 --> 00:24.320 for the development and the standardization layer. It might be good for end users though. 00:24.320 --> 00:30.240 So today I'm going to be talking about verifiable credentials in relation to query and how 00:30.240 --> 00:35.680 we can do zero knowledge over that query. I'm going to start with a bit about myself. So I'm 00:35.680 --> 00:43.200 Jesse Wright. I wear many hats. One of those hats is as a researcher at Oxford where I am working 00:43.200 --> 00:49.840 on Euro Symbolic AI systems and I am trying to build AI agents that faithfully represent human 00:49.840 --> 00:54.800 entities work faithfully with data on your behalf and help you manage your life. 00:56.400 --> 01:02.080 I am also lead of the solid project over at the Open Data Institute, which is a project that 01:02.080 --> 01:08.560 Tim Berners Lee created back in 2014 at MIT to give people back control of their personal data. 01:08.560 --> 01:14.240 So solid is a personal data store solution that allows you to build applications that don't store 01:14.240 --> 01:19.600 data on corporate servers but instead store your personal data in cloud storage such that you 01:19.600 --> 01:24.080 can reuse it as you go across the web. We talk about being able to use digital wallets for things 01:24.080 --> 01:29.520 like sharing your address and proving your address. With solid I have an authoritative source of my 01:29.520 --> 01:35.280 address. So that means that I updated once and my insurance company gets it straight away. My 01:36.880 --> 01:40.400 university gets it straight away. I have authority over my data. 01:41.200 --> 01:48.160 And I have been working on solid in particular for many years. I used to work for Tim's company 01:48.160 --> 01:54.560 in rupt as an enterprise software engineer and data architect over there and solid is already 01:54.560 --> 02:01.280 very heavily involved in the data wallets space. In fact one of the donated open wallet 02:01.360 --> 02:12.160 contributed. Thank you. Came from in rupt. And I work a lot on standards at the moment. 02:12.160 --> 02:17.920 I am in 14 standards bodies through the W3C mainly coming from the semantic web perspective 02:17.920 --> 02:26.720 as is going to become obvious throughout my talk. So what is a wallet? What are we talking about 02:26.720 --> 02:34.160 when we have a digital wallet? My suggestion is that it's a way to prove that someone 02:35.280 --> 02:41.520 said something. Fundamentally that is what we're doing. When I get issued with a credential that has 02:41.520 --> 02:49.360 my driver's license, all that is saying is that a driving authority, the driving authority that 02:49.440 --> 02:57.920 gave you this credential, says that you're accredited to drive. So to me a digital wallet is a way 02:57.920 --> 03:06.000 of saying that someone said something. A generalization of that concept is to say that digital 03:06.000 --> 03:15.120 wallets help supply some evidence to help someone believe something. So when I get my 03:15.600 --> 03:24.560 digital drivers license, that is evidence that I can supply to a car rental company that help them 03:24.560 --> 03:32.960 establish that I am accredited to drive. And this idea has been around forever. We call it 03:32.960 --> 03:39.840 provenance if you've been in semantic web spaces for a while. So if we look at the variable 03:39.840 --> 03:46.240 credentials back from the W3C for instance, there's two core features to it that support this. 03:46.240 --> 03:51.360 The first is signatures. So the signature is the way of having the driving authority prove 03:52.640 --> 03:59.360 that they stated something with their public-private key signatures. And then the other useful 03:59.360 --> 04:06.400 feature that I believe comes from W3C verifiable credentials is this notion of selective disclosure, 04:06.400 --> 04:11.440 which means once I've been issued this digital drivers license, I don't have to tell everyone 04:12.000 --> 04:18.800 everything from it. I can instead just reveal my data birth that's present on this digital 04:18.800 --> 04:27.840 credential. However, only selective disclosure is possible, not derivation of facts. So if it is my 04:27.840 --> 04:34.080 data birth stated in that verifiable credential, I must tell you my birth date. I can't prove that I'm 04:34.080 --> 04:41.360 over 18. I can't prove that I'm a commonwealth citizen. I have to say that I'm Australian. 04:42.000 --> 04:46.400 And that is all the standard support today through these BBS signatures, which I believe the 04:46.400 --> 04:54.800 other group mentioned. So if we look at the use cases where these kind of verifiable credentials 04:54.800 --> 05:00.640 are being used and we want to use them, I put it to you other current standards enough. 05:01.040 --> 05:09.040 So one common use case that we're seeing now is Directs, which is doing a lot of secondary data 05:09.600 --> 05:16.640 reuse in the medical system in places like Luxembourg. But when it comes to health and I'm thinking 05:16.640 --> 05:24.240 about studies that I might want to do, a very naive one might be to get an anonymized data set 05:24.240 --> 05:28.800 that has correlations between ages and BMI as of everyone in a particular country. 05:29.600 --> 05:35.680 Now if I'm doing this for research purposes and I want to publish a study, I want to be sure that 05:35.680 --> 05:43.360 that data is true. I don't want to have to trust some third party to have aggregated it correctly 05:43.360 --> 05:49.520 and then claim to me that they're aggregated it correctly. I actually want proof that all of the 05:49.520 --> 05:55.680 values that I've been given are issued by invalid hospitals throughout the country that I come from. 05:56.640 --> 06:04.320 And what about on-demand data integrity? So I mentioned driving authorities earlier. What if I want 06:04.320 --> 06:10.320 to prove that I'm eligible to hire a car in the country that I've come from? But I only want to 06:10.320 --> 06:23.040 provide that bullying answer. I don't want to tell them anything about my visa status or my driving 06:23.920 --> 06:28.240 credentials in the country that I come from. I don't even want to reveal where I'm born. 06:29.760 --> 06:35.440 Can the current standards make this integration possible so that I can integrate this data and 06:35.440 --> 06:38.560 provide it to a company and can they help make it more privacy preserving? 06:40.720 --> 06:49.440 So this led me to the question of can we support zero knowledge proof over sparkle queries 06:50.160 --> 06:56.640 so that we can base verifiable credential standards around sparkle instead of around fixed 06:56.640 --> 07:02.160 credential structures. Now, as a show of hands, who here has heard of sparkle and knows what the 07:02.160 --> 07:08.720 hell I'm talking about? Okay, half the crowd. Who likes sparkle or somatic web standards? 07:10.560 --> 07:19.360 Ooh, tough crowd. Okay, this is going to be fun. Let's start with why I think 07:19.840 --> 07:25.760 we should be using these kind of standards. So what we've got up here is a set of data in an 07:25.760 --> 07:34.080 emergent RDF standard called RDF 1.2. And what this allows you to do is talk about claims. 07:34.880 --> 07:41.280 So instead of saying that that is my data buff, I'm going to say that the driving authority 07:41.280 --> 07:48.800 claims that my data buff is 6th of April 2000 and instead of just in a database stating that my 07:48.800 --> 07:53.280 citizenship is Australia, I'm going to state that the UK immigration authority claims 07:53.280 --> 07:57.760 that my country is Australia. Now you might be going, well isn't this what the VC standards are 07:57.760 --> 08:02.320 already doing? We have the data in this credential and then we add a signature to it. 08:03.440 --> 08:08.560 One caveat there. Where in there areifiable credential standards at the moment, 08:08.560 --> 08:12.240 actually stating all this data at the top level. So you're saying there's a credential. 08:12.240 --> 08:16.560 Jessie does have this data buff. Jessie does have that citizenship. Oh, and by the way, 08:16.560 --> 08:21.760 here's a signature to prove that. We're not fundamentally modeling the fact that these are all 08:21.760 --> 08:27.280 claims that anyone can make and the data inside those angle brackets is not necessarily ground 08:27.280 --> 08:34.080 truth. That's just stuff claimed by these parties. And then what we want to be able to do is not 08:34.080 --> 08:39.280 have these custom procedures of integrating credentials and deriving proofs. We want to be able 08:39.280 --> 08:45.600 to have standards so that we can write queries like this one. Where I can just ask, is that Jessie's 08:46.560 --> 08:51.760 or is Jessie's data buff over 2006, so I'm over 18, and does he have citizenship of a 08:51.760 --> 08:56.320 commonwealth country? There's a standard to write this. We don't need to write custom business 08:56.320 --> 09:01.040 logic in programming languages of choice. We can do this all in standards. 09:03.040 --> 09:10.080 So I had about half the hands come up for sparkle and idea. So I'm going to very, very quickly 09:10.080 --> 09:17.440 do the breakneck tour of what is a sparkle query. So I'm hoping more of you have interacted 09:17.440 --> 09:25.040 with SQL. But sparkle is just a query language. It's nothing to be scared of that allows you 09:25.760 --> 09:34.320 to query over data on the web in a standardized format. So here we've got a query that's asking 09:34.320 --> 09:42.880 is Rubin. So we're wanting to see all the people that are born in the same place as Rubin 09:42.880 --> 09:51.760 and younger than him. And we're wanting to construct some derived facts that generate the fact 09:51.760 --> 09:56.240 that that person shares the place of birth and is older than. But we're no longer revealing the 09:56.240 --> 10:02.960 data birth and the birth place. And in DBPedia, which is an open data set, there's actually hundreds 10:02.960 --> 10:08.640 of people that share this data birth with, well over what's my mind to say. The share this data 10:08.640 --> 10:13.280 birth was Rubin and younger with him. So we can get this set of query results. And on an open 10:13.280 --> 10:19.040 web of data where I trust that all these facts are true, this works quite well already. Sparkles 10:19.040 --> 10:29.200 being around for decades. But we haven't had ways of talking about facts about facts until now. 10:29.200 --> 10:34.400 And this is where the IDF starry, IDF 1.2 reification that I was talking about comes in. 10:34.400 --> 10:42.480 So now we can talk about statements according to people. So now if we look at that query, 10:42.480 --> 10:53.280 we just tied for Rubin. What we really want to do is say, does is there a way of asserting 10:53.280 --> 10:57.520 that Rubin shares the same birth place as someone and is older than that person, 10:58.000 --> 11:01.600 according to some derived proof. That's what we want to get to. 11:02.800 --> 11:12.480 And moreover, I don't want that just to be any old proof. I want that proof to reveal at most 11:13.520 --> 11:21.120 the fact, uh, reveal at most, which entities stated the ground truth used to derive this fact. 11:21.120 --> 11:27.040 So I want to state that this is true. And all you need to trust is the Belgian government 11:27.120 --> 11:35.920 to get there. Can we do this? Well, when I pitched this talk, I didn't know. 11:37.760 --> 11:44.560 And I can tell you quite confidently the answer now is yes. And remarkably, despite my, I think 11:44.560 --> 11:50.240 someone else here was quite skeptical of blockchain before. Despite that shared skepticism, 11:50.240 --> 11:56.480 they've actually managed to produce in the, the blockchain space, a very useful privacy enhancing 11:56.480 --> 12:03.680 technology, uh, call zero knowledge virtual machines. Risk zero is one of them. It came into major 12:03.680 --> 12:09.840 release, I believe, late last year, but don't quite me on that. And the very naive way you can get 12:09.840 --> 12:14.400 everything I was just talking about working is to take one of these zero knowledge virtual machines, 12:15.360 --> 12:20.480 take a sparkle query engine that's written in rust, and run that inside the zero knowledge 12:20.480 --> 12:27.360 virtual machine. Because this thing can prove correct execution of arbitrary rust code, 12:29.200 --> 12:38.400 and then reveal the outputs of a function. So concretely, uh, if we look at the implementation, 12:38.400 --> 12:43.440 I just need to prove how short the code for this is, and then I'll check the time I've got a few minutes. 12:45.280 --> 12:51.360 72 lines of code. Less than, most of this is just data pipelining, and that's not sharing it because of, 12:53.040 --> 13:01.840 that's eggs of that. There we go. 72 lines of code. Most of it is just converting data and 13:01.840 --> 13:06.400 pausing data. Nothing really interesting happening in here, but we're able to do the zero 13:06.480 --> 13:16.960 knowledge proof, and as I say, 72 lines of code. So that's useful. Um, and it's a very naive approach. 13:18.720 --> 13:24.880 There is one gotcha to this approach, which if we look at the proving time, it did take 62 seconds 13:24.880 --> 13:34.560 to show that I was over 18, which isn't really what we want. Um, so yes, the gotchas, proving time, 13:34.640 --> 13:39.680 secondly, the proof is dependent on this implementation of a query engine, which from a security perspective, 13:39.680 --> 13:44.400 isn't great, because you have to audit the whole query engine. Uh, we can't do standards, 13:45.600 --> 13:53.600 um, uh, based security, uh, audits, and the, uh, modeling of this could be improved, because we've 13:53.600 --> 13:58.400 just chucked an existing query engine in there. We don't have what I was wanting to have at the 13:58.400 --> 14:02.640 start, which is this derived proof as part of a query result. Instead, the derived, the, the, the, 14:02.720 --> 14:08.240 the proof has come outside of the query result. Um, so let's address the 62 seconds first. 14:08.240 --> 14:12.400 Can we make it faster? And I'm going to stop speaking faster, too, because my times running out. 14:13.280 --> 14:19.200 The answer is yes, there's ways of building custom circuits, there's also things like zero 14:19.200 --> 14:27.040 knowledge, uh, theorem proof is where you can implement this, uh, and based on comparisons to other 14:27.040 --> 14:31.120 projects that have implemented in zero knowledge virtual machines, and then in sat solvers, 14:31.120 --> 14:37.840 we can expect a 100 to 1000x speed up, optimistically. So that brings us down to under a second, 14:37.840 --> 14:45.040 perhaps, 200ish nanoseconds, within what's reasonable for a latent, uh, request on the web anyway, 14:46.320 --> 14:52.880 uh, standardization. Yeah. So what we need to do is we need to move from where we've got the zero 14:52.880 --> 14:58.080 knowledge virtual machine, proving stuff over rust code, to instead proving stuff over spark 14:58.880 --> 15:06.640 operations, so that we can have the proof in a standard, a standard manner that can have multiple implementations. 15:08.480 --> 15:14.320 And then the last thing, I had a very small hat, number of hands up when people talked about how much 15:14.320 --> 15:20.960 they love sparkle. I love sparkle. So that is just me. So abstractions are important for both 15:20.960 --> 15:26.960 users and developers, and we can hide most of what I've just been talking about from everyone. 15:28.080 --> 15:35.920 For instance, uh, I'm going to slip that slide. Uh, there's data shape languages like shackle, 15:35.920 --> 15:41.440 where you can define the form of what you want your outcoming credential to be, and just to say, 15:41.440 --> 15:46.960 okay, go to the database, do the queries, get the data, and then just build the credential for me. 15:46.960 --> 15:50.000 And if this is what I want the shape of the credential to look like, this would be a social 15:50.000 --> 15:55.360 security number credential. And that way, you don't have to deal with all the data query layers, 15:55.440 --> 15:59.360 all the provenance and all the stuff that is only fun for people like me. 16:01.200 --> 16:07.840 So where do I want to go from here? Well, this is a very small part of everything I'm doing. So 16:07.840 --> 16:13.280 fundamentally, I'm looking for people also interested in zero knowledge over createable credentials. 16:13.280 --> 16:16.720 If you want to work on the performance optimization or put resourcing into that, please, 16:17.840 --> 16:24.960 if you want to work on the modeling, the data modeling around this, also good. And to ground this 16:25.040 --> 16:31.680 work, I really want to build it out in the context of the Gamma Trust framework in the UK and 16:31.680 --> 16:36.800 IDAS. So if anyone wants to actually build this out in applications, I would love that. 16:38.640 --> 16:42.800 And then in terms of future work. So the other reason I find this all exciting is 16:44.240 --> 16:48.720 with the query engine I just had, you can have some really cool applications come out like 16:48.720 --> 16:55.600 emergent multi-party computation. So we can do things like not revealing my salary to anyone, 16:55.600 --> 17:02.640 but being able to have an aggregate salary of my workplace developed with a proof that that 17:02.640 --> 17:07.680 aggregate salary is true. So we have a lot more privacy enhancing technologies available. 17:07.680 --> 17:12.000 When we turn this infrastructure into multiple query engines and query planning. 17:12.400 --> 17:20.160 And also when we integrate this with things like ODRL, we can start to pull data on demand 17:20.160 --> 17:25.280 in compliance with user consent. We talked about consent before and user data sharing options. 17:26.000 --> 17:30.560 Very nicely. Last thing, someone asked about a reading list before, 17:30.560 --> 17:35.520 that's my recommended reading list. So if you want to take a photo and look that up, go for it 17:35.520 --> 17:38.800 and I'll leave that there while I answer some questions. 17:39.760 --> 17:42.720 Please raise your hand if you have a question, please repeat the question. 17:43.600 --> 17:46.080 Back there, what happened? No, you. Yeah. 18:09.760 --> 18:13.840 And basically, if you have like one of those, then there's also other tools which go for people 18:13.840 --> 18:19.360 that are more festive because of that. So I think that even if you don't like to the blockchain, 18:19.360 --> 18:24.720 you should look at that. Plus one on that. Any other? 18:25.920 --> 18:33.040 Sorry, it was more a comment than a question, but I will repeat it, which was to say that, yes, 18:33.040 --> 18:41.120 the whole zero-knowledge space is a privacy enhancing technology that happens to be used within 18:41.120 --> 18:47.520 blockchain, but is not a blockchain thing in and of itself. And it's used in many other contexts. 18:47.520 --> 18:53.840 The reason I referenced blockchain in my talk is because it is a think of blockchain company 18:53.840 --> 18:59.280 that's developed the zero-knowledge virtual machine that I happen to use. But that's again 18:59.280 --> 19:04.400 just because it's a core technology. They also build tools for caching and web libraries. 19:04.400 --> 19:08.560 I wouldn't, I would use those in the same way. Any other questions? 19:11.280 --> 19:16.400 Maybe a bit of a fragmented question, but if you want to implement this kind of serial knowledge 19:16.400 --> 19:23.360 proof, for example, I want to go into a car in another country. I want to collect the fact that I have a very 19:23.360 --> 19:30.720 Belgian writing license. But I can imagine that in different countries, writing licenses are structured 19:30.720 --> 19:36.400 differently. We have different costs in Belgium, motorcycle, normal car, truck, maybe a different 19:36.400 --> 19:42.080 country doesn't distinguish it. How in this structure that is still allowed, I have a problem with 19:42.080 --> 19:47.280 the structure of the data where different authorities can provide competitive structures for seeing 19:48.240 --> 19:55.920 those in terms of how do we interpret it? There's a few ways to go about it. So my 19:57.280 --> 20:05.520 immediate answer to that is to have the driving authority of each country, 20:05.520 --> 20:12.080 it define in again structured RDF or notation three if you've ever heard of that, which is 20:12.160 --> 20:19.520 less likely, define what the requirements are in a given country for you to be considered a driver 20:19.520 --> 20:27.840 of a class C car or that kind of vehicle. And you can, so that can also be signed. Again, 20:27.840 --> 20:34.720 it's all just about someone proving that statement's a true. So if you have the Australian 20:34.800 --> 20:42.400 government state that a UK class C vehicle is equivalent to an Australian class E vehicle 20:43.200 --> 20:50.240 and publish that and sign that, then you can do the derivations to do that data integration. 20:55.120 --> 21:03.920 I like to live with a slightly amount of optimism. Fair point. But at the end of the day, 21:03.920 --> 21:09.600 you need to trust someone to write the business logic, whether that's an authority, whether that's 21:09.600 --> 21:17.920 a trusted intermediary. So one of the, in the UK with the DVS framework, they're certifying 50 or 21:17.920 --> 21:25.440 so companies as digital verifiers. I see it as quite a valid possibility for those companies to also, 21:27.280 --> 21:32.240 yeah, there's about 50 companies I think that are going to be certified as part of the 21:32.240 --> 21:37.520 Gamma Trust framework. And I see it as quite a valid possibility that those companies are also 21:37.520 --> 21:45.360 trusted for issuing data about how to align certain schemas and how to do that kind of data 21:45.360 --> 21:52.960 integration because they are already trusted within this framework. Any other questions? 21:53.040 --> 22:01.920 Yeah, one thing, the first of the thing that for very interesting talk here is, I've 22:01.920 --> 22:10.320 doubt find it very interesting, but how do you assure that the organizations that hold the data 22:10.320 --> 22:17.840 system have the data system that they put in limitations on what type of first that is allowed 22:18.160 --> 22:32.080 to take out knowledge from the data system that you would otherwise try to find. 22:32.080 --> 22:39.040 For instance, I want to know all people in this data set that is for what the age of this, 22:39.680 --> 22:47.280 nail, living in this area, it's at some point, the return set is so specific that you're 22:47.280 --> 22:59.680 losing, yeah, anonymous. So I'm going to go back, very much, I very much agree, so this is where I 22:59.680 --> 23:09.600 come from the solid perspective. So in the solid view of the world, you have your own personal 23:09.600 --> 23:18.240 data store and within that data store you can put a set of credentials amongst other data, 23:18.240 --> 23:24.560 but you can put credentials in there and what you have living on top of that store is a set of 23:24.560 --> 23:35.040 access in usage control policies. So I, as the data subject, am defining who I, who I permit to 23:35.040 --> 23:39.840 access my data and for what purposes I permit them to use it. So in the context of healthcare, 23:39.840 --> 23:49.440 it might be I, I permit trusted research institutions in the UK and Europe to access my data for 23:49.440 --> 23:58.080 the purpose of aggregate health studies. The, the query then is not necessarily done on the health 23:59.040 --> 24:08.480 NIH's servers. It's a query that's done either through communication between my pod and other pods 24:09.120 --> 24:14.960 or some third party aggregation service that has trusted, it is trusted to access this data 24:15.040 --> 24:22.080 and has a policy engine implemented to ensure that it's only pulling and resharing data within 24:22.080 --> 24:28.400 the bounds of express user consent. Does that answer the question? 24:30.800 --> 24:35.440 Yeah, well, there's ODRL as a policy language, for instance. 24:45.920 --> 24:51.760 Well, that, that's, that's, that's a problem outside of this context if I have 24:54.560 --> 24:59.440 credential about, you know, health data here and a credential about health data here and a credential 24:59.440 --> 25:05.040 about my address here. And I choose to reveal all three to a particular agency they can do exactly the 25:05.040 --> 25:08.320 same kind of correlative analysis you're talking about. 25:15.440 --> 25:21.440 So you get more and more specific data, maybe not the point of specific person, but you can get a lot of 25:21.440 --> 25:26.960 knowledge regarding a very limited to the food of the person. 25:30.000 --> 25:36.400 So I, I, I, I, to make sure I'm understanding your question correctly to take it to an extreme exam.