WEBVTT

00:00.000 --> 00:12.320
All righty. Good morning. I'm going to start by saying I agree with the previous speakers.

00:12.320 --> 00:17.760
We shouldn't be talking about digital wallets. It's fundamentally the wrong abstraction

00:17.760 --> 00:24.320
for the development and the standardization layer. It might be good for end users though.

00:24.320 --> 00:30.240
So today I'm going to be talking about verifiable credentials in relation to query and how

00:30.240 --> 00:35.680
we can do zero knowledge over that query. I'm going to start with a bit about myself. So I'm

00:35.680 --> 00:43.200
Jesse Wright. I wear many hats. One of those hats is as a researcher at Oxford where I am working

00:43.200 --> 00:49.840
on Euro Symbolic AI systems and I am trying to build AI agents that faithfully represent human

00:49.840 --> 00:54.800
entities work faithfully with data on your behalf and help you manage your life.

00:56.400 --> 01:02.080
I am also lead of the solid project over at the Open Data Institute, which is a project that

01:02.080 --> 01:08.560
Tim Berners Lee created back in 2014 at MIT to give people back control of their personal data.

01:08.560 --> 01:14.240
So solid is a personal data store solution that allows you to build applications that don't store

01:14.240 --> 01:19.600
data on corporate servers but instead store your personal data in cloud storage such that you

01:19.600 --> 01:24.080
can reuse it as you go across the web. We talk about being able to use digital wallets for things

01:24.080 --> 01:29.520
like sharing your address and proving your address. With solid I have an authoritative source of my

01:29.520 --> 01:35.280
address. So that means that I updated once and my insurance company gets it straight away. My

01:36.880 --> 01:40.400
university gets it straight away. I have authority over my data.

01:41.200 --> 01:48.160
And I have been working on solid in particular for many years. I used to work for Tim's company

01:48.160 --> 01:54.560
in rupt as an enterprise software engineer and data architect over there and solid is already

01:54.560 --> 02:01.280
very heavily involved in the data wallets space. In fact one of the donated open wallet

02:01.360 --> 02:12.160
contributed. Thank you. Came from in rupt. And I work a lot on standards at the moment.

02:12.160 --> 02:17.920
I am in 14 standards bodies through the W3C mainly coming from the semantic web perspective

02:17.920 --> 02:26.720
as is going to become obvious throughout my talk. So what is a wallet? What are we talking about

02:26.720 --> 02:34.160
when we have a digital wallet? My suggestion is that it's a way to prove that someone

02:35.280 --> 02:41.520
said something. Fundamentally that is what we're doing. When I get issued with a credential that has

02:41.520 --> 02:49.360
my driver's license, all that is saying is that a driving authority, the driving authority that

02:49.440 --> 02:57.920
gave you this credential, says that you're accredited to drive. So to me a digital wallet is a way

02:57.920 --> 03:06.000
of saying that someone said something. A generalization of that concept is to say that digital

03:06.000 --> 03:15.120
wallets help supply some evidence to help someone believe something. So when I get my

03:15.600 --> 03:24.560
digital drivers license, that is evidence that I can supply to a car rental company that help them

03:24.560 --> 03:32.960
establish that I am accredited to drive. And this idea has been around forever. We call it

03:32.960 --> 03:39.840
provenance if you've been in semantic web spaces for a while. So if we look at the variable

03:39.840 --> 03:46.240
credentials back from the W3C for instance, there's two core features to it that support this.

03:46.240 --> 03:51.360
The first is signatures. So the signature is the way of having the driving authority prove

03:52.640 --> 03:59.360
that they stated something with their public-private key signatures. And then the other useful

03:59.360 --> 04:06.400
feature that I believe comes from W3C verifiable credentials is this notion of selective disclosure,

04:06.400 --> 04:11.440
which means once I've been issued this digital drivers license, I don't have to tell everyone

04:12.000 --> 04:18.800
everything from it. I can instead just reveal my data birth that's present on this digital

04:18.800 --> 04:27.840
credential. However, only selective disclosure is possible, not derivation of facts. So if it is my

04:27.840 --> 04:34.080
data birth stated in that verifiable credential, I must tell you my birth date. I can't prove that I'm

04:34.080 --> 04:41.360
over 18. I can't prove that I'm a commonwealth citizen. I have to say that I'm Australian.

04:42.000 --> 04:46.400
And that is all the standard support today through these BBS signatures, which I believe the

04:46.400 --> 04:54.800
other group mentioned. So if we look at the use cases where these kind of verifiable credentials

04:54.800 --> 05:00.640
are being used and we want to use them, I put it to you other current standards enough.

05:01.040 --> 05:09.040
So one common use case that we're seeing now is Directs, which is doing a lot of secondary data

05:09.600 --> 05:16.640
reuse in the medical system in places like Luxembourg. But when it comes to health and I'm thinking

05:16.640 --> 05:24.240
about studies that I might want to do, a very naive one might be to get an anonymized data set

05:24.240 --> 05:28.800
that has correlations between ages and BMI as of everyone in a particular country.

05:29.600 --> 05:35.680
Now if I'm doing this for research purposes and I want to publish a study, I want to be sure that

05:35.680 --> 05:43.360
that data is true. I don't want to have to trust some third party to have aggregated it correctly

05:43.360 --> 05:49.520
and then claim to me that they're aggregated it correctly. I actually want proof that all of the

05:49.520 --> 05:55.680
values that I've been given are issued by invalid hospitals throughout the country that I come from.

05:56.640 --> 06:04.320
And what about on-demand data integrity? So I mentioned driving authorities earlier. What if I want

06:04.320 --> 06:10.320
to prove that I'm eligible to hire a car in the country that I've come from? But I only want to

06:10.320 --> 06:23.040
provide that bullying answer. I don't want to tell them anything about my visa status or my driving

06:23.920 --> 06:28.240
credentials in the country that I come from. I don't even want to reveal where I'm born.

06:29.760 --> 06:35.440
Can the current standards make this integration possible so that I can integrate this data and

06:35.440 --> 06:38.560
provide it to a company and can they help make it more privacy preserving?

06:40.720 --> 06:49.440
So this led me to the question of can we support zero knowledge proof over sparkle queries

06:50.160 --> 06:56.640
so that we can base verifiable credential standards around sparkle instead of around fixed

06:56.640 --> 07:02.160
credential structures. Now, as a show of hands, who here has heard of sparkle and knows what the

07:02.160 --> 07:08.720
hell I'm talking about? Okay, half the crowd. Who likes sparkle or somatic web standards?

07:10.560 --> 07:19.360
Ooh, tough crowd. Okay, this is going to be fun. Let's start with why I think

07:19.840 --> 07:25.760
we should be using these kind of standards. So what we've got up here is a set of data in an

07:25.760 --> 07:34.080
emergent RDF standard called RDF 1.2. And what this allows you to do is talk about claims.

07:34.880 --> 07:41.280
So instead of saying that that is my data buff, I'm going to say that the driving authority

07:41.280 --> 07:48.800
claims that my data buff is 6th of April 2000 and instead of just in a database stating that my

07:48.800 --> 07:53.280
citizenship is Australia, I'm going to state that the UK immigration authority claims

07:53.280 --> 07:57.760
that my country is Australia. Now you might be going, well isn't this what the VC standards are

07:57.760 --> 08:02.320
already doing? We have the data in this credential and then we add a signature to it.

08:03.440 --> 08:08.560
One caveat there. Where in there areifiable credential standards at the moment,

08:08.560 --> 08:12.240
actually stating all this data at the top level. So you're saying there's a credential.

08:12.240 --> 08:16.560
Jessie does have this data buff. Jessie does have that citizenship. Oh, and by the way,

08:16.560 --> 08:21.760
here's a signature to prove that. We're not fundamentally modeling the fact that these are all

08:21.760 --> 08:27.280
claims that anyone can make and the data inside those angle brackets is not necessarily ground

08:27.280 --> 08:34.080
truth. That's just stuff claimed by these parties. And then what we want to be able to do is not

08:34.080 --> 08:39.280
have these custom procedures of integrating credentials and deriving proofs. We want to be able

08:39.280 --> 08:45.600
to have standards so that we can write queries like this one. Where I can just ask, is that Jessie's

08:46.560 --> 08:51.760
or is Jessie's data buff over 2006, so I'm over 18, and does he have citizenship of a

08:51.760 --> 08:56.320
commonwealth country? There's a standard to write this. We don't need to write custom business

08:56.320 --> 09:01.040
logic in programming languages of choice. We can do this all in standards.

09:03.040 --> 09:10.080
So I had about half the hands come up for sparkle and idea. So I'm going to very, very quickly

09:10.080 --> 09:17.440
do the breakneck tour of what is a sparkle query. So I'm hoping more of you have interacted

09:17.440 --> 09:25.040
with SQL. But sparkle is just a query language. It's nothing to be scared of that allows you

09:25.760 --> 09:34.320
to query over data on the web in a standardized format. So here we've got a query that's asking

09:34.320 --> 09:42.880
is Rubin. So we're wanting to see all the people that are born in the same place as Rubin

09:42.880 --> 09:51.760
and younger than him. And we're wanting to construct some derived facts that generate the fact

09:51.760 --> 09:56.240
that that person shares the place of birth and is older than. But we're no longer revealing the

09:56.240 --> 10:02.960
data birth and the birth place. And in DBPedia, which is an open data set, there's actually hundreds

10:02.960 --> 10:08.640
of people that share this data birth with, well over what's my mind to say. The share this data

10:08.640 --> 10:13.280
birth was Rubin and younger with him. So we can get this set of query results. And on an open

10:13.280 --> 10:19.040
web of data where I trust that all these facts are true, this works quite well already. Sparkles

10:19.040 --> 10:29.200
being around for decades. But we haven't had ways of talking about facts about facts until now.

10:29.200 --> 10:34.400
And this is where the IDF starry, IDF 1.2 reification that I was talking about comes in.

10:34.400 --> 10:42.480
So now we can talk about statements according to people. So now if we look at that query,

10:42.480 --> 10:53.280
we just tied for Rubin. What we really want to do is say, does is there a way of asserting

10:53.280 --> 10:57.520
that Rubin shares the same birth place as someone and is older than that person,

10:58.000 --> 11:01.600
according to some derived proof. That's what we want to get to.

11:02.800 --> 11:12.480
And moreover, I don't want that just to be any old proof. I want that proof to reveal at most

11:13.520 --> 11:21.120
the fact, uh, reveal at most, which entities stated the ground truth used to derive this fact.

11:21.120 --> 11:27.040
So I want to state that this is true. And all you need to trust is the Belgian government

11:27.120 --> 11:35.920
to get there. Can we do this? Well, when I pitched this talk, I didn't know.

11:37.760 --> 11:44.560
And I can tell you quite confidently the answer now is yes. And remarkably, despite my, I think

11:44.560 --> 11:50.240
someone else here was quite skeptical of blockchain before. Despite that shared skepticism,

11:50.240 --> 11:56.480
they've actually managed to produce in the, the blockchain space, a very useful privacy enhancing

11:56.480 --> 12:03.680
technology, uh, call zero knowledge virtual machines. Risk zero is one of them. It came into major

12:03.680 --> 12:09.840
release, I believe, late last year, but don't quite me on that. And the very naive way you can get

12:09.840 --> 12:14.400
everything I was just talking about working is to take one of these zero knowledge virtual machines,

12:15.360 --> 12:20.480
take a sparkle query engine that's written in rust, and run that inside the zero knowledge

12:20.480 --> 12:27.360
virtual machine. Because this thing can prove correct execution of arbitrary rust code,

12:29.200 --> 12:38.400
and then reveal the outputs of a function. So concretely, uh, if we look at the implementation,

12:38.400 --> 12:43.440
I just need to prove how short the code for this is, and then I'll check the time I've got a few minutes.

12:45.280 --> 12:51.360
72 lines of code. Less than, most of this is just data pipelining, and that's not sharing it because of,

12:53.040 --> 13:01.840
that's eggs of that. There we go. 72 lines of code. Most of it is just converting data and

13:01.840 --> 13:06.400
pausing data. Nothing really interesting happening in here, but we're able to do the zero

13:06.480 --> 13:16.960
knowledge proof, and as I say, 72 lines of code. So that's useful. Um, and it's a very naive approach.

13:18.720 --> 13:24.880
There is one gotcha to this approach, which if we look at the proving time, it did take 62 seconds

13:24.880 --> 13:34.560
to show that I was over 18, which isn't really what we want. Um, so yes, the gotchas, proving time,

13:34.640 --> 13:39.680
secondly, the proof is dependent on this implementation of a query engine, which from a security perspective,

13:39.680 --> 13:44.400
isn't great, because you have to audit the whole query engine. Uh, we can't do standards,

13:45.600 --> 13:53.600
um, uh, based security, uh, audits, and the, uh, modeling of this could be improved, because we've

13:53.600 --> 13:58.400
just chucked an existing query engine in there. We don't have what I was wanting to have at the

13:58.400 --> 14:02.640
start, which is this derived proof as part of a query result. Instead, the derived, the, the, the,

14:02.720 --> 14:08.240
the proof has come outside of the query result. Um, so let's address the 62 seconds first.

14:08.240 --> 14:12.400
Can we make it faster? And I'm going to stop speaking faster, too, because my times running out.

14:13.280 --> 14:19.200
The answer is yes, there's ways of building custom circuits, there's also things like zero

14:19.200 --> 14:27.040
knowledge, uh, theorem proof is where you can implement this, uh, and based on comparisons to other

14:27.040 --> 14:31.120
projects that have implemented in zero knowledge virtual machines, and then in sat solvers,

14:31.120 --> 14:37.840
we can expect a 100 to 1000x speed up, optimistically. So that brings us down to under a second,

14:37.840 --> 14:45.040
perhaps, 200ish nanoseconds, within what's reasonable for a latent, uh, request on the web anyway,

14:46.320 --> 14:52.880
uh, standardization. Yeah. So what we need to do is we need to move from where we've got the zero

14:52.880 --> 14:58.080
knowledge virtual machine, proving stuff over rust code, to instead proving stuff over spark

14:58.880 --> 15:06.640
operations, so that we can have the proof in a standard, a standard manner that can have multiple implementations.

15:08.480 --> 15:14.320
And then the last thing, I had a very small hat, number of hands up when people talked about how much

15:14.320 --> 15:20.960
they love sparkle. I love sparkle. So that is just me. So abstractions are important for both

15:20.960 --> 15:26.960
users and developers, and we can hide most of what I've just been talking about from everyone.

15:28.080 --> 15:35.920
For instance, uh, I'm going to slip that slide. Uh, there's data shape languages like shackle,

15:35.920 --> 15:41.440
where you can define the form of what you want your outcoming credential to be, and just to say,

15:41.440 --> 15:46.960
okay, go to the database, do the queries, get the data, and then just build the credential for me.

15:46.960 --> 15:50.000
And if this is what I want the shape of the credential to look like, this would be a social

15:50.000 --> 15:55.360
security number credential. And that way, you don't have to deal with all the data query layers,

15:55.440 --> 15:59.360
all the provenance and all the stuff that is only fun for people like me.

16:01.200 --> 16:07.840
So where do I want to go from here? Well, this is a very small part of everything I'm doing. So

16:07.840 --> 16:13.280
fundamentally, I'm looking for people also interested in zero knowledge over createable credentials.

16:13.280 --> 16:16.720
If you want to work on the performance optimization or put resourcing into that, please,

16:17.840 --> 16:24.960
if you want to work on the modeling, the data modeling around this, also good. And to ground this

16:25.040 --> 16:31.680
work, I really want to build it out in the context of the Gamma Trust framework in the UK and

16:31.680 --> 16:36.800
IDAS. So if anyone wants to actually build this out in applications, I would love that.

16:38.640 --> 16:42.800
And then in terms of future work. So the other reason I find this all exciting is

16:44.240 --> 16:48.720
with the query engine I just had, you can have some really cool applications come out like

16:48.720 --> 16:55.600
emergent multi-party computation. So we can do things like not revealing my salary to anyone,

16:55.600 --> 17:02.640
but being able to have an aggregate salary of my workplace developed with a proof that that

17:02.640 --> 17:07.680
aggregate salary is true. So we have a lot more privacy enhancing technologies available.

17:07.680 --> 17:12.000
When we turn this infrastructure into multiple query engines and query planning.

17:12.400 --> 17:20.160
And also when we integrate this with things like ODRL, we can start to pull data on demand

17:20.160 --> 17:25.280
in compliance with user consent. We talked about consent before and user data sharing options.

17:26.000 --> 17:30.560
Very nicely. Last thing, someone asked about a reading list before,

17:30.560 --> 17:35.520
that's my recommended reading list. So if you want to take a photo and look that up, go for it

17:35.520 --> 17:38.800
and I'll leave that there while I answer some questions.

17:39.760 --> 17:42.720
Please raise your hand if you have a question, please repeat the question.

17:43.600 --> 17:46.080
Back there, what happened? No, you. Yeah.

18:09.760 --> 18:13.840
And basically, if you have like one of those, then there's also other tools which go for people

18:13.840 --> 18:19.360
that are more festive because of that. So I think that even if you don't like to the blockchain,

18:19.360 --> 18:24.720
you should look at that. Plus one on that. Any other?

18:25.920 --> 18:33.040
Sorry, it was more a comment than a question, but I will repeat it, which was to say that, yes,

18:33.040 --> 18:41.120
the whole zero-knowledge space is a privacy enhancing technology that happens to be used within

18:41.120 --> 18:47.520
blockchain, but is not a blockchain thing in and of itself. And it's used in many other contexts.

18:47.520 --> 18:53.840
The reason I referenced blockchain in my talk is because it is a think of blockchain company

18:53.840 --> 18:59.280
that's developed the zero-knowledge virtual machine that I happen to use. But that's again

18:59.280 --> 19:04.400
just because it's a core technology. They also build tools for caching and web libraries.

19:04.400 --> 19:08.560
I wouldn't, I would use those in the same way. Any other questions?

19:11.280 --> 19:16.400
Maybe a bit of a fragmented question, but if you want to implement this kind of serial knowledge

19:16.400 --> 19:23.360
proof, for example, I want to go into a car in another country. I want to collect the fact that I have a very

19:23.360 --> 19:30.720
Belgian writing license. But I can imagine that in different countries, writing licenses are structured

19:30.720 --> 19:36.400
differently. We have different costs in Belgium, motorcycle, normal car, truck, maybe a different

19:36.400 --> 19:42.080
country doesn't distinguish it. How in this structure that is still allowed, I have a problem with

19:42.080 --> 19:47.280
the structure of the data where different authorities can provide competitive structures for seeing

19:48.240 --> 19:55.920
those in terms of how do we interpret it? There's a few ways to go about it. So my

19:57.280 --> 20:05.520
immediate answer to that is to have the driving authority of each country,

20:05.520 --> 20:12.080
it define in again structured RDF or notation three if you've ever heard of that, which is

20:12.160 --> 20:19.520
less likely, define what the requirements are in a given country for you to be considered a driver

20:19.520 --> 20:27.840
of a class C car or that kind of vehicle. And you can, so that can also be signed. Again,

20:27.840 --> 20:34.720
it's all just about someone proving that statement's a true. So if you have the Australian

20:34.800 --> 20:42.400
government state that a UK class C vehicle is equivalent to an Australian class E vehicle

20:43.200 --> 20:50.240
and publish that and sign that, then you can do the derivations to do that data integration.

20:55.120 --> 21:03.920
I like to live with a slightly amount of optimism. Fair point. But at the end of the day,

21:03.920 --> 21:09.600
you need to trust someone to write the business logic, whether that's an authority, whether that's

21:09.600 --> 21:17.920
a trusted intermediary. So one of the, in the UK with the DVS framework, they're certifying 50 or

21:17.920 --> 21:25.440
so companies as digital verifiers. I see it as quite a valid possibility for those companies to also,

21:27.280 --> 21:32.240
yeah, there's about 50 companies I think that are going to be certified as part of the

21:32.240 --> 21:37.520
Gamma Trust framework. And I see it as quite a valid possibility that those companies are also

21:37.520 --> 21:45.360
trusted for issuing data about how to align certain schemas and how to do that kind of data

21:45.360 --> 21:52.960
integration because they are already trusted within this framework. Any other questions?

21:53.040 --> 22:01.920
Yeah, one thing, the first of the thing that for very interesting talk here is, I've

22:01.920 --> 22:10.320
doubt find it very interesting, but how do you assure that the organizations that hold the data

22:10.320 --> 22:17.840
system have the data system that they put in limitations on what type of first that is allowed

22:18.160 --> 22:32.080
to take out knowledge from the data system that you would otherwise try to find.

22:32.080 --> 22:39.040
For instance, I want to know all people in this data set that is for what the age of this,

22:39.680 --> 22:47.280
nail, living in this area, it's at some point, the return set is so specific that you're

22:47.280 --> 22:59.680
losing, yeah, anonymous. So I'm going to go back, very much, I very much agree, so this is where I

22:59.680 --> 23:09.600
come from the solid perspective. So in the solid view of the world, you have your own personal

23:09.600 --> 23:18.240
data store and within that data store you can put a set of credentials amongst other data,

23:18.240 --> 23:24.560
but you can put credentials in there and what you have living on top of that store is a set of

23:24.560 --> 23:35.040
access in usage control policies. So I, as the data subject, am defining who I, who I permit to

23:35.040 --> 23:39.840
access my data and for what purposes I permit them to use it. So in the context of healthcare,

23:39.840 --> 23:49.440
it might be I, I permit trusted research institutions in the UK and Europe to access my data for

23:49.440 --> 23:58.080
the purpose of aggregate health studies. The, the query then is not necessarily done on the health

23:59.040 --> 24:08.480
NIH's servers. It's a query that's done either through communication between my pod and other pods

24:09.120 --> 24:14.960
or some third party aggregation service that has trusted, it is trusted to access this data

24:15.040 --> 24:22.080
and has a policy engine implemented to ensure that it's only pulling and resharing data within

24:22.080 --> 24:28.400
the bounds of express user consent. Does that answer the question?

24:30.800 --> 24:35.440
Yeah, well, there's ODRL as a policy language, for instance.

24:45.920 --> 24:51.760
Well, that, that's, that's, that's a problem outside of this context if I have

24:54.560 --> 24:59.440
credential about, you know, health data here and a credential about health data here and a credential

24:59.440 --> 25:05.040
about my address here. And I choose to reveal all three to a particular agency they can do exactly the

25:05.040 --> 25:08.320
same kind of correlative analysis you're talking about.

25:15.440 --> 25:21.440
So you get more and more specific data, maybe not the point of specific person, but you can get a lot of

25:21.440 --> 25:26.960
knowledge regarding a very limited to the food of the person.

25:30.000 --> 25:36.400
So I, I, I, I, to make sure I'm understanding your question correctly to take it to an extreme exam.