WEBVTT 00:00.000 --> 00:16.120 Hello everybody, thank you very much for being here today, I am Dominic Torno, I am the founder 00:16.120 --> 00:25.480 and CEO of Resonate where we focus on a distributed async of eight, programming model, 00:25.480 --> 00:38.620 to make distributed applications did simple. I am also the author of Think Distributed 00:38.620 --> 00:48.700 Systems, which is currently part of a many publications early access program. The book focuses 00:48.700 --> 00:56.420 on dependable models to empower you to reason confidently about distributed systems. I spend 00:56.420 --> 01:04.900 most of my time on distributed systems and distributed systems are known for their complexity. 01:04.900 --> 01:12.780 We all just know that distributed systems are really, really complex, but what edits core 01:12.780 --> 01:20.380 makes distributed systems so complex and there are two reasons, two fundamental sources 01:20.380 --> 01:29.460 of complexity. The first is non-deterministic partial order introduced by concurrency. 01:29.460 --> 01:36.660 In a distributed system we do not know what happens next and the second is non-deterministic 01:36.660 --> 01:43.140 partial failure introduced by distribution. In a distributed system we simply do not know 01:43.140 --> 01:50.580 what will fail next. So as an aside feel free to take pictures throughout the presentation 01:50.580 --> 01:57.300 and I have marked some slides as a harm moment to highlight what I think are like the most 01:57.300 --> 02:04.700 insightful slides of the presentation. This is one of my favorites. Now let's consider a simple 02:04.700 --> 02:13.740 very abstract example to develop an intuition. We have two processes, P and Q, each consisting 02:13.740 --> 02:22.700 of three atomic actions, three steps. So first let's introduce sequentiality. The defining 02:22.700 --> 02:30.680 characteristic of sequentiality is total order and total application. A sequential system 02:30.680 --> 02:39.560 executes one step at a time in a predetermined sequence. So consequently the sequential 02:39.560 --> 02:47.800 composition of P and Q leads to exactly one possible execution. There is no room for alternative 02:47.800 --> 02:55.520 behavior. But then we introduce concurrency and the defining characteristic of concurrency 02:55.600 --> 03:04.160 is non-deterministic partial order. So consequently the concurrent composition of P and Q already 03:04.160 --> 03:12.960 leads to 20 distinct steps of possible executions. And then lastly let's introduce distribution 03:14.000 --> 03:21.120 and the defining characteristic of distribution is non-deterministic partial failure. So consequently 03:21.200 --> 03:30.800 the distributed composition of P and Q leads to 69 distinct possible executions. And in a real 03:30.800 --> 03:38.800 system some of these executions are very common while others are exceedingly rare. So for example 03:38.800 --> 03:45.840 imagine a scenario where a server accepts a connection at the exact moment a configuration reload is 03:46.800 --> 03:54.480 a highly unlikely sequence of events. So now let's do a simple thought experiment given that we 03:54.480 --> 04:03.840 have 69 distinct traces. So let's assume 10% of these traces are the happy path. They occur 90% 04:03.840 --> 04:12.720 of the time. And 90% of these traces are edge cases. They only occur 10% of the time. It turns out 04:12.800 --> 04:20.240 if we non-deterministically pick a trace according to this probability distribution one thousand times 04:21.040 --> 04:32.720 the chances of missing at least one trace are almost 100%. So these rare traces provide the ideal 04:32.720 --> 04:40.800 breeding ground for the dreaded heisenberg. The term heisenberg is inspired by the heisenberg 04:40.880 --> 04:49.040 uncertainty principle from quantum mechanics. Heisenbergs are elusive. The rarity of the traces 04:49.040 --> 04:56.240 that height them makes them really hard to produce. And by the off chance we actually produce one 04:57.920 --> 05:05.840 it's almost impossible to reproduce them. And now take a moment to imagine the behavior space 05:05.920 --> 05:12.480 of a real world distributed system where dozens or even hundreds of processes and each of these 05:12.480 --> 05:20.720 processes has dozens or hundreds of steps. So the number of possible executions grows combinatorially 05:20.720 --> 05:28.400 and that phenomenon is aptly called behavior state explosion. But concurrency and distribution 05:28.400 --> 05:35.760 they don't just expand the behavior space. They make reasoning about a system almost incomprehensible. 05:36.400 --> 05:43.120 Concurrency and distribution introduce daunting uncertainty. And this uncertainty is the true 05:43.120 --> 05:51.520 challenge we face when we design distributed systems. So there is a direct correlation between 05:51.520 --> 05:58.480 certainty and confidence. When I am certain about my application I am confident in my 05:58.480 --> 06:04.480 application's correctness. So making changes adding new features or even deploying to production 06:04.560 --> 06:13.920 on a Friday afternoon no problem. With deterministic simulation testing you have a key component 06:13.920 --> 06:22.560 to achieving this level of certainty and confidence by methodically exploring the behavior space 06:22.560 --> 06:29.200 scaling across space and time. So throughout the rest of this presentation we build a foundational 06:29.280 --> 06:34.480 understanding of distributed simulation testing step by step. And then we build a dependable 06:34.480 --> 06:39.600 mental model so you can all decide with confidence if deterministic simulation testing is the 06:39.600 --> 06:46.800 right approach for you. So let's begin by grounding our discussion with a mental model of 06:46.800 --> 06:53.760 software systems. For our discussion we will think of a software system as being partitioned 06:53.760 --> 07:00.400 into two distinct entities. The environment and the application that is the highest level 07:00.400 --> 07:06.080 system model that is universal applicable. What's in? What's out? The environment represents 07:06.080 --> 07:12.720 everything that is not the object that we are developing and testing and the application represents 07:12.720 --> 07:19.840 everything that is the object that we are developing and testing. Now our goal is to test the 07:19.840 --> 07:28.080 application. To achieve this the application under testing conditions must closely resemble 07:28.080 --> 07:34.880 the application under production conditions. Any deviation must be non-material otherwise we're 07:34.880 --> 07:42.240 risking to test something fundamentally different. However our goal is not to test the environment. 07:42.240 --> 07:47.600 The environment is only a means to facilitate the test. So therefore the environment under testing 07:48.480 --> 07:53.520 conditions does not have to resemble the environment under production conditions at all. We are 07:53.520 --> 08:02.400 free to replace completely mock any component in the environment as we see fit. Now question is 08:02.400 --> 08:08.240 how do we delineate between the environment and the application and there is no universal answer 08:08.240 --> 08:13.840 where you delineate depends entirely on your requirements. So here we see an example of a web 08:13.840 --> 08:19.120 application composed of three distinct components. A client on the left of web server in the 08:19.120 --> 08:27.280 center and a database server on the right. A reasonable choice is to assign the client to the environment 08:27.280 --> 08:33.040 while treating the web server and the database server as part of the application. The focus is on testing 08:33.040 --> 08:40.000 the web server and database server in composition. But a reasonable alternative is to assign both 08:40.000 --> 08:45.040 the client and the database server to the environment and treat the web server as the application. 08:45.600 --> 08:53.600 The focus then is on testing the web server and isolation. Now now that we have a mental model 08:53.600 --> 08:59.440 of software systems, then is a software system correct and how do we establish the correctness 08:59.440 --> 09:06.240 of a software system. Correctness is not absolute correctness is relative to a specification. 09:06.960 --> 09:13.120 For example think of your parents. I bet most of you had at least I had one parents who is 09:13.120 --> 09:20.080 strict and one parent who is more lenient. So the very same behavior my behavior one parent usually 09:20.080 --> 09:27.440 my mother she would scold me for it whereas my dad would cut me some slack. So the same 09:27.440 --> 09:34.720 implementation can be considered correct or incorrect depending entirely on the specification that we 09:34.800 --> 09:43.840 choose to apply. Now as Leslie Lambert explores in his foundational work specifying systems, 09:43.840 --> 09:50.400 software systems can be understood as state machines. Right. The state machine is remarkably 09:50.400 --> 09:58.160 simple but very powerful reasoning tool. A state machine consists of three components, a set of states, 09:58.880 --> 10:06.080 set of initial states, you just want and a next state relation that determines how we transition 10:06.080 --> 10:12.960 from one state to the next. When we execute a state machine, the state machine generates a trace, 10:13.680 --> 10:19.120 a trace is the sequence of states that represents the systems behavior over time. 10:20.480 --> 10:26.880 Traces are at the heart of how we reason about software systems. Each trace represents a unique 10:26.960 --> 10:34.880 execution of the system. Now a state machine is non-deterministic, if it presents a choice. 10:34.880 --> 10:42.160 Right. Or in more sinister terms, if it presents some ambiguity. So in this example from S1, 10:42.880 --> 10:51.040 we may transition to S2 or to S3. This is reflected in the set of possible traces given the 10:51.120 --> 10:58.080 initial state of S1. In this example from S1, we may generate two distinct traces, 10:58.080 --> 11:05.680 S1 to S2 or S1 to S3. So given the same initial state, we may witness different traces. 11:07.520 --> 11:11.680 The state machine is deterministic. If the next state relationship is a function, 11:11.680 --> 11:16.880 that is if the state machine does not present any choice. In a deterministic state machine, 11:16.880 --> 11:25.200 there is no ambiguity, just a single predictable path forward. And this is also reflected in the 11:25.200 --> 11:35.200 possible traces given the initial state of S1. So in this example, in S1, there is only one 11:35.200 --> 11:41.200 trace that we can possibly generate. That is from S1 to S2 to S3. So the same initial state, 11:41.760 --> 11:44.640 we will always witness the same trace. 11:48.640 --> 11:53.600 Or coming back to the specification and implementation relationship that we discussed earlier, 11:53.600 --> 12:00.720 in effect, a specification is a set of traces. These traces represent all the valid behaviors 12:00.720 --> 12:09.840 allowed by the specification. So an implementation is correct with respect to a specification 12:09.840 --> 12:16.160 if every possible trace of the implementation is also a trace of the specification. 12:17.840 --> 12:22.640 And consequently, an implementation is incorrect with respect to the specification. 12:22.640 --> 12:29.440 If some traces of the implementation are not traces of the specification. 12:30.160 --> 12:36.480 So how can we establish correctness? There are two possibilities. There is software verification 12:36.560 --> 12:44.480 and there is software testing. Software verification aims to prove that all traces of the 12:44.480 --> 12:50.960 implementation are within the specification. Software testing aims to find and software testing 12:50.960 --> 12:57.040 so in software testing aims to find some traces outside of the specification. 12:57.280 --> 13:05.600 So in summary, if the verification asserts, the system is correct. 13:07.440 --> 13:12.960 So the verification asserts, there is no correctness violation. No correctness violation 13:12.960 --> 13:20.160 will ever occur. If the test asserts the system is correct, the test merely asserts 13:20.160 --> 13:25.360 that no correctness violation has yet occurred. We may find one later. 13:26.080 --> 13:33.760 So testing is essentially a search problem. We are the goal is to find traces of the 13:33.760 --> 13:40.560 implementation that violate the specification, that violate correctness. 13:44.080 --> 13:49.600 Now how can we increase the confidence in a system's correctness? 13:50.240 --> 14:01.280 By increasing test coverage, and test coverage has two key dimensions, space and time. 14:02.720 --> 14:12.240 So to scale across space, we extend our test by extending the number of distinct traces 14:12.960 --> 14:23.040 covered by the test. And to scale across time, we extend our test by increasing the length 14:24.080 --> 14:33.680 of distinct traces. However, extending the number of distinct traces is difficult. 14:34.640 --> 14:41.520 Remember our initial example and our thought experiment randomly stimulating the system 14:42.240 --> 14:49.840 will just produce more of the same. If we run the test a thousand times, 14:50.960 --> 14:58.000 about 900 times, we will just test the happy path. And we will omit a lot of the edge cases 14:58.000 --> 15:05.280 that are the ones that hide the bugs, especially the heisenbergs. Now additionally, 15:05.280 --> 15:14.800 extending the length of distinct traces is also difficult. Many systems have actual timing 15:14.800 --> 15:23.200 constraints. There are timeouts around crash detection, for example. There are timeouts around 15:23.200 --> 15:31.200 message loss detection. We allow the other side to respond within 10 seconds, within 20 seconds. 15:31.680 --> 15:40.720 For example, the time intervals driving rate limit algorithms. There are time intervals 15:40.720 --> 15:49.200 1 minute, 5 minute, 10 minutes. So increasing the length of trace increases actual testing time. 15:53.200 --> 15:59.680 Now deterministic simulation testing enables you to scale across space. 16:01.280 --> 16:08.800 By controlling the scheduling and the failure injection. And therefore increasing the number 16:08.800 --> 16:18.720 of distinct traces. Instead of blindly picking from the happy path, we are intentionally picking 16:19.360 --> 16:28.640 from the edge cases. And on top of that deterministic simulation testing enables you to scale 16:28.720 --> 16:36.160 across time by virtualizing time. And therefore when we increase the length of the trace, 16:36.160 --> 16:43.440 we do not actually increase the length of the world clock time. It takes to run the trace. 16:45.280 --> 16:53.440 And lastly, and that is one of the key achievements of deterministic simulation testing. And also, 16:53.760 --> 17:00.720 the defining characteristic deterministic simulation testing enables you to reproduce a trace. 17:01.280 --> 17:07.920 By ensuring that for every single starting state, there is only one possible trace 17:07.920 --> 17:15.440 through the system. So once you find a bug, you can reproduce that trace by putting the system in 17:15.440 --> 17:34.000 the same starting state. Now to be methodical and to ensure reproducibility, we construct the system. 17:34.000 --> 17:40.080 So that is we construct both the environment and the application so that the behavior of the system 17:40.080 --> 17:48.560 only depends on the initial state of the environment and the initial state of the application. 17:52.640 --> 18:02.000 Now the core idea is to substitute the environment with simulator. Remember, we are allowed to replace 18:02.000 --> 18:08.720 any component of the environment that we see fit. Deterministic simulation testing repeatedly 18:08.800 --> 18:17.200 executes an application in a simulated environment under changing initial conditions. Monitoring 18:17.200 --> 18:23.920 that the correctness constraints are maintained across all executions. And in practice, 18:23.920 --> 18:29.040 that means constructing the system to control otherwise non-deterministic factors. 18:29.680 --> 18:36.640 Like for example, incoming requests, but also crashes, message losses, and of course time. 18:38.960 --> 18:44.480 So on a high level, we have to think about implementing this simulator. We have to think about 18:44.480 --> 18:51.920 two core interactions. Requests from the environment to the application and requests from the 18:51.920 --> 19:00.320 application to the environment. The first represents the application input and output. Think of for example 19:00.320 --> 19:09.840 the web client in our initial example. And the second represents the application side effects, 19:09.840 --> 19:15.600 like talking to the network or reading and writing to the file system. Here, think of the 19:17.280 --> 19:25.360 database server, the initial example. Now by controlling, scheduling and failure injection in the 19:25.360 --> 19:32.480 simulated environment, given a deterministic application, we can deterministically generate 19:32.480 --> 19:42.960 an arbitrary amount of distinct traces of an arbitrary length. Now, who is doing deterministic 19:42.960 --> 19:51.360 simulation testing today? Deterministic simulation testing is still fairly new and it is also 19:51.440 --> 19:57.280 still fairly involved, because making sure that your application itself is deterministic and 19:57.280 --> 20:05.040 then building a deterministic environment for it is quite an investment. But a few companies 20:05.600 --> 20:10.480 have adopted deterministic simulation testing. So for example, foundation db. It's widely 20:10.560 --> 20:20.240 regarded as the trailblazer in deterministic simulation testing. Tiger beetle db, wildly 20:20.240 --> 20:30.240 regarded as the one that spread the word about deterministic simulation testing. And then we have 20:30.240 --> 20:36.240 others like for example polar signals, tools or my company resonate. Each project though took 20:36.400 --> 20:42.160 a different approach in designing and developing its application and designing and developing its 20:42.160 --> 20:49.760 simulator. So there is very little common ground. There are no ready-made libraries. 20:50.960 --> 20:57.840 But additionally, there is antithesis. That is the first company that offers deterministic simulation 20:57.920 --> 21:07.440 testing as a service by implementing their own deterministic hypervisor. Now, if you're curious 21:07.440 --> 21:13.360 about deterministic simulation testing and would like to dive deeper into deterministic simulation 21:13.360 --> 21:21.600 testing, the nuances and also how to structure a simulator. You're on the CEO of Tiger beetle and 21:21.600 --> 21:31.520 I will host a webinar this Thursday discussing deterministic simulation testing in depth. So if you're 21:31.520 --> 21:41.120 interested, please join us. And so as a conclusion, deterministic simulation testing requires a 21:41.120 --> 21:48.880 real investment. However, the pay off based on like personal experience is significant. 21:49.760 --> 21:55.760 Deterministic simulation testing enables you to methodically test the behavior space of your system 21:56.480 --> 22:05.600 by scaling across space and time, making tests and traces reproducible. So then you get to 22:05.600 --> 22:11.840 squash the highs and the highs and the highs and the highs. And with that, thank you very much for 22:12.640 --> 22:18.320 extending my talk. I just want to let you know that mining publications offered a discount on my 22:18.880 --> 22:25.840 book for the first time. If you are interested, and once again, thank you very much and I 22:25.840 --> 22:31.840 hope you enjoy the rest of the conference. 22:39.840 --> 22:41.200 Yes, please. 22:41.280 --> 22:47.840 So at some point in your conclusion, this long talks work on TLA plus? 22:47.840 --> 22:57.040 Yes, actually work with bulletin shaking. That is a very nice question. So 22:58.080 --> 23:03.520 the question for anybody who couldn't hear was about TLA plus a temporal logic of actions 23:03.520 --> 23:08.160 that was discussed in the book that I refer in specifying systems. And 23:09.040 --> 23:16.000 it's a language to specify systems and it comes with a tool for model checking. 23:16.640 --> 23:23.840 So it can also methodically test the behavior space of their model. But there is where the difference 23:23.840 --> 23:30.400 lies. In TLA plus, you specify your system, but it is a model. It's not the implementation of your 23:30.400 --> 23:36.640 system. So there you model check and that's basically deterministic simulation test. You model 23:36.960 --> 23:42.160 check the model. With deterministic simulation testing, you actually model check if you 23:42.160 --> 23:51.360 so say, you actually model check the implementation. So that is where I think one can draw a good 23:51.360 --> 23:56.400 line, good mental model for that. Yes, please. 24:06.880 --> 24:18.800 Mm-hmm. So for model checking, I am not an expert in implementing model checkers, 24:18.800 --> 24:24.080 but for model checking as far as I know, they have a very methodical exploration of the state space. 24:24.080 --> 24:28.560 And what they also on top of that, what they can do is basically they calculate what's 24:28.560 --> 24:34.800 I believe called symmetry sets, where this basically is equivalent to that, therefore I completely 24:34.960 --> 24:44.640 skip it. So that is something that for example, for our simulator that we implement, we do not 24:44.640 --> 24:53.360 actually implement an actual search algorithm. So what instead, what we fall back on is basically 24:53.360 --> 25:00.560 the small world slash small data hypothesis that the most interesting side effects come about 25:00.560 --> 25:06.080 when you have a lot of need for coordination and content in your system. So therefore instead of 25:06.080 --> 25:12.320 like firing off 10 requests over a thousand accounts, for example, right, we fire off 10 requests 25:12.320 --> 25:17.520 of a thousand requests over 10 accounts in order to make sure that there is a lot of conflict in the 25:17.520 --> 25:25.680 system. And without we were able to find a lot of the really interesting bugs, even without a 25:25.760 --> 25:32.720 guided search. But I know that antithesis, whereas I'm not an expert in antithesis, but a 25:32.720 --> 25:39.040 tithesis has a few APIs that you can include in your application to actually guide their search. 25:39.760 --> 25:47.360 So with that, you can more methodically find the interesting, the interesting traces. 25:49.600 --> 25:50.240 Thanks a lot. 25:50.240 --> 25:50.960 Thank you.