WEBVTT 00:00.000 --> 00:13.000 What I'm going to show is, oh, we can analyze transactions, do some replication, and do some time travel. 00:13.000 --> 00:15.000 Right? 00:15.000 --> 00:24.000 So, the agenda for today, introduction, some requirements, some design implementation, of course, open source. 00:24.000 --> 00:29.000 And maybe if we have time for questions, I wanted to do a demo, but it's impossible here. 00:29.000 --> 00:32.000 I don't have the Wi-Fi. 00:32.000 --> 00:34.000 So, who am I? 00:34.000 --> 00:35.000 I'm Arno. 00:35.000 --> 00:39.000 I was born in Belgium, so I'm a local here. 00:39.000 --> 00:45.000 And I actually work in Chicago as a database team link for job trading. 00:45.000 --> 00:48.000 Jump trading is a trading firm. 00:48.000 --> 00:56.000 And so we trade all kinds of assets on all kinds of horizons. 00:56.000 --> 01:09.000 And so we do research, so we have research facilities like HPC and also cutting-edge technology databases. 01:10.000 --> 01:21.000 So, my talk is about transactional database, talking to clickos, as an example of a null app database. 01:21.000 --> 01:26.000 And what we're going to do today is we're going to, it's agnostic, basically. 01:26.000 --> 01:33.000 You can do it with mySQL, Postgres, MRDB, those are very known example of transactional databases. 01:33.000 --> 01:40.000 And we're going to do a terro genus replication to a very different kind of database. 01:40.000 --> 01:45.000 Clickos is completely different. 01:45.000 --> 01:50.000 And so we're going to try to do that to different paradigm today. 01:50.000 --> 01:57.000 So, what is clickos? Any of you know what clickos is here in the room? Oh yeah, a lot of people. 01:57.000 --> 02:03.000 So, we add first them, so it's also open source, it's an open source database. 02:03.000 --> 02:09.000 It's different from the transactional ones which are a row based, so it's colonner. 02:09.000 --> 02:23.000 And the most, the fundamental data structure that is under clickos is the merge tree data structure, the merge tree engine. 02:23.000 --> 02:27.000 And clickos was created in Russia in 2009. 02:27.000 --> 02:32.000 It currently has 45.5 stars, 1000 stars. 02:32.000 --> 02:37.000 As of now, I verify today if someone is in the room and see 0.6 is possible. 02:37.000 --> 02:45.000 So, that should be the, so it's extremely popular and it's very good at doing real-time analytics. 02:45.000 --> 02:49.000 If you want to know more about it, you can just scan this QR code. 02:49.000 --> 02:52.000 I will leave you a few seconds to scan it. 02:52.000 --> 02:59.000 But it has over 2,000 contributors and it's a very successful project. 02:59.000 --> 03:03.000 So, that's why I chose it as an example for replication. 03:03.000 --> 03:13.000 And this is an example, a query that run today on some clickos instances I run on or on our HPC environment. 03:13.000 --> 03:20.000 I'm not sure everybody sees, but I'm, it's a select constar with some data aggregation. 03:20.000 --> 03:28.000 And it's on a 100-retar by dataset, we manage to get 11.3 terabyte per second. 03:28.000 --> 03:34.000 Obviously, it's not reading that much data, but the, it's an illustration. 03:34.000 --> 03:41.000 All fast, the colonner database can go if they only read a subset of the data, just the decolon year. 03:41.000 --> 03:46.000 So, it's pretty impressive. It has 20 nodes and it runs on Apache as well. 03:46.000 --> 03:49.000 So, it's pretty cool stuff. 03:49.000 --> 03:51.000 So, what is replication? 03:51.000 --> 03:55.000 Replication is synchronization between primary and replica. 03:55.000 --> 03:57.000 Everybody knows what it is in this room. 03:57.000 --> 04:03.000 Usually, homogeneous replication is easier and that's the main stream. 04:03.000 --> 04:10.000 And heterogeneous replication is really challenging because, you know, you need to translate from one paradigm to the other. 04:10.000 --> 04:15.000 One database to the other, they may not understand the same data types. 04:15.000 --> 04:18.000 I don't have the same time zone. 04:18.000 --> 04:20.000 And so, it's really a challenge. 04:20.000 --> 04:25.000 And what we want to leverage here, there was a talk before on logical replication. 04:25.000 --> 04:28.000 That's really key for this operation. 04:28.000 --> 04:30.000 And of course, it's going to be log base. 04:30.000 --> 04:37.000 So, we're going to read the binary log in mySQL and we're going to read the world, for example, in postgres. 04:37.000 --> 04:44.000 And this is known as CDC. So, change data, capture. 04:44.000 --> 04:49.000 So, you capture the changes of the database and basically you replicate it to another system. 04:49.000 --> 04:52.000 It can be Kafka, it can be anything. 04:52.000 --> 04:56.000 So, why is replication so useful and so important? 04:56.000 --> 05:00.000 It's because it scales the scale or the read. 05:00.000 --> 05:06.000 So, you can have one primary and maybe 100 read replicas with log it and see you can achieve it. 05:06.000 --> 05:09.000 You can achieve very high read throughput. 05:09.000 --> 05:13.000 It's useful to migrate from one system to another if you want to migrate from say, 05:13.000 --> 05:19.000 Murray DB to mySQL, you can just use replication to do that and fail over and don't lose any data. 05:19.000 --> 05:24.000 It's good for continuous synchronization between system, four tolerance. 05:24.000 --> 05:29.000 If for example, you have two region, one is failing, you can fail over to the other region. 05:29.000 --> 05:32.000 And this is impossible without replication. 05:32.000 --> 05:36.000 And so, it's a very important building block. 05:36.000 --> 05:38.000 For this project, what do we need? 05:38.000 --> 05:42.000 We really want to not lose any bit of data. 05:42.000 --> 05:47.000 We want to be low latency and the one second is nice. 05:47.000 --> 05:51.000 We don't want duplicates, so exactly one delivery. 05:51.000 --> 05:53.000 We want something really simple. 05:53.000 --> 05:58.000 And we want to be able to do four tolerance, so we want to be able to run that in several days. 05:58.000 --> 05:59.000 Is it data center? 05:59.000 --> 06:04.000 And of course, we had first them fully open source, freedom of choice. 06:04.000 --> 06:06.000 So what did we design this? 06:06.000 --> 06:09.000 We wanted to keep it simple stupid. 06:09.000 --> 06:13.000 So we just have one container and a YML file. 06:13.000 --> 06:16.000 And we leverage things that are really working well. 06:16.000 --> 06:18.000 The binary logs. 06:18.000 --> 06:19.000 Oncler is a nice project. 06:19.000 --> 06:24.000 I saw that postgres still uses bison, which is like very old. 06:24.000 --> 06:27.000 Antler is fantastic. 06:27.000 --> 06:31.000 Also, we use PG logical replication. 06:31.000 --> 06:32.000 Yeah. 06:32.000 --> 06:34.000 And then that's it. 06:34.000 --> 06:39.000 And we didn't want to reinvent the wheel doing this because they are already projects that do that. 06:39.000 --> 06:40.000 And this project is dbysium. 06:40.000 --> 06:42.000 I saw some dbysium stickers. 06:42.000 --> 06:45.000 And I put one on my laptop. 06:45.000 --> 06:48.000 So dbysium is developed by Reddit. 06:48.000 --> 06:51.000 And it's a very nice open source project to change it. 06:51.000 --> 06:52.000 I capture. 06:52.000 --> 06:59.000 And why we picked that is because we got access to a lot of sources. 06:59.000 --> 07:07.000 And so for this design, we also wanted to have one to one table mapping. 07:07.000 --> 07:11.000 We wanted to leverage the primary key. 07:11.000 --> 07:12.000 Yeah. 07:12.000 --> 07:14.000 To replicate this better to have primary keys. 07:14.000 --> 07:19.000 We wanted to convert the data type without data loss. 07:19.000 --> 07:23.000 And I mentioned already the merge tree engine. 07:23.000 --> 07:30.000 But in this particular context, it turns out that the replacing merge tree engine was the best. 07:30.000 --> 07:31.000 Why? 07:31.000 --> 07:37.000 Because it turns all operation into inserts. 07:37.000 --> 07:40.000 So if you have an insert, it inserts. 07:40.000 --> 07:45.000 If you have an update, you can just increase the version. 07:45.000 --> 07:54.000 And the replacing merge tree will only show you the latest version for the primary key on when you're reading. 07:54.000 --> 07:59.000 By default, nobody can use a variable called final. 07:59.000 --> 08:02.000 And you can have that by default in your profile. 08:02.000 --> 08:08.000 And so it does similar to iceberg does on some stuff merge on reads. 08:08.000 --> 08:10.000 And it does it very quickly. 08:10.000 --> 08:15.000 And over time, those operations have been. 08:15.000 --> 08:19.000 The performance has improved dramatically in the latest version of clickers. 08:19.000 --> 08:28.000 And also to for the story, I was the one that introduced the final variable. 08:28.000 --> 08:32.000 The few years back when we work on this project. 08:32.000 --> 08:39.000 And one other thing that we wanted is that the same query should run and return the same result. 08:39.000 --> 08:45.000 Also something that's not very common with CDC is we wanted to have full DDL support. 08:45.000 --> 08:48.000 So it means that if you create a table, it should work. 08:48.000 --> 08:49.000 If you rename a table, it should work. 08:49.000 --> 08:52.000 If you drop it, etc., which is not easy. 08:52.000 --> 08:55.000 So I can explain all we did that. 08:55.000 --> 09:00.000 And of course, it's challenging because the time zones you may have a server that is new to see. 09:00.000 --> 09:02.000 And other servers in Chicago. 09:02.000 --> 09:03.000 And you don't want to lose. 09:03.000 --> 09:08.000 You don't want to corrupt the data types there, especially time stamps. 09:08.000 --> 09:11.000 And data. 09:11.000 --> 09:16.000 So to design this, we decided to replicate the state in the target. 09:16.000 --> 09:21.000 As it's done, usually for my school or for postgres. 09:21.000 --> 09:27.000 And we wanted to have an interface that will be super simple like my score replication. 09:27.000 --> 09:36.000 So show replica status, show replica, start and stop or so as a common line. 09:36.000 --> 09:44.000 We would want it to retry on failure because if there's a failure, you don't want to corrupt your data. 09:44.000 --> 09:49.000 So you retry until the arrow is fixed. 09:49.000 --> 10:00.000 We have replication filters, we can choose a table, we can choose a database, we can choose several tables in the database. 10:00.000 --> 10:08.000 Very importantly too, we have checks on this project to verify that what we do doesn't lose data because it's one of the goal. 10:08.000 --> 10:20.000 And also we developed efficient dumpers and loaders that can dump terabytes of data and restore terabytes of data in sometimes minutes. 10:20.000 --> 10:26.000 And so we did that using tools that already exist like the MySQL shell. 10:26.000 --> 10:29.000 I think it's possible with PG dump as well. 10:29.000 --> 10:35.000 So in terms of architecture, what I'm going to present is the component in the middle. 10:35.000 --> 10:43.000 So I already told you transaction database on the left, MongoDB is one of them, MongoDB etc. 10:43.000 --> 10:56.000 It works also with Oracle, it works with SQL Server because as long as DB is going to understand it, it can turn it into events that you can apply and the connector can apply. 10:56.000 --> 11:06.000 And so keep it simple, stupid, a single binary, a single jar because it's developed in Java and goes to clickers. 11:06.000 --> 11:08.000 So the project is there. 11:08.000 --> 11:16.000 I scroll the Altinity sync connector because we developed it with Altinity, a provider of support for clickers. 11:16.000 --> 11:22.000 And so as I said, only one container, Docker compose can run it. 11:22.000 --> 11:28.000 You can also run it as a service. 11:28.000 --> 11:36.000 It works with multiple threads, just like MySQL replication or yes. 11:36.000 --> 11:41.000 What we also wanted to have is something that was not fully transactional because it's impossible. 11:41.000 --> 11:45.000 Clickers is super performant, but by default it's not transactional. 11:45.000 --> 11:54.000 So what you want is that eventually, if you wait for replication to catch up, it gives you the same data. 11:54.000 --> 12:03.000 And also what we wanted is low latency and we achieved less than second replication. 12:04.000 --> 12:15.000 So in terms of table structure, so you have the the MySQL on the side, mySQL as an example, it could be another database. 12:15.000 --> 12:23.000 And the way we translated is of course we need to translate the name, the data types. 12:23.000 --> 12:31.000 I already mentioned the replacing merge tree engine as the target, which is here replaces in ODB with a primary key. 12:31.000 --> 12:41.000 The system, because of Antler and the passing understands the partitions and can turn the partition list here into partitioning clothes. 12:41.000 --> 12:53.000 The uniqueness is achieved with the order number here, order by, so order by is the equivalent of the primary key, but they can be duplicates in clickers. 12:53.000 --> 13:03.000 There are no duplicates if you run final knowing that you have a version and knowing that you can delete data with the easily deleted. 13:03.000 --> 13:09.000 So that's all we did, and that's all I think the competition is doing and a similar project. 13:09.000 --> 13:21.000 There's not many different ways to do this, especially if you want high performance, you need to turn a null tp workload into an insert workload, insert only workload, 13:21.000 --> 13:29.000 and then resolve the difference at query time, but also at merge time, because as the name indicates it's a merge tree. 13:29.000 --> 13:38.000 So if you have two parts or two partitions or two parts, they can merge together and they can remove the duplicates for you. 13:38.000 --> 13:47.000 So if you finalize that using merge is going to trigger that, then you are guaranteed to have zero duplicate eventually. 13:47.000 --> 14:05.000 So the talk would not be complete if I didn't mention what we did recently with the history, because you have all the changes coming through the transaction log, which is like the binary log or the well. 14:05.000 --> 14:25.000 And so we figured out that we could use that, we could add a list or a story mode, because if you replicate in a different schema, not like for like in a history schema, you can have the full transaction log as a big fat table, and this is very useful for analytics. 14:25.000 --> 14:39.000 And you can also create one history table per table in a separate schema, and this gives you the possibility to know what the state of the table was yesterday or a week ago. 14:39.000 --> 14:51.000 And you still have the current version, so it's still synchronized as the previous design, and this is called slow changing dimension number two. 14:51.000 --> 15:02.000 So I'm going to show you the transaction log here, so just a simple table with all the events and you see the before image and the after image. 15:02.000 --> 15:08.000 It's also replacing much tree, because if we rewind, we don't want duplicates in there. 15:08.000 --> 15:23.000 And then we figured out for this particular database, what would be the unique key, and so here's the server ID, the log file, the position, some sequence number, and we have the time, because this is a time series. 15:23.000 --> 15:44.000 And yesterday I presented the graph and a dashboards on that, you can run AI on top of this table, and so you can know a lot about your work workload by analyzing, by turning something that is just transient in a regular system into something that you can persist, and where you can run analysis on top of it. 15:44.000 --> 15:50.000 And of course you can have a TTL, because as a TTL, so after 30 days you can discard it. 15:50.000 --> 15:57.000 Okay, so all do we do time travel? 15:57.000 --> 16:05.000 We do time travel this way, well this is another slide that shows that you can even compress the data. 16:05.000 --> 16:14.000 So if you want to compress using this standard, you can add a recompression clause, and the slides on time travel is just here. 16:14.000 --> 16:23.000 For the history table, I already mentioned the version and this deleted for the RMT to get zero duplicate on the primary key. 16:23.000 --> 16:31.000 But if you are just three columns, actually two, valid from and valid two, you can rebuild the state of the table over time. 16:31.000 --> 16:40.000 And so that's called the temporal table, some database have them, you know, within the source database itself. 16:40.000 --> 16:46.000 Here we'd show that it was smarter to put them where we had no choice, the applications were already there. 16:46.000 --> 16:57.000 But if you treat them as an analytics table that you can have an analytic database, then you can reconstruct the state of the table. 16:57.000 --> 17:03.000 And so we have the same primary key, but we use the valid two in the primary key. 17:03.000 --> 17:10.000 And also we partition not by the partitioning order that is original, but we partition it by valid two. 17:10.000 --> 17:18.000 This way, valid two being the future, we always have the current version in the last partition. 17:18.000 --> 17:27.000 And then as the data updates, it trickles down the past partitions, which will have different data valid two. 17:27.000 --> 17:32.000 And then you can run a TTL on that, which is very, very interesting. 17:32.000 --> 17:36.000 So you're not going to run out of space with those kind of history tables. 17:36.000 --> 17:52.000 And of course, this is fantastic, the system that ate after one day, not only saved you space, but also removes all the intermediate versions that you may have and delete the data. 17:52.000 --> 18:04.000 Okay, so last slide about the development status of this, there's a QR code if you want to know more about the project, it has 300 stars. 18:04.000 --> 18:08.000 And so I'm glad that I can present it here. 18:08.000 --> 18:18.000 One to one table mapping works as GA development started four years ago, so it's bullet proof, we use it in production. 18:18.000 --> 18:24.000 The binary table is just a new ID and it's in beta state at the moment. 18:24.000 --> 18:37.000 And I wanted to mention that there are similar open source projects that appeared later, and PRDB, for example, was launched last year for my skills. 18:37.000 --> 18:42.000 So if it doesn't work for you, you can, you have open source alternatives as well. 18:42.000 --> 18:49.000 So no demo, and if I've time for some questions, I think I have two minutes. 18:49.000 --> 18:52.000 Any questions in the audience? 18:52.000 --> 18:56.000 How do you deal with the tick table systems? 18:56.000 --> 18:58.000 They need share, it's not sure. 18:58.000 --> 19:06.000 Oh yeah, I mentioned that, so I mentioned that, well, you can take a consistent snapshot in any database, like Postgres or MySQL. 19:06.000 --> 19:14.000 And then you restore that, and then you start replication from the position that's within that dump. 19:14.000 --> 19:26.000 MySQL would give you that position, and you would have the LSN in the transaction number in Postgres, because it's very important building block. 19:26.000 --> 19:33.000 If you don't want to lose data, you need to know exactly where to start. 19:33.000 --> 19:35.000 So yeah, it's supported, of course. 19:35.000 --> 19:38.000 And for very big database, another question? 19:38.000 --> 19:43.000 You then overload the source database? 19:43.000 --> 19:47.000 No, because you can use a replica to do that. 19:47.000 --> 19:49.000 Yeah, you can use a replica, so. 19:49.000 --> 19:51.000 Yeah, replication is super important. 19:51.000 --> 19:52.000 Yes. 19:52.000 --> 19:54.000 No other questions? 19:54.000 --> 19:55.000 Oh, you? 19:55.000 --> 19:57.000 Yeah. 19:57.000 --> 20:00.000 Sorry. 20:00.000 --> 20:05.000 I completely discard them because it's eventually consistent. 20:08.000 --> 20:10.000 Thank you.