WEBVTT

00:00.000 --> 00:13.000
What I'm going to show is, oh, we can analyze transactions, do some replication, and do some time travel.

00:13.000 --> 00:15.000
Right?

00:15.000 --> 00:24.000
So, the agenda for today, introduction, some requirements, some design implementation, of course, open source.

00:24.000 --> 00:29.000
And maybe if we have time for questions, I wanted to do a demo, but it's impossible here.

00:29.000 --> 00:32.000
I don't have the Wi-Fi.

00:32.000 --> 00:34.000
So, who am I?

00:34.000 --> 00:35.000
I'm Arno.

00:35.000 --> 00:39.000
I was born in Belgium, so I'm a local here.

00:39.000 --> 00:45.000
And I actually work in Chicago as a database team link for job trading.

00:45.000 --> 00:48.000
Jump trading is a trading firm.

00:48.000 --> 00:56.000
And so we trade all kinds of assets on all kinds of horizons.

00:56.000 --> 01:09.000
And so we do research, so we have research facilities like HPC and also cutting-edge technology databases.

01:10.000 --> 01:21.000
So, my talk is about transactional database, talking to clickos, as an example of a null app database.

01:21.000 --> 01:26.000
And what we're going to do today is we're going to, it's agnostic, basically.

01:26.000 --> 01:33.000
You can do it with mySQL, Postgres, MRDB, those are very known example of transactional databases.

01:33.000 --> 01:40.000
And we're going to do a terro genus replication to a very different kind of database.

01:40.000 --> 01:45.000
Clickos is completely different.

01:45.000 --> 01:50.000
And so we're going to try to do that to different paradigm today.

01:50.000 --> 01:57.000
So, what is clickos? Any of you know what clickos is here in the room? Oh yeah, a lot of people.

01:57.000 --> 02:03.000
So, we add first them, so it's also open source, it's an open source database.

02:03.000 --> 02:09.000
It's different from the transactional ones which are a row based, so it's colonner.

02:09.000 --> 02:23.000
And the most, the fundamental data structure that is under clickos is the merge tree data structure, the merge tree engine.

02:23.000 --> 02:27.000
And clickos was created in Russia in 2009.

02:27.000 --> 02:32.000
It currently has 45.5 stars, 1000 stars.

02:32.000 --> 02:37.000
As of now, I verify today if someone is in the room and see 0.6 is possible.

02:37.000 --> 02:45.000
So, that should be the, so it's extremely popular and it's very good at doing real-time analytics.

02:45.000 --> 02:49.000
If you want to know more about it, you can just scan this QR code.

02:49.000 --> 02:52.000
I will leave you a few seconds to scan it.

02:52.000 --> 02:59.000
But it has over 2,000 contributors and it's a very successful project.

02:59.000 --> 03:03.000
So, that's why I chose it as an example for replication.

03:03.000 --> 03:13.000
And this is an example, a query that run today on some clickos instances I run on or on our HPC environment.

03:13.000 --> 03:20.000
I'm not sure everybody sees, but I'm, it's a select constar with some data aggregation.

03:20.000 --> 03:28.000
And it's on a 100-retar by dataset, we manage to get 11.3 terabyte per second.

03:28.000 --> 03:34.000
Obviously, it's not reading that much data, but the, it's an illustration.

03:34.000 --> 03:41.000
All fast, the colonner database can go if they only read a subset of the data, just the decolon year.

03:41.000 --> 03:46.000
So, it's pretty impressive. It has 20 nodes and it runs on Apache as well.

03:46.000 --> 03:49.000
So, it's pretty cool stuff.

03:49.000 --> 03:51.000
So, what is replication?

03:51.000 --> 03:55.000
Replication is synchronization between primary and replica.

03:55.000 --> 03:57.000
Everybody knows what it is in this room.

03:57.000 --> 04:03.000
Usually, homogeneous replication is easier and that's the main stream.

04:03.000 --> 04:10.000
And heterogeneous replication is really challenging because, you know, you need to translate from one paradigm to the other.

04:10.000 --> 04:15.000
One database to the other, they may not understand the same data types.

04:15.000 --> 04:18.000
I don't have the same time zone.

04:18.000 --> 04:20.000
And so, it's really a challenge.

04:20.000 --> 04:25.000
And what we want to leverage here, there was a talk before on logical replication.

04:25.000 --> 04:28.000
That's really key for this operation.

04:28.000 --> 04:30.000
And of course, it's going to be log base.

04:30.000 --> 04:37.000
So, we're going to read the binary log in mySQL and we're going to read the world, for example, in postgres.

04:37.000 --> 04:44.000
And this is known as CDC. So, change data, capture.

04:44.000 --> 04:49.000
So, you capture the changes of the database and basically you replicate it to another system.

04:49.000 --> 04:52.000
It can be Kafka, it can be anything.

04:52.000 --> 04:56.000
So, why is replication so useful and so important?

04:56.000 --> 05:00.000
It's because it scales the scale or the read.

05:00.000 --> 05:06.000
So, you can have one primary and maybe 100 read replicas with log it and see you can achieve it.

05:06.000 --> 05:09.000
You can achieve very high read throughput.

05:09.000 --> 05:13.000
It's useful to migrate from one system to another if you want to migrate from say,

05:13.000 --> 05:19.000
Murray DB to mySQL, you can just use replication to do that and fail over and don't lose any data.

05:19.000 --> 05:24.000
It's good for continuous synchronization between system, four tolerance.

05:24.000 --> 05:29.000
If for example, you have two region, one is failing, you can fail over to the other region.

05:29.000 --> 05:32.000
And this is impossible without replication.

05:32.000 --> 05:36.000
And so, it's a very important building block.

05:36.000 --> 05:38.000
For this project, what do we need?

05:38.000 --> 05:42.000
We really want to not lose any bit of data.

05:42.000 --> 05:47.000
We want to be low latency and the one second is nice.

05:47.000 --> 05:51.000
We don't want duplicates, so exactly one delivery.

05:51.000 --> 05:53.000
We want something really simple.

05:53.000 --> 05:58.000
And we want to be able to do four tolerance, so we want to be able to run that in several days.

05:58.000 --> 05:59.000
Is it data center?

05:59.000 --> 06:04.000
And of course, we had first them fully open source, freedom of choice.

06:04.000 --> 06:06.000
So what did we design this?

06:06.000 --> 06:09.000
We wanted to keep it simple stupid.

06:09.000 --> 06:13.000
So we just have one container and a YML file.

06:13.000 --> 06:16.000
And we leverage things that are really working well.

06:16.000 --> 06:18.000
The binary logs.

06:18.000 --> 06:19.000
Oncler is a nice project.

06:19.000 --> 06:24.000
I saw that postgres still uses bison, which is like very old.

06:24.000 --> 06:27.000
Antler is fantastic.

06:27.000 --> 06:31.000
Also, we use PG logical replication.

06:31.000 --> 06:32.000
Yeah.

06:32.000 --> 06:34.000
And then that's it.

06:34.000 --> 06:39.000
And we didn't want to reinvent the wheel doing this because they are already projects that do that.

06:39.000 --> 06:40.000
And this project is dbysium.

06:40.000 --> 06:42.000
I saw some dbysium stickers.

06:42.000 --> 06:45.000
And I put one on my laptop.

06:45.000 --> 06:48.000
So dbysium is developed by Reddit.

06:48.000 --> 06:51.000
And it's a very nice open source project to change it.

06:51.000 --> 06:52.000
I capture.

06:52.000 --> 06:59.000
And why we picked that is because we got access to a lot of sources.

06:59.000 --> 07:07.000
And so for this design, we also wanted to have one to one table mapping.

07:07.000 --> 07:11.000
We wanted to leverage the primary key.

07:11.000 --> 07:12.000
Yeah.

07:12.000 --> 07:14.000
To replicate this better to have primary keys.

07:14.000 --> 07:19.000
We wanted to convert the data type without data loss.

07:19.000 --> 07:23.000
And I mentioned already the merge tree engine.

07:23.000 --> 07:30.000
But in this particular context, it turns out that the replacing merge tree engine was the best.

07:30.000 --> 07:31.000
Why?

07:31.000 --> 07:37.000
Because it turns all operation into inserts.

07:37.000 --> 07:40.000
So if you have an insert, it inserts.

07:40.000 --> 07:45.000
If you have an update, you can just increase the version.

07:45.000 --> 07:54.000
And the replacing merge tree will only show you the latest version for the primary key on when you're reading.

07:54.000 --> 07:59.000
By default, nobody can use a variable called final.

07:59.000 --> 08:02.000
And you can have that by default in your profile.

08:02.000 --> 08:08.000
And so it does similar to iceberg does on some stuff merge on reads.

08:08.000 --> 08:10.000
And it does it very quickly.

08:10.000 --> 08:15.000
And over time, those operations have been.

08:15.000 --> 08:19.000
The performance has improved dramatically in the latest version of clickers.

08:19.000 --> 08:28.000
And also to for the story, I was the one that introduced the final variable.

08:28.000 --> 08:32.000
The few years back when we work on this project.

08:32.000 --> 08:39.000
And one other thing that we wanted is that the same query should run and return the same result.

08:39.000 --> 08:45.000
Also something that's not very common with CDC is we wanted to have full DDL support.

08:45.000 --> 08:48.000
So it means that if you create a table, it should work.

08:48.000 --> 08:49.000
If you rename a table, it should work.

08:49.000 --> 08:52.000
If you drop it, etc., which is not easy.

08:52.000 --> 08:55.000
So I can explain all we did that.

08:55.000 --> 09:00.000
And of course, it's challenging because the time zones you may have a server that is new to see.

09:00.000 --> 09:02.000
And other servers in Chicago.

09:02.000 --> 09:03.000
And you don't want to lose.

09:03.000 --> 09:08.000
You don't want to corrupt the data types there, especially time stamps.

09:08.000 --> 09:11.000
And data.

09:11.000 --> 09:16.000
So to design this, we decided to replicate the state in the target.

09:16.000 --> 09:21.000
As it's done, usually for my school or for postgres.

09:21.000 --> 09:27.000
And we wanted to have an interface that will be super simple like my score replication.

09:27.000 --> 09:36.000
So show replica status, show replica, start and stop or so as a common line.

09:36.000 --> 09:44.000
We would want it to retry on failure because if there's a failure, you don't want to corrupt your data.

09:44.000 --> 09:49.000
So you retry until the arrow is fixed.

09:49.000 --> 10:00.000
We have replication filters, we can choose a table, we can choose a database, we can choose several tables in the database.

10:00.000 --> 10:08.000
Very importantly too, we have checks on this project to verify that what we do doesn't lose data because it's one of the goal.

10:08.000 --> 10:20.000
And also we developed efficient dumpers and loaders that can dump terabytes of data and restore terabytes of data in sometimes minutes.

10:20.000 --> 10:26.000
And so we did that using tools that already exist like the MySQL shell.

10:26.000 --> 10:29.000
I think it's possible with PG dump as well.

10:29.000 --> 10:35.000
So in terms of architecture, what I'm going to present is the component in the middle.

10:35.000 --> 10:43.000
So I already told you transaction database on the left, MongoDB is one of them, MongoDB etc.

10:43.000 --> 10:56.000
It works also with Oracle, it works with SQL Server because as long as DB is going to understand it, it can turn it into events that you can apply and the connector can apply.

10:56.000 --> 11:06.000
And so keep it simple, stupid, a single binary, a single jar because it's developed in Java and goes to clickers.

11:06.000 --> 11:08.000
So the project is there.

11:08.000 --> 11:16.000
I scroll the Altinity sync connector because we developed it with Altinity, a provider of support for clickers.

11:16.000 --> 11:22.000
And so as I said, only one container, Docker compose can run it.

11:22.000 --> 11:28.000
You can also run it as a service.

11:28.000 --> 11:36.000
It works with multiple threads, just like MySQL replication or yes.

11:36.000 --> 11:41.000
What we also wanted to have is something that was not fully transactional because it's impossible.

11:41.000 --> 11:45.000
Clickers is super performant, but by default it's not transactional.

11:45.000 --> 11:54.000
So what you want is that eventually, if you wait for replication to catch up, it gives you the same data.

11:54.000 --> 12:03.000
And also what we wanted is low latency and we achieved less than second replication.

12:04.000 --> 12:15.000
So in terms of table structure, so you have the the MySQL on the side, mySQL as an example, it could be another database.

12:15.000 --> 12:23.000
And the way we translated is of course we need to translate the name, the data types.

12:23.000 --> 12:31.000
I already mentioned the replacing merge tree engine as the target, which is here replaces in ODB with a primary key.

12:31.000 --> 12:41.000
The system, because of Antler and the passing understands the partitions and can turn the partition list here into partitioning clothes.

12:41.000 --> 12:53.000
The uniqueness is achieved with the order number here, order by, so order by is the equivalent of the primary key, but they can be duplicates in clickers.

12:53.000 --> 13:03.000
There are no duplicates if you run final knowing that you have a version and knowing that you can delete data with the easily deleted.

13:03.000 --> 13:09.000
So that's all we did, and that's all I think the competition is doing and a similar project.

13:09.000 --> 13:21.000
There's not many different ways to do this, especially if you want high performance, you need to turn a null tp workload into an insert workload, insert only workload,

13:21.000 --> 13:29.000
and then resolve the difference at query time, but also at merge time, because as the name indicates it's a merge tree.

13:29.000 --> 13:38.000
So if you have two parts or two partitions or two parts, they can merge together and they can remove the duplicates for you.

13:38.000 --> 13:47.000
So if you finalize that using merge is going to trigger that, then you are guaranteed to have zero duplicate eventually.

13:47.000 --> 14:05.000
So the talk would not be complete if I didn't mention what we did recently with the history, because you have all the changes coming through the transaction log, which is like the binary log or the well.

14:05.000 --> 14:25.000
And so we figured out that we could use that, we could add a list or a story mode, because if you replicate in a different schema, not like for like in a history schema, you can have the full transaction log as a big fat table, and this is very useful for analytics.

14:25.000 --> 14:39.000
And you can also create one history table per table in a separate schema, and this gives you the possibility to know what the state of the table was yesterday or a week ago.

14:39.000 --> 14:51.000
And you still have the current version, so it's still synchronized as the previous design, and this is called slow changing dimension number two.

14:51.000 --> 15:02.000
So I'm going to show you the transaction log here, so just a simple table with all the events and you see the before image and the after image.

15:02.000 --> 15:08.000
It's also replacing much tree, because if we rewind, we don't want duplicates in there.

15:08.000 --> 15:23.000
And then we figured out for this particular database, what would be the unique key, and so here's the server ID, the log file, the position, some sequence number, and we have the time, because this is a time series.

15:23.000 --> 15:44.000
And yesterday I presented the graph and a dashboards on that, you can run AI on top of this table, and so you can know a lot about your work workload by analyzing, by turning something that is just transient in a regular system into something that you can persist, and where you can run analysis on top of it.

15:44.000 --> 15:50.000
And of course you can have a TTL, because as a TTL, so after 30 days you can discard it.

15:50.000 --> 15:57.000
Okay, so all do we do time travel?

15:57.000 --> 16:05.000
We do time travel this way, well this is another slide that shows that you can even compress the data.

16:05.000 --> 16:14.000
So if you want to compress using this standard, you can add a recompression clause, and the slides on time travel is just here.

16:14.000 --> 16:23.000
For the history table, I already mentioned the version and this deleted for the RMT to get zero duplicate on the primary key.

16:23.000 --> 16:31.000
But if you are just three columns, actually two, valid from and valid two, you can rebuild the state of the table over time.

16:31.000 --> 16:40.000
And so that's called the temporal table, some database have them, you know, within the source database itself.

16:40.000 --> 16:46.000
Here we'd show that it was smarter to put them where we had no choice, the applications were already there.

16:46.000 --> 16:57.000
But if you treat them as an analytics table that you can have an analytic database, then you can reconstruct the state of the table.

16:57.000 --> 17:03.000
And so we have the same primary key, but we use the valid two in the primary key.

17:03.000 --> 17:10.000
And also we partition not by the partitioning order that is original, but we partition it by valid two.

17:10.000 --> 17:18.000
This way, valid two being the future, we always have the current version in the last partition.

17:18.000 --> 17:27.000
And then as the data updates, it trickles down the past partitions, which will have different data valid two.

17:27.000 --> 17:32.000
And then you can run a TTL on that, which is very, very interesting.

17:32.000 --> 17:36.000
So you're not going to run out of space with those kind of history tables.

17:36.000 --> 17:52.000
And of course, this is fantastic, the system that ate after one day, not only saved you space, but also removes all the intermediate versions that you may have and delete the data.

17:52.000 --> 18:04.000
Okay, so last slide about the development status of this, there's a QR code if you want to know more about the project, it has 300 stars.

18:04.000 --> 18:08.000
And so I'm glad that I can present it here.

18:08.000 --> 18:18.000
One to one table mapping works as GA development started four years ago, so it's bullet proof, we use it in production.

18:18.000 --> 18:24.000
The binary table is just a new ID and it's in beta state at the moment.

18:24.000 --> 18:37.000
And I wanted to mention that there are similar open source projects that appeared later, and PRDB, for example, was launched last year for my skills.

18:37.000 --> 18:42.000
So if it doesn't work for you, you can, you have open source alternatives as well.

18:42.000 --> 18:49.000
So no demo, and if I've time for some questions, I think I have two minutes.

18:49.000 --> 18:52.000
Any questions in the audience?

18:52.000 --> 18:56.000
How do you deal with the tick table systems?

18:56.000 --> 18:58.000
They need share, it's not sure.

18:58.000 --> 19:06.000
Oh yeah, I mentioned that, so I mentioned that, well, you can take a consistent snapshot in any database, like Postgres or MySQL.

19:06.000 --> 19:14.000
And then you restore that, and then you start replication from the position that's within that dump.

19:14.000 --> 19:26.000
MySQL would give you that position, and you would have the LSN in the transaction number in Postgres, because it's very important building block.

19:26.000 --> 19:33.000
If you don't want to lose data, you need to know exactly where to start.

19:33.000 --> 19:35.000
So yeah, it's supported, of course.

19:35.000 --> 19:38.000
And for very big database, another question?

19:38.000 --> 19:43.000
You then overload the source database?

19:43.000 --> 19:47.000
No, because you can use a replica to do that.

19:47.000 --> 19:49.000
Yeah, you can use a replica, so.

19:49.000 --> 19:51.000
Yeah, replication is super important.

19:51.000 --> 19:52.000
Yes.

19:52.000 --> 19:54.000
No other questions?

19:54.000 --> 19:55.000
Oh, you?

19:55.000 --> 19:57.000
Yeah.

19:57.000 --> 20:00.000
Sorry.

20:00.000 --> 20:05.000
I completely discard them because it's eventually consistent.

20:08.000 --> 20:10.000
Thank you.