WEBVTT 00:00.000 --> 00:22.000 Okay, my name is Joelson. I'm going to talk about a new open source data engineering framework called data prep pit. 00:23.000 --> 00:31.000 Okay, so this data prep pit was released by IBM that Apache 2.0 licensed last week. 00:31.000 --> 00:38.000 It was used internally by IBM to prepare their granite LOM family. 00:38.000 --> 00:42.000 So it was a data engineering tool that they used. They released it open source last week. 00:42.000 --> 00:46.000 There's three value props with this framework. 00:47.000 --> 00:53.000 Data engineering built on cube flow pipelines. So it's workflow based, which means you can define steps or flows made up of steps. 00:53.000 --> 00:58.000 You take the output of one step and you send it into another depending on what the output is. 00:58.000 --> 01:05.000 So it's a little easier to use than a little more flexible than just raw Python. 01:05.000 --> 01:15.000 The second value prop is scalable compute. So you can build and test your workflows locally and then easily migrate them up to much larger cloud clusters. 01:16.000 --> 01:20.000 And then the third value prop is community. 01:20.000 --> 01:30.000 We can collaborate since we're workflow based. We can collaborate and solve and complex data engineering problems facing GNI such as determining licensing and copyrights. 01:30.000 --> 01:36.000 Compliance GDPR identifying personal information hates speech and bias. 01:36.000 --> 01:39.000 So we're just shoveling all our data into our LOM. 01:39.000 --> 01:45.000 We can figure out and design pipelines and try to detect these things before they go into the LOM. 01:46.000 --> 01:49.000 The potential user base for this. 01:49.000 --> 01:58.000 GNI value creators. So if you don't want to get bogged down in data engineering, you're looking simply to maybe set up a rag with a bunch of documents in it. 01:58.000 --> 02:04.000 This would be a perfect tool to do this. There's many examples and tutorials for processing that kind of data. 02:04.000 --> 02:08.000 Maybe you're a professional data engineer and you are bogged down in data engineering. 02:08.000 --> 02:13.000 I develop workflows, Python locally, easily migrated up to spark ray, Qflow pipelines. 02:13.000 --> 02:23.000 There's a catalog existing transforms that operate on big data. So you're already can come out of the gates with a bunch of stuff you can already leverage without having to rewrite it yourself. 02:23.000 --> 02:27.000 And then the third potential user of this is the AI researchers. 02:27.000 --> 02:32.000 And if you want to collaborate with other researchers and solving some of these data problems, this is a good framework to do it. 02:32.000 --> 02:38.000 And I'll just quick shout out to the AI Alliance. The AI Alliance was started in December of 2023. 02:38.000 --> 02:49.000 It is about 100 members right now. It's a nonprofit organization that is promoting open-interested MLDI data engineering. 02:49.000 --> 02:51.000 So thank you. 02:51.000 --> 02:53.000 Thank you.