WEBVTT

00:00.000 --> 00:11.600
Hello, everyone. I have a really great pleasure to invite Matt Tophol to talk about

00:11.600 --> 00:18.080
Arrow. Matt is my friend. We meet quite often at Apache Software Foundation, a conference

00:18.080 --> 00:22.960
as a meet-up event. We have the opportunity now for Matt to talk about Arrow in the context

00:23.040 --> 00:37.360
of AI. So, it's all yours. Hey, everybody. So, as you all said, I'm Matt Tophol,

00:37.360 --> 00:46.560
a PMC member on the Apache Arrow and author of book in memory analytics with Apache Arrow. So,

00:47.040 --> 00:55.520
so small confession here. AI and ML is not my expertise and I hope that doesn't bring out

00:55.520 --> 01:02.080
the pitchforks and torches on them. But my expertise is more in the data analytics pipeline

01:02.080 --> 01:11.520
area. So, I've talked with a lot of people and a lot of AI and ML engineers and people focused on

01:11.520 --> 01:18.240
it. And I've been told that my little diagram here is surprisingly accurate from a lot of people.

01:18.240 --> 01:26.080
So, the point here is that when you think about it, a lot of AI and ML can be thought of as data

01:26.080 --> 01:32.320
pipelines. And you have all of these libraries and tools and utilities that exist in the ecosystem.

01:32.320 --> 01:40.320
You've got on X-JAR, you know, Jax, PKI's, GDUF's for formats. You've got TensorFlow,

01:40.320 --> 01:45.920
PyTorch, Lama, Rapids, Jax. You have all of these things and then you have all of the

01:45.920 --> 01:50.800
sources you're pulling the data from. You have the your varies the frame, libraries,

01:50.800 --> 01:57.920
your future stores, parquet files. It's an enormous ecosystem that keeps growing. And of course,

01:58.640 --> 02:06.000
there's all these different pros and cons to the various libraries. So, when you're looking

02:06.000 --> 02:12.480
in building out all these machine learning models and inference pipelines and just building out

02:12.480 --> 02:21.040
everything out, potentially controversial, it's a specialized version of a data pipeline.

02:22.000 --> 02:28.000
And that's where I came at this and I thought about it a lot. And when you think about it

02:28.000 --> 02:35.600
in those terms, in terms of AI and machine learning models as a specialized version of data

02:35.600 --> 02:43.760
pipelines, you come up with interesting ideas. Because now I'm aware that this doesn't represent

02:43.840 --> 02:50.240
every, you know, the entire area of miles. But a lot of miles are basically, you know, a

02:50.240 --> 02:56.800
pre-processing step of your inputs. And then you send it through various layers that do

02:56.800 --> 03:03.040
transformations, that do interesting computations. Eventually you get your output of a prediction,

03:03.040 --> 03:08.880
maybe with your probabilities and so on. And then you do a little bit of post-processing and

03:08.880 --> 03:17.200
send it out somewhere else. And in many cases, you're passing tensors between those layers and

03:17.200 --> 03:22.960
across, you know, you're pre-processing into the layers from layer to layer all the way through

03:22.960 --> 03:29.840
to your prediction, possibly even all the way to the post-processing output. And then at that point,

03:29.840 --> 03:36.240
you convert your tensors into whatever you want to actually output to your user, to your inference,

03:36.320 --> 03:44.560
to convert it back to reasonable texts, so on and so forth. And of course, with this entire giant

03:44.560 --> 03:54.000
ecosystem of tools and libraries, everyone wonders, which one should I use? Because it's expensive

03:54.000 --> 03:58.240
to tie them together. But there's all pros and cons, all them have different benefits,

03:58.240 --> 04:02.560
they have different drawbacks, they have different functionality, different algorithms that the

04:02.640 --> 04:10.160
library supports, that the framework has different performance characteristics, and the cost of

04:10.160 --> 04:16.400
linking these things together to kind of mix and match what you need can often be very high.

04:18.160 --> 04:25.440
They may not have the same layout in memory, so you have to copy and convert from one format to another,

04:25.520 --> 04:32.400
the passive between the libraries, or you have to, or one library may only work on CPUs

04:32.400 --> 04:38.480
and the other may support GPUs, or even just, you know, the library doesn't expose

04:38.480 --> 04:43.520
a way of getting the data out while keeping it on a device model, possibly, because of streams and

04:43.520 --> 04:49.280
events and other very low-level manipulations. And then you have a lot of other

04:50.240 --> 04:56.320
new tools and utilities coming up. You know, you've got the building of LansDB for a vectorization

04:56.320 --> 05:02.560
database, and a new format. You've got nimble as another option for improving upon like the

05:02.560 --> 05:09.200
parquet formats and optimizing towards ML workflows and point lookups and random lookups. And a lot

05:09.200 --> 05:18.640
of these systems are consolidating around utilizing the Apache Arrow format internally for how they

05:18.640 --> 05:24.560
represent their data in memory and on the devices. Now, this is the first point where I'm mentioning

05:24.560 --> 05:33.040
Apache Arrow. So the point I'm mentioning here is interoperability is being able to get data between

05:33.040 --> 05:40.080
all these tools, the all these libraries, without a high cost. You know, you want to be able to mix

05:40.160 --> 05:51.200
and match everything without having this like expensive slowdown. Currently, when you're passing

05:51.200 --> 05:58.400
your data between TensorFlow and PyTorch or from hugging phase to TensorFlow or everywhere else,

05:58.400 --> 06:03.680
you're typically going to use NumPy. You can use pandas, do some data frame manipulation,

06:03.680 --> 06:11.360
pre-apac, for your pre-processing, maybe Coupai, Dail Pack is an often big one. XG Boost is another one

06:11.360 --> 06:18.800
like where you're finding ways of passing your data in reasonable formats between systems between

06:18.800 --> 06:23.920
libraries. You know, and then you have these intermediate scenarios where you have your data source

06:23.920 --> 06:31.920
of CSV files and parquet or just raw JSON data because you hate yourself. You know, so

06:32.240 --> 06:39.040
the result is that there's a ton of unnecessary copying that happens because the underlying

06:39.040 --> 06:45.680
tools and libraries can't operate on the data as it is where you're starting from and you need

06:45.680 --> 06:53.520
to copy it and transform it to useless use it. So before I continue going forward, which obviously

06:53.520 --> 06:58.320
you can see where my answer is going to be going, let me actually explain what Apache Arrow is in case

06:58.560 --> 07:06.320
you're not aware of it, you may not be familiar with it. So Arrow is an in memory

07:07.040 --> 07:16.320
column-oriented data format. As core, it's a spec. It's a spec that's been implemented in

07:16.320 --> 07:23.040
the whole mess of languages. You know, you've got C++, Go, Rust, Python, or Python,

07:23.360 --> 07:29.760
C++ libraries. You've got Julia Matlab. You've got Java. It's implemented in a ton of languages.

07:29.760 --> 07:35.360
And the point here is that by consolidating around a column-oriented in memory format,

07:36.400 --> 07:42.160
you eliminate a lot of the cost of serialization and decirilization passing it between libraries,

07:42.160 --> 07:48.320
passing it between nodes, between systems, transport over the wire, and so on. You know, transporting

07:48.400 --> 07:54.320
the data over the wire is identical to the bytes in memory, which means there's no serialized

07:54.320 --> 08:00.960
or decirilized step whenever you cross systems, or even crossing process boundaries, or even just

08:00.960 --> 08:04.960
one time library boundaries within the same process. Let's not get to in a moment.

08:06.480 --> 08:11.680
And because this column-oriented, it's highly efficient and friendly for vectorization and

08:11.680 --> 08:18.400
Cindy. And the chances are, you're probably interacting with arrow or utilizing it under the

08:18.400 --> 08:25.520
hood somewhere and don't even realize it because it's already everywhere. If you ever use lib

08:25.520 --> 08:31.520
kudyeth in video rapids and that whole ecosystem, you know, kuml, kuspe, all that entire ecosystem.

08:32.240 --> 08:40.240
And video rapids and all of the related system kudyeth is using Apache arrow data format just on the

08:40.560 --> 08:47.600
GPU. It's the same Apache arrow data format. It's laid out the raw bytes in GPU memory.

08:48.320 --> 08:53.840
All right, identical to the way it would be laid out in CPU memory, but because of the efficiency

08:53.840 --> 09:01.520
of a calm-oriented data format and the benefits of passing it around without the cost of serialized

09:01.520 --> 09:06.960
and decirilized, it becomes super efficient and Nvidia adopted that for that entire ecosystem.

09:07.920 --> 09:14.480
If you're using Polars, the Polars data frame libraries and pandas or rust. The entire underlying

09:14.480 --> 09:22.160
backing memory representation for Polars is Apache arrow. It's just arrow. Which means that you can go

09:22.160 --> 09:31.600
from libraries, for example, to pandas, to Polars and so on, zero copy because they both have that

09:31.680 --> 09:39.760
format. Pandas has a pie arrow back end. You know, earlier on, we had to talk from from a developer

09:39.760 --> 09:48.320
hugging face. If you go under the hood, hugging faces data sets library. You load the data set

09:48.320 --> 09:55.120
downloads a bunch of parquet files and then hugging face has a caching system locally so that you

09:55.120 --> 10:03.520
can operate on larger than memory data without blowing out your RAM. How to do that? It stores it

10:03.520 --> 10:11.600
as a series of Apache arrow files in your cache and then memory maps them so that it can load only

10:11.600 --> 10:17.840
the bits it needs into memory as it reads it, but you can interact with your large data set

10:17.840 --> 10:24.640
that's larger than memory without requiring a large amount of RAM. And because of the properties

10:24.640 --> 10:32.320
of Apache arrow that allow you to write out arrow IPC file and just memory map it to improve your

10:32.320 --> 10:39.280
data and only actually have to allocate memory for the bits that you use. And then we go to the

10:39.280 --> 10:46.400
compute side of things, the actual computational things. You know, dot db's internal memory representations

10:46.400 --> 10:52.320
about 95% identical to arrow. Which means that in dot db implements the interfaces so you can

10:52.320 --> 11:01.200
get zero copy, cost, results from dot db in the arrow format. Snowflake has an arrow interface for

11:01.200 --> 11:07.680
getting API data. BigQuery has an arrow output implementation. If you're using Apache data fusion,

11:08.720 --> 11:14.480
data fusion started as a sub project of the Apache arrow project, a top level project of its own,

11:14.480 --> 11:19.920
is entirely built under the hood as Apache arrow for its memory representation.

11:20.480 --> 11:28.000
And as a result, all of these system support arrow in an arrow out for interoperability. Any tool

11:28.000 --> 11:34.960
that can output arrow can be used to then send it to one of these systems as ingestion. And any tool

11:34.960 --> 11:41.680
that can receive arrow data can zero copy superficially get data from those systems into it.

11:41.760 --> 11:48.480
Now, the point when I'm talking about all the zero copy and interoperability and

11:48.480 --> 11:55.920
manipulations is that it's often very hidden where these copies are. When you're dealing with,

11:55.920 --> 12:02.640
like I'm going to bring up the hugging face data sets library again, you can see here that

12:02.720 --> 12:12.560
when you load the data set, the underlying dot data for the data set is a pi arrow schema

12:13.840 --> 12:19.120
and the underlying actual data columns are pi arrow chunked arrays.

12:20.960 --> 12:28.560
But when you actually index into the data set top part to retrieve your column,

12:28.880 --> 12:39.920
it converts it to a Python list. That's a copy. Because it has to convert, it has to be in the

12:39.920 --> 12:47.120
Python memory, not the raw underlying series, it's a plus memory, there's a copy there.

12:47.920 --> 12:53.440
You may not realize it and it's just very simple hidden situations where you don't realize that

12:53.440 --> 12:58.640
copies are happening simply because the libraries you're interacting with are not necessarily

12:58.640 --> 13:05.600
already utilizing arrow. So with everything I'm talking about, everything I brought out,

13:06.720 --> 13:13.280
the meat of one, the meat of this is called about what exactly am I proposing? Why should all of

13:13.280 --> 13:20.320
you care what I have to say here? Well, you have the multiverse of arrow systems and

13:20.400 --> 13:27.840
universes and libraries, you've got the IPC format for interrupt interrupt data, for interoperability

13:27.840 --> 13:35.200
between libraries and systems. You've got a C interface which is a stable ABI to pass the data across

13:35.200 --> 13:41.360
run times in the process without having to copy anything. It just passes pointers to the buffers.

13:42.720 --> 13:48.960
And you've got this implemented in every language under the Sun has, well, not every, but most

13:49.040 --> 13:54.720
languages have a implementation of the arrow spec. Not only that, but arrow also specifies

13:55.440 --> 14:02.640
interfaces and protocols for data transport and database connections with ADVC instead of ODBC

14:02.640 --> 14:11.600
to get a column-oriented communication of data. So with all of this connectivity and all of this

14:11.680 --> 14:18.480
interoperability, how can we benefit from this? How can the AI and ML pipelines and data

14:18.480 --> 14:28.720
tools and libraries really utilize this to their fullest? And by doing so, the result of doing that

14:28.720 --> 14:35.920
is going to be an increase in the ability to mix in match libraries for getting exactly the right

14:36.000 --> 14:42.400
tool for whatever operations, computations, calculations, you need in your model.

14:45.280 --> 14:51.520
Now, when you actually think about leveraging that zero copy, you can see the performance benefits.

14:51.520 --> 14:59.200
There was a white paper published last year by members of one of the straws I mentioned earlier

14:59.200 --> 15:09.200
about plan, which develops a Python sidecar runtime, highly utilized for AI and ML workflows.

15:10.640 --> 15:16.160
And they were testing in the, as one of the pieces of the white paper they published, they were testing,

15:16.160 --> 15:27.200
reading data frames on a, you know, C5 9x large AWS machine. And depending on with a source of

15:27.280 --> 15:34.400
the raw data was coming from. Now, the interesting part of this was because you're, if you're building

15:34.400 --> 15:42.000
a distributed system, how do you get data from, from node to node, from piece to piece? And if

15:42.000 --> 15:48.240
you're just, and so a common way of doing that is you write the resulting intermediate data to

15:48.240 --> 15:56.160
a parquet file in S3. Another way of doing it commonly is writing a parquet file onto a shared

15:56.240 --> 16:01.440
SSD because you don't want to, because you want to bypass the network cost of going to S3.

16:02.800 --> 16:06.800
And you can see those comparisons here of doing it with, you know, 10 million or 50 million

16:06.800 --> 16:15.280
rows of, you know, six gig or 30 gigs of data. If you utilize one of the frameworks that the

16:15.280 --> 16:23.280
arrow project specifies, arrow flight, which is a GRPC based protocol for building a framework

16:23.360 --> 16:30.960
on top of system communication built around streams of arrow record badges. And you can see right

16:30.960 --> 16:36.560
here, you know, there's, you still have that network overhead and also arrow isn't as comprise

16:36.560 --> 16:44.800
and compressed like parquet. So that may not be the best, but that arrow IPC, if you write the arrow

16:44.800 --> 16:52.400
for arrow data to share memory and then just memory map it into your system, which is like I said,

16:52.480 --> 16:57.840
exactly how hugging faces caching works, look at the difference in the performance.

16:59.360 --> 17:07.040
You know, right there you can see what, where this removing of unnecessary copies results in

17:07.040 --> 17:13.920
enormous potential for performance gains, which therefore leads to enormous potential for expanding

17:13.920 --> 17:21.840
model computation and inputting more operations without increasing the massive amount of cost.

17:22.480 --> 17:29.600
Now, one thing you might wonder about, if I keep talking about arrow, I'll keep talking about

17:29.600 --> 17:34.720
how we can use like, one thing you might talk, you might wonder about is, okay, can it support

17:34.720 --> 17:41.840
everything I need to represent? And the answer is most likely yes, you know, you have all of your

17:41.840 --> 17:47.520
standard primitive types, you know, you know, floats, float 32, float 64 integers of all

17:47.520 --> 17:53.840
signed and bit width and so on, decimals, strings and so on, but you also have complex types,

17:53.840 --> 18:00.400
maps, trucks, lists, all efficiently represented in column oriented organized memory.

18:01.920 --> 18:09.680
Now, the arrow spec also includes a piece for extension types and an extension type is just

18:10.640 --> 18:18.880
an already existing arrow primitive type and then layering some metadata on top of it so that

18:18.880 --> 18:26.880
an individual library or individual runtime or instance can register a custom class to add

18:26.880 --> 18:37.360
semantic utilities to that data. So that, for example, you could register a JSON type that's just

18:37.520 --> 18:44.560
a string column, but when you read in the data, it will wrap it in this custom class that will

18:44.560 --> 18:51.760
give you extra semantics and extra semantic functions on that string column because it's supposed

18:51.760 --> 18:58.640
to be JSON. Doesn't change the underlying storage and also it means that the data can flow through

18:58.640 --> 19:05.520
other systems that support arrow that don't have to care about the extension type because it's

19:05.600 --> 19:12.800
just an arrow data with some metadata keys. And so the community got together, especially with

19:12.800 --> 19:19.680
this AI and ML work and put together some canonical extension types, specifically to help out

19:19.680 --> 19:25.840
this interoperability with ML workflows and thus create a fixed shape and variable shape

19:25.840 --> 19:33.840
tensor types of extension types for arrow so that you can represent, you know, tensors of any

19:33.840 --> 19:41.840
amount of dimensions, permutations and so on with the arrow format in a standardized way to make it

19:41.840 --> 19:48.960
very easy to work around. How does it work? Well, I can give credit to a couple of my colleagues

19:48.960 --> 19:54.160
who recently came in and said that sat down after their pie arrow talk, rock and a link up that

19:54.800 --> 20:00.080
gave a big talk and also pushed through a lot of the stuff on how the arrow tensor system works.

20:00.480 --> 20:08.080
So you can see here, like, utilizing the way that you have, like, say a NumPy and DRA and the

20:08.080 --> 20:16.000
way it would be represented in the raw memory is just that continuous values buffer of all the

20:16.000 --> 20:24.640
data in one chunk and then you have a logical, you have a logical handling of it that will then

20:25.600 --> 20:32.560
allow you to interact with it as an actual list of multi-dimensional tensors.

20:33.840 --> 20:40.800
But the raw underlying bytes in memory are a continuous chunk of bytes, the way you would expect it

20:40.800 --> 20:47.360
in say NumPy or DRA, or any other way, or any other library, or the way that you represented.

20:48.320 --> 20:55.440
And so you can get this highly efficient constant time at random access to your individual

20:55.440 --> 21:01.120
tensor you want, but you also have this list of tensors that you can pass around very efficiently.

21:03.680 --> 21:09.520
And as I mentioned, there's also a variable shape tensor. Now, while most, most AI

21:09.520 --> 21:14.400
in the ML probably is utilizing fixed shape tensors, there's also a lot of usage of variables,

21:15.040 --> 21:20.400
variable shape tensors, especially when you start talking about LLMs, where your input data of

21:20.400 --> 21:26.400
tokens of your strings and your variables and your values and your text is not going to be a fixed shape.

21:27.440 --> 21:32.480
Each individual input may be a different length so you need to handle this variable shape

21:32.480 --> 21:38.800
of the tensor and still, and still handled using the underlying arrow format, raw bytes,

21:38.880 --> 21:46.080
continuous, easy to utilize, and interoperable. And just so you know I'm not blowing smoke,

21:47.360 --> 21:55.280
and so I'm not lying to you, some proof. So if you look at the actual code here,

21:57.600 --> 22:08.400
I generate a random NumPy array of data and I flatten it out so that I can have my nice list

22:09.200 --> 22:17.600
of 32 by 32 tensors. So, and then I create an arrow tensor array from that.

22:19.280 --> 22:26.480
And then for the app for the NumPy array, the arrow tensor array, and the

22:26.480 --> 22:33.520
and the NumPy underlying result, when I go from the tensor back to NumPy, you can see that

22:34.480 --> 22:43.760
one, the shape is retained, you know, a thousand 32 by 32 tensors, and two on the left-hand side,

22:43.760 --> 22:49.200
you can see that all three versions, whether I'm accessing it through arrow,

22:49.760 --> 22:55.760
whether I'm looking at the original NumPy array, or if you're looking at going from arrow to NumPy,

22:56.960 --> 23:01.520
the actual memory address of the raw data is the same address.

23:02.400 --> 23:09.040
That through these conversions of how you can interact with it, with arrow, with NumPy, with the

23:09.040 --> 23:14.560
extended tensor type, there was never a copy of the data. All three of them are still pointing

23:14.560 --> 23:20.080
at the same exact location in memory, not, not no point actually copied it.

23:21.920 --> 23:26.640
Another example that I got from, from another colleague from my colleague is

23:26.640 --> 23:34.000
storing tensors in parquet, which can happen sometimes when you want to store the weights,

23:34.000 --> 23:38.640
when you want to store the result of tokenization, when you start to want to store input data,

23:38.640 --> 23:44.720
you want to store it in parquet, and it is extremely easy to do so. You know, you have your,

23:44.720 --> 23:48.960
you know, create my little shapes, do it, my extra transformations here, in this case,

23:48.960 --> 23:54.720
I'm loading up geotyph data for geospatial handling, do some manipulations for the inputs,

23:54.800 --> 24:00.400
and then I store it in the parquet file as a whole bunch of tensors representing the geotyph

24:00.400 --> 24:09.680
data. And by converting it to a fixed shape tensor pyro, I can then use the parquet library that's

24:09.680 --> 24:19.280
part of pyro, and just write my table to parquet, and it maintains the tensors, just going to rot

24:19.440 --> 24:26.560
it right table, and then I can read that back out as a pyro table, and now I have my tensor,

24:26.560 --> 24:32.640
my, my list of tensors again, extremely efficiently, and then I can go through it,

24:34.480 --> 24:41.280
read the table, two numpy and d array, guess what, there's no copy, there's zero copy,

24:42.160 --> 24:48.160
pass the resulting tensor list of tensors, p-process them, put them into my model for prediction,

24:48.160 --> 25:00.160
get the result, pull up from a numpy and d array, make my, make my pyro table, and then write the results to a parquet file,

25:00.160 --> 25:10.000
the same way that I can then read the tensors back easily. Okay, so many of you might be familiar with this lovely

25:10.880 --> 25:15.680
slide, and the big question is, okay, so, but we have the alpac, we have numpy,

25:16.880 --> 25:24.720
why we need another interoperable standard here, and the big thing that came to me as I investigated,

25:24.720 --> 25:32.160
and as I researched, was the complex types of board. For example, the alpac numpy,

25:33.040 --> 25:38.800
only support numeric data. You, the alpac has a little thing, you can put opaque stuff in, but you

25:38.880 --> 25:46.160
lose your semantics stuff. You can't represent say strings using it, or null data.

25:48.000 --> 25:53.200
This becomes this habit of using N-A-N to represent nulls, instead of actually having nulls,

25:53.200 --> 25:59.760
via arrows believe it map. You can see that, you know, a while back there was a big

26:00.800 --> 26:05.760
talk with a gauge showing a huge performance benefit of building a feature store with ducted

26:05.760 --> 26:11.840
DB and arrow flight. The other thing is that, like I mentioned before, QDF and when I already

26:11.840 --> 26:18.880
are arrow on the device, which means that if more libraries implement support for the arrow device

26:18.880 --> 26:30.320
API, you can do pre-processing on a GPU, and pass your results to your model without having to

26:30.320 --> 26:37.200
copy it back to the CPU and back to the GPU. You can keep the data on the device for whole workflow,

26:37.760 --> 26:44.400
whether it's for training or inference. The pre-processing and the post-processing can be on the device

26:44.400 --> 26:50.000
and not force those copies. And that's all that's it. You were removing the unnecessary copies.

26:51.360 --> 26:56.160
Now, I had a contrived quick demo, but I'm being told I have no time. So I'm going to skip that for

26:56.320 --> 27:01.200
now, but if you want to see it, come find me. And the ultimate of the goal here is really just

27:01.200 --> 27:06.960
mix and match picking the best tools for the job as you need them. So if you want to learn more,

27:09.200 --> 27:14.880
you can go to the arrow documentation, you can go to the arrow websites, you can leverage the

27:15.680 --> 27:21.280
second edition of my book, which actually contains a whole bunch of stuff including the example I was

27:21.360 --> 27:26.960
going to show of using arrow with GPUs and so like that, among other things outside this top context

27:26.960 --> 27:32.640
also. So thank you very much, and I hope to see you guys around.