Pandas 2.0: A Game-Changer for Data Scientists?

The Top 5 Features for Efficient Data Manipulation

This April, pandas 2.0.0 was officially launched, making huge waves across the data science community. Photo by Yancy Min on Unsplash.

Due to its extensive functionality and versatility, pandas has secured a place in every data scientist’s heart.

From data input/output to data cleaning and transformation, it’s nearly impossible to think about data manipulation without import pandas as pd, right?

Now, bear with me: with such a buzz around LLMs over the past months, I have somehow let slide the fact that pandas has just undergone a major release! Yep, pandas 2.0 is out and came with guns blazing!

Although I wasn’t aware of all the hype, the Data-Centric AI Community promptly came to the rescue:

The 2.0 release seems to have created quite an impact in the data science community, with a lot of users praising the modifications added in the new version. Screenshot by Author.

Fun fact: Were you aware this release was in the making for an astonishing 3 years? Now that’s what I call “commitment to the community”!

So what does pandas 2.0 bring to the table? Let’s dive right into it!

1. Performance, Speed, and Memory-Efficiency

As we all know, pandas was built using numpy, which was not intentionally designed as a backend for dataframe libraries. For that reason, one of the major limitations of pandas was handling in-memory processing for larger datasets.

In this release, the big change comes from the introduction of the Apache Arrow backend for pandas data.

Essentially, Arrow is a standardized in-memory columnar data format with available libraries for several programming languages (C, C++, R, Python, among others). For Python there is PyArrow, which is based on the C++ implementation of Arrow, and therefore, fast!

So, long story short, PyArrow takes care of our previous memory constraints of versions 1.X and allows us to conduct faster and more memory-efficient data operations, especially for larger datasets.

Here’s a comparison between reading the data without and with thepyarrow backend, using the Hacker News dataset, which is around 650 MB (License CC BY-NC-SA 4.0):

https://medium.com/media/745d057d0f8fb3873d789f46966303ed/href

As you can see, using the new backend makes reading the data nearly 35x faster. Other aspects worth pointing out:

Without the pyarrow backend, each column/feature is stored as its own unique data type: numeric features are stored as int64 or float64 while string values are stored as objects;With pyarrow, all features are using the Arrow dtypes: note the [pyarrow] annotation and the different types of data: int64 , float64 , string , timestamp , and double :https://medium.com/media/bac8abd48b2ba0121367e7c1a9cc3aaa/href

2. Arrow Data Types and Numpy Indices

Beyond reading data, which is the simplest case, you can expect additional improvements for a series of other operations, especially those involving string operations, since pyarrow’s implementation of the string datatype is quite efficient:

https://medium.com/media/1cb37167194490936e8ada4f301d6585/href

In fact, Arrow has more (and better support for) data types than numpy, which are needed outside the scientific (numerical) scope: dates and times, duration, binary, decimals, lists, and maps. Skimming through the equivalence between pyarrow-backed and numpy data types might actually be a good exercise in case you want to learn how to leverage them.

It is also now possible to hold more numpy numeric types in indices.
The traditional int64, uint64, and float64 have opened up space for all numpy numeric dtypes Index values so we can, for instance, specify their 32-bit version instead:

https://medium.com/media/6f5e05f68e0f6a4287cf4a5668420055/href

This is a welcome change since indices are one of the most used functionalities in pandas, allowing users to filter, join, and shuffle data, among other data operations. Essentially, the lighter the Index is, the more efficient those processes will be!

3. Easier Handling of Missing Values

Being built on top of numpy made it hard for pandas to handle missing values in a hassle-free, flexible way, since numpy does not support null values for some data types.

For instance, integers are automatically converted to floats, which is not ideal:

https://medium.com/media/e701c7b7de786f38908620bc0cef2874/href

Note how points automatically changes from int64 to float64 after the introduction of a singleNone value.

There is nothing worst for a data flow than wrong typesets, especially within a data-centric AI paradigm.

Erroneous typesets directly impact data preparation decisions, cause incompatibilities between different chunks of data, and even when passing silently, they might compromise certain operations that output nonsensical results in return.

As an example, at the Data-Centric AI Community, we’re currenlty working on a project around synthetic data for data privacy. One of the features, NOC (number of children), has missing values and therefore it is automatically converted to float when the data is loaded. The, when passing the data into a generative model as a float , we might get output values as decimals such as 2.5 — unless you’re a mathematician with 2 kids, a newborn, and a weird sense of humor, having 2.5 children is not OK.

In pandas 2.0, we can leverage dtype = ‘numpy_nullable’, where missing values are accounted for without any dtype changes, so we can keep our original data types (int64 in this case):

https://medium.com/media/5593df94f8aae5a826315c8da6ed79c2/href

It might seem like a subtle change, but under the hood it means that now pandas can natively use Arrow’s implementation of dealing with missing values. This makes operations much more efficient, since pandas doesn’t have to implement its own version for handling null values for each data type.

4. Copy-On-Write Optimization

Pandas 2.0 also adds a new lazy copy mechanism that defers copying DataFrames and Series objects until they are modified.

This means that certain methods will return views rather than copies when copy-on-write is enabled, which improves memory efficiency by minimizing unnecessary data duplication.

It also means you need to be extra careful when using chained assignments.

If the copy-on-write mode is enabled, chained assignments will not work because they point to a temporary object that is the result of an indexing operation (which under copy-on-write behaves as a copy).

When copy_on_write is disabled, operations like slicing may change the original df if the new dataframe is changed:

https://medium.com/media/874115eaec709968f2580a0de265224c/href

When copy_on_write is enabled, a copy is created at assignment, and therefore the original dataframe is never changed. Pandas 2.0 will raise a ChainedAssignmentError in these situations to avoid silent bugs:

https://medium.com/media/68eaede11395cada0bf667b4d134b6d3/href

5. Optional Dependencies

When using pip, version 2.0 gives us the flexibility to install optional dependencies, which is a plus in terms of customization and optimization of resources.

We can tailor the installation to our specific requirements, without spending disk space on what we don’t really need.

Plus, it saves a lot of “dependency headaches”, reducing the likelihood of compatibility issues or conflicts with other packages we may have in our development environments:

https://medium.com/media/cad7254638e9194048b73f243f8eb960/href

Taking it for a spin!

Yet, the question lingered: is the buzz really justified? I was curious to see whether pandas 2.0 provided significant improvements with respect to some packages I use on a daily basis: ydata-profiling, matplotlib, seaborn, scikit-learn.

From those, I decided to take ydata-profiling for a spin— it has just added support for pandas 2.0, which seemed like a must-have for the community! In the new release, users can rest to sure that their pipelines won’t break if they’re using pandas 2.0, and that’s a major plus! But what else?

Truth be told, ydata-profiling has been one of my top favorite tools for exploratory data analysis, and it’s a nice and quick benchmark too — a 1-line of code on my side, but under the hood it is full of computations that as a data scientist I need to work out — descriptive statistics, histogram plotting, analyzing correlations, and so on.

So what better way than testing the impact of the pyarrow engine on all of those at once with minimal effort?

https://medium.com/media/eed9efe9af6f5f81e17c2f48426a2787/href

Again, reading the data is definitely better with the pyarrow engine, althought creating the data profile has not changed significanlty in terms of speed.

Yet, differences may rely on memory efficiency, for which we’d have to run a different analysis. Also, we could further investigate the type of analysis being conducted over the data: for some operations, the difference between 1.5.2 and 2.0 versions seems negligible.

But the main thing I noticed that might make a difference to this regard is that ydata-profiling is not yet leveraging the pyarrow data types. This update could have a great impact in both speed and memory and is something I look forward in future developments!

The Verdict: Performance, Flexibility, Interoperability!

This new pandas 2.0 release brings a lot of flexibility and performance optimization with subtle, yet crucial modifications “under the hood”.

Maybe they are not “flashy” for newcomers into the field of data manipulation, but they sure as hell are like water in the desert for veteran data scientists that used to jump through hoops to overcome the limitations of the previous versions.

Wrapping it up, these are the top main advantages introduced in the new release:

Performance Optimization: With the introduction of Apache Arrow backend, more numpy dtype indices, and copy-on-write mode;Added flexibility and customization: Allowing users to control optional dependencies and taking advantage of the Apache Arrow data types (including nullability from the get go!);Interoperability: Perhaps a less “acclaimed” advantage of the new version but with huge impact. Since Arrow is language-independent, in-memory data can be transferred between programs built not only on Python, but also R, Spark, and others using Apache Arrow backend!

And there you have it, folks! I hope this wrap up as quieted down some of your questions around pandas 2.0 and its applicability on our data manipulation tasks.

I’m still curious whether you have found major differences in you daily coding with the introduction of pandas 2.0 as well! If you’re up to it, come and find me at the Data-Centric AI Community and let me know your thoughts! See you there?

About me

Ph.D., Machine Learning Researcher, Educator, Data Advocate, and overall “jack-of-all-trades”. Here on Medium, I write about Data-Centric AI and Data Quality, educating the Data Science & Machine Learning communities on how to move from imperfect to intelligent data.

Developer Relations @ YData | Data-Centric AI Community | GitHub | Instagram | Google ScholarLinkedIn

Pandas 2.0: A Game-Changer for Data Scientists? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top