Pandas DataFrames Tips and Tricks

Published on Jul 31, 2022 (edited on Aug 18, 2022)

Pandas is without question one of the most popular data science packages for Python. It is hard to image performing data analysis in Python without using Pandas. The DataFrame (and, of course, Series) data structure is fundamental for working with tabular data.

Data scientists who are new to Python tend to poorly utilize Pandas. Some use it predominately for reading and writing csv files. Others use it extensively but perform operations on DataFrames in ways that are not necessarily computationally performant.

This article is not a Pandas tutorial. We assume you are familiar with the basics such as creating DataFrames and indexing columns and rows. In this article, we outline a small number of operations that show Pandas' true power and usefulness for data science tasks.

Iterating over rows and columns

Iterating over a DataFrame's rows and columns is a basic operation. Pandas offers convenient methods for iterating over rows and columns.

Iterating over rows

Pandas provides the DataFrame method iterrows() for this purpose. Consider the following example,

import pandas as pd
import numpy as np    # we are going to need this one later

df = pd.DataFrame(
    {
        'name': ['Frank','Kate','Mia'],
        'age': [32,18,65]
    }
)
for index, row in df.iterrows():
  print("index: ", index, "\nrow:\n", row, "\ntype(row): ", type(row))
index:  0
row:
 name    Frank
age        32
Name: 0, dtype: object
type(row):  
index:  1
row:
 name    Kate
age       18
Name: 1, dtype: object
type(row):  
index:  2
row:
 name    Mia
age      65
Name: 2, dtype: object
type(row):  

As we can see from the above example, iterrows() returns an index and a Pandas Series object for each row in the DataFrame. Be careful of the operations you perform on the returned Series objects. As the documentation states, you should never modify something you are iterating over and since the iterator returns a copy of the data and not a view, changing the values of the returned row will not modify the original DataFrame. See the Pandas documentation for details: pandas.DataFrame.iterrows

There is another method called itertuples that we can use to iterate over DataFrame rows. We will discuss this method in a later section. First, let's have a look at methods for iterating over a DataFrame's columns.

Iterating over columns

We can iterate over DataFrame columns in a couple of ways. First, consider the following approach,

column_names = df.columns
print(column_names)
for column in column_names:
  print(df[column])
Index(['name', 'age'], dtype='object')
0    Frank
1     Kate
2      Mia
Name: name, dtype: object
0    32
1    18
2    65
Name: age, dtype: int64

In the above example, we first get a reference to the DataFrame's column Index object. We then iterate over the Index values and access each column using the index. One of the benefits of this approach is that we can iterate over the columns in any order we want using slice notation. Consider the following example where we iterate over the columns in reverse order,

for column in df.columns[::-1]:
  print(df[column])
0    32
1    18
2    65
Name: age, dtype: int64
0    Frank
1     Kate
2      Mia
Name: name, dtype: object

One issue with this approach is that it might be a bit too verbose if we just want to iterate over the columns based on the index order. In this case, we can use the items() method as follows,

for column_name, column_data in df.items():
  print("Column name: ", column_name, "\n", column_data)
Column name:  name
0    Frank
1     Kate
2      Mia
Name: name, dtype: object
Column name:  age
0    32
1    18
2    65
Name: age, dtype: int64

The method items() returns two values. The first one is the column name or label. The second is the column data as a Pandas Series object.

Accessing the underlying data

Oftentimes, we want to access the data stored in a DataFrame but in a format suitable for use with other packages such as Numpy or plain Python. One example is using data stored in a DataFrame as input to a TensorFlow or PyTorch model.

Conveniently, Pandas provides us with the DataFrame method to_numpy(). Consider the following example of a 2-column DataFrame.

df = pd.DataFrame(
    {
        'a': [1,2,3],
        'b': [-1,-2,-3]
    }
)
arr = df.to_numpy()
print("Shape: ", arr.shape, "\n", arr, "\n", "Type: ", arr.dtype)

That worked nicely. What should we be careful about when using this method?

First, we must understand that by default the method does not necessarily return a copy of the data in the DataFrame but instead might return a view. The documentation does not say explicitly under what conditions the returned data is a view or a copy. In others words, we best be cautious and assume a view. This means, that if we change the data via the arr reference we are also changing the data in df and vice versa. If we really want a copy, assuming that our computer has enough memory to store two copies of the data, then we must be explicit by calling to_numpy(copy=True).

arr_view = df.to_numpy()
arr_view[0, 0 ]= 100
print(arr_view)
print("--------")
print(df)
[[100  -1]
 [  2  -2]
 [  3  -3]]
--------
     a  b
0  100 -1
1    2 -2
2    3 -3

The second but also very important consideration when using to_numpy() is data conversion. What will the type of the Numpy array be? In the above example we might expect it to be integer; but would that be 16, 32, or 64-bit integer? What if the values were floating point? Would it be 32 or 64-bit float? The way Pandas handles this is to convert the data into the most common type among all the values. That is, if we have both integers and floats, the Numpy array will be of type float. The actual type will depend on the type of data in the DataFrame. So, if some values are 32-bit integers and the rest are 64-bit floats, then the Numpy array will have type float64.

One last word of caution with regard to data conversion. If your DataFrame has non-numeric entries, then the resulting Numpy array will be of type object. This may not be what you want. Be mindful of the type of data stored in your DataFrame and the type of data your downstream task requires! For more information see the Pandas documentation: pandas.DataFrame.to_numpy

Another very useful method for accessing the values in a DataFrame is the method itertuples(). When calling this method on a DataFrame object, we get back an iterator. We can use it to iterate over the DataFrame rows which are returned not as Pandas Series objects but rather as tuples. We can easily access the data in a DataFrame as a list of tuples as follows,

list_of_tuples = list(df.itertuples())
list_of_tuples
[Pandas(Index=0, a=100, b=-1),
 Pandas(Index=1, a=2, b=-2),
 Pandas(Index=2, a=3, b=-3)]

Notice that the result is a list of named tuples where the name is the default Pandas. Also notice that the first entry in each tuple is the row index. We can specify our custom name and that we do not want the index as follows (we can do either or both of these),

list(df.itertuples(index=False, name='MyIntegers'))
[MyIntegers(a=100, b=-1), MyIntegers(a=2, b=-2), MyIntegers(a=3, b=-3)]

For more information see the Pandas documentation: pandas.DataFrame.itertuples

One-hot Encoding of Categorical Data

One of the most common data pre-processing operations for categorical data is one-hot encoding. Machine Learning models cannot correctly handle categorical data without this operation. Both dependent and independent variables can be categorical.

Consider the simple example of predicting a person's income based on their gender identity and age. Age is a continuous variable but gender identity is categorical. For simplicity, let us assume that gender identity can be one of "male", "female", and "trans".

For our machine learning model, we want to use the following binary codes 001, 010, 100, for "male", "female", and "trans" respectively. Given a Pandas DataFrame with one column for gender identity and one for income, we can one-hot encode the former using the get_dummies() method.

df = pd.DataFrame(
    {
        'gender': ['male', 'female', 'trans', 'male'],
        'age': [42, 39, 54, 21],
        'income': [109, 98, 124, 67]   # in thousands of US dollars
    }
)
pd.get_dummies(df)
age	income	gender_female	gender_male	gender_trans
0	42	109	0	1	0
1	39	98	1	0	0
2	54	124	0	0	1
3	21	67	0	1	0

In the above example, calling get_dummies() has returned a new DataFrame with new columns. All the columns with categorical data were one-hot encoded. In the example only the column gender was one-hot encoded since it is the only column with categorical data. We see that the new DataFrame has one column for each category value. Column names are a composite of the original column name, gender, and the value, one of "male", "female", "trans". We can customise the columns names by passing our desired value to get_dummies(). See the Pandas documentation for details: pandas.get_dummies

So, we had a brief look at a small number of important operations we can perform on Pandas DataFrames. Iterating over DataFrame rows and columns as well as accessing the stored data and one-hot encoding categorical variables are some of the most common operations a data scientist performs on a daily basis.

Go forth and calculate!


Read more from our Blog.