How to check if any value is NaN in a pandas DataFrame
Posted by: AJ Welch
The official documentation for pandas defines what most developers would know as null
values as missing
or missing data
in pandas. Within pandas, a missing
value is denoted by NaN
.
In most cases, the terms missing
and null
are interchangeable, but to abide by the standards of pandas, we’ll continue using missing
throughout this tutorial.
Evaluating for missing data
At the base level, pandas offers two functions to test for missing
data, isnull() and notnull(). As you may suspect, these are simple functions that return a boolean
value indicating whether the passed in argument value is in fact missing
data.
In addition to the above functions, pandas also provides two methods to check for missing
data on Series and DataFrame objects. These methods evaluate each object in the Series or DataFrame and provide a boolean
value indicating if the data is missing
or not.
For example, let’s create a simple Series in pandas:
import pandas as pd
import numpy as np
s = pd.Series([2,3,np.nan,7,"The Hobbit"])
Now evaluating the Series s
, the output shows each value as expected, including index 2
which we explicitly set as missing
.
In [2]: s
Out[2]:
0 2
1 3
2 NaN
3 7
4 The Hobbit
dtype: object
To test the isnull()
method on this series, we can use s.isnull()
and view the output:
In [3]: s.isnull()
Out[3]:
0 False
1 False
2 True
3 False
4 False
dtype: bool
As expected, the only value evaluated as missing
is index 2
.
Determine if ANY value in a Series is missing
While the isnull()
method is useful, sometimes we may wish to evaluate whether any value is missing
in a Series.
There are a few possibilities involving chaining multiple methods together.
The fastest method is performed by chaining .values.any()
:
In [4]: s.isnull().values.any()
Out[4]:
True
In some cases, you may wish to determine how many missing
values exist in the collection, in which case you can use .sum()
chained on:
In [5]: s.isnull().sum()
Out[5]:
1
Count missing values in DataFrame
While the chain of .isnull().values.any()
will work for a DataFrame object to indicate if any value is missing
, in some cases it may be useful to also count the number of missing
values across the entire DataFrame. Since DataFrames are inherently multidimensional, we must invoke two methods of summation.
For example, first we need to create a simple DataFrame with a few missing
values:
In [6]: df = pd.DataFrame(np.random.randn(5,5))
df[df > 0.9] = pd.np.nan
Now if we chain a .sum()
method on, instead of getting the total sum of missing
values, we’re given a list of all the summations of each column
:
In [7]: df.isnull().sum()
Out[7]:
0 3
1 0
2 1
3 1
4 0
dtype: int64
We can see in this example, our first column contains three missing
values, along with one each in column 2
and 3
as well.
In order to get the total summation of all missing
values in the DataFrame, we chain two .sum()
methods together:
In [8]: df.isnull().sum().sum()
Out[8]:
5