In this article I would like to describe how to find NaN values in a pandas DataFrame. This kind of operation can be very useful given that is common to find datasets with missing or incorrect data values.
I will be using the numpy package to generate some data with NaN values.
Import necessary packages
Generate data with NaN values
num_nan = 25 # number of NaN values wanted in the generated data
np.random.seed(6765431) # set a seed for reproducibility
A = np.random.randn(10, 10)
print(A)
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
0.44499302 -0.61209568 0.31170935 1.4127548 ]
[ 0.85330488 0.68517546 -1.10140989 0.84918019 0.72802961 -0.35161197
0.73519152 1.13145412 0.53231247 0.78103143]
[-0.81614324 0.15906898 0.49940119 -0.09319255 -1.07837721 -0.76053341
0.73622083 -0.45518154 -0.69194032 1.02550409]
[-1.96339975 0.07593331 -0.16798377 -1.20398958 0.88333656 1.17908422
0.26324698 -2.65442248 -0.31583796 -0.16065732]
[-1.24321376 -0.89816898 0.02824671 0.15304093 0.56505667 -0.78115883
0.74504467 1.14025258 -0.04518221 -0.83908358]
[ 1.00967019 0.84240102 1.15043436 -0.40120489 0.00664105 -1.23247563
0.64738343 1.66096762 -0.92556683 0.47575796]
[ 0.96516278 1.11158059 -0.82155143 0.88900313 2.16943761 -2.05250161
2.40156233 0.92453867 -0.24437783 -2.91029265]
[-0.86492662 0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
0.07177233 1.02820181 -2.08947076 0.89859677]
[-0.07263982 -0.56840867 1.30910275 -0.52846822 0.06019191 -0.61000727
0.40782356 -0.36124333 -1.54522486 -0.07891861]
[-1.96361682 -1.06315325 -0.45582138 -0.74566868 1.27579529 -2.46306005
0.57022673 -0.02793746 0.78652775 1.27690195]]
# Set random values to nan
A.ravel()[np.random.choice(A.size, num_nan, replace=False)] = np.nan
print(A)
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
nan -0.61209568 0.31170935 1.4127548 ]
[ 0.85330488 0.68517546 nan 0.84918019 nan -0.35161197
0.73519152 nan 0.53231247 0.78103143]
[-0.81614324 0.15906898 0.49940119 nan -1.07837721 -0.76053341
0.73622083 nan -0.69194032 1.02550409]
[-1.96339975 0.07593331 nan -1.20398958 0.88333656 nan
0.26324698 nan -0.31583796 -0.16065732]
[-1.24321376 -0.89816898 0.02824671 0.15304093 0.56505667 -0.78115883
0.74504467 1.14025258 -0.04518221 -0.83908358]
[ 1.00967019 0.84240102 nan -0.40120489 0.00664105 nan
0.64738343 1.66096762 -0.92556683 0.47575796]
[ 0.96516278 nan -0.82155143 0.88900313 2.16943761 nan
2.40156233 nan -0.24437783 nan]
[-0.86492662 0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
0.07177233 1.02820181 -2.08947076 nan]
[-0.07263982 nan 1.30910275 -0.52846822 0.06019191 -0.61000727
0.40782356 -0.36124333 nan nan]
[ nan nan nan nan 1.27579529 -2.46306005
nan nan 0.78652775 1.27690195]]
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -1.561323 | -0.169541 | -0.178454 | -1.336891 | -0.191851 | -1.186178 | NaN | -0.612096 | 0.311709 | 1.412755 |
1 | 0.853305 | 0.685175 | NaN | 0.849180 | NaN | -0.351612 | 0.735192 | NaN | 0.532312 | 0.781031 |
2 | -0.816143 | 0.159069 | 0.499401 | NaN | -1.078377 | -0.760533 | 0.736221 | NaN | -0.691940 | 1.025504 |
3 | -1.963400 | 0.075933 | NaN | -1.203990 | 0.883337 | NaN | 0.263247 | NaN | -0.315838 | -0.160657 |
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
5 | 1.009670 | 0.842401 | NaN | -0.401205 | 0.006641 | NaN | 0.647383 | 1.660968 | -0.925567 | 0.475758 |
6 | 0.965163 | NaN | -0.821551 | 0.889003 | 2.169438 | NaN | 2.401562 | NaN | -0.244378 | NaN |
7 | -0.864927 | 0.824432 | -0.482469 | -1.051831 | -1.152725 | -0.771707 | 0.071772 | 1.028202 | -2.089471 | NaN |
8 | -0.072640 | NaN | 1.309103 | -0.528468 | 0.060192 | -0.610007 | 0.407824 | -0.361243 | NaN | NaN |
9 | NaN | NaN | NaN | NaN | 1.275795 | -2.463060 | NaN | NaN | 0.786528 | 1.276902 |
Check for NaN values
Now that we have some data to operate on let’s see the different ways we can check for missing values.
There are two methods of the DataFrame object that can be used: DataFrame#isna()
and DataFrame#isnull()
. But if you check the source code it seems that isnull()
is only an alias for the isna()
method. To keep it simple I will only use the isna()
method as we would get the same result using isnull()
.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | True | False | False | False |
1 | False | False | True | False | True | False | False | True | False | False |
2 | False | False | False | True | False | False | False | True | False | False |
3 | False | False | True | False | False | True | False | True | False | False |
4 | False | False | False | False | False | False | False | False | False | False |
5 | False | False | True | False | False | True | False | False | False | False |
6 | False | True | False | False | False | True | False | True | False | True |
7 | False | False | False | False | False | False | False | False | False | True |
8 | False | True | False | False | False | False | False | False | True | True |
9 | True | True | True | True | False | False | True | True | False | False |
As it can be seen above when we use the isna()
method it returns a DataFrame with boolean values, where True
indicates NaN values and False
otherwise.
If we wanted to know how many missing values there are on each row or column we could use the DataFrame#sum()
method:
To simply know the total number of missing values we can call sum()
again:
If we simply wanna know if there is any missing value with no care for the quantity we can simply use the any()
method:
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
dtype: bool
Calling it again we have a single boolean output:
Besides the isna()
method we also have the notna()
method which is its boolean inverse. Applying it we can get the number of values that are not missing or simply if all values are not missing (but using the all()
method instead of any()
).
Note 1: in the examples, it was used the DataFrame methods to check for missing values, but the pandas package has its own functions with the same purpose that can be applied to other objects. Example:
[False False True]
[ True True False]
Note 2: the methods applied here on DataFrame objects are also available for Series and Index objects.
Time comparison
Comparing the time taken by the two methods we can see that using any()
is faster but sum()
will give us the additional information about how many missing values there are.
274 µs ± 3.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Dealing with missing values
Two easy ways to deal with missing values are removing them or filling them with some value. These can be achieved with the dropna()
and fillna()
methods.
The dropna()
method will return a DataFrame without the rows and columns containing missing values.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
The fillna()
method will return a DataFrame with the missing values filled with a specified value.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -1.561323 | -0.169541 | -0.178454 | -1.336891 | -0.191851 | -1.186178 | 5.000000 | -0.612096 | 0.311709 | 1.412755 |
1 | 0.853305 | 0.685175 | 5.000000 | 0.849180 | 5.000000 | -0.351612 | 0.735192 | 5.000000 | 0.532312 | 0.781031 |
2 | -0.816143 | 0.159069 | 0.499401 | 5.000000 | -1.078377 | -0.760533 | 0.736221 | 5.000000 | -0.691940 | 1.025504 |
3 | -1.963400 | 0.075933 | 5.000000 | -1.203990 | 0.883337 | 5.000000 | 0.263247 | 5.000000 | -0.315838 | -0.160657 |
4 | -1.243214 | -0.898169 | 0.028247 | 0.153041 | 0.565057 | -0.781159 | 0.745045 | 1.140253 | -0.045182 | -0.839084 |
5 | 1.009670 | 0.842401 | 5.000000 | -0.401205 | 0.006641 | 5.000000 | 0.647383 | 1.660968 | -0.925567 | 0.475758 |
6 | 0.965163 | 5.000000 | -0.821551 | 0.889003 | 2.169438 | 5.000000 | 2.401562 | 5.000000 | -0.244378 | 5.000000 |
7 | -0.864927 | 0.824432 | -0.482469 | -1.051831 | -1.152725 | -0.771707 | 0.071772 | 1.028202 | -2.089471 | 5.000000 |
8 | -0.072640 | 5.000000 | 1.309103 | -0.528468 | 0.060192 | -0.610007 | 0.407824 | -0.361243 | 5.000000 | 5.000000 |
9 | 5.000000 | 5.000000 | 5.000000 | 5.000000 | 1.275795 | -2.463060 | 5.000000 | 5.000000 | 0.786528 | 1.276902 |