7 minutes

In this article I would like to describe how to find NaN values in a pandas DataFrame. This kind of operation can be very useful given that is common to find datasets with missing or incorrect data values.

I will be using the numpy package to generate some data with NaN values.

Import necessary packages
In [1]:
import pandas as pd
import numpy as np
import platform
In [2]:
print(f'Python version: {platform.python_version()} ({platform.python_implementation()})')
print(f'Pandas version: {pd.__version__}')
print(f'Numpy version: {np.__version__}')
Python version: 3.6.4 (CPython)
Pandas version: 0.23.1
Numpy version: 1.14.5

Generate data with NaN values

In [3]:
num_nan = 25 # number of NaN values wanted in the generated data
np.random.seed(6765431)  # set a seed for reproducibility
A = np.random.randn(10, 10)
print(A)
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
   0.44499302 -0.61209568  0.31170935  1.4127548 ]
 [ 0.85330488  0.68517546 -1.10140989  0.84918019  0.72802961 -0.35161197
   0.73519152  1.13145412  0.53231247  0.78103143]
 [-0.81614324  0.15906898  0.49940119 -0.09319255 -1.07837721 -0.76053341
   0.73622083 -0.45518154 -0.69194032  1.02550409]
 [-1.96339975  0.07593331 -0.16798377 -1.20398958  0.88333656  1.17908422
   0.26324698 -2.65442248 -0.31583796 -0.16065732]
 [-1.24321376 -0.89816898  0.02824671  0.15304093  0.56505667 -0.78115883
   0.74504467  1.14025258 -0.04518221 -0.83908358]
 [ 1.00967019  0.84240102  1.15043436 -0.40120489  0.00664105 -1.23247563
   0.64738343  1.66096762 -0.92556683  0.47575796]
 [ 0.96516278  1.11158059 -0.82155143  0.88900313  2.16943761 -2.05250161
   2.40156233  0.92453867 -0.24437783 -2.91029265]
 [-0.86492662  0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
   0.07177233  1.02820181 -2.08947076  0.89859677]
 [-0.07263982 -0.56840867  1.30910275 -0.52846822  0.06019191 -0.61000727
   0.40782356 -0.36124333 -1.54522486 -0.07891861]
 [-1.96361682 -1.06315325 -0.45582138 -0.74566868  1.27579529 -2.46306005
   0.57022673 -0.02793746  0.78652775  1.27690195]]
In [4]:
# Set random values to nan
A.ravel()[np.random.choice(A.size, num_nan, replace=False)] = np.nan
print(A)
[[-1.56132314 -0.16954058 -0.17845422 -1.33689111 -0.19185078 -1.18617765
          nan -0.61209568  0.31170935  1.4127548 ]
 [ 0.85330488  0.68517546         nan  0.84918019         nan -0.35161197
   0.73519152         nan  0.53231247  0.78103143]
 [-0.81614324  0.15906898  0.49940119         nan -1.07837721 -0.76053341
   0.73622083         nan -0.69194032  1.02550409]
 [-1.96339975  0.07593331         nan -1.20398958  0.88333656         nan
   0.26324698         nan -0.31583796 -0.16065732]
 [-1.24321376 -0.89816898  0.02824671  0.15304093  0.56505667 -0.78115883
   0.74504467  1.14025258 -0.04518221 -0.83908358]
 [ 1.00967019  0.84240102         nan -0.40120489  0.00664105         nan
   0.64738343  1.66096762 -0.92556683  0.47575796]
 [ 0.96516278         nan -0.82155143  0.88900313  2.16943761         nan
   2.40156233         nan -0.24437783         nan]
 [-0.86492662  0.82443151 -0.48246862 -1.05183143 -1.15272524 -0.77170733
   0.07177233  1.02820181 -2.08947076         nan]
 [-0.07263982         nan  1.30910275 -0.52846822  0.06019191 -0.61000727
   0.40782356 -0.36124333         nan         nan]
 [        nan         nan         nan         nan  1.27579529 -2.46306005
          nan         nan  0.78652775  1.27690195]]
In [5]:
# Create a DataFrame from the generated data
df = pd.DataFrame(A)
df
Out[5]:
0 1 2 3 4 5 6 7 8 9
0 -1.561323 -0.169541 -0.178454 -1.336891 -0.191851 -1.186178 NaN -0.612096 0.311709 1.412755
1 0.853305 0.685175 NaN 0.849180 NaN -0.351612 0.735192 NaN 0.532312 0.781031
2 -0.816143 0.159069 0.499401 NaN -1.078377 -0.760533 0.736221 NaN -0.691940 1.025504
3 -1.963400 0.075933 NaN -1.203990 0.883337 NaN 0.263247 NaN -0.315838 -0.160657
4 -1.243214 -0.898169 0.028247 0.153041 0.565057 -0.781159 0.745045 1.140253 -0.045182 -0.839084
5 1.009670 0.842401 NaN -0.401205 0.006641 NaN 0.647383 1.660968 -0.925567 0.475758
6 0.965163 NaN -0.821551 0.889003 2.169438 NaN 2.401562 NaN -0.244378 NaN
7 -0.864927 0.824432 -0.482469 -1.051831 -1.152725 -0.771707 0.071772 1.028202 -2.089471 NaN
8 -0.072640 NaN 1.309103 -0.528468 0.060192 -0.610007 0.407824 -0.361243 NaN NaN
9 NaN NaN NaN NaN 1.275795 -2.463060 NaN NaN 0.786528 1.276902

Check for NaN values

Now that we have some data to operate on let's see the different ways we can check for missing values.

There are two methods of the DataFrame object that can be used: DataFrame#isna() and DataFrame#isnull(). But if you check the source code it seems that isnull() is only an alias for the isna() method. To keep it simple I will only use the isna() method as we would get the same result using isnull().

In [6]:
df.isna()
Out[6]:
0 1 2 3 4 5 6 7 8 9
0 False False False False False False True False False False
1 False False True False True False False True False False
2 False False False True False False False True False False
3 False False True False False True False True False False
4 False False False False False False False False False False
5 False False True False False True False False False False
6 False True False False False True False True False True
7 False False False False False False False False False True
8 False True False False False False False False True True
9 True True True True False False True True False False

As it can be seen above when we use the isna() method it returns a DataFrame with boolean values, where True indicates NaN values and False otherwise.

If we wanted to know how many missing values there are on each row or column we could use the DataFrame#sum() method:

In [7]:
df.isna().sum(axis='rows')  # 'rows' or 0
Out[7]:
0    1
1    3
2    4
3    2
4    1
5    3
6    2
7    5
8    1
9    3
dtype: int64
In [8]:
df.isna().sum(axis='columns')  # 'columns' or 1
Out[8]:
0    1
1    3
2    2
3    3
4    0
5    2
6    4
7    1
8    3
9    6
dtype: int64

To simply know the total number of missing values we can call sum() again:

In [9]:
df.isna().sum().sum()
Out[9]:
25

If we simply wanna know if there is any missing value with no care for the quantity we can simply use the any() method:

In [10]:
df.isna().any()  # can also receive axis='rows' or 'columns'
Out[10]:
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
8    True
9    True
dtype: bool

Calling it again we have a single boolean output:

In [11]:
df.isna().any().any()
Out[11]:
True

Besides the isna() method we also have the notna() method which is its boolean inverse. Applying it we can get the number of values that are not missing or simply if all values are not missing (but using the all() method instead of any()).

In [12]:
print(df.notna().sum().sum())  # not missing
print(df.notna().all().all())
75
False

Note 1: in the examples, it was used the DataFrame methods to check for missing values, but the pandas package has its own functions with the same purpose that can be applied to other objects. Example:

In [13]:
print(pd.isna([1, 2, np.nan]))
print(pd.notna([1, 2, np.nan]))
[False False  True]
[ True  True False]

Note 2: the methods applied here on DataFrame objects are also available for Series and Index objects.

Time comparison

Comparing the time taken by the two methods we can see that using any() is faster but sum() will give us the additional information about how many missing values there are.

In [14]:
%timeit df.isna().any().any()
333 µs ± 33.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [15]:
%timeit df.isna().sum().sum()
561 µs ± 97.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Dealing with missing values

Two easy ways to deal with missing values are removing them or filling them with some value. These can be achieved with the dropna() and fillna() methods.

The dropna() method will return a DataFrame without the rows and columns containing missing values.

In [16]:
df.dropna()
Out[16]:
0 1 2 3 4 5 6 7 8 9
4 -1.243214 -0.898169 0.028247 0.153041 0.565057 -0.781159 0.745045 1.140253 -0.045182 -0.839084

The fillna() method will return a DataFrame with the missing values filled with a specified value.

In [17]:
df.fillna(value=5)
Out[17]:
0 1 2 3 4 5 6 7 8 9
0 -1.561323 -0.169541 -0.178454 -1.336891 -0.191851 -1.186178 5.000000 -0.612096 0.311709 1.412755
1 0.853305 0.685175 5.000000 0.849180 5.000000 -0.351612 0.735192 5.000000 0.532312 0.781031
2 -0.816143 0.159069 0.499401 5.000000 -1.078377 -0.760533 0.736221 5.000000 -0.691940 1.025504
3 -1.963400 0.075933 5.000000 -1.203990 0.883337 5.000000 0.263247 5.000000 -0.315838 -0.160657
4 -1.243214 -0.898169 0.028247 0.153041 0.565057 -0.781159 0.745045 1.140253 -0.045182 -0.839084
5 1.009670 0.842401 5.000000 -0.401205 0.006641 5.000000 0.647383 1.660968 -0.925567 0.475758
6 0.965163 5.000000 -0.821551 0.889003 2.169438 5.000000 2.401562 5.000000 -0.244378 5.000000
7 -0.864927 0.824432 -0.482469 -1.051831 -1.152725 -0.771707 0.071772 1.028202 -2.089471 5.000000
8 -0.072640 5.000000 1.309103 -0.528468 0.060192 -0.610007 0.407824 -0.361243 5.000000 5.000000
9 5.000000 5.000000 5.000000 5.000000 1.275795 -2.463060 5.000000 5.000000 0.786528 1.276902

Comments

Comments powered by Disqus