How to Fill In Missing Data Using Python pandas

February 17, 2022May 11, 2022

Data cleaning undoubtedly takes a ton of time in data science, and missing data is one of the challenges you’ll face often. pandas is a valuable Python data manipulation tool that helps you fix missing values in your dataset, among other things.

You can fix missing data by either dropping or filling them with other values. In this article, we’ll explain and explore the different ways to fill missing data using pandas.

1. Use the fillna() Method:

The fillna() function iterates through your dataset and fills all null rows with a specified value. It accepts some optional arguments—take note of the following ones:

Value: This is the value you want to insert into the missing rows.

Method: Lets you fill missing values forward or in reverse. It accepts a ‘bfill’ or ‘ffill’ parameter.

Inplace: This accepts a conditional statement. If True, it modifies the DataFrame permanently. Otherwise, it doesn’t.

Before we start, make sure you install pandas into your Python virtual environment using pip in your terminal:

pip install pandas

Next, inside the Python script, we’ll create a practice DataFrame and insert null values (Nan) into some rows:

import pandas
df = pandas.DataFrame({'A' :[0, 3, None, 10, 3, None], 
 'B' : [None, None, 7.13, 13.82, 7, 7], 
 'C' : [None, "Pandas", None, "Pandas", "Python", "JavaScript"]})

Now, check out how you can fill these missing values using the various available methods in pandas.

Fill Missing Values With Mean, Median, or Mode

This method involves replacing missing values with computed averages. Filling missing data with a mean or median value is applicable when the columns involved have integer or float data types.

You can also fill missing data with the mode value, which is the most occurring value. This is also applicable to integers or floats. But it’s handier when the columns in question contain strings.

Here’s how to insert the mean and median into the missing rows in the DataFrame you created earlier:

#To insert the mean value of each column into its missing rows: 
df.fillna(df.mean().round(1), inplace=True)
#For median: 
df.fillna(df.median().round(1), inplace=True)
print(df)

Inserting the modal value as you did for the mean and median above doesn’t capture the entire DataFrame. But you can insert it into a specific column instead, say, column C:

df['C'].fillna(df['C'].mode()[0], inplace=True)

With that said, it’s still possible to insert the modal value of each column across its missing rows at once using a for loop:

for i in df.columns:
 df[i].fillna(df[i].mode()[0], inplace=True)
print(df)

If you want to be column-specific while inserting the mean, median, or mode:

df.fillna({"A":df['A'].mean(), 
 "B": df['B'].median(), 
 "C": df['C'].mode()[0]}, 
 inplace=True)
print(df)

Fill Null Rows With Values Using ffill

This involves specifying the fill method inside as the fillna() function. This method fills each missing row with the value of the nearest one above it.

You could also call it forward-filling:

df.fillna(method='ffill', inplace=True)

Fill Missing Rows With Values Using bfill

Here, you’ll replace the ffill method mentioned above with bfill. It fills each missing row in the DataFrame with the nearest value below it.

This one is called backward-filling:

df.fillna(method='bfill', inplace=True)

2. The replace() Method

You can replace the Nan values in a specific column with the mean, median, mode, or any other value.

See how this works by replacing the null rows in a named column with its mean, median, or mode:

import pandas
import numpy #this requires that you've previously installed numpy 
#Replace the null values with the mean: 
df['A'].replace([numpy.nan], df[A].mean(), inplace=True)
#Replace column A with the median: 
df['B'].replace([numpy.nan], df[B].median(), inplace=True)
#Use the modal value for column C: 
df['C'].replace([numpy.nan], df['C'].mode()[0], inplace=True)
print(df)

3. Fill Missing Data With interpolate()

The interpolate() function uses existing values in the DataFrame to estimate the missing rows.

Run the following code to see how this works:

#Interpolate backwardly across the column:
df.interpolate(method ='linear', limit_direction ='backward', inplace=True)
#Interpolate in forward order across the column:
df.interpolate(method ='linear', limit_direction ='forward', inplace=True)

Deal With Missing Rows Carefully

While we’ve only considered filling missing data with default values like averages, mode, and other methods, other techniques exist for fixing missing values. Data scientists, for instance, sometimes remove these missing rows, depending on the case.

Moreover, it’s essential to think critically about your strategy before using it. Otherwise, you might get undesirable analysis or prediction results. Some initial data visualization strategies might help.