Card image cap

Technology: Python, sklearn, pandas
Tags: Machine Learning, Statistics, Data Engineering, Analytics
Image Credits: Ryoji Iwata on Unsplash

Imputation Module for Python

Impyte is a lightweight Python module that has been created with one goal in mind: Simplify the way a researcher deals with missing values. Impyte leverages machine learning algorithms to complete missing information and offers intuitive visualization methods to help with data engineering and analytics.

The main goal of this module is to support people who are dealing with missing information. impyte was designed to gather additional insights about different NaN patterns and impute them in an easy way. The full documentation can be found here. Since this module is still in beta, you can install the latest version through its GitHub repository via pip.

pip install git+git://github.com/andirs/impyte.git

There are two essential features to impyte:

  1. Visualization of Patterns
  2. Imputation of missing information

This report highlights how to use both functions in just a few lines of code.

Pattern Visualization

One problem I am facing a lot when dealing with datasets is the uncertainty about how missing data is distributed across variables. Before i.e. discarding all observations with missing values its helpful to gather more information about the patterns that can be detected just by analyzing the absence of information. Therefore I wanted to create a method that would make the first investigation steps easier. In my research I came across the beautiful mice package by Stef van Buuren which inspired the first function of impyte. Below is a first example of how to visualize information regarding missing data. To visualize for example NaN-values within the iconic Titanic dataset you can do the following:

import pandas as pd
from impyte import impyte

# when in a jupyter notebook
!wget http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls

df = pd.read_excel('titanic3.xls')
df.drop(['body', 'boat', 'cabin'], axis=1, inplace=True)  # remove for clarity

# load impyte module and show patterns
imp = impyte.Impyter()
imp.load_data(df)
imp.pattern()
pclass survived name sex age sibsp parch ticket fare embarked home.dest Count
0 1 1 1 1 1 1 1 1 1 1 1 684
1 1 1 1 1 1 1 1 1 1 1 NaN 359
2 1 1 1 1 NaN 1 1 1 1 1 NaN 203
3 1 1 1 1 NaN 1 1 1 1 1 1 60
4 1 1 1 1 1 1 1 1 1 NaN NaN 1
5 1 1 1 1 1 1 1 1 1 NaN 1 1
6 1 1 1 1 1 1 1 1 NaN 1 NaN 1

The overview lists all NaN-patterns that are in the data. Each row shows a specific NaN-pattern. To identify the patterns a NaN token is listed in any column where the pattern has missing data points and 1 if the data was complete in that variable. A complete dataset would have only 1s in all columns. On the right hand side is a count variable to indicate how often that specific pattern was found. The patterns are always sorted by count. Therefore it is not a given, that pattern 0 is always the pattern with only complete cases.

This overview can already hint towards co-occurences of NaN values. Is it coincidence that age values are missing almost exclusively for passengers that have no home destination record? Insights like these can offer a first inference on how to refine for example the data collection process, weigh certain variables or inform on how to deal with uncertainty.

To dive deeper into the analysis the get_sumary() function offers insights into the consistency of each variable. When setting the importance_filter flag to True only variables with NaN or empty values are listed.

# load feature summary
imp.get_summary(importance_filter=True)
Complete Missing Percentage Unique
fare 1308 1 0.08 % 282
embarked 1307 2 0.15 % 4
age 1046 263 20.09 % 99
home.dest 745 564 43.09 % 370

Next to statistical information regarding the amount of missing data per feature, each variable is counted for its unique values in the data. This gives a first glimpse on the character of the variable and is an indicator for a continuous or discrete variable structure. The definition of continuous and discrete variables is performed under the hood and works with certain thresholds to detect continuous variables.

Imputation

There are multiple ways how to handle missing data. From dropping observations with missing input to computing average values or regression models. impyte leverages machine learning algorithms to complete the dataset. By doing so it turns the completion problem into a classification and/or regression task. While training the chosen machine learning model, impyte computes cross-validation F1 or R2 scores to give an estimate of the predictive power of the model. All of this can be done with only one a few lines of code.

# read in data
df = pd.read_excel('titanic3.xls')

# drop indecisive variables
df.drop(['body', 'boat', 'cabin', 'name', 'ticket', 'fare', 'home.dest'], axis=1, inplace=True)

# load impyte module ad impute
imp = impyte.Impyter(df)
imp.impute(estimator='rf', threshold={'f1_macro': .7, 'r2': .75}) 

In the example above all values that have less than the desired threshold (.7 F1 score for classification models and .75 R2 score for regression models) won't be imputed. All other variables are completed.

impyte is only one piece of the equation. In order to maximize the return in any value imputation process, a deep understanding of the data is needed. In addition thorough pre-processing and cleaning of the data will improve the imputation process. Impyte helps with some of the challenges but was designed to work in concert with additional data science endeavors.

As mentioned earlier, the module is still under development. Any feedback or comments regarding its functionality are highly appreciated.