Dawn of the Denoisers: Multi-Output ML Models for Tabular Data Imputation

Photo by Jon Tyson on Unsplash

Dealing with missing values in tabular data is a fundamental problem in data science. If the missing values cannot be ignored or omitted for whatever reason, then we can try to impute them, i.e., replace the missing values with some other values. There are a few simple (yet simplistic) approaches to imputation and a few advanced ones (more accurate but complex and potentially resource-intensive). This article presents a novel approach to tabular data imputation that seeks to achieve a balance between simplicity and usefulness.

Specifically, we will see how the concept of denoising (typically associated with unstructured data) can be used to quickly turn just about any multi-output ML algorithm into a tabular data imputer that is fit for use in practice. We will first cover some basic concepts around denoising, imputation and multi-output algorithms, and subsequently dive into the details of how to turn multi-output algorithms into imputers using denoising. We will then briefly look at how this novel approach can be applied in practice with an example from industry. Finally, we will discuss the future relevance of denoising-based imputation of tabular data in the age of generative AI and foundation models. For ease of explication, code examples will only be shown in Python, although the conceptual approach itself is language-agnostic.

From Denoising to Imputation

Denoising is about removing noise from data. Denoising algorithms take noisy data as input, do some clever processing to reduce the noise as much as possible, and return the de-noised data. Typical use cases for denoising include removing noise from audio data and sharpening blurry images. Denoising algorithms can be built using several approaches, ranging from Gaussian and median filters to autoencoders.

While the concept of denoising tends to be primarily associated with use cases involving unstructured data (e.g., audio, images), imputation of structured tabular data is a closely related concept. There are many ways to replace (or impute) missing values in tabular data. For example, the data could simply be replaced by zeros (or some equivalent value in the given context), or by some statistic of the relevant row or column for numerical data (e.g., mean, median, mode, min, max) — but doing this can distort the data and, if used as a pre-processing step in an ML training workflow, such simplistic imputation could adversely affect predictive performance. Other approaches like K Nearest Neighbors (KNNs) or association rule mining may perform better, but since they do not have the notion of training and work directly on test data instead, they can struggle for speed when the size of the test data becomes large; this is especially problematic for use cases that require fast online inference.

Now, one could simply train an ML model that sets the feature with the missing values as the output and uses the rest of the features as predictors (or inputs). If we have several features with missing values, building single-output models for each of them might be cumbersome, not to mention expensive, so we could try to build one multi-output model that predicts missing values for all the affected features at once. Crucially, if missing values can be thought of as noise, then we may be able to apply denoising concepts to impute tabular data — and this is the key insight that we will build on in the following sections.

Multi-Output ML Algorithms

As the name suggests, multi-output (or multi-target) algorithms can be used to train models for predicting multiple output/target features simultaneously. The Scikit-learn website provides a great overview of multi-output algorithms for classification and regression (see here).

While some ML algorithms allow multi-output modeling out-of-the-box, others may natively support single-output modeling only. Libraries such as Scikit-learn offer ways to leverage single-output algorithms for multi-output modeling by providing wrappers that implement the usual functions like fit and predict, and applying these to separate single-output models independently under the hood. The following example code shows how to wrap the implementation of a Linear Support Vector Regression (Linear SVR) in Scikit-learn, which natively only supports single-output modeling, into a multi-output regressor using the MultiOutputRegressor wrapper.

from sklearn.datasets import make_regression
from sklearn.svm import LinearSVR
from sklearn.multioutput import MultiOutputRegressor

# Construct a toy dataset
RANDOM_STATE = 100
xs, ys = make_regression(
n_samples=2000, n_features=7, n_informative=5,
n_targets=3, random_state=RANDOM_STATE, noise=0.2
)

# Wrap the Linear SVR to enable multi-output modeling
wrapped_model = MultiOutputRegressor(
LinearSVR(random_state=RANDOM_STATE)
).fit(xs, ys)

While such a wrapping strategy at least lets us use single-output algorithms in multi-output use cases at all, it may not account for correlations or dependencies between the output features (i.e., whether a predicted set of output features makes sense as a whole). By contrast, some ML algorithms that natively support multi-output modeling do seem to account for inter-output relationships. For example, when a decision tree in Scikit-learn is used to model n outputs based on some input data, all n output values are stored in the leaves and splitting criteria are used that consider all n output values as a set, e.g., by averaging over them (see here). The following example code shows how a multi-output decision tree regressor can be built — you will notice that, on the surface, the steps are quite similar to those shown earlier for training the Linear SVR with a wrapper.

from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor

# Construct a toy dataset
RANDOM_STATE = 100
xs, ys = make_regression(
n_samples=2000, n_features=7, n_informative=5,
n_targets=3, random_state=RANDOM_STATE, noise=0.2
)

# Train a multi-output model directly using a decision tree
model = DecisionTreeRegressor(random_state=RANDOM_STATE).fit(xs, ys)

Training Multi-Output ML Models as Denoisers for Tabular Data Imputation

Now that we have covered the basics of denoising, imputation and multi-output ML algorithms, we are ready to put all of these building blocks together. In general, training multi-output ML models to impute tabular data using denoising consists of the steps outlined below. Note that, unlike the code examples in the previous section, we will not explicitly differentiate between predictors and targets in the following — this is because, in the context of tabular data imputation, features can serve as predictors if they are present in the data, and as targets if they are missing.

Step 1: Create training and validation datasets

Split the data into a training and validation set, e.g., using an 80:20 split ratio. Let us call these sets df_training and df_validation, respectively.

Step 2: Create noisy/masked copies of the training and validation datasets

Make a copy of df_training and df_validation and add noise to the data in these copies, e.g., by randomly masking values. Let us call the masked copies df_training_masked and df_validation_masked, respectively. The choice of the masking function can have an impact on the predictive accuracy of the imputer that is trained in the end, so we will look at some masking strategies in the next section. Also, if the size of df_training is small, it may make sense to up-sample the rows by some factor k, such that if df_training has n rows and m columns, then the up-sampled df_training_masked dataset will have n*k rows (and m columns as before).

Step 3: Train a multi-output model as a denoising-based imputer

Pick a multi-output algorithm of your choice and train a model that predicts the original training data using the noisy/masked copy. Conceptually, you would do something like model.fit(predictors = df_training_masked, targets = df_training).

Step 4: Apply the imputer to the masked validation dataset

Pass df_validation_masked to the trained model to predict df_validation. Conceptually, this will look something like df_validation_imputed = model.predict(df_validation_masked). Note that some fitting functions may directly take the validation datasets as arguments to compute the validation error during the fitting process (e.g., for neural nets in TensorFlow) — if so, then remember to use the noisy/masked validation set (df_validation_masked) for the predictors and the original validation set (df_validation) for the targets when computing the validation error.

Step 5: Evaluate the imputation accuracy for the validation dataset

Evaluate the imputation accuracy by comparing df_validation_imputed (what the model predicted) to df_validation (the ground truth). The evaluation can be done by column (to determine the imputation accuracy by feature) or by row (to check accuracy by prediction instance). To avoid getting inflated accuracy results per column, rows where the to-be-predicted column value is not masked in df_validation_masked can be filtered out before computing accuracy.

Finally, experiment with the above steps to optimize the model (e.g., use another masking strategy or pick a different multi-output ML algorithm).

The following code shows a toy example of how Steps 1–5 could be implemented.

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier

# Construct a toy dataset
RANDOM_STATE = 100
data = make_classification(n_samples=2000, n_features=7, n_classes=1, random_state=RANDOM_STATE, class_sep=2, n_informative=3)
df = pd.DataFrame(data[0]).applymap(lambda x: int(abs(x)))

#####
# Step 1: Create training and validation datasets
#####

TRAIN_TEST_SPLIT_FRAC = 0.8
n = int(df.shape[0]*TRAIN_TEST_SPLIT_FRAC)

df_training, df_validation = df.iloc[:n, :], df.iloc[n:, :].reset_index(drop=True)

#####
# Step 2: Create noisy/masked copies of training and validation datasets
#####

# Example of random masking where each decision to mask a value is framed as a coin toss (Bernoulli event)
def random_masking(value): return -1 if np.random.binomial(n=1, p=0.5) else value
df_training_masked = df_training.applymap(random_masking)
df_validation_masked = df_validation.applymap(random_masking)

#####
# Step 3: Train a multi-output model to be used as a denoising-based imputer
#####

# Notice that the masked data is used to model the original data
model = DecisionTreeClassifier(random_state=RANDOM_STATE).fit(X=df_training_masked, y=df_training)

#####
# Step 4: Apply imputer to masked validation dataset
#####

df_validation_imputed = pd.DataFrame(model.predict(df_validation_masked))

#####
# Step 5: Evaluate imputation accuracy on validation dataset
#####

# Check basic top-1 accuracy metric, accounting for inflated results
feature_accuracy_dict = {}
for i in range(df_validation_masked.shape[1]):
# Get list of row indexes where feature i was masked, i.e., needed to be imputed
masked_indexes = df_validation_masked.index[df_validation_masked[i] == -1]
# Compute imputation accuracy only for those rows for feature i
feature_accuracy_dict[i] = (df_validation_imputed.iloc[masked_indexes, i] == df_validation.iloc[masked_indexes, i]).mean()
print(feature_accuracy_dict)

Data Masking Strategies

In general, several strategies could be employed to masking the training and validation data. At a high level, we might distinguish between three masking strategies: exhaustive, random and domain-driven.

Exhaustive masking

This strategy involves generating all possible masking combinations for each row in the dataset. Suppose we have a dataset with n rows and m columns. Then exhaustive masking would expand each row into at most 2^m rows, one for each masking combination of the m values in the row; this maximum total number of combinations for the row is equivalent to the sum of row m in Pascal’s triangle, although we may choose to omit some combinations that are not useful for a given use case (e.g., the combination where all values are masked). The final masked dataset would therefore have at most n*(2^m) rows and m columns. While the exhaustive strategy has the benefit of being quite comprehensive, it may not be very practicable in cases where m is large, since the resulting masked dataset might be too large for most computers to easily handle today. For instance, if the original dataset has just 1000 rows and 50 columns, the exhaustively masked dataset would have roughly 10¹⁸ rows (that is one quintillion rows).

Random masking

As the name suggests, this strategy works by masking values using some random function. In a simple implementation, for example, the decision to mask each value in the dataset could be framed as independent Bernoulli events with probability p of masking. The obvious benefit of the random masking strategy is that, unlike with exhaustive masking, the size of the masked data will be manageable. However, especially from small datasets, in order to achieve a sufficiently high imputation accuracy, it may be necessary to up-sample the rows of the training dataset before applying random masking so that more masking combinations are reflected in the resulting masked dataset.

Domain-driven masking

This strategy aims to apply masking in a way that approximates the pattern of missing values in real life, i.e., within the domain or use case where the imputer will be utilized. To spot these patterns, it can be useful to analyze quantitative, observational data, as well as incorporating insights from domain experts.

Practical Applications

Denoising-based imputers of the kind discussed in this article can offer a pragmatic “middle way” in practice, where other approaches might be too simplistic or too complex and resource-intensive. Beyond its use in data cleaning as a pre-processing step in larger ML workflows, denoising-based imputation of tabular data can potentially be used to drive core product functionality in certain practical use cases.

AI-assisted completion of online forms is one such example from industry. With the increasing digitization of various business processes, paper-based forms are being replaced by digital, online versions. Processes such as submitting a job application, creating a purchase requisition, corporate travel booking, and registering for events typically involve filling an online form of some kind. Manually completing such a form can be tedious, time-consuming, and potentially error-prone, especially if the form has several fields that need to be filled. With the help of an AI assistant, however, the task of completing such an online form can be made a lot easier, faster, and more accurate, by providing input recommendations to users based on available contextual information. For example, as a user starts filling in some fields on the form, the AI assistant could infer the most likely values for the remaining fields and suggest these in real-time to the user. Such a use case can readily be framed as a denoising-based, multi-output imputation problem, where the noisy/masked data is given by the current state of the form (with some fields filled in and others empty/missing), and the goal is to predict the missing fields. The model can be tuned as needed to satisfy various use case requirements including predictive accuracy and end-to-end response time (as perceived by the user).

Relevance in the Age of Generative AI and Foundation Models

With recent advancements in generative AI and foundation models — and the growing awareness of their potential, even among non-technical audiences, ever since ChatGPT burst onto the scene in late 2022 — it is fair to ask what relevance denoising-based imputers will have in the future. For example, large language models (LLMs) could conceivably handle imputation tasks for tabular data. After all, predicting missing tokens in sentences is a typical learning objective used for training LLMs like Bidirectional Encoder Representations from Transformers (BERT).

Yet, it is unlikely that denoising-based imputers — or other simpler approaches to tabular data imputation that exist today for that matter — will become obsolete in the age of generative AI and foundation models any time soon. The reasons for this can be appreciated by considering the situation in the late 2010s, by which point neural nets had become more technically feasible and economically viable options for several use cases that had previously relied on simpler algorithms like logistic regressions, decision trees, and random forests. While neural nets did replace these other algorithms for some high-end use cases where sufficiently large training data was available and the cost of training and maintaining neural nets was deemed justifiable, many other use cases remained unaffected. In fact, the increasing ease of access to cheaper storage and computational resources that spurred the adoption of neural nets also benefitted the other, simpler algorithms. From this standpoint, considerations such as cost, complexity, the need for explainability, fast response times for real-time use cases, and the threat of lock-in to a potentially oligopolistic set of external providers of pre-trained models, all seem to point towards a future in which pragmatic innovations such as denoising-based imputers for tabular data find a way to meaningfully co-exist with generative AI and foundation models rather than being replaced by them.

Dawn of the Denoisers: Multi-Output ML Models for Tabular Data Imputation was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top