ColumnTransformer: Why and How to Use It to Preprocess Data

Yashmeet Singh · Feb 15, 2022 · 10 minute read

Let's explore why you should always use ColumnTransformer to prepare datasets with different column types.

Introduction

The columns in your dataset may need special handling. That’s especially true when you transform them to train a machine learning model.

For example, numerical columns need scaling, and categorical columns should be encoded. And you would want to vectorize columns with text data.

Some columns are fine just the way they are. You don’t want to change them in any way.

How can you apply these different transformations to different types of columns?

We’ll look at three approaches:

Apply Transformations Sequentially
Split, Transform and Combine
Use ColumnTransformer

By the end of this post, you’ll see why you should always use the third approach.

A Motivating Example

We’ll use a simplified version of the titanic dataset. Here are a few sample rows:

CLASS	GENDER	AGE	FARE	RELATIVES	SURVIVED
3	male	26.51	7.75	0	0
3	male	40.50	7.75	0	0
1	female	18.00	227.52	1	1
3	female	5.00	12.48	0	1
2	male	32.00	10.50	0	0
1	female	36.00	120.00	3	1
1	female	22.00	66.60	1	1
3	male	25.00	7.22	0	0
1	female	24.00	263.00	5	1
3	female	4.00	16.70	2	1

Let’s have a closer look at the columns:

Categorical Columns: `CLASS` & `GENDER`

There are only three possible values for passenger’s CLASS: 1, 2, and 3. Similarly, the column GENDER has only two unique values: male and female.

Both columns contain categorical data and, therefore, should be encoded.

Numerical Columns: `AGE`, `FARE` & `RELATIVES`

Some machine learning algorithms perform better if numerical inputs have roughly similar mean, standard deviation, and range.

These statistics differ significantly for our numerical columns:

Column	Mean	Standard Deviation	Range
AGE	29.32	13.28	79.58
FARE	32.20	49.69	512.33
RELATIVES	0.90	1.61	10.00

We’ll need to scale these columns. Scaling will set their mean to 0 and standard deviation to 1.

‘Skip’ Column: `SURVIVED`

The column SURVIVED has only two possible values: 0 and 1. Therefore it’s already in encoded form. We should leave it as is.

In summary, here are the columns and how we want to transform them:

Column(s)	Transformation
CLASS, GENDER	Encode
AGE, FARE, RELATIVES	Scale
SURVIVED	Skip

How do we apply these transformations?

Let’s begin with the first of our three approaches.

1. Apply Transformations Sequentially

In this approach, the dataset undergoes a series of transformations, applied one after another.

The output of one transformation becomes the input for the next transformation.

For the titanic dataset, we’ll first encode it. Then we’ll take the encoded dataset and scale it.

Column Transformation: Apply each transformation sequentially

Let’s see this in action. First, we load the dataset:

import pandas as pd
 
titanic = pd.read_csv('titanic.csv')
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   CLASS      891 non-null    int64  
 1   GENDER     891 non-null    object 
 2   AGE        891 non-null    float64
 3   FARE       891 non-null    float64
 4   RELATIVES  891 non-null    int64  
 5   SURVIVED   891 non-null    int64  
dtypes: float64(2), int64(3), object(1)
memory usage: 41.9+ KB

titanic.head()

	CLASS	GENDER	AGE	FARE	RELATIVES	SURVIVED
0	3	male	22.0	7.25	1	0
1	1	female	38.0	71.28	1	1
2	3	female	26.0	7.92	0	1
3	1	female	35.0	53.10	1	1
4	3	male	35.0	8.05	0	0

1.1 Apply Encoding

We encode columns CLASS and GENDER using pandas’ method get_dummies().

It’s standard practice to drop the first level of each encoded categorical column. We’ll do that using the drop_first argument.

titanic = pd.get_dummies(titanic, columns=['CLASS', 'GENDER'], 
                         drop_first=True)
titanic.head()

	AGE	FARE	RELATIVES	SURVIVED	CLASS_3	GENDER_male
0	22.0	7.25	1	0	1	1
1	38.0	71.28	1	1	0	0
2	26.0	7.92	0	1	1	0
3	35.0	53.10	1	1	0	0
4	35.0	8.05	0	0	1	1

The new encoded columns (in green) look good.

1.2 Apply Scaling

Next, we scale using sklearn’s StandardScaler:

from sklearn.preprocessing import StandardScaler
 
scaler = StandardScaler()
scaled_titanic = scaler.fit_transform(titanic)
 
titanic = pd.DataFrame(scaled_titanic, columns=titanic.columns)
titanic.head()

	AGE	FARE	RELATIVES	SURVIVED	CLASS_2	CLASS_3	GENDER_male
0	-0.55	-0.50	0.06	-0.79	-0.51	0.90	0.74
1	0.65	0.79	0.06	1.27	-0.51	-1.11	-1.36
2	-0.25	-0.49	-0.56	1.27	-0.51	0.90	-1.36
3	0.43	0.42	0.06	1.27	-0.51	-1.11	-1.36
4	0.43	-0.49	-0.56	-0.79	-0.51	0.90	0.74

As expected, the numerical columns AGE, FARE, and RELATIVES (in green) are scaled.

But we have a problem. All the other columns were scaled as well.

That’s not what we intended. We had encoded categorical columns in the previous step. But they were scaled in this step and don’t contain 0s and 1s anymore (in yellow).

The column SURVIVED (in red) suffered a similar fate.

1.3 The Fatal Flaw

The sequential approach applies every transformation, one after another, to every column.

For example, we wanted to apply only one transformation, encoding, to the column GENDER. However, we encoded and then scaled it.

A better technique will ensure that we don’t apply unwanted transformations to any column.

That’s what our second approach will do.

2. Split, Transform and Combine

This approach works in 3 steps:

Split the input dataset into subsets based on the column types.
For example, we’ll split the titanic dataset into three subsets. The first will have numerical columns. The second will contain categorical columns. The columns we don’t want to transform will go in the third ‘Skip’ subset.
Transform each subset based on its column type.
Each subset is independent and only gets the transformations it needs. That means we’ll scale the numerical subset and encode the categorical subset.
Combine all the subsets to create our final, transformed dataset.

Split, Transform and Combine

Let’s see these steps in action.

titanic = pd.read_csv('titanic.csv')
titanic.head()

	CLASS	GENDER	AGE	FARE	RELATIVES	SURVIVED
0	3	male	22.00	7.25	1	0
1	1	female	38.00	71.28	1	1
2	3	female	26.00	7.92	0	1
3	1	female	35.00	53.10	1	1
4	3	male	35.00	8.05	0	0

2.1 Split Dataset

We split the titanic dataset into three subsets:

Categorical columns: CLASS and GENDER
Numerical columns: AGE, FARE, and RELATIVES
‘Skip’ columns: SURVIVED

categorical_columns = titanic.loc[:, ['CLASS', 'GENDER'] ]
numerical_columns = titanic.loc[:, ['AGE', 'FARE', 'RELATIVES']]
skip_columns = titanic.loc[:, ['SURVIVED']]

2.2 Apply Transformations

Let’s transform each subset according to its column type.

Encode Categorical Columns

We encode using OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
 
encoder = OneHotEncoder(dtype='int', drop='first')
encoded_raw = encoder.fit_transform(categorical_columns)
 
categorical_columns = pd.DataFrame(
    encoded_raw.toarray(), 
    columns=encoder.get_feature_names_out()
)
categorical_columns.head()

	CLASS_3	GENDER_male
0	1	1
1	0	0
2	1	0
3	0	0
4	1	1

Scale Numerical Columns

Next, we scale using StandardScaler:

scaler = StandardScaler()
scaled_raw = scaler.fit_transform(numerical_columns)
 
numerical_columns = pd.DataFrame(
    scaled_raw, 
    columns=numerical_columns.columns
)
numerical_columns.head()

	AGE	FARE	RELATIVES
0	-0.55	-0.50	0.06
1	0.65	0.79	0.06
2	-0.25	-0.49	-0.56
3	0.43	0.42	0.06
4	0.43	-0.49	-0.56

2.3 Combine Subsets

Lastly, we use pandas’ method concat() to combine the subsets. The argument axis='columns' ensures that subsets are stitched horizontally across columns.

titanic = pd.concat(
    [categorical_columns, 
    numerical_columns, 
    skip_columns], axis='columns')
 
titanic.head()

	CLASS_3	GENDER_male	AGE	FARE	RELATIVES	SURVIVED
0	1	1	-0.55	-0.50	0.06	0
1	0	0	0.65	0.79	0.06	1
2	1	0	-0.25	-0.49	-0.56	1
3	0	0	0.43	0.42	0.06	1
4	1	1	0.43	-0.49	-0.56	0

The final dataset looks good. We didn’t change the column SURVIVED. Other columns were either encoded or scaled, but not both.

Unlike the sequential approach, we didn’t apply any unintended transformation to any column.

Now, let’s look at ColumnTransformer, which does everything we want with virtually no effort.

3. Use ColumnTransformer

We got the desired results using the last approach. But it required a lot of work. We had to split the dataset, transform each subset, and combine all the subsets.

Wouldn’t it be nice if you could list the transformations required for different columns? And the system does all the heavy lifting?

Sklearn’s ColumnTransformer is the answer to your prayers.

Let’s see it action. We start by loading the titanic dataset:

titanic = pd.read_csv('titanic.csv')
titanic.head()

	CLASS	GENDER	AGE	FARE	RELATIVES	SURVIVED
0	3	male	22.00	7.25	1	0
1	1	female	38.00	71.28	1	1
2	3	female	26.00	7.92	0	1
3	1	female	35.00	53.10	1	1
4	3	male	35.00	8.05	0	0

To use ColumnTransformer, we need to pass the columns and their transformations as a list of tuples. Each tuple should have:

Name: A unique name.
Transformer: An instance of a transformer like OneHotEncoder or StandardScaler. Set it to passthrough if you want to skip the columns.
Columns: The list of columns to transform.

Here’s a quick review of the columns and how we want to transform them:

Column(s)	Transformation
CLASS, GENDER	Encode
AGE, FARE, RELATIVES	Scale
SURVIVED	Skip

And here’s the corresponding list of tuples:

transformers_list = [
('encode', OneHotEncoder(dtype='int',drop='first'),['CLASS','GENDER']), 
('scale', StandardScaler(), ['AGE', 'FARE', 'RELATIVES']),
('skip', 'passthrough', ['SURVIVED'])
]

Next, let’s create an instance of ColumnTransformer using the list of tuples:

from sklearn.compose import ColumnTransformer
 
column_transformer = ColumnTransformer(transformers_list)

Then we call the fit_transform() method. This one line of code performs all the steps from the previous approach (see FIG. 2).

transformed_raw = column_transformer.fit_transform(titanic)

Let’s see the result as a DataFrame:

titanic = pd.DataFrame(
    transformed_raw, 
    columns=column_transformer.get_feature_names_out()
)
titanic.head()

	encode__CLASS_3	encode__GENDER_male	scale__AGE	scale__FARE	scale__RELATIVES	skip__SURVIVED
0	1.00	1.00	-0.55	-0.50	0.06	0.00
1	0.00	0.00	0.65	0.79	0.06	1.00
2	1.00	0.00	-0.25	-0.49	-0.56	1.00
3	0.00	0.00	0.43	0.42	0.06	1.00
4	1.00	1.00	0.43	-0.49	-0.56	0.00

The final dataset looks good. We transformed (or skipped) every column as expected.

Conclusion

We explored three approaches to preparing dataset columns.

The first approach gave us incorrect results. We cannot use it to apply different transformations to different columns.

The second approach isolated the columns and transformed them independently. However, it required us to do the grunt work. We had to write code for the tricky ‘split, transform, combine’ steps.

The final approach, ColumnTransformer, performed these steps behind the scenes. It required minimal work and delivered the results we wanted.

There’s one more reason why you should always use ColumnTransformer. The other approaches can lead to data leakage, sabotaging your machine learning model.

ColumnTransformer, when used with Pipeline, can help you fix data leakage. I’ll cover this critical topic in another post.

Title Image by Tama66

ColumnTransformer Data Preparation

Comment anonymously

M ↓ Markdown

Commento

Share this article:

What's Popular

Explore More