Keras - Regression with categorical variable embeddings
The purpose of this blog post:¶
- To show how to implement (technically) a feature vector with both continuous and categorical features.
- To use a Regression head to predict continuous values
We would like to predict the housing prices in 3 different suburbs of Tel Aviv (Mercaz, Old North and Florentine). Let's assume that a deterministic function $f(a, s, n)$ exist that determines the price of a house where:
- $n$: Number of rooms
- $s$: Size (in square meters)
- $a$: Area (TLV suburb)
For each area $a$ we define a price per square meter $s\_price(a)$ :
- Mercaz: \$500
- Old North: \$350
- Florentine: \$230
And an additional price per room $n\_price(a)$:
- Mercaz: \$150,000
- Old North: \$80,000
- Florentine: \$50,000
The price of a house in area $a$ with size $s$ and $n$ rooms will be: $$f(a, s, n) = s * s\_price(a) + n * n\_price(a)$$
This function is what we will later try to predict with a Regression DNN.
Imports and helper functions¶
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import keras
from keras.models import Sequential
from keras.layers import Dense, BatchNormalization
from keras.layers import Input, Embedding, Dense
from keras.models import Model
from keras.callbacks import Callback
import matplotlib.pyplot as plt
# Bayesian Methods for Hackers style sheet
plt.style.use('bmh')
np.random.seed(1234567890)
class PeriodicLogger(Callback):
"""
A helper callback class that only prints the losses once in 'display' epochs
"""
def __init__(self, display=100):
self.display = display
def on_train_begin(self, logs={}):
self.epochs = 0
def on_epoch_end(self, batch, logs={}):
self.epochs += 1
if self.epochs % self.display == 0:
print ("Epoch: %d - loss: %f - val_loss: %f" % (self.epochs, logs['loss'], logs['val_loss']))
periodic_logger_250 = PeriodicLogger(250)
Define the mapping and a function that computes the house price for each example¶
per_meter_mapping = {
'Mercaz': 500,
'Old North': 350,
'Florentine': 230
}
per_room_additional_price = {
'Mercaz': 15. * 10**4,
'Old North': 8. * 10**4,
'Florentine': 5. * 10**4
}
def house_price_func(row):
"""
house_price_func is the function f(a,s,n).
:param row: dict (contains the keys: ['area', 'size', 'n_rooms'])
:return: float
"""
area, size, n_rooms = row['area'], row['size'], row['n_rooms']
return size * per_meter_mapping[area] + n_rooms * per_room_additional_price[area]
Create toy data¶
AREAS = ['Mercaz', 'Old North', 'Florentine']
def create_samples(n_samples):
"""
Helper method that creates dataset DataFrames
Note that the np.random.choice call only determines the number of rooms and the size of the house
(the price, which we calculate later, is deterministic)
:param n_samples: int (number of samples for each area (suburb))
:return: pd.DataFrame
"""
samples = []
for n_rooms in np.random.choice(range(1, 6), n_samples):
samples += [(area, int(np.random.normal(25, 5)), n_rooms) for area in AREAS]
return pd.DataFrame(samples, columns=['area', 'size', 'n_rooms'])
Create the train and validation sets¶
train = create_samples(n_samples=1000)
val = create_samples(n_samples=100)
Calculate the prices for each set¶
train['price'] = train.apply(house_price_func, axis=1)
val['price'] = val.apply(house_price_func, axis=1)
Here is the structure of the train/val DataFrame¶
train.head()
Define the features and the y vectores¶
We will separate the continuous and categorical variables
continuous_cols = ['size', 'n_rooms']
categorical_cols = ['area']
y_col = ['price']
X_train_continuous = train[continuous_cols]
X_train_categorical = train[categorical_cols]
y_train = train[y_col]
X_val_continuous = val[continuous_cols]
X_val_categorical = val[categorical_cols]
y_val = val[y_col]
Normalization¶
# Normalizing both train and test sets to have 0 mean and std. of 1 using the train set mean and std.
# This will give each feature an equal initial importance and speed up the training time
train_mean = X_train_continuous.mean(axis=0)
train_std = X_train_continuous.std(axis=0)
X_train_continuous = X_train_continuous - train_mean
X_train_continuous /= train_std
X_val_continuous = X_val_continuous - train_mean
X_val_continuous /= train_std
Build a model using a categorical variable¶
First let's define a helper class for the categorical variable¶
class EmbeddingMapping():
"""
Helper class for handling categorical variables
An instance of this class should be defined for each categorical variable we want to use.
"""
def __init__(self, series):
# get a list of unique values
values = series.unique().tolist()
# Set a dictionary mapping from values to integer value
# In our example this will be {'Mercaz': 1, 'Old North': 2, 'Florentine': 3}
self.embedding_dict = {value: int_value+1 for int_value, value in enumerate(values)}
# The num_values will be used as the input_dim when defining the embedding layer.
# It will also be returned for unseen values
self.num_values = len(values) + 1
def get_mapping(self, value):
# If the value was seen in the training set, return its integer mapping
if value in self.embedding_dict:
return self.embedding_dict[value]
# Else, return the same integer for unseen values
else:
return self.num_values
Create an embedding column for the train/validation sets¶
area_mapping = EmbeddingMapping(X_train_categorical['area'])
X_train_categorical = X_train_categorical.assign(area_mapping=X_train_categorical['area'].apply(area_mapping.get_mapping))
X_val_categorical = X_val_categorical.assign(area_mapping=X_val_categorical['area'].apply(area_mapping.get_mapping))
A corresponding 'area_mapping' column was added to the train/validation sets¶
X_train_categorical.head()
Define the input layers¶
# Define the embedding input
area_input = Input(shape=(1,), dtype='int32')
# Decide to what vector size we want to map our 'area' variable.
# I'll use 1 here because we only have three areas
embeddings_output = 1
# Let’s define the embedding layer and flatten it
area_embedings = Embedding(output_dim=embeddings_output, input_dim=area_mapping.num_values, input_length=1)(area_input)
area_embedings = keras.layers.Reshape((embeddings_output,))(area_embedings)
# Define the continuous variables input (just like before)
continuous_input = Input(shape=(X_train_continuous.shape[1], ))
# Concatenate continuous and embeddings inputs
all_input = keras.layers.concatenate([continuous_input, area_embedings])
To merge them together we will use Keras Functional API¶
Will define a simple model with 2 hidden layers, with 25 neurons each.
# Define the model
units=25
dense1 = Dense(units=units, activation='relu')(all_input)
dense2 = Dense(units, activation='relu')(dense1)
predictions = Dense(1)(dense2)
# Note using the input object 'area_input' not 'area_embeddings'
model = Model(inputs=[continuous_input, area_input], outputs=predictions)
Lets train the model¶
epochs = 10000
model.compile(loss='mse', optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9, beta_2=0.999, decay=1e-03, amsgrad=True))
# Note continuous and categorical columns are inserted in the same order as defined in all_inputs
history = model.fit([X_train_continuous, X_train_categorical['area_mapping']], y_train,
epochs=epochs, batch_size=128,
callbacks=[periodic_logger_250], verbose=0,
validation_data=([X_val_continuous, X_val_categorical['area_mapping']], y_val))
# Plot the train/validation loss values
plt.figure(figsize=(20,10))
_loss = history.history['loss'][250:]
_val_loss = history.history['val_loss'][250:]
train_loss_plot, = plt.plot(range(1, len(_loss)+1), _loss, label='Train Loss')
val_loss_plot, = plt.plot(range(1, len(_val_loss)+1), _val_loss, label='Validation Loss')
_ = plt.legend(handles=[train_loss_plot, val_loss_plot])
After ~500 epochs the validation_loss is very low. This would have been a very good result if this was a "real world problem", particularly because we are using a quadratic loss function (Mean Square Error) when trying to predict a large value:
print ("This is the average value we are trying to predict: %d" % y_val.mean().iloc[0])
How good are the model's predictions?¶
df = y_val.copy()
# Add a column for the model's predicted values
df['pred'] = model.predict([X_val_continuous, X_val_categorical['area_mapping']])
# Calculate the difference between the predicted and the actual price
df['diff'] = df['pred'] - df['price']
# Calculate the absolute difference between the predicted and the actual price
df['abs_diff'] = np.abs(df['diff'])
# Calculate the percentage of the difference from the actual price
df['%diff'] = 100 * (df['diff'] / df['price'])
# Calculate the absolute percentage difference from the actual price
df['abs_%diff'] = np.abs(df['%diff'])
What is the biggest difference in absolute values?¶
# Sort by the 'abs_diff' field and show the 5 largest mistakes in absolute values
df.sort_values("abs_diff", ascending=False).head(5)
The biggest absolute difference is ~\$537, which is less than a 0.25% of the ~\$240K house price
# Calculate the mean and std. of the diff field
diff_mean, diff_std = df['diff'].mean(), df['diff'].std()
print("The mean is very close to 0 ({mean}) with std. {std}.".format(mean=round(diff_mean, 2), std=round(diff_std, 2)))
# Here is the histogram of the differences
plt.figure(figsize=(20,10))
plt.hist(df['diff'], bins=100)
plt.xlabel("$")
plt.ylabel("# samples")
_ = plt.title("Difference between predicted and actual price")
What is the biggest difference in percentage?¶
# Sort by the '%diff' field and show the 5 largest proportional mistakes
df.sort_values("abs_%diff", ascending=False).head(5)
The biggest absolute/percentage difference comes from the same validation sample
# Also, plot the histogram
plt.figure(figsize=(20,10))
plt.hist(df['%diff'], bins=100)
plt.xlabel("%")
plt.ylabel("# samples")
_ = plt.title("% of difference between predicted and actual price")
Let's try and define a model with only continuous variables¶
It is clear from the design of the problem that the housing prices cannot be correctly predicted without the categorical variable (area). Therefore, this model will surely fail to converge, and we expect the loss to stay high for both train/validation sets.
We will use the same simple architecture used with the embeddings model.
continuous_input = Input(shape=(X_train_continuous.shape[1], ))
# Define the model
units=25
dense1 = Dense(units=units, activation='relu')(continuous_input)
dense2 = Dense(units, activation='relu')(dense1)
predictions = Dense(1)(dense2)
model_cont = Model(inputs=[continuous_input], outputs=predictions)
#### Train the model
epochs = 10000
model_cont.compile(loss='mse', optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9, beta_2=0.999, decay=1e-03, amsgrad=True))
history_cont = model_cont.fit([X_train_continuous], y_train,
epochs=epochs, batch_size=128,
callbacks=[periodic_logger_250], verbose=0,
validation_data=([X_val_continuous], y_val))
# Plot the train/validation loss values
plt.figure(figsize=(20,10))
_loss = history_cont.history['loss'][250:]
_val_loss = history_cont.history['val_loss'][250:]
train_loss_plot, = plt.plot(range(1, len(_loss)+1), _loss, label='Train Loss')
val_loss_plot, = plt.plot(range(1, len(_val_loss)+1), _val_loss, label='Validation Loss')
_ = plt.legend(handles=[train_loss_plot, val_loss_plot])
As expected the network failed to converge.
Note: To actually prove the network failed to converge, a much more rigorous test should be performed (grid of different architectures, learning rates and optimizers).