# Productionising a model using Pandas Categorical variables.

## November 23, 2015

One of the major limitations of Scikit Learn is its lack of native support of categorical variables for tree models. One way around this is to use `pd.get_dummies`

to perform a one-hot-encoding on categorical variables, however this doesn’t work well when the column has a lot of levels. Instead it can actually be more effective to convert the column to a numeric id and treat it as a continuous variable – as long as your trees have enough depth, they will naturally re-segment the variable into its natural categories. (I have routinely achieved up to 5% improvement on AUROC using this techique compared to a model of the same size/complexity using dummy variables.) The introduction of the `categorical`

data type in Pandas since version 0.15 makes this a breeze – the column is stored as a vector of Numpy `int64`

, together with a list of levels which is used as a codebook.

I’ve been trying to work out the a natural way to save these encodings for productionising a model. I’ve settled on pickling an empty dataframe with the categorical predictor columns. Then when I append the data that is to be scored against the model, the categorical encoding is automatically applied. As a bonus, if a new value is encountered which is not present in the original dataframe, it will simply be set as `NaN`

, which is exactly what I would hope for (this is a bit trickier to deal with when using dummy variables).

Here is a toy example to show what I mean

```
import pandas as pd
import pickle
df = pd.DataFrame({'happy':[1,1,0],
'weather':['sunny','cloudy','cloudy'],
'snack':['fruit','cake','fruit']})\
.apply(lambda x: x.astype('category'))
model = {(0,0): 1, (1,0): 0, (0,0): 1, (0,1):0}
#create template as a slice consisting of no rows, and columns 1 and 2
input_template = df.iloc[[], [1,2]]
model_and_input_template = {'model': model, 'input_template': input_template}
with open('model.pkl', 'wb') as f:
f.write(pickle.dumps(model_and_input_template))
def score_data(input_data, model, input_template):
with open('model.pkl', 'rb') as f:
model_and_input_template = pickle.loads(f.read())
model = model_and_input_template['model']
input_template = model_and_input_template['input_template']
categorical_data = input_template.append(input_data, ignore_index=True)
encoded_data = categorical_data.apply(lambda x : x.cat.codes if x.dtype == 'category' else x)
return model[tuple(encoded_data.ix[0].values)]
input_data = {'weather':'cloudy','snack':'cake'}
print (score_data(input_data, model, input_template))
#1
#it's cloudy outside but the birthday cake must have cheered me up!
```