Data Engineering for Dummies

Using XCS and ANN | still in progress

Artificial Neural Network (ANN) is used for several purposes, where classification is one of the mous ideas in the past years. As a model of supervised learning method, it should be trained performing any real operation. The aim for such a learning is to adjust the values of its weights, meaning that the training data highly determine its success. On the other hand, XCS uses reinforcement cycles to collect its knowledge, which plays a similar role to ANN's weights. However, there are at least two main differences here:

learning paradigm: XCS's environment vs ANN's data
XCS owns a human-readable set of knowledge

This is not trying to say which one is better, because every algorithm has its own purpose. Many scientist believe that a combination of ANN (precisely Deep Learning) and RL is the next algorithmic advancement, which for now, is still an ongoing investigation. You got the chance to learn them both here. 😎

What to Compare?

Churn Modeling is a dataset available at kaggle, consisting of 10000 rows in a CSV file. Both algorithms should optimize their correctness rates in predicting the test result. Get XCS-RC with pip install xcs-rc or download it here while ANN lines can be obtained here.

The copy buttons sends the code to clipboard, full source is also available in Conclusions.

1. Start with obtaining the data

Click on the filename to view it.

# read data file
import pandas as pd
dataset = pd.read_csv('Churn_Modelling.csv')

# split columns, X as input, y as (expected) output
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]

2. Check the contents of X and y

This is performed on the console, not included in the code.

In[1]: X
Out[1]:
array([[619, 'France', 'Female', ..., 1, 1, 101348.88],
       [608, 'Spain', 'Female', ..., 0, 1, 112542.58],
       [502, 'France', 'Female', ..., 1, 0, 113931.57],
       ...,
       [709, 'France', 'Female', ..., 0, 1, 42085.58],
       [772, 'Germany', 'Male', ..., 1, 0, 92888.52],
       [792, 'France', 'Female', ..., 1, 0, 38190.78]], dtype=object)

In[2]: y
Out[2]: array([1, 0, 1, ..., 1, 1, 0], dtype=int64)

3. Encode and split

Use scikit-learn library for encoding the columns "Geography" and "Gender", then split the data for training and testing purposes each.

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X["Geography"] = encoder.fit_transform(X["Geography"])
X["Gender"] = encoder.fit_transform(X["Gender"])

# Split to 80% training and 20% testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

4. Load XCS-RC instance

XCS-RC can be installed using pip install xcs-rc at the command line. Afterwards, some parameter adjustment might be useful.

import xcs_rc
agent = xcs_rc.Agent()
agent.predtol = 10.0
agent.prederrtol = 0.0
agent.maxpopsize = 100

5. Train XCS-RC

Feed the training data to XCS and then save training results to a file.

print("Training starts.")
agent.train(X_train, y_train, show_progress=True)
agent.save('final_pop.csv', title="Final Population")

def train(X_train, y_train, show_progress=True):

    for i in range(len(X_train)):  # iterate over training data
        answer = next_action(X_train[i], True)  # get XCS response
        reward = int(answer == y_train[i]) * maxreward  # assign reward
        apply_reward(reward)  # update classifiers

        # simple progress visualization
        if show_progress:
            print('.', end='')
            if i % 100 == 99:
                print()

    return pop

The training produces a set of human readable rules (in this case, 71 classifiers), displayed in a simplified version for ease of understanding.
A download button is available at the end of the table.

How to Read

Square brackets denote the range, e.g.,

[2,
                            8]

covers any value between and including 2 and 8.

Encoded columns (see Phase A, Section 3):

Geography: ['France', 'Spain', 'Germany']
Gender: ['Female', 'Male']

Each rule represents a set of attributes covered by the value range, e.g., rule #11 can be translated that a customer with:

a credit in range [544, 628]
coming from France or Spain
either male or female
in age range 26 to 44
length of tenure up to 5 years
with a balance no more than 153538
getting products #1 and #2
either having a credit card or not
not active
and within such range of estimated salary
has a 95.8% likelihood of staying

The classifier itself has:

a fitness value of 0.424 compared to other rules
a record that such case has occured 111 times

How to Use

Let's start with an example, predicting a customer with a set of attributes:

output, prob = agent.predict([500, 1, 1, 30, 6, 90000, 1, 0, 1, 100000])

Only two steps required to get the prediction, plus a beautification:

Collect all matching classifiers: id [1, 2, 4] for stay; [5, 6] for exit
Calculate the likelihood of staying 0 and exiting 1 with the formula:

\( P_i = \cfrac{ \sum \mathcal{L_i} \times \mathcal{F_i} }{ \sum \mathcal{F_i} }, \) where \( \mathcal{L} = \) likelihood and \( \mathcal{F} = \) rule fitness

making \( P_0 = \cfrac { 0.987 \times 0.449 + 0.992 \times 0.931 + 0.986 \times 0.313 } { 0.449 + 0.931 + 0.313 } = \cfrac{ 1.675333 }{ 1.693 } = 0.98956\% \)

and \( P_1 = \cfrac{ 0.264 \times 0.736 + 0.399 \times 0.287 } { 0.736 + 0.287 } = 0.30187\% \)
And finally after normalizing P with the softmax function, XCS-RC sends two responses:
```
output = 0
prob = [0.6654528588932179, 0.3345471411067821]
```

This brings the conclusion:
the customer is predicted to stay at 67% and 33% to exit, based on the training data. So, it should be quite safe for now. 😏

6. Test XCS-RC

Test the learning results using the prepared data (Phase A).

print("Testing.")
cm = agent.test(X_test, y_test, show_progress=False)
print("Confusion matrix:", cm)
print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))

Then the console gives an output like this:

Testing.
Confusion matrix: [[1582, 13], [345, 60]]
Correctness rate: 82.10%

7. Evaluation

...

Full Code

# read data file
import pandas as pd
dataset = pd.read_csv('Churn_Modelling.csv')

# split columns, X as input, y as (expected) output
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X["Geography"] = encoder.fit_transform(X["Geography"])
X["Gender"] = encoder.fit_transform(X["Gender"])

# Split to 80% train and 20% test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

import xcs_rc
agent = xcs_rc.Agent()
agent.predtol = 10.0
agent.prederrtol = 0.0
agent.maxpopsize = 100

print("Training starts.")
agent.train(X_train, y_train, show_progress=True)
agent.save('final_pop.csv', title="Final Population")

print("Testing.")
cm = agent.test(X_test, y_test, show_progress=False)
print("Confusion matrix:", cm)
print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))