# Data Engineering for Dummies

###### Using XCS and ANN | still in progress

Artificial Neural Network (ANN) is used for several purposes, where classification is one of the mous ideas in the past years. As a model of supervised learning method, it should be trained performing any real operation. The aim for such a learning is to adjust the values of its weights, meaning that the training data highly determine its success. On the other hand, XCS uses reinforcement cycles to collect its knowledge, which plays a similar role to ANN's weights. However, there are at least two main differences here:

1. learning paradigm: XCS's environment vs ANN's data
2. XCS owns a human-readable set of knowledge
This is not trying to say which one is better, because every algorithm has its own purpose. Many scientist believe that a combination of ANN (precisely Deep Learning) and RL is the next algorithmic advancement, which for now, is still an ongoing investigation. You got the chance to learn them both here. 😎

What to Compare?
Churn Modeling is a dataset available at kaggle, consisting of 10000 rows in a CSV file. Both algorithms should optimize their correctness rates in predicting the test result. Get XCS-RC with pip install xcs-rc or download it here while ANN lines can be obtained here.

The copy buttons sends the code to clipboard, full source is also available in Conclusions.

Click on the filename to view it.
# read data file
import pandas as pd

# split columns, X as input, y as (expected) output
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]

2. Check the contents of X and y
This is performed on the console, not included in the code.
In[1]: X
Out[1]:
array([[619, 'France', 'Female', ..., 1, 1, 101348.88],
[608, 'Spain', 'Female', ..., 0, 1, 112542.58],
[502, 'France', 'Female', ..., 1, 0, 113931.57],
...,
[709, 'France', 'Female', ..., 0, 1, 42085.58],
[772, 'Germany', 'Male', ..., 1, 0, 92888.52],
[792, 'France', 'Female', ..., 1, 0, 38190.78]], dtype=object)

In[2]: y
Out[2]: array([1, 0, 1, ..., 1, 1, 0], dtype=int64)

3. Encode and split
Use scikit-learn library for encoding the columns "Geography" and "Gender", then split the data for training and testing purposes each.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X["Geography"] = encoder.fit_transform(X["Geography"])
X["Gender"] = encoder.fit_transform(X["Gender"])

# Split to 80% training and 20% testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

XCS-RC can be installed using pip install xcs-rc at the command line. Afterwards, some parameter adjustment might be useful.
import xcs_rc
agent = xcs_rc.Agent()
agent.predtol = 10.0
agent.prederrtol = 0.0
agent.maxpopsize = 100

5. Train XCS-RC
Feed the training data to XCS and then save training results to a file.
print("Training starts.")
agent.train(X_train, y_train, show_progress=True)
agent.save('final_pop.csv', title="Final Population")

def train(X_train, y_train, show_progress=True):

for i in range(len(X_train)):  # iterate over training data
answer = next_action(X_train[i], True)  # get XCS response
reward = int(answer == y_train[i]) * maxreward  # assign reward
apply_reward(reward)  # update classifiers

# simple progress visualization
if show_progress:
print('.', end='')
if i % 100 == 99:
print()

return pop

The training produces a set of human readable rules (in this case, 71 classifiers), displayed in a simplified version for ease of understanding.

Square brackets denote the range, e.g., [2, 8] covers any value between and including 2 and 8.

Encoded columns (see Phase A, Section 3):
• Geography: ['France', 'Spain', 'Germany']
• Gender: ['Female', 'Male']
Each rule represents a set of attributes covered by the value range, e.g., rule #11 can be translated that a customer with:
1. a credit in range [544, 628]
2. coming from France or Spain
3. either male or female
4. in age range 26 to 44
5. length of tenure up to 5 years
6. with a balance no more than 153538
7. getting products #1 and #2
8. either having a credit card or not
9. not active
10. and within such range of estimated salary
11. has a 95.8% likelihood of staying
The classifier itself has:
• a fitness value of 0.424 compared to other rules
• a record that such case has occured 111 times
How to Use
Let's start with an example, predicting a customer with a set of attributes:
output, prob = agent.predict([500, 1, 1, 30, 6, 90000, 1, 0, 1, 100000])
Only two steps required to get the prediction, plus a beautification:
1. Collect all matching classifiers: id [1, 2, 4] for stay; [5, 6] for exit
2. Calculate the likelihood of staying 0 and exiting 1 with the formula:

$$P_i = \cfrac{ \sum \mathcal{L_i} \times \mathcal{F_i} }{ \sum \mathcal{F_i} },$$ where $$\mathcal{L} =$$ likelihood and $$\mathcal{F} =$$ rule fitness

making $$P_0 = \cfrac { 0.987 \times 0.449 + 0.992 \times 0.931 + 0.986 \times 0.313 } { 0.449 + 0.931 + 0.313 } = \cfrac{ 1.675333 }{ 1.693 } = 0.98956\%$$

and $$P_1 = \cfrac{ 0.264 \times 0.736 + 0.399 \times 0.287 } { 0.736 + 0.287 } = 0.30187\%$$

3. And finally after normalizing P with the softmax function, XCS-RC sends two responses:
output = 0prob = [0.6654528588932179, 0.3345471411067821]
This brings the conclusion:
the customer is predicted to stay at 67% and 33% to exit, based on the training data. So, it should be quite safe for now. 😏

6. Test XCS-RC
Test the learning results using the prepared data (Phase A).
print("Testing.")
cm = agent.test(X_test, y_test, show_progress=False)
print("Confusion matrix:", cm)
print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))
Then the console gives an output like this:
Testing.
Confusion matrix: [[1582, 13], [345, 60]]
Correctness rate: 82.10%

7. Evaluation
...

Full Code
# read data file
import pandas as pd

# split columns, X as input, y as (expected) output
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X["Geography"] = encoder.fit_transform(X["Geography"])
X["Gender"] = encoder.fit_transform(X["Gender"])

# Split to 80% train and 20% test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

import xcs_rc
agent = xcs_rc.Agent()
agent.predtol = 10.0
agent.prederrtol = 0.0
agent.maxpopsize = 100

print("Training starts.")
agent.train(X_train, y_train, show_progress=True)
agent.save('final_pop.csv', title="Final Population")

print("Testing.")
cm = agent.test(X_test, y_test, show_progress=False)
print("Confusion matrix:", cm)
print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))
1.

2.

3.

4.