Data Engineering for Dummies
Using XCS and ANN | still in progress
Artificial Neural Network (ANN) is used for several purposes, where classification is one of the mous ideas in the past years. As a model of supervised learning method, it should be trained performing any real operation. The aim for such a learning is to adjust the values of its weights, meaning that the training data highly determine its success. On the other hand, XCS uses reinforcement cycles to collect its knowledge, which plays a similar role to ANN's weights. However, there are at least two main differences here:
- learning paradigm: XCS's environment vs ANN's data
- XCS owns a human-readable set of knowledge
pip install xcs-rc or download it here while ANN lines
can be obtained here.The copy buttons sends the code to clipboard, full source is also available in Conclusions.
# read data file
import pandas as pd
dataset = pd.read_csv('Churn_Modelling.csv')
# split columns, X as input, y as (expected) output
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]
In[1]: X
Out[1]:
array([[619, 'France', 'Female', ..., 1, 1, 101348.88],
[608, 'Spain', 'Female', ..., 0, 1, 112542.58],
[502, 'France', 'Female', ..., 1, 0, 113931.57],
...,
[709, 'France', 'Female', ..., 0, 1, 42085.58],
[772, 'Germany', 'Male', ..., 1, 0, 92888.52],
[792, 'France', 'Female', ..., 1, 0, 38190.78]], dtype=object)
In[2]: y
Out[2]: array([1, 0, 1, ..., 1, 1, 0], dtype=int64)
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() X["Geography"] = encoder.fit_transform(X["Geography"]) X["Gender"] = encoder.fit_transform(X["Gender"]) # Split to 80% training and 20% testing set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
pip install xcs-rc at the
command line.
Afterwards, some parameter adjustment might be useful.import xcs_rc agent = xcs_rc.Agent() agent.predtol = 10.0 agent.prederrtol = 0.0 agent.maxpopsize = 100
print("Training starts.")
agent.train(X_train, y_train, show_progress=True)
agent.save('final_pop.csv', title="Final Population")
def train(X_train, y_train, show_progress=True):
for i in range(len(X_train)): # iterate over training data
answer = next_action(X_train[i], True) # get XCS response
reward = int(answer == y_train[i]) * maxreward # assign reward
apply_reward(reward) # update classifiers
# simple progress visualization
if show_progress:
print('.', end='')
if i % 100 == 99:
print()
return pop
The training produces a set of human readable rules (in this case,
71 classifiers), displayed in a simplified version for ease of
understanding.A download button is available at the end of the table.
[2,
8] covers any value between and including 2 and 8.Encoded columns (see Phase A, Section 3):
- Geography:
['France', 'Spain', 'Germany'] - Gender:
['Female', 'Male']
rule #11
can be
translated that a customer with:
- a credit in range
[544, 628] - coming from
FranceorSpain - either
maleorfemale - in age range
26to44 - length of tenure
up to 5 years - with a balance
no more than 153538 - getting products
#1and#2 - either
having a credit card or not not active- and within such
range of estimated salary - has a
95.8%likelihood ofstaying
- a fitness value of
0.424compared to other rules - a record that such case has occured
111times
output, prob = agent.predict([500, 1, 1, 30, 6, 90000, 1, 0, 1, 100000])Only two steps required to get the prediction, plus a beautification:
- Collect all matching classifiers: id
[1, 2, 4]for stay;[5, 6]for exit - Calculate the likelihood of staying
0and exiting1with the formula:
\( P_i = \cfrac{ \sum \mathcal{L_i} \times \mathcal{F_i} }{ \sum \mathcal{F_i} }, \) where \( \mathcal{L} = \) likelihood and \( \mathcal{F} = \) rule fitness
making \( P_0 = \cfrac { 0.987 \times 0.449 + 0.992 \times 0.931 + 0.986 \times 0.313 } { 0.449 + 0.931 + 0.313 } = \cfrac{ 1.675333 }{ 1.693 } = 0.98956\% \)
and \( P_1 = \cfrac{ 0.264 \times 0.736 + 0.399 \times 0.287 } { 0.736 + 0.287 } = 0.30187\% \)
- And finally after normalizing
Pwith the softmax function, XCS-RC sends two responses:output = 0
prob = [0.6654528588932179, 0.3345471411067821]
the customer is predicted to stay at
67% and
33% to exit, based on the training data. So, it should be quite safe for now.
😏
print("Testing.")
cm = agent.test(X_test, y_test, show_progress=False)
print("Confusion matrix:", cm)
print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))
Then the console gives an output like this:
Testing.
Confusion matrix: [[1582, 13], [345, 60]]
Correctness rate: 82.10%
# read data file
import pandas as pd
dataset = pd.read_csv('Churn_Modelling.csv')
# split columns, X as input, y as (expected) output
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X["Geography"] = encoder.fit_transform(X["Geography"])
X["Gender"] = encoder.fit_transform(X["Gender"])
# Split to 80% train and 20% test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
import xcs_rc
agent = xcs_rc.Agent()
agent.predtol = 10.0
agent.prederrtol = 0.0
agent.maxpopsize = 100
print("Training starts.")
agent.train(X_train, y_train, show_progress=True)
agent.save('final_pop.csv', title="Final Population")
print("Testing.")
cm = agent.test(X_test, y_test, show_progress=False)
print("Confusion matrix:", cm)
print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))