Data Engineering for Dummies
Using XCS and ANN | still in progress
Artificial Neural Network (ANN) is used for several purposes, where classification is one of the mous ideas in the past years. As a model of supervised learning method, it should be trained performing any real operation. The aim for such a learning is to adjust the values of its weights, meaning that the training data highly determine its success. On the other hand, XCS uses reinforcement cycles to collect its knowledge, which plays a similar role to ANN's weights. However, there are at least two main differences here:
- learning paradigm: XCS's environment vs ANN's data
- XCS owns a human-readable set of knowledge
pip install xcs-rc
or download it here while ANN lines
can be obtained here.The copy buttons sends the code to clipboard, full source is also available in Conclusions.
# read data file import pandas as pd dataset = pd.read_csv('Churn_Modelling.csv') # split columns, X as input, y as (expected) output X = dataset.iloc[:, 3:13] y = dataset.iloc[:, 13]
In[1]: X
Out[1]:
array([[619, 'France', 'Female', ..., 1, 1, 101348.88],
[608, 'Spain', 'Female', ..., 0, 1, 112542.58],
[502, 'France', 'Female', ..., 1, 0, 113931.57],
...,
[709, 'France', 'Female', ..., 0, 1, 42085.58],
[772, 'Germany', 'Male', ..., 1, 0, 92888.52],
[792, 'France', 'Female', ..., 1, 0, 38190.78]], dtype=object)
In[2]: y
Out[2]: array([1, 0, 1, ..., 1, 1, 0], dtype=int64)
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() X["Geography"] = encoder.fit_transform(X["Geography"]) X["Gender"] = encoder.fit_transform(X["Gender"]) # Split to 80% training and 20% testing set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
pip install xcs-rc
at the
command line.
Afterwards, some parameter adjustment might be useful.import xcs_rc agent = xcs_rc.Agent() agent.predtol = 10.0 agent.prederrtol = 0.0 agent.maxpopsize = 100
print("Training starts.") agent.train(X_train, y_train, show_progress=True) agent.save('final_pop.csv', title="Final Population")
def train(X_train, y_train, show_progress=True): for i in range(len(X_train)): # iterate over training data answer = next_action(X_train[i], True) # get XCS response reward = int(answer == y_train[i]) * maxreward # assign reward apply_reward(reward) # update classifiers # simple progress visualization if show_progress: print('.', end='') if i % 100 == 99: print() return pop
The training produces a set of human readable rules (in this case,
71 classifiers
), displayed in a simplified version for ease of
understanding.A download button is available at the end of the table.
[2,
8]
covers any value between and including 2
and 8
.Encoded columns (see Phase A, Section 3):
- Geography:
['France', 'Spain', 'Germany']
- Gender:
['Female', 'Male']
rule #11
can be
translated that a customer with:
- a credit in range
[544, 628]
- coming from
France
orSpain
- either
male
orfemale
- in age range
26
to44
- length of tenure
up to 5 years
- with a balance
no more than 153538
- getting products
#1
and#2
- either
having a credit card or not
not active
- and within such
range of estimated salary
- has a
95.8%
likelihood ofstaying
- a fitness value of
0.424
compared to other rules - a record that such case has occured
111
times
output, prob = agent.predict([500, 1, 1, 30, 6, 90000, 1, 0, 1, 100000])Only two steps required to get the prediction, plus a beautification:
- Collect all matching classifiers: id
[1, 2, 4]
for stay;[5, 6]
for exit - Calculate the likelihood of staying
0
and exiting1
with the formula:
\( P_i = \cfrac{ \sum \mathcal{L_i} \times \mathcal{F_i} }{ \sum \mathcal{F_i} }, \) where \( \mathcal{L} = \) likelihood and \( \mathcal{F} = \) rule fitness
making \( P_0 = \cfrac { 0.987 \times 0.449 + 0.992 \times 0.931 + 0.986 \times 0.313 } { 0.449 + 0.931 + 0.313 } = \cfrac{ 1.675333 }{ 1.693 } = 0.98956\% \)
and \( P_1 = \cfrac{ 0.264 \times 0.736 + 0.399 \times 0.287 } { 0.736 + 0.287 } = 0.30187\% \)
- And finally after normalizing
P
with the softmax function, XCS-RC sends two responses:output = 0
prob = [0.6654528588932179, 0.3345471411067821]
the customer is predicted to stay at
67%
and
33%
to exit, based on the training data. So, it should be quite safe for now.
😏
print("Testing.") cm = agent.test(X_test, y_test, show_progress=False) print("Confusion matrix:", cm) print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))Then the console gives an output like this:
Testing.
Confusion matrix: [[1582, 13], [345, 60]]
Correctness rate: 82.10%
# read data file import pandas as pd dataset = pd.read_csv('Churn_Modelling.csv') # split columns, X as input, y as (expected) output X = dataset.iloc[:, 3:13] y = dataset.iloc[:, 13] from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() X["Geography"] = encoder.fit_transform(X["Geography"]) X["Gender"] = encoder.fit_transform(X["Gender"]) # Split to 80% train and 20% test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) import xcs_rc agent = xcs_rc.Agent() agent.predtol = 10.0 agent.prederrtol = 0.0 agent.maxpopsize = 100 print("Training starts.") agent.train(X_train, y_train, show_progress=True) agent.save('final_pop.csv', title="Final Population") print("Testing.") cm = agent.test(X_test, y_test, show_progress=False) print("Confusion matrix:", cm) print("Correctness rate: {0:.2f}%".format((cm[0][0] + cm[1][1]) / len(y_test) * 100))