柯摩的BLOG

Forget before get.


  • 首页

  • 人生小记

  • 关于

  • 标签118

  • 分类10

  • 归档32

  • 运动日记

  • 相册

  • 热门

  • 友链

  • 搜索

从零实现朴素贝叶斯分类器(离散情况)--以学生分班为例

发表于 2020-05-13 分类于 Algorithm 阅读次数: 阅读次数: | 字数: 2.4k 评论数:
本文字数: 3.6k 阅读时长 ≈ 3 分钟

闲来无事,算法糖。

1️⃣ 任务要求

1️⃣.1️⃣ 实现5个函数,分别为:

  • load_data():读取数据,并转换为可用的形式;
  • split_data():将数据集分为训练集和测试集;
  • train():从当前数据集中训练模型;
  • predict():用train()生成的模型,对测试集的学生进行分班;
  • evaluate():输出模型的准确率。

    1️⃣.2️⃣ train()和predict()不可以用第三方库;

    1️⃣.3️⃣ 数据集(下载链接:student.csv)

  • 每个学生对应的情况,与最终分到的班级;
  • 649行数据(instances);
  • 30个类别性特征;
  • 6个班级,包括{A+,A,B,C,D,F};

    1️⃣.4️⃣ 数据集解释:

    1 school - students school (binary: “GP” - Gabriel Pereira or “MS” - Mousinho da Silveira)

2 sex - students sex (binary: “F” - female or “M” - male)

3 address - students home address type (binary: “U” - urban or “R” - rural)

4 famsize - family size (binary: “LE3” - less or equal to 3 or “GT3” - greater than 3)

5 Pstatus - parents cohabitation status (binary: “T” - living together or “A” - apart)

6 Medu - mothers education (nominal: low, none, mid, high)

7 Fedu - fathers education (nominal: low, none, mid, high)

8 Mjob - mothers job (nominal: “teacher”, “health” care related, civil “services” (e.g. administrative or police), “at_home” or “other”)

9 Fjob - fathers job (nominal: “teacher”, “health” care related, civil “services” (e.g. administrative or police), “at_home” or “other”)

10 reason - reason to choose this school (nominal: close to “home”, school “reputation”, “course” preference or “other”)

11 guardian - students guardian (nominal: “mother”, “father” or “other”)

12 traveltime - home to school travel time (nominal: none, low, medium, high, very_high)

13 studytime - weekly study time (nominal: none, low, medium, high, very_high)

14 failures - number of past class failures (nominal: none, low, medium, high, very_high)

15 schoolsup - extra educational support (binary: yes or no)
16 famsup - family educational support (binary: yes or no)

17 paid - extra paid classes within the course subject (binary: yes or no)

18 activities - extra-curricular activities (binary: yes or no)

19 nursery - attended nursery school (binary: yes or no)

20 higher - wants to take higher education (binary: yes or no)

21 internet - Internet access at home (binary: yes or no)

22 romantic - with a romantic relationship (binary: yes or no)

23 famrel - quality of family relationships (nominal: very_bad, bad, mediocre, good, excellent)

24 freetime - free time after school (nominal: very_low, low, mediocre, high, very_high)

25 goout - going out with friends (nominal: very_low, low, mediocre, high, very_high)
26 Dalc - workday alcohol consumption (nominal: very_low, low, mediocre, high, very_high)

27 Walc - weekend alcohol consumption (nominal: very_low, low, mediocre, high, very_high)

28 health - current health status (nominal: very_bad, bad, mediocre, good, excellent)

29 absences - number of school absences (nominal: none, one_to_three, four_to_six, seven_to_ten, more_than_ten)

30 Grade - final grade (A+, A, B, C, D, F)

2️⃣ 代码

2️⃣.1️⃣ load_data()

1
2
3
4
5
# This function should open a data file in csv, and transform it into a usable format 
def load_data():
import pandas as pd
data = pd.read_csv('student.csv', sep=',')
return data

2️⃣.2️⃣ split_data()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# This function should split a data set into a training set and hold-out test set
def split_data(data, test_size):
"""
split the data into train set and test set
:param data: Dtype from pd.read_csv
:param test_size: float, define the position to split
:return:
"""
import numpy as np
X = data[list(data.columns[:-1])].values # get the instances matrix
y = data['Grade'] # get the class vector
index = np.arange(data.shape[0]) # get the number of the dataset
np.random.shuffle(index) # shuffle the order of the data
X = X[index] # reorder the instances matrix
y = y[index] # reorder the class vector

split_point = int(X.shape[0] * test_size) # define the position to split the data into train and test
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]

return X_train, X_test, y_train, y_test

2️⃣.3️⃣ train()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# This function should build a supervised NB model
def train(X, y, alpha):
"""
train or generate the probability matrix of Naive Bayes Classifier
:param X: Dtype from pd.read_csv, train set
:param y: Dtype from pd.read_csv, train class
:param alpha: Laplace smooth index
:return:
"""
y_class_count = {}
feature_dimension = len(X[1]) # number of feature

# get the number of each labels
for c in y:
y_class_count[c] = y_class_count.get(c, 0) + 1 # generate the dict of class, e.g. {'A':'69',...}
y_class_tuple = sorted(y_class_count.items(), reverse=False) # generate the tuple of class and sort it in terms of number, e.g. [('A','69'),...]
K = len(y_class_tuple) # the specific number of class grade
N = len(y) # the number of instances

# get the prior probability
prior_prob = {}
for key in range(len(y_class_tuple)):
prior_prob[y_class_tuple[key][0]] = (y_class_tuple[key][1] + alpha) / (N + K * alpha) # laplace smooth

# get the value set of each feature
feature_value = [] # feature with different value
feature_value_number = [] # the number of unique values of each feature
for feature in range(feature_dimension):
unique_feature = list(set(X[:, feature])) # use `set` to get the unique value
feature_value_number.append(len(unique_feature))
feature_value.append(unique_feature)

# calculate the conditional probability
conditional_prob = []
# calculate the count (x = a & y = c)
for j in range(feature_dimension):
count = [[0 for i in range(len(y_class_count))] for i in range(feature_value_number[j])]
# use list comprehension to generate zero matrix, (feature_value_number[j] rows x y_class_count cols)
for i in range(len(X[:, j])):
for k in range(len(feature_value[j])):
for t in range(len(y_class_count)):
if X[:, j][i] == feature_value[j][k] and list(y)[i] == y_class_tuple[t][0]:
# x = value and y = class, get the count
count[k][t] += 1
# calculate the conditional probability
for m in range(len(y_class_tuple)):
for r in range(len(count)):
count[r][m] = (count[r][m] + alpha) / (y_class_tuple[m][1] + alpha * feature_value_number[j]) # laplace smoothing
conditional_prob.append(count)
return y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob

2️⃣.4️⃣ predict()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, alpha, instance):
"""
generate the answer of classification
:param y_class_tuple: list, the tuple of class and sort it in terms of number
:param prior_prob: float list, prior probability of class
:param feature_value: list, feature value of all the attributes
:param conditional_prob: float list, posterior probability
:param feature_value_number: float list, number of different unique features
:param alpha: float, Laplace smooth index default 1
:param instance: list, one row of test set
:return:
"""
import math
predict = {}
for m in range(len(y_class_tuple)):
# get the prior_probability of m-th label in y_class_tuple
yhat = math.log(prior_prob[y_class_tuple[m][0]]) # use log-transformation to avoid float missing
for n in range(len(instance)):
if instance[n] in feature_value[n]:
index = feature_value[n].index(instance[n]) # locate the feature in feature_value
yhat = yhat + math.log(conditional_prob[n][index][m]) # accumulate the probability
else:
# if the value of feature is not in training set, return the laplace smoothing
yhat = alpha / (feature_value_number[n] * alpha)
predict[y_class_tuple[m][0]] = yhat
return predict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# This function should predict the class for an instance or a set of instances, based on a trained model 
def predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X, alpha, flag=0):
"""
predict the class for an instance or a set of instances, based on a trained model
:param y_class_tuple: list, the tuple of class and sort it in terms of number
:param prior_prob: float list, prior probability of class
:param feature_value: list, feature value of all the attributes
:param conditional_prob: float list, posterior probability
:param feature_value_number: float list, number of different unique features
:param alpha: float, Laplace smooth index default 1
:param X: Dtype from pd.read_csv, test set
:param flag: set 1 return probability or set 0 return prediction, default 0
:return:
"""
import operator as op
test_num = len(X)
prediction = [0 for i in range(test_num)]
probability = [0 for i in range(test_num)]
for i in range(test_num):
result = classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, 1, X[i, :])
# result is the probability of each class
result = sorted(result.items(), key=op.itemgetter(1), reverse=True) # the max probability is the predict class
prediction[i] = result[0][0] # show the predict answer
probability[i] = result[0][1] # show the predict probability

if flag:
return probability
else:
return prediction

2️⃣.5️⃣ evaluate()

1
2
3
4
# This function should evaluate a set of predictions in terms of accuracy
def evaluate(p, y_test):
accuracy = sum(p == y_test)/len(y_test)
return accuracy

2️⃣.6️⃣ 主函数

1
2
3
4
5
data = load_data()
X_train, X_test, y_train, y_test = split_data(data, 0.7)
y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob = train(X_train, y_train, 1)
p = predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X_test, 1)
evaluate(p, y_test)

3️⃣ 整合全部代码(方便大家复制后直接运行)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# This function should open a data file in csv, and transform it into a usable format 
def load_data():
import pandas as pd
data = pd.read_csv('student.csv', sep=',')
return data

# This function should split a data set into a training set and hold-out test set
def split_data(data, test_size):
"""
split the data into train set and test set
:param data: Dtype from pd.read_csv
:param test_size: float, define the position to split
:return:
"""
import numpy as np
X = data[list(data.columns[:-1])].values # get the instances matrix
y = data['Grade'] # get the class vector
index = np.arange(data.shape[0]) # get the number of the dataset
np.random.shuffle(index) # shuffle the order of the data
X = X[index] # reorder the instances matrix
y = y[index] # reorder the class vector

split_point = int(X.shape[0] * test_size) # define the position to split the data into train and test
X_train, X_test = X[:split_point], X[split_point:]
y_train, y_test = y[:split_point], y[split_point:]

return X_train, X_test, y_train, y_test

# This function should build a supervised NB model
def train(X, y, alpha):
"""
train or generate the probability matrix of Naive Bayes Classifier
:param X: Dtype from pd.read_csv, train set
:param y: Dtype from pd.read_csv, train class
:param alpha: Laplace smooth index
:return:
"""
y_class_count = {}
feature_dimension = len(X[1]) # number of feature

# get the number of each labels
for c in y:
y_class_count[c] = y_class_count.get(c, 0) + 1 # generate the dict of class, e.g. {'A':'69',...}
y_class_tuple = sorted(y_class_count.items(), reverse=False) # generate the tuple of class and sort it in terms of number, e.g. [('A','69'),...]
K = len(y_class_tuple) # the specific number of class grade
N = len(y) # the number of instances

# get the prior probability
prior_prob = {}
for key in range(len(y_class_tuple)):
prior_prob[y_class_tuple[key][0]] = (y_class_tuple[key][1] + alpha) / (N + K * alpha) # laplace smooth

# get the value set of each feature
feature_value = [] # feature with different value
feature_value_number = [] # the number of unique values of each feature
for feature in range(feature_dimension):
unique_feature = list(set(X[:, feature])) # use `set` to get the unique value
feature_value_number.append(len(unique_feature))
feature_value.append(unique_feature)

# calculate the conditional probability
conditional_prob = []
# calculate the count (x = a & y = c)
for j in range(feature_dimension):
count = [[0 for i in range(len(y_class_count))] for i in range(feature_value_number[j])]
# use list comprehension to generate zero matrix, (feature_value_number[j] rows x y_class_count cols)
for i in range(len(X[:, j])):
for k in range(len(feature_value[j])):
for t in range(len(y_class_count)):
if X[:, j][i] == feature_value[j][k] and list(y)[i] == y_class_tuple[t][0]:
# x = value and y = class, get the count
count[k][t] += 1
# calculate the conditional probability
for m in range(len(y_class_tuple)):
for r in range(len(count)):
count[r][m] = (count[r][m] + alpha) / (y_class_tuple[m][1] + alpha * feature_value_number[j]) # laplace smoothing
conditional_prob.append(count)
return y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob


def classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, alpha, instance):
"""
generate the answer of classification
:param y_class_tuple: list, the tuple of class and sort it in terms of number
:param prior_prob: float list, prior probability of class
:param feature_value: list, feature value of all the attributes
:param conditional_prob: float list, posterior probability
:param feature_value_number: float list, number of different unique features
:param alpha: float, Laplace smooth index default 1
:param instance: list, one row of test set
:return:
"""
import math
predict = {}
for m in range(len(y_class_tuple)):
# get the prior_probability of m-th label in y_class_tuple
yhat = math.log(prior_prob[y_class_tuple[m][0]]) # use log-transformation to avoid float missing
for n in range(len(instance)):
if instance[n] in feature_value[n]:
index = feature_value[n].index(instance[n]) # locate the feature in feature_value
yhat = yhat + math.log(conditional_prob[n][index][m]) # accumulate the probability
else:
# if the value of feature is not in training set, return the laplace smoothing
yhat = alpha / (feature_value_number[n] * alpha)
predict[y_class_tuple[m][0]] = yhat
return predict


# This function should predict the class for an instance or a set of instances, based on a trained model
def predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X, alpha, flag=0):
"""
predict the class for an instance or a set of instances, based on a trained model
:param y_class_tuple: list, the tuple of class and sort it in terms of number
:param prior_prob: float list, prior probability of class
:param feature_value: list, feature value of all the attributes
:param conditional_prob: float list, posterior probability
:param feature_value_number: float list, number of different unique features
:param alpha: float, Laplace smooth index default 1
:param X: Dtype from pd.read_csv, test set
:param flag: set 1 return probability or set 0 return prediction, default 0
:return:
"""
import operator as op
test_num = len(X)
prediction = [0 for i in range(test_num)]
probability = [0 for i in range(test_num)]
for i in range(test_num):
result = classify(y_class_tuple, prior_prob, feature_value, conditional_prob, feature_value_number, 1, X[i, :])
# result is the probability of each class
result = sorted(result.items(), key=op.itemgetter(1), reverse=True) # the max probability is the predict class
prediction[i] = result[0][0] # show the predict answer
probability[i] = result[0][1] # show the predict probability

if flag:
return probability
else:
return prediction


# This function should evaluate a set of predictions in terms of accuracy
def evaluate(p, y_test):
accuracy = sum(p == y_test)/len(y_test)
return accuracy


data = load_data()
X_train, X_test, y_train, y_test = split_data(data, 0.7)
y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob = train(X_train, y_train, 1)
p = predict(y_class_tuple, prior_prob, feature_value, feature_value_number, conditional_prob, X_test, 1)
evaluate(p, y_test)
1
0.4358974358974359
相关文章推荐
  • 一文解决--matplotlib绘制极坐标热力图并插值优化
  • 利用Selenium秒填朋友圈各种问卷星调查问卷
  • 简单地打造一个搜索工具--爬取所有网页并创造单词的反向索引
  • 天池二手车价格预测Task1-2—赛题理解与数据分析
  • 高性能分布式计算(HPC)作业1--节点实时通信
喜欢就支持一下吧~
柯摩 微信支付

微信支付

柯摩 支付宝

支付宝

  • 本文作者: 柯摩
  • 本文链接: https://kemo.xyz/从零实现朴素贝叶斯分类器-离散情况-以学生分班为例.html
  • 版权声明: 本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!
python pandas Navie Bayes 朴素贝叶斯
Tweet
Share
Link
Plus
Share
Class
Send
Send
Pin
高性能分布式计算(HPC)作业3--节点通信,发布计算任务,并在计算任务中阻塞
利用Selenium秒填朋友圈各种问卷星调查问卷
------------- 本 文 结 束 感 谢 您 的 阅 读 -------------
  • 文章目录
  • 站点概览
柯摩

柯摩

真实自有万钧之力
32 日志
10 分类
118 标签
RSS
GitHub E-Mail Weibo QQ Wechat Twitter Bilibili
Creative Commons
  1. 1. 1️⃣ 任务要求
    1. 1.1. 1️⃣.1️⃣ 实现5个函数,分别为:
    2. 1.2. 1️⃣.2️⃣ train()和predict()不可以用第三方库;
    3. 1.3. 1️⃣.3️⃣ 数据集(下载链接:student.csv)
    4. 1.4. 1️⃣.4️⃣ 数据集解释:
  2. 2. 2️⃣ 代码
    1. 2.1. 2️⃣.1️⃣ load_data()
    2. 2.2. 2️⃣.2️⃣ split_data()
    3. 2.3. 2️⃣.3️⃣ train()
    4. 2.4. 2️⃣.4️⃣ predict()
    5. 2.5. 2️⃣.5️⃣ evaluate()
    6. 2.6. 2️⃣.6️⃣ 主函数
  3. 3. 3️⃣ 整合全部代码(方便大家复制后直接运行)
0%

Tag Cloud

  • Algorithm1
  • Anaconda31
  • Barrier1
  • BeautifulSoup1
  • BoxPlot1
  • CUDA1
  • CVM1
  • DataMining3
  • Diary5
  • ECS1
  • ESXi3
  • Editor8
  • Go1
  • Hexo1
  • IPFS2
  • IQR1
  • Job1
  • Jupyter3
  • Jupyter Notebook1
  • Lasso正则化1
  • Learning1
  • LightGBM1
  • Linear Regression1
  • Linux8
  • MATLAB1
  • MySQL1
  • Navie Bayes1
  • Next1
  • Nvidia5
  • OJ1
  • Onehot1
  • PaddleOCR1
  • Pandas Head1
  • PySimpleGUI1
  • Pycharm1
  • Python1
  • Pytorch1
  • RTX 2070 Super4
  • Reading1
  • Ridge正则化1
  • SSH1
  • Science Uranology1
  • Searcher1
  • Selenium1
  • Sticker1
  • Study3
  • Thinking2
  • Tianchi3
  • Ubuntu8
  • VMware1
  • X Server1
  • XGBoost1
  • Xftp1
  • Xshell3
  • alive_progress1
  • cross validation1
  • css1
  • cuDNN1
  • custom1
  • distributed system3
  • fan1
  • groupby1
  • matplotlib1
  • multithreading3
  • nvidia-settings1
  • pandas7
  • python9
  • requests1
  • seaborn3
  • set_gpu_fans_public2
  • socket3
  • svm1
  • xorg.conf1
  • 二进制1
  • 五折交叉验证1
  • 内存优化1
  • 决策树1
  • 分布式1
  • 分布式计算2
  • 包裹式(wrapper)1
  • 博客1
  • 反向索引1
  • 回忆1
  • 地铁线路优化1
  • 多层感知机(MLP)1
  • 多线程3
  • 嵌入式(embedding)1
  • 感情1
  • 插值1
  • 数学建模1
  • 日记1
  • 朋友圈1
  • 朴素贝叶斯1
  • 极坐标1
  • 梯度提升树(GBDT)1
  • 正则表达式1
  • 正态分布1
  • 毕业论文1
  • 气象图1
  • 温馨1
  • 热力图1
  • 画图1
  • 目标函数1
  • 真挚1
  • 私有网络1
  • 约束条件1
  • 网格调参1
  • 节点通信3
  • 虚拟机3
  • 调查问卷1
  • 贝叶斯调参1
  • 贪心调参1
  • 过滤式(filter)1
  • 遗传算法1
  • 问卷星1
  • 随机森林1
  • 随笔2
  • 风扇1
© 2015 – 2021 柯摩 | 站点总字数: 162k | 站点阅读时长 ≈ 2:27
全站共 77.3k 字
载入天数...载入时分秒...
皖ICP备 - 19020907 | 苏公网安备 32050802011200号
总访客量 人 | 总访客数 次