NLP数据集探索

英文IMDB数据集

该部分使用的是IMDB数据集,已经进行预处理,将影评转换为整数序列。

导入IMDB数据集的代码:

1
2
3
imdb = keras.datasets.imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

num_words=10000代表保留训练数据中出现频次在前 10000 位的字词。

==注意:==直接导入数据,可能会出现Object arrays cannot be loaded when allow_pickle=False这一错误。这个问题是因为numpy修补漏洞而使得pickle默认为False,因此需要将pickle设置为true

解决这一错误的方法:

  1. 将numpy版本降级为1.16.1 即可1
  2. 改imdb的源代码,将85行补充allow_pickle=True2

查看数据的方式

查看影评的长度:

1
print(train_data[0])

显示的结果为:

1
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

但是每条影评的长度都不一样,因此需要在送入神经网络之前统一长度

将整数转换回字词

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])

这样就可以使用decode_review显示影评文本

1
decode_review(train_data[0])
1
" this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert  is an amazing actor and now the same being director  father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for  and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also  to the two little boy's that played the  of norman and paul they were just brilliant children are often left out of the  list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"

准备训练

因此影评的长度各不相同,因此在训练之前,将所有的数据统一长度。

统一长度的方法有:

  • 对数组进行独热编码,将它们转换为由 0 和 1 构成的向量。例如,序列 [3, 5] 将变成一个 10000 维的向量,除索引 3 和 5 转换为 1 之外,其余全转换为 0。然后,将它作为网络的第一层,一个可以处理浮点向量数据的密集层。不过,这种方法会占用大量内存,需要一个大小为 num_words * num_reviews 的矩阵。
  • 填充数组,使它们都具有相同的长度,然后创建一个形状为 max_length * num_reviews 的整数张量。我们可以使用一个能够处理这种形状的嵌入层作为网络中的第一层。

由于电脑的原因,我们根据官方教程使用第二种方法,使用 pad_sequences函数将长度标准化:

1
2
3
4
5
6
7
8
9
train_data = keras.preprocessing.sequence.pad_sequences(train_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=word_index["<PAD>"],
padding='post',
maxlen=256)

此时,数据集被扩充到256长度

1
print(train_data[0])

输出为:

1
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

构建模型

1
2
3
4
5
6
7
8
9
10
# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000

model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))

model.summary()

通过输出可知:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 16) 160000
_________________________________________________________________
global_average_pooling1d (Gl (None, 16) 0
_________________________________________________________________
dense (Dense) (None, 16) 272
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
  1. 第一层是 Embedding 层。该层会在整数编码的词汇表中查找每个字词-索引的嵌入向量。模型在接受训练时会学习这些向量。这些向量会向输出数组添加一个维度。生成的维度为:(batch, sequence, embedding)
  2. 接下来,一个 GlobalAveragePooling1D 层通过对序列维度求平均值,针对每个样本返回一个长度固定的输出向量。这样,模型便能够以尽可能简单的方式处理各种长度的输入。
  3. 该长度固定的输出向量会传入一个全连接 (Dense) 层(包含 16 个隐藏单元)。
  4. 最后一层与单个输出节点密集连接。应用 sigmoid 激活函数后,结果是介于 0 到 1 之间的浮点值,表示概率或置信水平。

损失函数和优化器

模型在训练时需要一个损失函数和一个优化器。由于这是一个二元分类问题且模型会输出一个概率(应用 S 型激活函数的单个单元层),因此我们将使用 binary_crossentropy 损失函数。

同理,在回归问题中,可以使用mean_squared_error

1
2
3
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='binary_crossentropy',
metrics=['accuracy'])

创建验证集

我们从原始训练数据中分离出 10000 个样本,创建一个验证集。

训练模型

用有 512 个样本的小批次训练模型 40 个周期。这将对 x_trainy_train 张量中的所有样本进行 40 次迭代。在训练期间,监控模型在验证集的 10000 个样本上的损失和准确率:

1
2
3
4
5
6
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
15000/15000 [==============================] - 12s 784us/sample - loss: 0.6916 - acc: 0.6300 - val_loss: 0.6892 - val_acc: 0.7099
Epoch 2/40
15000/15000 [==============================] - 1s 55us/sample - loss: 0.6850 - acc: 0.7375 - val_loss: 0.6803 - val_acc: 0.7570
Epoch 3/40
15000/15000 [==============================] - 1s 54us/sample - loss: 0.6714 - acc: 0.7671 - val_loss: 0.6636 - val_acc: 0.7501
Epoch 4/40
15000/15000 [==============================] - 1s 59us/sample - loss: 0.6487 - acc: 0.7743 - val_loss: 0.6380 - val_acc: 0.7524
Epoch 5/40
15000/15000 [==============================] - 1s 55us/sample - loss: 0.6161 - acc: 0.7965 - val_loss: 0.6039 - val_acc: 0.7826
Epoch 6/40
15000/15000 [==============================] - 1s 49us/sample - loss: 0.5741 - acc: 0.8156 - val_loss: 0.5629 - val_acc: 0.8085
Epoch 7/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.5267 - acc: 0.8345 - val_loss: 0.5189 - val_acc: 0.8215
Epoch 8/40
15000/15000 [==============================] - 1s 47us/sample - loss: 0.4795 - acc: 0.8493 - val_loss: 0.4782 - val_acc: 0.8360
Epoch 9/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.4357 - acc: 0.8647 - val_loss: 0.4417 - val_acc: 0.8449
Epoch 10/40
15000/15000 [==============================] - 1s 47us/sample - loss: 0.3970 - acc: 0.8756 - val_loss: 0.4113 - val_acc: 0.8516
Epoch 11/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.3641 - acc: 0.8836 - val_loss: 0.3862 - val_acc: 0.8617
Epoch 12/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.3363 - acc: 0.8916 - val_loss: 0.3665 - val_acc: 0.8638
Epoch 13/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.3133 - acc: 0.8970 - val_loss: 0.3490 - val_acc: 0.8698
Epoch 14/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.2922 - acc: 0.9023 - val_loss: 0.3358 - val_acc: 0.8733
Epoch 15/40
15000/15000 [==============================] - 1s 43us/sample - loss: 0.2744 - acc: 0.9062 - val_loss: 0.3252 - val_acc: 0.8743
Epoch 16/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.2585 - acc: 0.9117 - val_loss: 0.3162 - val_acc: 0.8745
Epoch 17/40
15000/15000 [==============================] - 1s 49us/sample - loss: 0.2439 - acc: 0.9155 - val_loss: 0.3088 - val_acc: 0.8788
Epoch 18/40
15000/15000 [==============================] - 1s 46us/sample - loss: 0.2311 - acc: 0.9211 - val_loss: 0.3027 - val_acc: 0.8807
Epoch 19/40
15000/15000 [==============================] - 1s 48us/sample - loss: 0.2196 - acc: 0.9232 - val_loss: 0.2975 - val_acc: 0.8824
Epoch 20/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.2093 - acc: 0.9269 - val_loss: 0.2940 - val_acc: 0.8828
Epoch 21/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.1988 - acc: 0.9333 - val_loss: 0.2910 - val_acc: 0.8833
Epoch 22/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.1901 - acc: 0.9356 - val_loss: 0.2888 - val_acc: 0.8841
Epoch 23/40
15000/15000 [==============================] - 1s 47us/sample - loss: 0.1813 - acc: 0.9407 - val_loss: 0.2878 - val_acc: 0.8829
Epoch 24/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.1737 - acc: 0.9435 - val_loss: 0.2861 - val_acc: 0.8841
Epoch 25/40
15000/15000 [==============================] - 1s 46us/sample - loss: 0.1660 - acc: 0.9469 - val_loss: 0.2850 - val_acc: 0.8852
Epoch 26/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.1591 - acc: 0.9499 - val_loss: 0.2858 - val_acc: 0.8838
Epoch 27/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.1526 - acc: 0.9526 - val_loss: 0.2854 - val_acc: 0.8854
Epoch 28/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.1464 - acc: 0.9549 - val_loss: 0.2858 - val_acc: 0.8858
Epoch 29/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.1410 - acc: 0.9579 - val_loss: 0.2881 - val_acc: 0.8834
Epoch 30/40
15000/15000 [==============================] - 1s 44us/sample - loss: 0.1354 - acc: 0.9594 - val_loss: 0.2874 - val_acc: 0.8863
Epoch 31/40
15000/15000 [==============================] - 1s 46us/sample - loss: 0.1297 - acc: 0.9615 - val_loss: 0.2888 - val_acc: 0.8864
Epoch 32/40
15000/15000 [==============================] - 1s 47us/sample - loss: 0.1245 - acc: 0.9647 - val_loss: 0.2905 - val_acc: 0.8860
Epoch 33/40
15000/15000 [==============================] - 1s 49us/sample - loss: 0.1196 - acc: 0.9662 - val_loss: 0.2930 - val_acc: 0.8838
Epoch 34/40
15000/15000 [==============================] - 1s 52us/sample - loss: 0.1153 - acc: 0.9671 - val_loss: 0.2949 - val_acc: 0.8855
Epoch 35/40
15000/15000 [==============================] - 1s 52us/sample - loss: 0.1111 - acc: 0.9685 - val_loss: 0.2982 - val_acc: 0.8848
Epoch 36/40
15000/15000 [==============================] - 1s 46us/sample - loss: 0.1068 - acc: 0.9711 - val_loss: 0.3000 - val_acc: 0.8847
Epoch 37/40
15000/15000 [==============================] - 1s 45us/sample - loss: 0.1026 - acc: 0.9719 - val_loss: 0.3026 - val_acc: 0.8842
Epoch 38/40
15000/15000 [==============================] - 1s 66us/sample - loss: 0.0987 - acc: 0.9730 - val_loss: 0.3063 - val_acc: 0.8829
Epoch 39/40
15000/15000 [==============================] - 1s 50us/sample - loss: 0.0956 - acc: 0.9745 - val_loss: 0.3098 - val_acc: 0.8820
Epoch 40/40
15000/15000 [==============================] - 1s 47us/sample - loss: 0.0916 - acc: 0.9769 - val_loss: 0.3122 - val_acc: 0.8818

评估模型

模型会返回两个值:损失(表示误差的数字,越低越好)和准确率。

1
2
3
results = model.evaluate(test_data, test_labels)

print(results)

输出结果为

1
2
25000/25000 [==============================] - 2s 68us/sample - loss: 0.3419 - acc: 0.8696
[0.3418755183649063, 0.86956]

准确率和损失跟踪

1
2
history_dict = history.history
history_dict.keys()

上述两句话可以直接记录训练期间发生的变化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

这显示了loss的趋势

1
2
3
4
5
6
7
8
9
10
11
12
plt.clf()   # clear figure
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

这显示了准确率的趋势

中文数据集 THUCNews

数据集地址:THUCTC:一个高效的中文文本分类工具包

原始数据集过大,因此此次训练使用其中10个分类,每个分类6500条,总共65000条新闻数据。

类别分别是:体育, 财经, 房产, 家居, 教育, 科技, 时尚, 时政, 游戏, 娱乐

划分如下:

  • 训练集: 5000*10
  • 验证集: 500*10
  • 测试集: 1000*10

预处理

data/cnews_loader.py为数据的预处理文件。

  • read_file(): 读取文件数据;
  • build_vocab(): 构建词汇表,使用字符级的表示,这一函数会将词汇表存储下来,避免每一次重复处理;
  • read_vocab(): 读取上一步存储的词汇表,转换为{词:id}表示;
  • read_category(): 将分类目录固定,转换为{类别: id}表示;
  • to_words(): 将一条由id表示的数据重新转换为文字;
  • preocess_file(): 将数据集从文字转换为固定长度的id序列表示;
  • batch_iter(): 为神经网络的训练准备经过shuffle的批次的数据。

经过数据预处理,数据的格式如下:

Data Shape Data Shape
x_train [50000, 600] y_train [50000, 10]
x_val [5000, 600] y_val [5000, 10]
x_test [10000, 600] y_test [10000, 10]

将该数据集套用英语数据集的模型,仍能得到90%左右的准确率

召回率、准确率、ROC曲线、AUC、PR曲线

召回率和准确率

四个概念:
True Positive:真实类别为正例,预测类别为正例;
False Negative:真实类别为正例,预测类别为负例;
False Positive:真实类别为负例,预测类别为正例 ;
True Negative:真实类别为负例,预测类别为负例;

由此推出以下概念:
准确率(accuracy):分类正确的样本数/所有样本数

$$ACC=\frac{TP+TN}{TP+FN+FP+TN}$$

精确率(precision):正样本的预测数/被预测为正样本的数量

$$P=\frac{TP}{TP+FP}$$

召回率(Recall): 分类正确的正样本个数占正样本个数的比例

$$R=\frac{TP}{FN+TP}$$

ROC曲线

对于二分类问题,分类器得到的结果往往不是0或1,而是介于0-1之间的一个数。这时,人为的设置一个阈值,比如0.4。那么,小于0.4的为0,大于等于0.4的为1.

例如上图所示,蓝色表示原始为负类分类的统计图,红色为正类得到统计图。那么取一条直线,直线左边为负类,右边为正类。这条直线为阈值。

如上就是ROC曲线的动机。

放在具体领域来理解上述两个指标。
如在医学诊断中,判断有病的样本。
那么尽量把有病的揪出来是主要任务,也就是第一个指标P,要越高越好。
而把没病的样本误诊为有病的,也就是第二个指标R,要越低越好。

我们可以看出,左上角的点(P=1,R=0),为完美分类

AUC

AUC值为ROC曲线所覆盖的区域面积,显然,AUC越大,分类器分类效果越好。

AUC = 1,是完美分类器,采用这个预测模型时,不管设定什么阈值都能得出完美预测。绝大多数预测的场合,不存在完美分类器。

0.5 < AUC < 1,优于随机猜测。这个分类器(模型)妥善设定阈值的话,能有预测价值。

AUC = 0.5,跟随机猜测一样(例:丢铜板),模型没有预测价值。

AUC < 0.5,比随机猜测还差;但只要总是反预测而行,就优于随机猜测。

AUC的物理意义
假设分类器的输出是样本属于正类的socre(置信度),则AUC的物理意义为,任取一对(正、负)样本,正样本的score大于负样本的score的概率。

PR曲线

PR曲线与ROC曲线的相同点是都采用了TPR (Recall),都可以用AUC来衡量分类器的效果。不同点是ROC曲线使用了FPR,而PR曲线使用了Precision。

mAP所代表的意义即为PR曲线的面积

https://tensorflow.google.cn/tutorials/keras/basic_text_classification#create_a_graph_of_accuracy_and_loss_over_time

https://blog.csdn.net/u011439796/article/details/77692621