Attention原理

Attention原理

注意力机制就是加权求和,将需要重点注意的地方设置大的权重,可以来解决rnn的局限问题

attention其实很简单,比如有翻译:
我喜欢游泳->I like swimming
那么在翻译的时候可以这样,也就是越靠近相对应的词,我越注意,影响也就越大
i = f(0.7(“我”),0.2(“喜欢”)+0.1(“游泳”))
like = f(0.2(“我”),0.6(“喜欢”)+0.2(“游泳”))
swimming = f(0.1(“我”),0.2(“喜欢”)+0.7(“游泳”))

Attention的出现就是为了两个目的:

  1. 减小处理高维输入数据的计算负担,通过结构化的选取输入的子集,降低数据维度。

  2. “去伪存真”,让任务处理系统更专注于找到输入数据中显著的与当前输出相关的有用信息,从而提高输出的质量。

Attention模型的最终目的是帮助类似编解码器这样的框架,更好的学到多种内容模态之间的相互关系,从而更好的表示这些信息,克服其无法解释从而很难设计的缺陷。从上述的研究问题可以发现,Attention机制非常适合于推理多种不同模态数据之间的相互映射关系,这种关系很难解释,很隐蔽也很复杂,这正是Attention的优势—不需要监督信号,对于上述这种认知先验极少的问题,显得极为有效.

Hierarchical Attention Networks

层级“注意力”网络的网络结构如图1所示,网络可以被看作为两部分,第一部分为词“注意”部分,另一部分为句“注意”部分。整个网络通过将一个句子分割为几部分(例如可以用“,”讲一句话分为几个小句子),对于每部分,都使用双向RNN结合“注意力”机制将小句子映射为一个向量,然后对于映射得到的一组序列向量,我们再通过一层双向RNN结合“注意力”机制实现对文本的分类。

1559043293574

词层面的“注意力”机制

本文针对的是任务是文档分类任务,即认为每个要分类的文档都可以分为多个句子。因此层级“注意力”模型的第一部分是来处理每一个分句。对于第一个双向RNN输入是每句话的每个单词$w_{it}$,其计算公式如下所示:

1559043351397

首先,通过一个线性层对双向RNN的输出进行变换,然后通过softmax公式计算出每个单词的重要性,最后通过对双向RNN的输出进行加权平均得到每个句子的表示。

句层面的“注意力”机制

句层面的“注意力”模型和词层面的“注意力”模型有异曲同工之妙。

最后就是使用最常用的softmax分类器对整个文本进行分类了

文本分类

代码中的BIRNN(双向RNN)使用的是LSTM

代码基于tensorflow1.0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#-*- coding:utf-8 -*-

import tensorflow as tf
import numpy as np

class BiRNN(object):
"""
用于文本分类的双向RNN
"""
def __init__(self, embedding_size, rnn_size, layer_size,
vocab_size, attn_size, sequence_length, n_classes, grad_clip, learning_rate):
"""
- embedding_size: word embedding dimension
- rnn_size : hidden state dimension
- layer_size : number of rnn layers
- vocab_size : vocabulary size
- attn_size : attention layer dimension
- sequence_length : max sequence length
- n_classes : number of target labels
- grad_clip : gradient clipping threshold
- learning_rate : initial learning rate
"""

self.output_keep_prob = tf.placeholder(tf.float32, name='output_keep_prob')
self.input_data = tf.placeholder(tf.int32, shape=[None, sequence_length], name='input_data')
self.targets = tf.placeholder(tf.float32, shape=[None, n_classes], name='targets')

# 定义前向RNN Cell
with tf.name_scope('fw_rnn'), tf.variable_scope('fw_rnn'):
print tf.get_variable_scope().name
lstm_fw_cell_list = [tf.contrib.rnn.LSTMCell(rnn_size) for _ in xrange(layer_size)]
lstm_fw_cell_m = tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.MultiRNNCell(lstm_fw_cell_list), output_keep_prob=self.output_keep_prob)

# 定义反向RNN Cell
with tf.name_scope('bw_rnn'), tf.variable_scope('bw_rnn'):
print tf.get_variable_scope().name
lstm_bw_cell_list = [tf.contrib.rnn.LSTMCell(rnn_size) for _ in xrange(layer_size)]
lstm_bw_cell_m = tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.MultiRNNCell(lstm_fw_cell_list), output_keep_prob=self.output_keep_prob)


with tf.device('/cpu:0'):
embedding = tf.Variable(tf.truncated_normal([vocab_size, embedding_size], stddev=0.1), name='embedding')
inputs = tf.nn.embedding_lookup(embedding, self.input_data)

# self.input_data shape: (batch_size , sequence_length)
# inputs shape : (batch_size , sequence_length , rnn_size)

# bidirection rnn 的inputs shape 要求是(sequence_length, batch_size, rnn_size)
# 因此这里需要对inputs做一些变换
# 经过transpose的转换已经将shape变为(sequence_length, batch_size, rnn_size)
# 只是双向rnn接受的输入必须是一个list,因此还需要后续两个步骤的变换
inputs = tf.transpose(inputs, [1,0,2])
# 转换成(batch_size * sequence_length, rnn_size)
inputs = tf.reshape(inputs, [-1, rnn_size])
# 转换成list,里面的每个元素是(batch_size, rnn_size)
inputs = tf.split(inputs, sequence_length, 0)

with tf.name_scope('bi_rnn'), tf.variable_scope('bi_rnn'):
outputs, _, _ = tf.contrib.rnn.static_bidirectional_rnn(lstm_fw_cell_m, lstm_bw_cell_m, inputs, dtype=tf.float32)

# 定义attention layer
attention_size = attn_size
with tf.name_scope('attention'), tf.variable_scope('attention'):
attention_w = tf.Variable(tf.truncated_normal([2*rnn_size, attention_size], stddev=0.1), name='attention_w')
attention_b = tf.Variable(tf.constant(0.1, shape=[attention_size]), name='attention_b')
u_list = []
for t in xrange(sequence_length):
u_t = tf.tanh(tf.matmul(outputs[t], attention_w) + attention_b)
u_list.append(u_t)
u_w = tf.Variable(tf.truncated_normal([attention_size, 1], stddev=0.1), name='attention_uw')
attn_z = []
for t in xrange(sequence_length):
z_t = tf.matmul(u_list[t], u_w)
attn_z.append(z_t)
# transform to batch_size * sequence_length
attn_zconcat = tf.concat(attn_z, axis=1)
self.alpha = tf.nn.softmax(attn_zconcat)
# transform to sequence_length * batch_size * 1 , same rank as outputs
alpha_trans = tf.reshape(tf.transpose(self.alpha, [1,0]), [sequence_length, -1, 1])
self.final_output = tf.reduce_sum(outputs * alpha_trans, 0)

print self.final_output.shape
# outputs shape: (sequence_length, batch_size, 2*rnn_size)
fc_w = tf.Variable(tf.truncated_normal([2*rnn_size, n_classes], stddev=0.1), name='fc_w')
fc_b = tf.Variable(tf.zeros([n_classes]), name='fc_b')

#self.final_output = outputs[-1]

# 用于分类任务, outputs取最终一个时刻的输出
self.logits = tf.matmul(self.final_output, fc_w) + fc_b
self.prob = tf.nn.softmax(self.logits)

self.cost = tf.losses.softmax_cross_entropy(self.targets, self.logits)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), grad_clip)

optimizer = tf.train.AdamOptimizer(learning_rate)
self.train_op = optimizer.apply_gradients(zip(grads, tvars))
self.accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(self.targets, axis=1), tf.argmax(self.prob, axis=1)), tf.float32))

def inference(self, sess, labels, inputs):

prob = sess.run(self.prob, feed_dict={self.input_data:inputs, self.output_keep_prob:1.0})
ret = np.argmax(prob, 1)
ret = [labels[i] for i in ret]
return ret


if __name__ == '__main__':
model = BiRNN(128, 128, 2, 100, 256, 50, 30, 5, 0.001)

https://blog.csdn.net/qq_30366667/article/details/88648726

https://blog.csdn.net/u011311291/article/details/89787714

https://blog.csdn.net/qq_24305433/article/details/80427159

https://blog.csdn.net/thriving_fcl/article/details/73381217