Transformer

12 분 소요

[논문 리뷰 스터디 1] Natural Language Process

처음으로 리뷰할 논문은 자연어 처리 분야의 “Attention is all you need”이다.

1. Attention Mechanism 개요

Attention은 기계 번역을 할 때 문장에서 번역하고자 하는 언어의 모든 단어를 주목하는 것이 아니라, 어떤 특정 단어들을 중요하게 취급(주목, Attention)해야 하는지를 결정하기 위하여 개발되었다.

1.1. Attention Map

기존에는 순서가 존재하는 RNN(순차적으로 학습)의 Encoder와 Decoder를 이용하여 진행하였기 때문에 번역하고자 하는 언어의 단어와 번역된 언어의 단어간의 직접적인 관계를 파악하기 어려웠다.

Attention은 단어와 단어간의 직접적인 관계를 파악할 수 있으며, Attention Map 은 이것을 시각화 한 그림으로 단어의 흐름을 파악할 수 있다(흰색 -> Attention이 더 많이 들어가 있는 것).

출처: Bahdanau et al. 2015 “Neural machine translation by jointly learning to align and translate”

1.2. Mechanism

기존 모델에서는 Encoder에서 추출된 하나의 vector(전체 문장의 정보가 함축되어 있음)가 Decoder의 hidden state로 작동하였다. 수식적으로는 argmax(p(y|x))로 I am a student라는 문장이 입력되었을 때 Je suis etudiant라는 문장이 나올 확률을 가장 높일 수 있도록 y에 대한 parameter를 최적화하며 학습을 진행한다.

end vector 하나로 Decoder를 진행함으로써 발생하는 문제를 long-range dependency problem라고 하는데, 현재의 time step과 멀어지면 멀어질수록 정보의 손실이 커짐을 의미한다.

이러한 문제점을 해결하기 위하여 Attention 방법에서는 context vector를 만들어 전체 정보를 활용할 수 있도록 보완하였다. 아래 그림에서 볼 수 있듯 I am a student 문장에서의 모든 단어를 할용할 수 있도록 hidden state를 구성한다.

예를 들면 Je라는 단어에서 suis라는 단어를 예측할 때와 그리고 suis라는 단어에서 etudiant라는 단어를 예측할 때마다 context vector를 새롭게 계산하여 사용한다.

출처: https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg?hl=ko

Context vector를 계산하는 방법은 다음과 같다. 먼저 Ju라는 단어에 대하여 I, am, a, student 단어들(key)과 attention weight(유사도 개념으로 보통 Bahdanau Attention을 사용)을 계산한다.

계산된 weight를 이용하여 I, am, a, student 단어들(value)과 다시 가중합을 구하고 이를 convext vector로 사용하여 다음에 올 단어가 무엇인지 예측한다. 그리고 이 과정을 반복하여 수행한다.

Attention 방법은 long-range dependency problem을 해결하여 문장이 길더라도 좋은 성능을 보여주었으며, 해석가능한 모델이라는 점에 그 장점이 있다.

1.3. Luong vs Bahdanau

출처: https://stackoverflow.com/questions/44238154/what-is-the-difference-between-luong-attention-and-bahdanau-attention

2. Transformer Mechanism 개요

Attention 모델을 발전시켜 만들어진 것이 Transformaer 모델인데, 이는 자연어를 이해하는 부분인 Encoder와 자연어를 생성하는 부분인 Decoder로 구성되어 있으며 RNN구조를 사용하는 것이 아닌 행렬 연산으로 계산을 진행한다.

따라서 Trnasformer 모델은 거리에 영향을 받지 않는다. 이러한 구조에서 Encoder를 이용한 모델이 Bert이고 Decoder를 이용한 모델이 GPT이다.

Attention 개념만 가저온 모델로서 병렬 연산이 가능해짐(문장 전체를 한 번에 학습)에 따라 학습 속도가 매우 빠르고, 단어간의 관계를 Attention 모델보다 정확하게 파악할 수 있다는 장점이 존재한다.

2.1. Attention map

input과 output이 동일한 self attention을 수행해며, 한 단어에 대하여 여러 특징을 추출하기 위하여 다양하게 map을 생성한다.

출처: https://data-science-blog.com/wp-content/uploads/2021/12/Transformer_head_img-1030x161.png

2.2. Transformer의 기본 구조

Attention과 동일하게 seq2seq(Encoder - Decoder)구조를 가지고 있지만, Attention Map을 생성할 때 재귀적으로 학습이 이뤄지는 것이 아니라 한번에 연산을 수행한다.

아래 모델은 2개의 sub-layer를 가진 Encoder와 3개의 sub-layer를 가진 Decoder로 이루어져 있으며, 해당 연산을 수행할 때 input과 output의 shape(tensor;batch size, seq of length, embedding size)는 모두 동일하다.

출처: https://mlwhiz.com/images/transformers/4.png

실질적으로 Attention을 수행하는 것은 오른편의 Multi-Head Attention과 Scaled Dot-product Attention에서 진행되며, Scaled Dot-product Attention의 결과에서 Attention Map을 얻을 수 있다.

또한, Transformer를 이해하기 위해서는 Attention Mask, Positive-wise FFNN, Encoder구조, Residual Connection & Layer Normalization, Decoder의 구조, Encoder-decoder Self Attention의 개념을 이해하여야 한다.

출처: https://deepfrench.gitlab.io/deep-learning-project/resources/transformer.png

3. Transformer

3.1. Input 데이터

데이터에 대하여 self-attention을 진행하기에 앞서, 순서나 위치 정보를 파악하기 위하여 아래와 같이 position encoding을 진행한다.

출처: https://www.scaler.com/topics/images/positionalencodingexampletable.webp

고유한 위치 정보를 보존하기 위해 d차원의 벡터를 생성하는데, 이때 i 값을 그대로 사용하게 되면 문장이 길어질 수록 값이 너무 커지기 때문에, 주기함수(sin, cos)를 사용하여 개선하였다.

출처: https://i.stack.imgur.com/PxeeE.png

각 포지션 마다 다른 주기를 부여(i가 커질수록 주기가 길어지도록)하여 동일한 위치 embedding이 입력되더라도 다른 값을 출력할 수 있도록 하였다.

출처: https://www.scaler.com/topics/images/sin-curve1.webp

def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
  angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

  pos_encoding = angle_rads[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)

n, d = 2048, 512
pos_encoding = positional_encoding(n, d)
print(pos_encoding.shape)
pos_encoding = pos_encoding[0]

# Juggle the dimensions for the plot
pos_encoding = tf.reshape(pos_encoding, (n, d//2, 2))
pos_encoding = tf.transpose(pos_encoding, (2, 1, 0))
pos_encoding = tf.reshape(pos_encoding, (d, n))

plt.pcolormesh(pos_encoding, cmap='RdBu')
plt.ylabel('Depth')
plt.xlabel('Position')
plt.colorbar()
plt.show()

#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

Attention Map 은 단어가 지니고 있는 여러 특징들을 파악할 수 있도록 다수의 map을 만들어 내는데 이를 위하여 Input 데이터를 head 수에 맞춰 아래와 같이 4차원의 벡터로 변형해 주어야 한다.

(batch size, seq of length, embedding size) -> (batch size, head size, seq of length, embedding size/head size)

한번에 진행되는 것은 아니고 resahpe과정과 transform 과정이 수반된다.

1) original: (batch size, seq of length, embedding size/head size)

2) reshape: (batch size, seq of length, head size, query size)

3) transform: (batch size, head size, seq of length, query size)

출처: https://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png

def create_padding_mask(seq):
  seq = tf.cast(tf.math.equal(seq, 0), tf.float32)

  # add extra dimensions to add the padding
  # to the attention logits.
  return seq[:, tf.newaxis, tf.newaxis, :]  # (batch_size, 1, 1, seq_len)

x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)

def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)

x = tf.random.uniform((1, 3))
temp = create_look_ahead_mask(x.shape[1])
temp

#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.2. Transformer의 구조 상세

3.2.1. Multi-Head Attention

4차원 벡터를 V(Vlaue), K(Key), Query(Q)로 복사하여 아래와 같이 self-attetntion을 진행한다. 3)에서 4)로 넘어가는 과정에서 각 가중치 W0V, W0K, W0Q Linear 연산을 거치며 V, K, Q는 각기 다른 값을 지닌 벡터가 된다.

출처: https://production-media.paperswithcode.com/methods/multi-head-attention_l1A3G7a.png

scaled dot-product attention 과정에서는 1.2에서 적용했던 방식이 그대로 적용되며, head별로 각 과정을 수행한 후에 다시 결과 값을 합쳐주는 과정을 수행한다.

마지막에 Linear 모델을 적용하여 데이터의 shape를 입력에서의 shape과 동일하게 유지시켜준다.

class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadAttention, self).__init__()
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0

    self.depth = d_model // self.num_heads

    self.wq = tf.keras.layers.Dense(d_model)
    self.wk = tf.keras.layers.Dense(d_model)
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)

  def split_heads(self, x, batch_size):
    """Split the last dimension into (num_heads, depth).
    Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(x, perm=[0, 2, 1, 3])

  def call(self, v, k, q, mask):
    batch_size = tf.shape(q)[0]

    q = self.wq(q)  # (batch_size, seq_len, d_model)
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)

    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
    scaled_attention, attention_weights = scaled_dot_product_attention(
        q, k, v, mask)

    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

    concat_attention = tf.reshape(scaled_attention,
                                  (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

    return output, attention_weights

temp_mha = MultiHeadAttention(d_model=512, num_heads=8)
y = tf.random.uniform((1, 60, 512))  # (batch_size, encoder_sequence, d_model)
out, attn = temp_mha(y, k=y, q=y, mask=None)
out.shape, attn.shape

#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.2.2. Scaled Dot-product Attention

Multi-Head Attention 내에서 Scaled Dot-priduct Attention 연산이 진행되는데, 연산의 과정은 다음과 같다. 먼저 Q와 K를 이용한 연산을 통해 SoftMax를 거쳐 weight를 구하는데, 여기서 Attention map을 얻을 수 있다. V와 MatMul 연산을 통해 최종 output을 계산한다.

중간에 Mask라는 연산을 수행하는데, 이는 문장의 길이가 서로 달라 생기는 차이를 padding을 이용하여 동일하게 맞춰주는 과정이다.

출처: https://production-media.paperswithcode.com/methods/35184258-10f5-4cd0-8de3-bd9bc8f88dc3.png!

def scaled_dot_product_attention(q, k, v, mask):
  """Calculate the attention weights.
  q, k, v must have matching leading dimensions.
  k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
  The mask has different shapes depending on its type(padding or look ahead)
  but it must be broadcastable for addition.

  Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

  Returns:
    output, attention_weights
  """

  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

  # scale matmul_qk
  dk = tf.cast(tf.shape(k)[-1], tf.float32)
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

  # add the mask to the scaled tensor.
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)

  # softmax is normalized on the last axis (seq_len_k) so that the scores
  # add up to 1.
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

  return output, attention_weights

def print_out(q, k, v):
  temp_out, temp_attn = scaled_dot_product_attention(
      q, k, v, None)
  print('Attention weights are:')
  print(temp_attn)
  print('Output is:')
  print(temp_out)

np.set_printoptions(suppress=True)

temp_k = tf.constant([[10, 0, 0],
                      [0, 10, 0],
                      [0, 0, 10],
                      [0, 0, 10]], dtype=tf.float32)  # (4, 3)

temp_v = tf.constant([[1, 0],
                      [10, 0],
                      [100, 5],
                      [1000, 6]], dtype=tf.float32)  # (4, 2)

# This `query` aligns with the second `key`,
# so the second `value` is returned.
temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

# This query aligns with a repeated key (third and fourth),
# so all associated values get averaged.
temp_q = tf.constant([[0, 0, 10]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

# This query aligns equally with the first and second key,
# so their values get averaged.
temp_q = tf.constant([[10, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

temp_q = tf.constant([[0, 0, 10],
                      [0, 10, 0],
                      [10, 10, 0]], dtype=tf.float32)  # (3, 3)
print_out(temp_q, temp_k, temp_v)

#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.2.3. Attention Mask

Masking은 아래와 같이 padding이 필요한 영역에 -inf값을 더해주는 것이다. 해당 영역은 다음 단계에서 SoftMax 연산을 거치며, 0의 값을 지니게 된다.

출처: https://cdn.hashnode.com/res/hashnode/image/upload/v1638824585791/vkXCmdGyw.png?auto=compress,format&format=webp

전체 구조에서 Masking은 총 세 번 수행된다.

1) Encoder의 첫 번째 sub-layer(padding): 문장의 길이가 다른 것을 맞춰주기 위하여 사용한다.

2) Decoder의 첫 번째 sub-layer(look a head): 해당 순서 이후의 단어는 고려하지 않기(정답가리기) 위하여 사용한다.

3) Decoder의 두 번째 sub-layer(padding): Encoder에서 사용한 것과 마찬가지로 문장의 길이가 다른 것을 맞춰주기 위하여 사용한다.

3.2.4. Position-wise Feed Forward Network(FFNN)

FFNN은 아래 그림과 같이 병렬적으로 적용된다.

출처: https://mlwhiz.com/images/transformers/8_huc65a66df5c091b1250a15290be6a8151_100130_1200x0_resize_box_2.png

def point_wise_feed_forward_network(d_model, dff):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])
sample_ffn = point_wise_feed_forward_network(512, 2048)
sample_ffn(tf.random.uniform((64, 50, 512))).shape
#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.2.5. Residual Connection

첫번째 입력을 마지막 출력에 더해주는 Residual Connection 과정은 그래디언트가 소실되는 것을 방지해 주는 것으로layer 수를 늘려도 학습이 가능하도록 하고, 다양한 조합이 만들어지면서 앙상블의 효과를 볼 수 있게 한다.

출처: https://mlwhiz.com/images/transformers/15_hu08fef8dba1d9273c744d53408caeff4b_39526_1200x0_resize_box_2.png

3.2.6. Layer Normalization

Layer Normalization은 학습 데이터 분포와 검정 데이터 분포 차이에서 나오는 covariate shift 문제와 학습 layer내의 데이터 분포차이로 인해 발생하는 internal covariate shift 문제를 해결하여 과적합을 방지한다.

출처: https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766

Batch Normalization과 비슷하며 Layer Normalization에서는 normalize 기준이 layer가 된다.

출처: https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-19_at_4.24.42_PM.png

3.3. Encoder와 Decoder

3.3.1. Encoder layer

def point_wise_feed_forward_network(d_model, dff):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])
sample_ffn = point_wise_feed_forward_network(512, 2048)
sample_ffn(tf.random.uniform((64, 50, 512))).shape
#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.3.2. Encoder

class EncoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(EncoderLayer, self).__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

    ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
    ffn_output = self.dropout2(ffn_output, training=training)
    out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

    return out2

sample_encoder_layer = EncoderLayer(512, 8, 2048)

sample_encoder_layer_output = sample_encoder_layer(
    tf.random.uniform((64, 43, 512)), False, None)

sample_encoder_layer_output.shape  # (batch_size, input_seq_len, d_model)
#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.3.3. Decoder layer

class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super(DecoderLayer, self).__init__()

    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)

    self.ffn = point_wise_feed_forward_network(d_model, dff)

    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):
    # enc_output.shape == (batch_size, input_seq_len, d_model)

    attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    attn2, attn_weights_block2 = self.mha2(
        enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attn_weights_block1, attn_weights_block2

sample_decoder_layer = DecoderLayer(512, 8, 2048)

sample_decoder_layer_output, _, _ = sample_decoder_layer(
    tf.random.uniform((64, 50, 512)), sample_encoder_layer_output,
    False, None, None)

sample_decoder_layer_output.shape  # (batch_size, target_seq_len, d_model)

class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding,
                                            self.d_model)

    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
                       for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):

    seq_len = tf.shape(x)[1]

    # adding embedding and position encoding.
    x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x = self.enc_layers[i](x, training, mask)

    return x  # (batch_size, input_seq_len, d_model)

sample_encoder = Encoder(num_layers=2, d_model=512, num_heads=8,
                         dff=2048, input_vocab_size=8500,
                         maximum_position_encoding=10000)
temp_input = tf.random.uniform((64, 62), dtype=tf.int64, minval=0, maxval=200)

sample_encoder_output = sample_encoder(temp_input, training=False, mask=None)

print(sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)
#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.3.4. Decoder

class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
               maximum_position_encoding, rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.num_layers = num_layers

    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
                       for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           look_ahead_mask, padding_mask):

    seq_len = tf.shape(x)[1]
    attention_weights = {}

    x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :seq_len, :]

    x = self.dropout(x, training=training)

    for i in range(self.num_layers):
      x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                                             look_ahead_mask, padding_mask)

      attention_weights[f'decoder_layer{i+1}_block1'] = block1
      attention_weights[f'decoder_layer{i+1}_block2'] = block2

    # x.shape == (batch_size, target_seq_len, d_model)
    return x, attention_weights

sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8,
                         dff=2048, target_vocab_size=8000,
                         maximum_position_encoding=5000)
temp_input = tf.random.uniform((64, 26), dtype=tf.int64, minval=0, maxval=200)

output, attn = sample_decoder(temp_input,
                              enc_output=sample_encoder_output,
                              training=False,
                              look_ahead_mask=None,
                              padding_mask=None)

output.shape, attn['decoder_layer2_block2'].shape
#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.4. Transformer

class Transformer(tf.keras.Model):
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, pe_input, pe_target, rate=0.1):
    super().__init__()
    self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                             input_vocab_size, pe_input, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, pe_target, rate)

    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, inputs, training):
    # Keras models prefer if you pass all your inputs in the first argument
    inp, tar = inputs

    enc_padding_mask, look_ahead_mask, dec_padding_mask = self.create_masks(inp, tar)

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)

    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights

  def create_masks(self, inp, tar):
    # Encoder padding mask
    enc_padding_mask = create_padding_mask(inp)

    # Used in the 2nd attention block in the decoder.
    # This padding mask is used to mask the encoder outputs.
    dec_padding_mask = create_padding_mask(inp)

    # Used in the 1st attention block in the decoder.
    # It is used to pad and mask future tokens in the input received by
    # the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return enc_padding_mask, look_ahead_mask, dec_padding_mask

sample_transformer = Transformer(
    num_layers=2, d_model=512, num_heads=8, dff=2048,
    input_vocab_size=8500, target_vocab_size=8000,
    pe_input=10000, pe_target=6000)

temp_input = tf.random.uniform((64, 38), dtype=tf.int64, minval=0, maxval=200)
temp_target = tf.random.uniform((64, 36), dtype=tf.int64, minval=0, maxval=200)

fn_out, _ = sample_transformer([temp_input, temp_target], training=False)

fn_out.shape  # (batch_size, tar_seq_len, target_vocab_size)
#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

num_layers = 4
d_model = 128
dff = 512
num_heads = 8
dropout_rate = 0.1
#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

learning_rate = CustomSchedule(d_model)

optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

temp_learning_rate_schedule = CustomSchedule(d_model)

plt.plot(temp_learning_rate_schedule(tf.range(40000, dtype=tf.float32)))
plt.ylabel("Learning Rate")
plt.xlabel("Train Step")

#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

3.5. loss 계산

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)


def accuracy_function(real, pred):
  accuracies = tf.equal(real, tf.argmax(pred, axis=2))

  mask = tf.math.logical_not(tf.math.equal(real, 0))
  accuracies = tf.math.logical_and(mask, accuracies)

  accuracies = tf.cast(accuracies, dtype=tf.float32)
  mask = tf.cast(mask, dtype=tf.float32)
  return tf.reduce_sum(accuracies)/tf.reduce_sum(mask)

train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.Mean(name='train_accuracy')

transformer = Transformer(
    num_layers=num_layers,
    d_model=d_model,
    num_heads=num_heads,
    dff=dff,
    input_vocab_size=tokenizers.pt.get_vocab_size().numpy(),
    target_vocab_size=tokenizers.en.get_vocab_size().numpy(),
    pe_input=1000,
    pe_target=1000,
    rate=dropout_rate)

#출처: https://www.tensorflow.org/text/tutorials/transformer?hl=ko

4. 전체 Code

https://www.tensorflow.org/text/tutorials/transformer?hl=ko

Twitter Facebook LinkedIn