example: character-level english to french
### 数据
数据集格式: english sentence \t french sentence \t ...
target: english
input: french
起始符&终止符: '\t' + decoder_input_sentence + '\n'
characabulary: 有字母,数字,空格,标点符号,decoder_input_sentence还有起始符&终止符,decoder_target_sentence比它早一步,没有起始符,有终止符
one-hot: sentence范围内one-hot for each charac, 其余的填充用空格的one-hot
### data dim
encoder_input_data:
[batch_size, max_sentence_length, num_eng_characters]
each element is a one-hot vec (standing for a specific charac)
decoder_input_data:
[batch_size, max_sentence_length, num_eng_characters]
each element is a one-hot vec
decoder_target_data:
[batch_size, max_sentence_length, num_eng_characters]
offset the decoder_input_data by one step:
decoder_target比decoder_input早一步,decoder这一步的预测/gt作为下一步的input
decoder_target_data[:,t,:] = decoder_input_data[:,t+1,:]
### model
model input:
encoder input: time-distributed input sequence
decoder input: target sequence starting with \t
期望输入数据尺寸: (batch_size, timesteps, data_dim)
model output: predict sequence, (batch_size, timesteps, data_dim)
model target: target sequence, (batch_size, timesteps, data_dim)
encoder_states: LSTM有两个(hidden state和cell state),和每个时刻的输出维度相同(batch_size, latent_dim)
initial_state: 用于指定RNN层的初始状态,
decoder起始时刻输入的是起始符,hidden state就来自encoder编码得到的信息,
后续每个时刻的输入可以是前一时刻的预测/gt,循环的状态就是逐渐积累的前文信息
* 如果使用函数式API的Model类模型,我们会定义input layer的shape=(timestep, emb_dim)
* 如果使用Sequential模型,我们要在第一个LSTM层里显示定义input_shape=(timesteps, data_dim)
* batch_size is set for axis-0 by default for both the methods above
* 但是在stateful的LSTM layer里面,batch_size要被显示声明:
model.add(LSTM(32, return_sequences=True, stateful=True, batch_input_shape=(batch_size, timesteps, data_dim)))
### teacher-forcing or reinjection
decoder的输出是预测sequence(gt是与之对应的target sequence),输入是比预测序列one-step-lag-behind的输出序列————你只能拿到并复用已经预测出的东西
如果用teacher-forcing,decoder的输入是target sequence,
如果用reinjection,decoder的输入是predicted sequence
【QUESTION】为啥teacher-forcing更常使用?
### inference
decoder模型在inference阶段,变动比较大:
1. 要显式的指定每一个time step的hidden state
2. 要显式的实现逐步预测
what if:
* GRU: 用GRU就是少了cell state,内部结构基于LSTM,输出跟常规RNN一样
* word-level:
vocabulary比characabulary要大得多,one-hot embedding维度太大了,而且过于稀疏,
可以在输入层后面添加Embedding层,将词向量转换成固定尺寸的稠密向量
further improvements:
* attention
* bi-rnn
* deeper: stacking layers
出发点:
* LSTM在一定程度上缓解了RNN的长距离依赖问题,但不是完全解决
* 一个很常见的场景,句子填空,既需要结合上文,又需要参考下文,单向的RNN不能胜任
* 或者一个图像任务,需要结合全局信息进行分类,单向的RNN获取的信息是不完整的
实现:将sequence倒序再解析一遍
example: MNIST classification
keras layer Bidirectional:
* https://github.com/keras-team/keras/blob/d71247dcd805e58110a784b03cf2fcbaa1c837c8/keras/layers/wrappers.py
* fw和bw层的输入是同一个inputs
* fw和bw层可以是同一个种layer,也可以是不同的layer
* fw和bw层的输出在emb_dim上concat在一起(也有其他fusion mode),本例中lstm的out dim是128,所以bi-lstm的out dim是256
出发点:
basic的seq2seq模型,encoder从输入序列中编码得到一个context vector([b,dim]),然后在解码阶段,
这个固定的Context Vector作为initial_state,编码input整体的信息,输入给decoder
考虑机翻这个场景,翻译当前词的时候不是与输入序列中每个element都是强相关的,绝大多数情况只与对应词相关
实现:建立输入语句中第j个单词与输出语句中第i个单词的匹配程度,每个step使用一个加权的context vec
* for each decoding step
* s是decoder的输出,[1,dim],当前单词
* h是encoder的输出,[N,dim],所有输入词向量
* $e_{ij} = a(s_{i-1},h_j)$
* $\alpha_{ij} = \frac{exp(e_{ij})}{\sum_k exp(e_{ik})}$
* $c_i = \sum_j \alpha_{ij} h_j$
example: character-level english to french
这里面的attention是learnable attention,类似se-block,计算每个embedding与其他embedding的线性映射value,然后softmax
tf2.4.1 keras估计要2.3以上
https://github.com/keras-team/keras/blob/70d7d07bd186b929d81f7a8ceafff5d78d8bd701/keras/layers/multi_head_attention.py
given sequence length N, batch size B, key dim d, num_heads m, value_dim dv:
step1: projects `query`, `key` and `value`,
* each is a list of tensors of length `num_attention_heads`
* each tensor [B, N, d]
* trainable variables Wq,Wk,Wv(biases)
step2: compute attention
* dot(Q,K)
* scaled by hidden dim d, [B,N,N]
* softmax to obtain attention probabilities, [B,N,1]
* dropout,我们的实现中dropout放在MSA layer后面,因为drop的是一整个特征维度,放在哪都行
* reweight content vec V, [B,N,mdv]
step3: final dense
* concat multi-heads along d-axis, [B,N,mdv]
* linear projection to d, [B,N,dv]
这里面的attention是transformer attention,established on Q,K,V
可以用来表征token embedding的不同维度:https://zhuanlan.zhihu.com/p/158952064
* Q: query,词的查询向量
* K: key,词的被查向量
* V: value,词的内容向量
multi-head self-attention:
* dq=dk=qv=dmodel/h=64
* h=8
encoder:
N=6, encoder由6个相同的attention block构成
每个attention block包含:MSA,FF,LN,residual, dropout
MSA:MultiHeadAttention + add&norm
FF:dense + relu/gelu + dense
输入:input embedding + ppositional embedding
self-attention: single input, x=q=k=v
官方repo: https://github.com/google-research/vision_transformer
third repo: https://github.com/lucidrains/vit-pytorch
task: supervised classification
inputs: 将图片切成不重叠的16x16块,然后flatten,然后用learnable的线性层降维,然后添加cls token,然后加上PE
* image patch sequence & trainable linear projection
* PE: trainable 1d embedding
* x0: trainable pretended 1d embedding
* [x0, patch_embeddings, ] + PEs
实现上,是通过一个一层卷积,kernel size和stride都是patch size,将每个ch3-patch线性映射成一个emb-dim vec
model: transformer encoder
* patch_size
* hidden_size: through all
* MSA layer: 没有mask,最简单的版本
MLP head:
GeLU:
Gaussian error linear unit: x * P(X <= x), where P(X) ~ N(0, 1)
if approx:
y = 0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))
else:
y = 0.5 * x * (1 + erf(x / sqrt(2)))
LN:
https://www.geek-book.com/src/docs/keras/keras/keras.io/api/layers/normalization_layers/layer_normalization/index.html
trainable的情况下,given inputs [b,(hwd),c],参数量是2*(hwd)*c,所以用在1D比较正常一点
子类继承模型:class ***(keras.Model)
* init里面定义层不能复用
* 批量定义的layer list里面每个layer必须在self作用空间下声明
* checkpoint只能save_weights不能save_model,因为不支持get_config()和序列化
training details:
* cosine learning rate
* Adam + L2 reg: momentum=0.9, wd=1e-5
主要缺陷:
* 模型量级太大,batch size大,tpu级别训练
* 训练数据量必须要大,不大精度不行,基本没办法在自己的数据集上train from scratch
基于ViT的提升有:DeiT, T2T-ViT, LV-ViT
官方repo: https://github.com/zihangJiang/TokenLabeling
patch embedding
4-layer conv, kernel size [7,3,3,8], stride [2,1,1,8], filters 64
[conv-bn-relu]-[conv-bn-relu]-[conv-bn-relu]-[conv-bias]
re-labeling
用另一个模型inference training set,给出一个K-dim dense score map
在训练我们的模型的时候,random crop以后,基于cropped score map重新计算label
token labeling
基于re-labeling的dense score map,我们能够进一步地给到每一个token一个独立的K-dim label
每个token的label和prediction能够独立计算一个CE:auxiliary token labeling loss
mixtoken
针对token grids,以cutMix的形式(crop box),而不是noisy drop
crop box的长宽是服从beta分布的(大概率落在较小值,从而保证总体的label是beta分布)
token label是token individual的,所以mixtoken不影响每个token的label学习,所以源代码在计算token loss之前,将crop patches复原,这样token gt labels就不用转换了
mixtoken本质上还是在augment原图,所以只影响cls token的prediction,cls token要基于随机mask重新计算
loss
cls_token对应的out embedding [b,D]接上MLP prediction head,预测总体的类别概率
其他token对应的out embedding [b,N,D]接上shared MLP prediction head,学习每个token的类别预测,求全部tokens的平均
再加权求和:cls_loss + 0.5*token_loss
encoder block
* stochastic depth (dropblock): random drop by sample
* residual_scale: 给residual downscale有提升,scale=2
training details
* lr: linear scaling by batch 1e-3*batch_size/1024, 5 warmup epochs + cosine decay
* AdamW: weight_decay=5e-2
* batch_size: 1024
* dropout = 0.
* dropconnect = .1
* randAug, mixup
official repo: https://github.com/microsoft/Swin-Transformer
keras version: https://github.com/keras-team/keras-io/blob/master/examples/vision/swin_transformers.py
related papers:
origin Swin: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin for object detection: End-to-End Semi-Supervised Object Detection with Soft Teacher
Swin for segmentation: Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation
swin family:
swin-T: (224,224), C=96, num_layers=[2,2,6,2], num_heads=[3,6,12,24]
swin-S: (224,224), C=96, num_layers=[2,2,18,2], num_heads=[3,6,12,24]
swin-B: (224,224) / (384,384), C=128, num_layers=[2,2,18,2], num_heads=[4,8,16,32]
swin-L: (224,224) / (384,384), C=196, num_layers=[2,2,18,2], num_heads=[6,12,24,48]
what's new in swin:
* hierarchical: 一般ViT都是桶型结构,fp过程中resolution不变,浅层计算量不友好,而且不好应用于FPN及后续dense任务
* window attention: window比patch高一个level,将att分解成window-based global att和local att,减少计算量
positional embeddings:
* no pos: 目前为止仅发现Google的MLP-Mixer是不使用PE的,说是隐式地学到了
* abs pos: 大多数ViT的做法,基于input size计算一组1D/2D的固定值
* rel pos: 本文的做法,不加abs PE,但是在MSA的QKV softmax层里面添加bias
classification head:
* swin里面没有cls token,stage4 最终输出[b,H/32*W/32,8C]的token embeddings
* 对所有的token embeddings求平均,类似GAP,[b,8C]
* 然后送入linear classifier,[b,n_classes]
patch merging:
* 先是空间维度转换成特征维度
* 然后linear proj
* 比pooling保留的信息多
relative position index:
* 用来描述window中任意两点的相对位置关系:[wh, wh],两个wh分别表示window map上任意一点
* 初始相对距离度量分为h和w两个axis,range from [0,2h-1]和[0,2w-1]
# * 2-dim coords可以合并成1-dim:采用两个digit->两位数的转换方式
* shared among windows
* 常量
relative position bias:
* 用来保存任意一对相对位置的position bias:[2h-1, 2w-1, n_heads]
* truncated normal distribution: 初始用截断的正态分布填充
* relative position index中保存的所有相对距离,都能在relative position bias找到一组bias: [wh,wh,n_heads]
* learnable
window attention:
* 将特征图分解成互不重叠的window,每个window包含M*M个patch
* 在每个windows内部做self-attention,每个window参数共享————window-based local attention
* window_size=7: 要求特征图尺寸要能整除7,否则pooling
* shifted-window:
如果没有shifted-window,每个stage的感受野才2倍,不然都不变的
given window_size=M: 划分windows的时候不从左上角开始,而是wh各平移M//2
等价于把featuremap平移一部分然后正常partition
tf.manip.roll / torch.roll