R深度学习——文本分类问题

R深度学习——文本分类问题

这一节用一个实例介绍怎么对文本信息进行二项分类,使用的是IMDB数据集,将其中对电影的评价进行正面和负面的分类:

library(keras)
library(dplyr)
library(ggplot2)
library(purrr)

下载数据集:

imdb <- dataset_imdb(num_words = 10000)

c(train_data, train_labels) %<-% imdb$train
c(test_data, test_labels) %<-% imdb$test

同样还有一个单词对应数字的字典:

word_index <- dataset_imdb_word_index()

对数据集进行建模时首先要了解下数据特征:

paste0("Training entries: ", length(train_data), ", labels: ", length(train_labels))

[1] “Training entries: 25000, labels: 25000”

所有评价的文本信息都被转化成了整型格式,每一个数字代表了字典中的一个单词:

train_data[[1]]

[1] 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941 4 173
[17] 36 256 5 25 100 43 838 112 50 670 2 9 35 480 284 5
[33] 150 4 172 112 167 2 336 385 39 4 172 4536 1111 17 546 38
[49] 13 447 4 192 50 16 6 147 2025 19 14 22 4 1920 4613 469
[65] 4 22 71 87 12 16 43 530 38 76 15 13 1247 4 22 17
[81] 515 17 12 16 626 18 2 5 62 386 12 8 316 8 106 5
[97] 4 2223 5244 16 480 66 3785 33 4 130 12 16 38 619 5 25
[113] 124 51 36 135 48 25 1415 33 6 22 12 215 28 77 52 5
[129] 14 407 16 82 2 8 4 107 117 5952 15 256 4 2 7 3766
[145] 5 723 36 71 43 530 476 26 400 317 46 7 4 2 1029 13
[161] 104 88 4 381 15 297 98 32 2071 56 26 141 6 194 7486 18
[177] 4 226 22 21 134 476 26 480 5 144 30 5535 18 51 36 28
[193] 224 92 25 104 4 226 65 16 38 1334 88 12 16 283 5 16
[209] 4472 113 103 32 15 16 5345 19 178 32

但由于神经网络的输入部分必须是相同长度的,这需要在后面进行相应处理:

length(train_data[[1]])
length(train_data[[2]])

[1] 218
[1] 189

将整数格式转换成数字:

word_index_df <- data.frame(
  word = names(word_index),
  idx = unlist(word_index, use.names = FALSE),
  stringsAsFactors = FALSE
)

# The first indices are reserved  
word_index_df <- word_index_df %>% mutate(idx = idx + 3)
word_index_df <- word_index_df %>%
  add_row(word = "<PAD>", idx = 0)%>%
  add_row(word = "<START>", idx = 1)%>%
  add_row(word = "<UNK>", idx = 2)%>%
  add_row(word = "<UNUSED>", idx = 3)

word_index_df <- word_index_df %>% arrange(idx)

decode_review <- function(text){
  paste(map(text, function(number) word_index_df %>%
              filter(idx == number) %>%
              select(word) %>% 
              pull()),
        collapse = " ")
}

这样就可以得到各个训练数据的文本:

decode_review(train_data[[1]])

[1] " this film was just brilliant casting location scenery story direction
everyone’s really suited the part they played and you could just imagine being there
robert is an amazing actor and now the same being director father came from
the same scottish island as myself so i loved the fact there was a real connection with
this film the witty remarks throughout the film were great it was just brilliant so much
that i bought the film as soon as it was released for and would recommend it to
everyone to watch and the fly fishing was amazing really cried at the end it was so sad
and you know what they say if you cry at a film it must have been good and this
definitely was also to the two little boy’s that played the of norman and
paul they were just brilliant children are often left out of the list i think
because the stars that play them all grown up are such a big profile for the whole film
but these children are amazing and should be praised for what they have done don’t you
think the whole story was so lovely because it was true and was someone’s life after all
that was shared with us all"

再就是准备数据,由于输入神经网络的数据格式需要为tensor,一种方法是转化为0和1的矩阵,另外一种使用评论中字符最多的字符数作为第一层的结构,这里使用了pad_sequences函数:

train_data <- pad_sequences(
  train_data,
  value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
  padding = "post",
  maxlen = 256
)

test_data <- pad_sequences(
  test_data,
  value = word_index_df %>% filter(word == "<PAD>") %>% select(idx) %>% pull(),
  padding = "post",
  maxlen = 256
)

再看下训练集的长度:

length(train_data[1, ])
length(train_data[2, ])

[1] 256
[1] 256

再看下第一条评论:

train_data[1, ]

[1] 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941 4
[16] 173 36 256 5 25 100 43 838 112 50 670 2 9 35 480
[31] 284 5 150 4 172 112 167 2 336 385 39 4 172 4536 1111
[46] 17 546 38 13 447 4 192 50 16 6 147 2025 19 14 22
[61] 4 1920 4613 469 4 22 71 87 12 16 43 530 38 76 15
[76] 13 1247 4 22 17 515 17 12 16 626 18 2 5 62 386
[91] 12 8 316 8 106 5 4 2223 5244 16 480 66 3785 33 4
[106] 130 12 16 38 619 5 25 124 51 36 135 48 25 1415 33
[121] 6 22 12 215 28 77 52 5 14 407 16 82 2 8 4
[136] 107 117 5952 15 256 4 2 7 3766 5 723 36 71 43 530
[151] 476 26 400 317 46 7 4 2 1029 13 104 88 4 381 15
[166] 297 98 32 2071 56 26 141 6 194 7486 18 4 226 22 21
[181] 134 476 26 480 5 144 30 5535 18 51 36 28 224 92 25
[196] 104 4 226 65 16 38 1334 88 12 16 283 5 16 4472 113
[211] 103 32 15 16 5345 19 178 32 0 0 0 0 0 0 0
[226] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[241] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[256] 0

建立模型:

# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size <- 10000

model <- keras_model_sequential()
model %>% 
  layer_embedding(input_dim = vocab_size, output_dim = 16) %>%
  layer_global_average_pooling_1d() %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")

model %>% summary()
  • 模型的第一层是embedding,这一层用于对应数字和文本
  • 第二层是global_average_pooling_1d,这一层输出一个固定长度的向量,用于处理不同长度的变量
  • dense层是一个16个单元的全连接层
  • 这一层只有一个输出结果,sigmoid函数输出0到1的置信度函数

编译模型:

model %>% compile(
  optimizer = 'adam',
  loss = 'binary_crossentropy',
  metrics = list('accuracy')
)

这里用的Binary_crossentropy更适合处理概率问题。
这里再添加一个确认集:

x_val <- train_data[1:10000, ]
partial_x_train <- train_data[10001:nrow(train_data), ]

y_val <- train_labels[1:10000]
partial_y_train <- train_labels[10001:length(train_labels)]

训练模型:

history <- model %>% fit(
  partial_x_train,
  partial_y_train,
  epochs = 40,
  batch_size = 512,
  validation_data = list(x_val, y_val),
  verbose=1
)

评价模型:

results <- model %>% evaluate(test_data, test_labels)
results

25000/25000 [==============================] - 0s 15us/step
$loss
[1] 0.34057
$acc
[1] 0.8696

绘图:

plot(history)

R深度学习——文本分类问题