从原始卡片分类数据创建相似矩阵

问题描述:

我有一个来自在线卡片分类活动的数据集。参与者获得了随机子卡(来自更大的一组),并被要求创建他们感觉彼此相似的卡组。参与者可以根据自己的喜好创建任意数量的团体,并根据自己的意愿命名团体。从原始卡片分类数据创建相似矩阵

示例数据集中是这样的:

Data <- structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L), Card = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 2L, 3L, 5L, 7L, 9L, 10L, 11L, 12L, 13L, 14L, 
1L, 3L, 4L, 5L, 6L, 7L, 8L, 12L, 13L, 14L), .Label = c("A", "B", 
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N"), class = "factor"), 
    Group = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 3L, 5L, 2L, 5L, 
    1L, 2L, 1L, 3L, 1L, 4L, 4L, 2L, 3L, 1L, 1L, 2L, 1L, 2L, 3L, 
    2L, 1L, 2L, 2L, 3L), .Label = c("Cat1", "Cat2", "Cat3", "Cat4", 
    "Cat5"), class = "factor")), .Names = c("Subject", "Card", 
"Group"), class = "data.frame", row.names = c(NA, -30L)) 

从这些数据我想创建一个相似矩阵,理想的比例或在物品被组合在一起总计数的百分比。

的类似:

计数:

A B C D E F G H I J K L M N 
A  0 0 1 1 0 0 1 0 0 0 0 0 0 
B 0  0 0 1 0 0 0 2 0 0 0 0 1 
C 0 0  0 0 1 2 0 0 0 0 2 1 0 
D 1 0 0  0 0 0 1 0 0 0 0 0 0 
E 1 1 0 0  0 1 0 1 0 0 1 1 1 
F 0 0 1 0 0  1 0 0 0 0 0 0 1 
G 0 0 2 0 1 1  0 0 0 0 1 2 0 
H 1 0 0 1 0 0 0  0 1 0 0 0 0 
I 0 2 0 0 1 0 0 0  0 0 0 0 1 
J 0 0 0 0 0 0 0 1 0  1 0 0 0 
K 0 0 0 0 0 0 0 0 0 1  0 0 0 
L 0 0 2 0 1 0 1 0 0 0 0  1 0 
M 0 0 1 0 1 0 2 0 0 0 0 1  0 
N 0 1 0 0 1 1 0 0 1 0 0 0 0 

每一个主题命名了他们的群体产生不同的,所以它不可能指数集团。

除了计数,我也想生成报告的参与者,谁是提出了一个特别的对Cards,该组合这两个Cards在一起的百分比相似矩阵。

从示例数据集,这个结果:

A B C D E F G H I J K L M N 
A  0 0 50 50 0 0 50 0 0 0 0 0 0 
B 0  0 0 50 0 0 0 100 0 0 0 0 100 
C 0 0  0 0 50 67 0 0 0 0 100 50 0 
D 50 0 0  0 0 0 50 0 0 0 0 0 0 
E 50 50 33 0  0 33 0 50 0 0 33 50 50 
F 0 0 50 0 0  50 0 0 0 0 0 0 100 
G 0 0 67 0 33 50  0 0 0 0 50 100 0 
H 50 0 0 50 0 0 0  0 100 0 0 0 0 
I 0 100 0 0 50 0 0 0  0 0 0 0 100 
J 0 0 0 0 0 0 0 100 0  100 0 0 0 
K 0 0 0 0 0 0 0 0 0 100  0 0 0 
L 0 0 100 0 33 0 50 0 0 0 0  50 0 
M 0 0 50 0 50 0 100 0 0 0 0 50  0 
N 0 100 0 0 50 100 0 0 100 0 0 0 0 

任何建议,将不胜感激!

编辑:下面的答案适用于示例数据。这似乎并不适用于我发布在这里的实际数据:https://www.dropbox.com/s/mhqwyok0nmvt3g9/Sim_Example.csv?dl=0

例如,在这些数据中,我手动统计了22架“飞机”和“机场”的配对,大约为55%。但是,下面的答案得到12的数和60%的基础上OP的要求澄清

步骤1.过程数据来创建卡双&是否已经被任何用户聚集在一起

编辑的解决方案:

library(tidyverse); library(data.table) 

Data.matrix <- Data %>% 

    # convert data into list of data frames by subject 
    split(Data$Subject) %>% 

    # for each subject, we create all pair combinations based on the subset cards he 
    # received, & note down whether he grouped the pair into the same group 
    # (assume INTERNAL group naming consistency. i.e. if subject 1 uses group names such 
    # as "cat", "dog", "rat", they are all named exactly so, & we don't worry about 
    # variations/typos such as "cat1.5", "dgo", etc.) 
    lapply(function(x){ 
    data.frame(V1 = t(combn(x$Card, 2))[,1], 
       V2 = t(combn(x$Card, 2))[,2], 
       G1 = x$Group[match(t(combn(x$Card, 2))[,1], x$Card)], 
       G2 = x$Group[match(t(combn(x$Card, 2))[,2], x$Card)], 
       stringsAsFactors = FALSE) %>% 
     mutate(co.occurrence = 1, 
      same.group = G1==G2) %>% 
     select(-G1, -G2)}) %>% 

    # combine the list of data frames back into one, now that we don't worry about group 
    # names, & calculate the proportion of times each pair is assigned the same group, 
    # based on the total number of times they occurred together in any subject's 
    # subset. 
    rbindlist() %>% 
    rowwise() %>% 
    mutate(V1.sorted = min(V1, V2), 
     V2.sorted = max(V1, V2)) %>% 
    ungroup() %>% 
    group_by(V1.sorted, V2.sorted) %>% 
    summarise(co.occurrence = sum(co.occurrence), 
      same.group = sum(same.group)) %>% 
    ungroup() %>% 
    rename(V1 = V1.sorted, V2 = V2.sorted) %>% 
    mutate(same.group.perc = same.group/co.occurrence * 100) %>% 

    # now V1 ranges from A:M, where V2 ranges from B:N. let's complete all combinations 
    mutate(V1 = factor(V1, levels = sort(unique(Data$Card))), 
     V2 = factor(V2, levels = sort(unique(Data$Card)))) %>% 
    complete(V1, V2, fill = list(NA)) 

> Data.matrix 
# A tibble: 196 x 5 
     V1  V2 co.occurrence same.group same.group.perc 
    <fctr> <fctr>   <dbl>  <int>   <dbl> 
1  A  A   NA   NA    NA 
2  A  B    1   0    0 
3  A  C    2   0    0 
4  A  D    2   1    50 
5  A  E    2   1    50 
6  A  F    2   0    0 
7  A  G    2   0    0 
8  A  H    2   1    50 
9  A  I    1   0    0 
10  A  J    1   0    0 
# ... with 186 more rows 

# same.group is the number of times a card pair has been grouped together. 
# same.group.perc is the percentage of users who grouped the card pair together. 

步骤2.对计数&百分比创建独立的矩阵:

# spread count/percentage respectively into wide form 

Data.count <- Data.matrix %>% 
    select(V1, V2, same.group) %>% 
    spread(V2, same.group, fill = 0) %>% 
    remove_rownames() %>% 
    column_to_rownames("V1") %>% 
    as.matrix() 

Data.perc <- Data.matrix %>% 
    select(V1, V2, same.group.perc) %>% 
    spread(V2, same.group.perc, fill = 0) %>% 
    remove_rownames() %>% 
    column_to_rownames("V1") %>% 
    as.matrix() 

第3步转换的上三角矩阵为对称矩阵(注:我刚刚发现了一个更短的&整洁的解决方案here):

# fill up lower triangle to create symmetric matrices 
Data.count[lower.tri(Data.count)] <- t(Data.count)[lower.tri(t(Data.count))] 
Data.perc[lower.tri(Data.perc)] <- t(Data.perc)[lower.tri(t(Data.perc))] 

# ALTERNATE to previous step 
Data.count <- pmax(Data.count, t(Data.count)) 
Data.perc <- pmax(Data.perc, t(Data.perc)) 

第4步摆脱对角线的,因为没有点配对卡本身:

# convert diagonals to NA since you don't really need them 
diag(Data.count) <- NA 
diag(Data.perc) <- NA 

步骤5.检验结果:

> Data.count 
    A B C D E F G H I J K L M N 
A NA 0 0 1 1 0 0 1 0 0 0 0 0 0 
B 0 NA 0 0 1 0 0 0 2 0 0 0 0 1 
C 0 0 NA 0 1 1 2 0 0 0 0 2 1 0 
D 1 0 0 NA 0 0 0 1 0 0 0 0 0 0 
E 1 1 1 0 NA 0 1 0 1 0 0 1 1 1 
F 0 0 1 0 0 NA 1 0 0 0 0 0 0 1 
G 0 0 2 0 1 1 NA 0 0 0 0 1 2 0 
H 1 0 0 1 0 0 0 NA 0 1 0 0 0 0 
I 0 2 0 0 1 0 0 0 NA 0 0 0 0 1 
J 0 0 0 0 0 0 0 1 0 NA 1 0 0 0 
K 0 0 0 0 0 0 0 0 0 1 NA 0 0 0 
L 0 0 2 0 1 0 1 0 0 0 0 NA 1 0 
M 0 0 1 0 1 0 2 0 0 0 0 1 NA 0 
N 0 1 0 0 1 1 0 0 1 0 0 0 0 NA 

> Data.perc 
    A B C D E F G H I J K L M N 
A NA 0 0 50 50 0 0 50 0 0 0 0 0 0 
B 0 NA 0 0 50 0 0 0 100 0 0 0 0 100 
C 0 0 NA 0 33 50 67 0 0 0 0 100 50 0 
D 50 0 0 NA 0 0 0 50 0 0 0 0 0 0 
E 50 50 33 0 NA 0 33 0 50 0 0 50 50 50 
F 0 0 50 0 0 NA 50 0 0 0 0 0 0 100 
G 0 0 67 0 33 50 NA 0 0 0 0 50 100 0 
H 50 0 0 50 0 0 0 NA 0 100 0 0 0 0 
I 0 100 0 0 50 0 0 0 NA 0 0 0 0 100 
J 0 0 0 0 0 0 0 100 0 NA 100 0 0 0 
K 0 0 0 0 0 0 0 0 0 100 NA 0 0 0 
L 0 0 100 0 50 0 50 0 0 0 0 NA 50 0 
M 0 0 50 0 50 0 100 0 0 0 0 50 NA 0 
N 0 100 0 0 50 100 0 0 100 0 0 0 0 NA 
+0

我得到一个错误与上述'选择(错误)。,-G1,-G2):未使用的参数(-G1,-G2)' 此外,百分比显示为关闭。我认为百分比应该是相对于总数的百分比 – JLC

+0

@JLC:将其更改为'dplyr :: select(。,-G1,-G2)'。我怀疑你有另一个包含类似命名的函数加载的地方?当'MASS :: select'屏蔽'dplyr :: select'时,我自己得到这个错误。 –

+0

谢谢!这有效,但百分比已经有所下降 – JLC