如何创建组成员的表格或数据框（从长格式数据中按组分组）？

问题描述：

我正在处理一些聚类分析结果。我正在尝试为我正在进行的每个群集分析创建群集成员表。如何创建组成员的表格或数据框（从长格式数据中按组分组）？

例如：

test_data <- data.frame(
     Cluster = sample(1:5,100,replace=T), 
     Item = sample(LETTERS[1:20],5, replace=F)) 

head(test_data) 
    Cluster Item 
1  2 R 
2  5 F 
3  1 T 
4  5 Q 
5  3 B 
6  3 J

我想产生这样的：

Cluster_1 Cluster_2 Cluster_3 Cluster_4 Cluster_5 
     T   R   C   P   L 
     K   O   J   M   Q 
     I   H   B   N   F 
     D         G   E 
     S            A

我第一次尝试spread，但并没有与这些数据进行工作

spread(test_data, item,group)

错误：行重复标识符

spread(test_data, group,item)

错误：重复标识符行

然后我试图：

test_frame <- split.data.frame(test_data,test_data$group)

但是，这会导致数据帧的列表，以及每个组的数据帧。我没有能够成功地将它变成我想要的东西。

我试过unnest和unlist，但由于每个组的成员元素数量不同，这些功能会给出错误。

引入NA就没问题。

有没有一种简单的方法可以实现我忽略的功能？

答

test_data <- data.frame(
     Cluster = sample(1:5,100,replace=T), 
      Item = sample(LETTERS[1:20],5, replace=T),stringsAsFactors = FALSE) 

m <- with(test_data,tapply(Item,paste("Cluster",Cluster,sep="_"),I)) 
e <- data.frame(sapply(m,`length<-`,max(lengths(m)))) 
    print(e,na.print="")

简洁并做好工作 - 谢谢！ – JLC

答

重做了我的答案。所有在基地R.合理简洁：

test_data <- data.frame(
    Cluster = sample(1:5,100,replace=T), 
    Item = sample(LETTERS[1:20],5, replace=T), stringsAsFactors=FALSE) 

clusters <- unique(test_data$Cluster) 

test_data <- lapply(clusters, function(i) { 
    test_data[test_data$Cluster == i,]$Item }) 

n_max <- Reduce(f=max, x=lapply(test_data, FUN=length)) 

test_data <- lapply(test_data, function(i) {length(i) <- n_max; i}) 

test_data <- Reduce(x=test_data, f=cbind) 

test_data <- as.data.frame(test_data) 

names(test_data) <- paste0('Cluster_', clusters) 

test_data

谢谢！不幸的是，我得到一个错误与上面的代码： “错误的减少（，cbind）：参数‘初始化’缺失，没有默认值” – JLC

现在，它的丢失数据： 'STR（test_data3）指定的int 1 - attr（*，“names”）= chr“Cluster_” – JLC

最近的编辑接近，但它将每个元素强制转换为整数。我的数据实际上是字符串（项目名称），所以我需要将这些字符串保持为字符。 – JLC

答

这是一个解决方案，使用tidyverse。 test_final是最终的输出。

# Load package 
library(tidyverse) 

# Set seed for reproducibility 
set.seed(123) 

# Create example data frame 
test_data <- data.frame(
    Cluster = sample(1:5,100,replace=T), 
    Item = sample(LETTERS[1:20],5, replace=T)) 

# Split the data frame into a list of data frames 
test_list <- test_data %>% 
    mutate(Item = as.character(Item)) %>% 
    arrange(Cluster) %>% 
    split(f = .$Cluster) 

# Find out the maximum row number of each data frame 
max_row <- max(map_int(test_list, nrow)) 

# Design a function to process each data frame in test_list 
process_fun <- function(dt, max_row){ 

    # Append NA to the Item column 
    dt_vec <- dt$Item 
    dt_vec2 <- c(dt_vec, rep(NA, max_row - nrow(dt))) 
    # Get the cluster number 
    clusterNum <- unique(dt$Cluster) 
    # Create a new data frame 
    dt2 <- data_frame(Item = dt_vec2) 
    # Change column name 
    colnames(dt2) <- paste("Cluster", clusterNum, sep = "_") 
    return(dt2) 
} 

# Process the data 
test_final <- test_list %>% 
    map(process_fun, max_row = max_row) %>% 
    bind_cols()

这个作品 - 谢谢！ – JLC

我很高兴它的工作原理。如果您认为此答案有用，请通过检查此帖子左上角的绿色标记来接受此问题。 – www

如何创建组成员的表格或数据框（从长格式数据中按组分组）？

相关推荐