是否没有用多索引熊猫数据框动态创建列的语法suger?

问题描述:

首先,我显示熊猫数据框来阐明我的问题。是否没有用多索引熊猫数据框动态创建列的语法suger?

import pandas as pd 
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2']) 
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi) 

这条巨蟒代码创建数据帧(DF1)是这样的:

#input dataframe 
lv1 A  B 
lv2 c d c d 
0 1 2 3 4 
1 5 6 7 8 
2 9 10 11 12 

我想通过使用DF1的数据来创建LV2列 'C * d'。像这样:

#output dataframe after calculation 
lv1 A   B   
lv2 c d c*d c d c*d 
0 1 2 2 3 4 12 
1 5 6 30 7 8 56 
2 9 10 90 11 12 132 

对于这个问题,我写了一些像这样的代码:

for l1 in mi.levels[0]: 
    df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")] 
df1.sort_index(1,inplace=True) 

尽管这段代码几乎解决我的问题,但我真的想不写“为”这样的语句:

df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")] 

有了这个声明,我得到了'c * d'丢失的关键错误。 这个计算没有语法糖吗?还是我可以通过其他代码获得更好的性能?

解释是可能是最地道在熊猫的方式。

output = (df1 
      # "Stack" data, by moving the top level ('lv1') of the 
      # column MultiIndex into row index, 
      # now the rows are a MultiIndex and the columns 
      # are a regular Index. 
      .stack(0) 

      # Since we only have 2 columns now, 'lv2' ('c' & 'd') 
      # we can multiply them together along the row axis. 
      # The assign method takes key=value pairs mapping new column 
      # names to the function used to calculate them. Here we're 
      # wrapping them in a dictionary and unpacking them using ** 
      .assign(**{'c*d': lambda x: x.product(axis=1)}) 

      # Undos the stack operation, moving 'lv1', back to the 
      # column index, but now as the bottom level of the column index 
      .unstack() 

      # This sets the order of the column index MultiIndex levels. 
      # Since they are named we can use the names, you can also use 
      # their integer positions instead. Here axis=1 references 
      # the column index 
      .swaplevel('lv1', 'lv2', axis=1) 

      # Sort the values in both levels of the column MultiIndex. 
      # This will order them as c, c*d, d which is not what you 
      # specified above, however having a sorted MultiIndex is required 
      # for indexing via .loc[:, (...)] to work properly 
      .sort_index(axis=1) 
     ) 

有点改善您的解决方案:

for l1 in mi.levels[0]: 
    df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")] 
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']]) 
df1 = df1.reindex(columns=mux) 
print (df1) 
    A   B   
    c d c*d c d c*d 
0 1 2 2 3 4 12 
1 5 6 30 7 8 56 
2 9 10 90 11 12 132 

stackunstack另一种解决方案:jezrael的答案使用堆栈的

mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']]) 
df1 = df1.stack(0) 
     .assign(c_d = lambda x: x.sum(1)) 
     .unstack() 
     .swaplevel(0,1,1) 
     .reindex(columns=mux) 
print (df1) 
    A   B   
    c d c_d c d c_d 
0 1 2 3 3 4 7 
1 5 6 11 7 8 15 
2 9 10 19 11 12 23 

df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1)) 
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']]) 
print (df2) 
    A B 
    c*d c*d 
0 2 12 
1 30 56 
2 90 132 

mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']]) 
df = df1.join(df2).reindex(columns=mux) 
print (df) 
    A   B   
    c d c*d c d c*d 
0 1 2 2 3 4 12 
1 5 6 30 7 8 56 
2 9 10 90 11 12 132