是否没有用多索引熊猫数据框动态创建列的语法suger?
问题描述:
首先,我显示熊猫数据框来阐明我的问题。是否没有用多索引熊猫数据框动态创建列的语法suger?
import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)
这条巨蟒代码创建数据帧(DF1)是这样的:
#input dataframe
lv1 A B
lv2 c d c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
我想通过使用DF1的数据来创建LV2列 'C * d'。像这样:
#output dataframe after calculation
lv1 A B
lv2 c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
对于这个问题,我写了一些像这样的代码:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)
尽管这段代码几乎解决我的问题,但我真的想不写“为”这样的语句:
df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]
有了这个声明,我得到了'c * d'丢失的关键错误。 这个计算没有语法糖吗?还是我可以通过其他代码获得更好的性能?
答
解释是可能是最地道在熊猫的方式。
output = (df1
# "Stack" data, by moving the top level ('lv1') of the
# column MultiIndex into row index,
# now the rows are a MultiIndex and the columns
# are a regular Index.
.stack(0)
# Since we only have 2 columns now, 'lv2' ('c' & 'd')
# we can multiply them together along the row axis.
# The assign method takes key=value pairs mapping new column
# names to the function used to calculate them. Here we're
# wrapping them in a dictionary and unpacking them using **
.assign(**{'c*d': lambda x: x.product(axis=1)})
# Undos the stack operation, moving 'lv1', back to the
# column index, but now as the bottom level of the column index
.unstack()
# This sets the order of the column index MultiIndex levels.
# Since they are named we can use the names, you can also use
# their integer positions instead. Here axis=1 references
# the column index
.swaplevel('lv1', 'lv2', axis=1)
# Sort the values in both levels of the column MultiIndex.
# This will order them as c, c*d, d which is not what you
# specified above, however having a sorted MultiIndex is required
# for indexing via .loc[:, (...)] to work properly
.sort_index(axis=1)
)
答
有点改善您的解决方案:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']])
df1 = df1.reindex(columns=mux)
print (df1)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
与stack
和unstack
另一种解决方案:jezrael的答案使用堆栈的
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']])
df1 = df1.stack(0)
.assign(c_d = lambda x: x.sum(1))
.unstack()
.swaplevel(0,1,1)
.reindex(columns=mux)
print (df1)
A B
c d c_d c d c_d
0 1 2 3 3 4 7
1 5 6 11 7 8 15
2 9 10 19 11 12 23
df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1))
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']])
print (df2)
A B
c*d c*d
0 2 12
1 30 56
2 90 132
mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']])
df = df1.join(df2).reindex(columns=mux)
print (df)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132