熊猫创建powerset和平均数据
问题描述:
对于每个日期date
我想要从品牌的每个组合中获得金额的平均值。熊猫创建powerset和平均数据
例如,我有一个数据帧:
df1 =
Company Brand Date Amount
A 1 01/01/2015 3
A 1 01/02/2015 4
A 1 01/03/2015 2
A 2 01/01/2015 7
A 2 01/02/2015 2
A 2 01/03/2015 1
A 3 01/01/2015 6
A 3 01/02/2015 3
A 3 01/03/2015 1
和我想要的结果是以下DF,其中所述的量是组合组的平均值:
result =
Company Brand Date Amount
A 1 01/01/2015 3
A 1 01/02/2015 4
A 1 01/03/2015 2
A 2 01/01/2015 7
A 2 01/02/2015 2
A 2 01/03/2015 1
A 3 01/01/2015 6
A 3 01/02/2015 3
A 3 01/03/2015 1
A 1_2 01/01/2015 5
A 1_2 01/02/2015 3
A 1_2 01/03/2015 1.5
A 2_3 01/01/2015 6.5
A 2_3 01/02/2015 2.5
A 2_3 01/03/2015 1
A 1_3 01/01/2015 4.5
A 1_3 01/02/2015 3.5
A 1_3 01/03/2015 1.5
A 1_2_3 01/01/2015 5.33
A 1_2_3 01/02/2015 3
A 1_2_3 01/03/2015 1.33
目前,我用groupby来做这个循环,但速度很慢。
d = pd.DataFrame()
comb = ['1_2','1_3','2_3','1_2_3']
for c in comb:
new = df1.loc[(df1.Brand.isin(map(int,c.split('_')))].groupby(['Company','Date'])['Amount'].mean().reset_index()
new.insert(1,'Group',c)
d = d.append(new)
df = df.append(d)
但是,我正在与千家独特的公司和数百万行,所以这是非常缓慢的。有没有办法加快这一点?
答
from itertools import combinations
# define a generator to use combinations
# (iterate through combinations of a specific length)
# and iterate through all combinations
def combo(iterable):
for r in range(1, len(iterable) + 1):
for c in combinations(iterable, r):
yield c
df.groupby(['Company', 'Date']).Brand.unique().apply(
lambda x: pd.Series(
{'_'.join(map(str, blist)): df.query('Brand in @blist').Amount.sum() for blist in combo(x)},
)
).rename_axis('Combo', 1).stack().reset_index(name='Amount')
答
import pandas as pd
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) -->() (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
a = [['A', 'b1', '01/01/2015', 3],
['A', 'b1', '01/02/2015', 4],
['A', 'b1', '01/03/2015', 2],
['A', 'b2', '01/01/2015', 7],
['A', 'b2', '01/02/2015', 2],
['A', 'b2', '01/03/2015', 1],
['A', 'b3', '01/01/2015', 6],
['A', 'b3', '01/02/2015', 3],
['A', 'b3', '01/03/2015', 1]]
df = pd.DataFrame(a, columns=['Company', 'Brand', 'Date', 'Amount'])
ps = powerset(['b1', 'b2', 'b3'])
# create new dataframe to append to
new_df = pd.DataFrame()
for s in ps:
view = df[df.Brand.isin(s)].groupby(['Company', 'Date']).mean()
view['Brand'] = '_'.join(s)
new_df = new_df.append(view)
输出看起来像:
Amount Brand
Company Date
A 01/01/2015 3.000000 b1
01/02/2015 4.000000 b1
01/03/2015 2.000000 b1
01/01/2015 7.000000 b2
01/02/2015 2.000000 b2
01/03/2015 1.000000 b2
01/01/2015 6.000000 b3
01/02/2015 3.000000 b3
01/03/2015 1.000000 b3
01/01/2015 5.000000 b1_b2
01/02/2015 3.000000 b1_b2
01/03/2015 1.500000 b1_b2
01/01/2015 4.500000 b1_b3
01/02/2015 3.500000 b1_b3
01/03/2015 1.500000 b1_b3
01/01/2015 6.500000 b2_b3
01/02/2015 2.500000 b2_b3
01/03/2015 1.000000 b2_b3
01/01/2015 5.333333 b1_b2_b3
01/02/2015 3.000000 b1_b2_b3
01/03/2015 1.333333 b1_b2_b3