根据行和列条件设置熊猫数据帧值

问题描述：

基本上我有一个数据帧如下：

 month taken score 
1  1  2  23 
2  1  1  34 
3  1  2  12 
4  1  2  59 
5  2  1  12 
6  2  2  23 
7  2  1  43 
8  2  2  45 
9  3  1  43 
10  3  2  43 
11  4  1  23 
12  4  2  94

我想让它使“分数”一栏更改为100，其中采取== 2持续，直到月底天。所以，没有采取== 2中所有出现有其得分为100，如果在一个月之后的任何一天都有一个采取== 1

所以结果我想要的是：

 month taken score 
1  1  2  23 
2  1  1  34 
3  1  2  100 
4  1  2  100 
5  2  1  12 
6  2  2  23 
7  2  1  43 
8  2  2  100 
9  3  1  43 
10  3  2  43 
11  3  1  23 
12  3  2  100 
13  4  1  32 
14  4  2  100

我写这个代码，我觉得应该这样做：

#iterate through months 
for month in range(12): 
    #iterate through scores 
    for score in range(len(df_report.loc[df_report['month'] == month+1])): 
     #starting from the bottom, of that month, if 'taken' == 2... 
     if df_report.loc[df_report.month==month+1, 'taken'].iloc[-score-1] == 2: 
      #then set the score to 100 
      df_report.loc[df_report.month==month+1, 'score'].iloc[-score-2] = 100 
     #if you run into a 'taken' == 1, move on to next month 
     else: break

然而，这似乎并没有更改任何值，尽管不引发错误...它也没有给我一个错误关于将值设置为复制的数据帧。

任何人都可以解释我做错了什么吗？

如果我猜的话，这将是你用新值设置副本。链接* loc呼叫不是最好的主意。 –

我认为你是对的，但我该如何解决这个问题？另外，如果.loc不是副本，并且.iloc不是副本，那么为什么.loc的.iloc是副本？！ – James

答

原因你的价值观不被更新是分配给iloc更新副本由前loc调用返回，使原来没有被触及。

下面是我如何解决这个问题。首先，定义一个函数foo。

def foo(df): 
    for i in reversed(df.index): 
     if df.loc[i, 'taken'] != 2: 
      break 
     df.loc[i, 'score'] = 100 
     i -= 1 
    return df

现在，groupbymonth并调用foo：

df = df.groupby('month').apply(foo) 
print(df) 
    month taken score 
1  1  2  23 
2  1  1  34 
3  1  2 100 
4  1  2 100 
5  2  1  12 
6  2  2  23 
7  2  1  43 
8  2  2 100 
9  3  1  43 
10  3  2 100 
11  4  1  23 
12  4  2 100

显然，apply有它的缺点，但我想不出一个向量化的办法处理这一问题。

我也没有。我可以摆脱for循环，但不适用于groupby – Dark

谢谢，这完美的作品 – James

答

你可以做

import numpy as np 
def get_value(x): 
    s = x['taken'] 
    # Get a mask of duplicate sequeence and change values using np.where 
    mask = s.ne(s.shift()).cumsum().duplicated(keep=False) 
    news = np.where(mask,100,x['score']) 

    # if last number is 2 then change the news value to 100 
    if s[s.idxmax()] == 2: news[-1] = 100 
    return pd.Series(news) 

df['score'] = df.groupby('month').apply(get_value).values

输出：

 
    month taken score 
1  1  2  23 
2  1  1  34 
3  1  2 100 
4  1  2 100 
5  2  1  12 
6  2  2  23 
7  2  1  43 
8  2  2 100 
9  3  1  43 
10  3  2 100 
11  4  1  23 
12  4  2 100

几乎相同的速度，但@coldspeed是赢家

ndf = pd.concat([df]*10000).reset_index(drop=True) 

%%timeit 
ndf['score'] = ndf.groupby('month').apply(foo) 
10 loops, best of 3: 40.8 ms per loop 


%%timeit 
ndf['score'] = ndf.groupby('month').apply(get_value).values 
10 loops, best of 3: 42.6 ms per loop

当然，这比简单的迭代更好？ –

我不知道需要检查速度 – Dark

@cᴏʟᴅsᴘᴇᴇᴅ它非常非常奇怪。无论数据帧在我的电脑中有多大，差值都是2毫秒。 – Dark

根据行和列条件设置熊猫数据帧值

相关推荐