Python 同期群分析实战（个人笔记）

关注微信公共号：小程在线

关注博客：程志伟的博客

完整脚本在公共号

Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 7.12.0 -- An enhanced Interactive Python.

加载数据

import pandas as pd
df = pd.read_excel(r'F:\Python\同期群订单数据.xlsx')
df.head()
Out[1]:
平台店铺名称客户昵称付款时间订单状态支付金额购买数量省份
0 程志伟的博客小程在线入倩出入深 2019-09-01 00:10:04 交易成功 15.2 1 江苏省
1 程志伟的博客小程在线愛hya爱 2019-09-01 00:14:52 交易成功 8.4 1 广东省
2 程志伟的博客小程在线象95象大 2019-09-01 02:17:15 交易成功 8.4 1 辽宁省
3 程志伟的博客小程在线卡哇伊氛十 2019-09-01 03:37:28 交易成功 22.0 1 广西壮族自治区
4 程志伟的博客小程在线一只羊哈阿 2019-09-01 08:53:50 交易成功 85.0 1 辽宁省

#查看数据信息

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42713 entries, 0 to 42712
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 平台 42713 non-null object
1 店铺名称 42713 non-null object
2 客户昵称 42713 non-null object
3 付款时间 40339 non-null datetime64[ns]
4 订单状态 42713 non-null object
5 支付金额 42713 non-null float64
6 购买数量 42713 non-null int64
7 省份 42713 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 2.6+ MB

缺失付款时间的数据，订单状态主要是“交易失败”
df.loc[df['付款时间'].isnull(),:].head()
Out[3]:
平台店铺名称客户昵称付款时间订单状态支付金额购买数量省份
40339 程志伟的博客小程在线爱购物nx NaT 交易失败 97.8 1 浙江省
40340 程志伟的博客小程在线 975ay NaT 交易失败 117.3 2 浙江省
40341 程志伟的博客小程在线 101呆阿 NaT 交易失败 144.5 2 新疆维吾尔自治区
40342 程志伟的博客小程在线 489bt NaT 交易失败 92.7 1 江苏省
40343 程志伟的博客小程在线姚琳儿姚姚 NaT 交易失败 8.4 1 广东省

缺失付款时间的订单都是“交易失败”状态，而完整的数据则是“交易成功”
df.loc[df['付款时间'].isnull(),:]['订单状态'].value_counts()
Out[4]:
交易失败 2374
Name: 订单状态, dtype: int64

df.loc[df['付款时间'].isnull()==False,:]['订单状态'].value_counts()
Out[5]:
交易成功 40339
Name: 订单状态, dtype: int64

只需要筛选出交易成功的订单就好
order = df.loc[df['付款时间'].isnull()==False,:]
order.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 40339 entries, 0 to 40338
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 平台 40339 non-null object
1 店铺名称 40339 non-null object
2 客户昵称 40339 non-null object
3 付款时间 40339 non-null datetime64[ns]
4 订单状态 40339 non-null object
5 支付金额 40339 non-null float64
6 购买数量 40339 non-null int64
7 省份 40339 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(5)
memory usage: 2.8+ MB

用字符串形式的“年-月”标签更加方便：

order['时间标签'] = order['付款时间'].astype(str).str[:7]
order['时间标签'].value_counts().sort_index()
__main__:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[7]:
2019-09 2201
2019-10 8096
2019-11 6050
2019-12 6760
2020-01 7443
2020-02 9789
Name: 时间标签, dtype: int64

订单源数据是从 19 年 9 月开始，到 2020 年 2 月。我们以 2019 年 10 月的数据为样板，实现单行的同期群分析。

month = '2019-10'
sample = order.loc[order['时间标签']==month,:]
print('10月订单数量：',len(sample))
sample.head()
sample_c = sample.groupby('客户昵称')['支付金额'].sum().reset_index()
print('10月客户数量：',len(sample_c))
sample_c.head()
10月订单数量： 8096
10月客户数量： 7336
Out[8]:
客户昵称支付金额
0 0000栗 4.2
1 000ab 16.8
2 000al 16.8
3 000bb 16.8
4 000il 16.8

显而易见， 2019 年 10 月份一共有 7336 位客户，购买了 8096 笔订单。
接下来，我们要计算的是每个月的新增客户数，这个新增，是需要和之前的月份遍历匹配来验证的， 2019 年 10 月之前的客户就是 2019 年 9 月的数据：

history = order.loc[order['时间标签']=='2019-09']
history.head()
Out[9]:
平台店铺名称客户昵称付款时间订单状态支付金额购买数量省份时间标签
0 程志伟的博客小程在线入倩出入深 2019-09-01 00:10:04 交易成功 15.2 1 江苏省 2019-09
1 程志伟的博客小程在线愛hya爱 2019-09-01 00:14:52 交易成功 8.4 1 广东省 2019-09
2 程志伟的博客小程在线象95象大 2019-09-01 02:17:15 交易成功 8.4 1 辽宁省 2019-09
3 程志伟的博客小程在线卡哇伊氛十 2019-09-01 03:37:28 交易成功 22.0 1 广西壮族自治区 2019-09
4 程志伟的博客小程在线一只羊哈阿 2019-09-01 08:53:50 交易成功 85.0 1 辽宁省 2019-09

和历史数据做匹配，验证并筛选出 2019 年 10 月新增的客户数：
sample_c = sample_c.loc[sample_c['客户昵称'].isin(history['客户昵称'])==False,:]
print('2019年10月新增客户：',len(sample_c))
sample_c.head()
2019年10月新增客户： 7083
Out[10]:
客户昵称支付金额
0 0000栗 4.2
1 000ab 16.8
2 000al 16.8
3 000bb 16.8
4 000il 16.8

然后，和 10 月之后每个月的客户昵称进行匹配，计算出每个月的留存情况：
re=[]
for i in ['2019-11','2019-12','2020-01','2020-02']:
next_month = order.loc[order['时间标签']==i,:]
target_user = sample_c.loc[sample_c['客户昵称'].isin(next_month['客户昵称'])==True,:]
re.append([i+'留存情况：',len(target_user)])
re

把最开始的当月新增客户加入到列表：
re.insert(0,['2019年10月新增客户：',len(sample_c)])
re
Out[11]:
[['2019年10月新增客户：', 7083],
['2019-11留存情况：', 539],
['2019-12留存情况：', 428],
['2020-01留存情况：', 414],
['2020-02留存情况：', 426]]

结论：019 年 10 月新增客户 7083 位，次月（11 月）留存 539 人，随后有所降低，
而到了 2020 年 2 月留存回购客户数较上月有小幅上升

month_lst = order['时间标签'].unique()
month_lst
Out[12]:
array(['2019-09', '2019-10', '2019-11', '2019-12', '2020-01', '2020-02'],
dtype=object)

遍历合并，完成脚本：

Python 同期群分析实战（个人笔记）
final
Out[13]:
当月新增 +1月 +2月 +3月 +4月 +5月
2019-09 2042 253 219 167 159 165
2019-10 253 89 69 76 67 0
2019-11 758 193 195 193 0 0
2019-12 1043 334 299 0 0 0
2020-01 1434 429 0 0 0 0

真实数据是留存率形式体现，再稍做加工即可：
result = final.divide(final['当月新增'],axis=0).iloc[:,1:]
result['当月新增'] = final['当月新增']
result
Out[14]:
+1月 +2月 +3月 +4月 +5月当月新增
2019-09 0.123898 0.107248 0.081783 0.077865 0.080803 2042
2019-10 0.351779 0.272727 0.300395 0.264822 0.000000 253
2019-11 0.254617 0.257256 0.254617 0.000000 0.000000 758
2019-12 0.320230 0.286673 0.000000 0.000000 0.000000 1043
2020-01 0.299163 0.000000 0.000000 0.000000 0.000000 1434

• 横向观察，次月流失严重，表现最好的月份次月留存也只有 12%，随后平稳降低，稳定在 6%左右。
• 纵向对比， 2019 年当月新增客户最少，仅有 2042 位，但人群相对精准，留存率表现优于其他月份。

Python 同期群分析实战（个人笔记）

关注微信公共号：小程在线

关注****博客：程志伟的博客

相关推荐

关注博客：程志伟的博客