电影数据分析------
1.导入包:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
2.导入用户数据;
unames = ['user_id','gender','age','occupation','zip']
users=pd.read_table('ml-1m/users.dat',sep='::',header=None,names=unames)
3.导入电影分数表:
rating_name=['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('ml-1m/ratings.dat',sep='::',header=None,names=rating_name)
4.导入电影基本信息表:
movie_name = ['movie_id','title','genres']
moives = pd.read_table('ml-1m/movies.dat',sep='::',header=None,names=movie_name)
5.合并数据表
data =pd.merge(pd.merge(users,ratings),moives)
6.筛选电影平均分
ratings_by_gender=data.pivot_table(values='rating',index='title',columns='gender',aggfunc='mean')
7.筛选出电影男生和女生人数:
by_boy_movies =data[data.gender=='F']
by_girl_movies=data[data.gender=='M']
8.筛选各个电影男生人数,然后放入新建表中:
by_boy_movies_sum= by_boy_movies.groupby('title').size()
df_by_boy_movies_sum=pd.DataFrame({'F_sum':by_boy_movies_sum})
9.在(8)中的表(df_by_boy_movies_sum)中筛选人数大于250,目的是减少男生的人数不足,存在数据误差:
df_by_boy_movies_hot=df_by_boy_movies_sum.loc[df_by_boy_movies_sum.F_sum>250]
10.筛选最受男生欢迎电影(前十):
df_by_boy_movies_hot.sort_values(by='F_sum',ascending=False).head(10)
11.同理筛选最受女生欢迎电影(前十):
by_girl_movies_sum=by_girl_movies.groupby('title').size()
df_by_girl_movies_sum=pd.DataFrame({'M_sum':by_girl_movies_sum})
df_by_girl_movies_hot=df_by_girl_movies_sum.loc[df_by_girl_movies_sum.M_sum>250]
df_by_girl_movies_hot.sort_values(by='M_sum',ascending=False).head(10)
12.筛选出最受欢迎电影(前十):
b=pd.concat([df_by_boy_movies_hot,df_by_girl_movies_hot],axis=1)
by_hot_movies=b.dropna()
by_movies_hot=ratings_by_gender.loc[ratings_by_gender.index.isin(by_hot_movies.index)]
by_movies_hot
13.筛选出高分而最受欢迎电影(前十)
by_movies_hot_plot=by_movies_hot.sort_values(by='diff',ascending=False).abs().head(10)
by_movies_hot_plot