数据预处理：数据清洗、生成样本数据<3>

用sklearn分析基金数据<1>
python爬虫获取基金数据<2>
数据预处理：数据清洗、生成样本数据<3>
用sklearn训练样本数据<4>
用模型进行预测及改进<5>

得到数据相当于原材料，在进行训练前还需要对数据进行预处理，保证数据质量，处理异常值等，减少最终结果误差，由于我们得到的数据比较单一，数据质量问题不大，但也有需要处理加工地方：

一是nan值即空值，比如2017年10月18日

处理方式也比较简单用上一日或下一日的值代替即可，当日增长率为0，但不影响当月的增长率。

将历史数据加上列名保存为csv文件再处理，代码如下

for i in list(range(0,Curprice.count()-1)):
    if np.isnan(Curprice[i]):
        Curprice[i]=Curprice[i-1]
        print(i)

result = pd.concat([records,Curprice], axis=1)
outputfile = 'checkedalldata.csv'
result.to_excel(outputfile)

此时由于几百万的数据处理起来会比较慢，可能先将有nan值的记录行号找出来，可以用ue，notepad++或shell命令很快得到，再进行处理，会快很多，当然放在数据库中加索引也会很快。

import pandas as pd
import numpy as np
Allrecords=pd.read_csv('alldata.csv')
Allrecords.index=Allrecords.iloc[:,1]
Allrecords.index=pd.to_datetime(Allrecords.index, format='%Y-%m-%d')
#just get date,price,fund-code
records = Allrecords.iloc[:,0:3]
Curprice=records.cur_price
Curprice.name='new_price'
ModifCodes=pd.read_csv('includeNANcodes.csv')
ModifCodes=ModifCodes.code
for code in ModifCodes:
    Curprice[code-2]=Curprice[code-3]
result = pd.concat([records,Curprice], axis=1)
outputfile = 'checkedalldata.csv'
result.to_csv(outputfile)

除了nan值以外，还会发现有些记录没有当日价格，而是有七日年化收益率，这是因为该基金为理财型类似银行理财产品，只有这样的记录，但这并不要紧，因为理财型基金会被过滤，此前也已经有了基金类型的数据了。

数据预处理：数据清洗、生成样本数据<3>

因为我们分析的是基金收益率，所以基金类型是一个重要属性，不同类型的基金产品，其风险及收益水平是差别很大。像前面说的理财型基金没有分析的必要，本次主要分析中高风险的基金类型，包括混合型，股票指数，股票型，债券型。

同时还要去掉当月基金历史数据较少的记录，比如本月末才刚刚上市的。

然后就是样本数据的设计了，样本输入X为第N月的各基金的业绩指标，输出Y为第N+1月的月度收益率，
为简化问题，可将此设计为一个二分类模型，根据Y的输出设置大于5%为正例，否则为负例。

其中做为输入的业绩指标即特征为能够反应该基金在这个月的表现，分别是本月增长率、本月增长率平均值、本月最高增长率、本月最小增长率，本月上升强度五个特征。历史数据包括两个维度一是每支基金，二是数据跨度为一年，以2017为例，值为基金当日结算价格。
虽然原始只有价格一种数据但也是时间序列数据，所以可以很快算出每支基金每天的增长率，具体到以月为单位就可以算出以上五个特征，月增长率：(月末-月初)/月初、本月最高增长率：本月增长率最大值，本月最小增长率：本月增长率最低值，本月上升强度：上升天数/总天数。

但是这样的样本数据有个先天性缺陷，因为需要用到一整个月的数据再能得到，所以上月数据只能用于预测，不能拿来训练，其实是用上上个月数据来做训练。

以思路加工样本数据，时间跨度为201701到201801，每月生成一份样本数据，代码如下：

import pandas as pd
import numpy as np
#get all data
Allrecords=pd.read_csv('checkedalldata.csv')
Allrecords.index=Allrecords.iloc[:,0]
Allrecords.index=pd.to_datetime(Allrecords.index, format='%Y-%m-%d')
#just get date,price,fund-code
records = Allrecords.iloc[:,0:4]
Allfund=pd.read_table('Leixingall.txt',encoding='utf-8',sep=',')
Allfund.index=Allfund.iloc[:,1]
Allfund=Allfund[(Allfund.fund_type=='混合型') | (Allfund.fund_type=='股票指数')|(Allfund.fund_type=='股票型') |(Allfund.fund_type=='债券型') ]
codes=Allfund.iloc[:,1]
datelist =[['2017-12','2018-01'],
           ['2017-11','2017-12'],
           ['2017-10','2017-11'],
           ['2017-09','2017-10'],
           ['2017-08','2017-09'],
           ['2017-07','2017-08'],
           ['2017-06','2017-07'],
           ['2017-05','2017-06'],
           ['2017-04','2017-05'],
           ['2017-03','2017-04'],
           ['2017-02','2017-03'],
           ['2017-01','2017-02']]

def preparedata(currdt,nextdt,filename):
    for currcode in codes:
        upcount =0
        downcount =0
        try:
            currrecords=records[records.trade_code==currcode]
            Curprice=currrecords.cur_price
            ret=(Curprice-Curprice.shift(-1))/Curprice.shift(-1)*100
            ret.name='Ret'
            retTM=ret[currdt]
            counts=int(retTM.describe()['count'])
            CurpriceTM=Curprice[currdt]

            if counts <2:
                print('count<2',currcode)
                continue
            avgret = (CurpriceTM[0]-CurpriceTM[counts-1])/CurpriceTM[counts-1]*100
            for rets in retTM:
                if rets >0:
                    upcount +=1
                else:
                    downcount+=1
            upret=upcount/counts*100
            maxret=retTM.describe()['max']
            minret=retTM.describe()['min']
            meanret=retTM.describe()['mean']
   #get next month data
            nextrecordsTM=currrecords[nextdt]
   #get price
            Nextprice=nextrecordsTM.cur_price
            counts=int(Nextprice.describe()['count'])
            nextavgret = (Nextprice.iloc[0]-Nextprice.iloc[counts-1])/Nextprice.iloc[counts-1]*100
            calsstype = 0
            if nextavgret >= 5 : calsstype = 1
            with open(filename,'ab') as files:
                items = str(currcode) +','+str(avgret) +','+ str(maxret) +','+ str(minret) + ','+ str(meanret)+ ','+ str(upret) + ','+ str(upcount) +','+ str(downcount) + ','+ str(calsstype) +','+ str(nextavgret) + ','+ str(currdt) + '\r\n'
                items = items.encode('utf-8')
                files.write(items)
        except:
            print(currcode)
            pass
for date in datelist:
    currdt =date[0]
    nextdt =date[1]
    filename=str(currdt)+'_data.csv'
    preparedata(currdt,nextdt,filename)

共生成12份样本数据

数据预处理：数据清洗、生成样本数据<3>

数据预处理：数据清洗、生成样本数据<3>

相关推荐