学习笔记第五篇之聚类算法

今年年初的时候学习了《机器学习》这本书中的算法，并实践了一些。现在整理成笔记，以后需要时还可以找到。

今天先写个简单的聚类算法。

1、K-means聚类

K-means算法是很典型的基于距离的聚类算法，采用距离作为相似性的评价指标，即认为两个对象的距离越近，其相似度就越大。该算法认为簇

是由距离靠近的对象组成的，因此把得到紧凑且独立的簇作为最终目标。k个初始类聚类中心点的选取对聚类结果具有较大的影响，因为在该算法第一

步中是随机的选取任意k个对象作为初始聚类的中心，初始地代表一个簇。该算法在每次迭代中对数据集中剩余的每个对象，根据其与各个簇中心的距

离将每个对象重新赋给最近的簇。当考察完所有数据对象后，一次迭代运算完成，新的聚类中心被计算出来。如果在一次迭代前后，J的值没有发生变

化，说明算法已经收敛。

算法过程如下：

1）从N个文档随机选取K个文档作为质心

2）对剩余的每个文档测量其到每个质心的距离，并把它归到最近的质心的类

3）重新计算已经得到的各个类的质心

4）迭代2～3步直至新的质心与原质心相等或小于指定阈值，算法结束

具体如下：

输入：k, data[n];

（1）选择k个初始中心点，例如c[0]=data[0],…c[k-1]=data[k-1]；

（2）对于data[0]….data[n]，分别与c[0]…c[k-1]比较，假定与c[i]差值最少，就标记为i；

（3）对于所有标记为i点，重新计算c[i]={ 所有标记为i的data[j]之和}/标记为i的个数；

（4）重复(2)(3)，直到所有c[i]值的变化小于给定阈值。

算法优点

K-Means聚类算法的优点主要集中在:

1.算法快速、简单;

2.对大数据集有较高的效率并且是可伸缩性的;

3.时间复杂度近于线性，而且适合挖掘大规模数据集。K-Means聚类算法的时间复杂度是O(nkt) ,其中n代表数据集中对象的数量，t代表着算法迭代的次数，k代表着簇的数目。

下面以2011年到2015年全国主要城市的国内生产总值实例，根据该数据将城市聚为三类，一线城市、二线城市和三线城市。

[python]view
plain copy

#coding:utf-8  

from __future__ import division  

from math import sqrt  

import xlrd  

import random  

import numpy  

import sys  

reload(sys)  

sys.setdefaultencoding('utf-8')  

#数据存储函数，返回城市数据字典，键值为5年的数据列表  

def storage():  

    table = xlrd.open_workbook('E:/citydata.xls')  

    sheet = table.sheets()[0]  

    col = sheet.col_values(0)  

    list1 = []  

    for i in range(4, 40):  

        row_data = sheet.row_values(i)  

        list1.append(row_data)  

    l = len(list1)  

    list2 = []  

    for j in range(l):  

        temp = list1[j]  

        list2.append(temp[0])  

        del temp[0]  

        list1[j] = temp  

    dict1 = {}  

    for k in range(l):  

        temp1 = list2[k]  

        dict1[temp1] = list1[k]  

    return dict1  

#求两个城市向量的距离  

def distance(elem, K_mean):  

    global Sum  

    length = len(elem)  

    Sum = 0  

    # print type(K_mean)  

    for i in range(length):  

        Sum = Sum + (elem[i] - K_mean[i])**2  

    Sum = sqrt(Sum)  

    return Sum  

#簇中心向量  

def mean(cluster):  

    length = len(cluster)  

    S = [0.0, 0.0, 0.0, 0.0, 0.0]  

    for every in cluster:  

        a = numpy.array(every)  

        S = S + a  

    mean_value = S/length  

    mean_value = list(mean_value)  

    # print mean_value  

    return mean_value  

#比较函数  

def compare(list1, list2):  

    a = len(list1)  

    b = len(list2)  

    # print a, b  

    c = 0  

    for i in range(a):  

        if list1[i] == list2[i]:  

            c = c + 1  

    # print c  

    if c == a:  

        return True  

    else:  

        return False  

#代价函数，返回簇间的距离  

def Cost_function(K1_cluster, K2_cluster, K3_cluster, K1_mean, K2_mean, K3_mean):  

    global cost  

    cost = 0  

    for each in K1_cluster:  

        cost = cost + distance(each, K1_mean)  

    # print cost  

    for every in K2_cluster:  

        cost = cost + distance(every, K2_mean)  

    for single in K3_cluster:  

        cost = cost + distance(single, K3_mean)  

    return cost  

#算法第一步，返回三个簇  

def Step_one(dict1, K1, K2, K3):  

    K1_cluster = []  

    K2_cluster = []  

    K3_cluster = []  

    for each in dict1:  

        dist1 = distance(dict1[each], K1)  

        dist2 = distance(dict1[each], K2)  

        dist3 = distance(dict1[each], K3)  

        Min = min(dist1, dist2, dist3)  

        if Min == dist1:  

            K1_cluster.append(dict1[each])  

        if Min == dist2:  

            K2_cluster.append(dict1[each])  

        if Min == dist3:  

            K3_cluster.append(dict1[each])  

    # print K1_cluster  

    return K1_cluster, K2_cluster, K3_cluster  

#求簇的中心向量  

def Step_two(K1_cluster, K2_cluster, K3_cluster):  

    K1_mean = mean(K1_cluster)  

    K2_mean = mean(K2_cluster)  

    K3_mean = mean(K3_cluster)  

    return K1_mean, K2_mean, K3_mean  

#聚类函数，返回聚类结果  

def K_means(dict1, K):  

    global K11_mean, K22_mean, K33_mean, error, cost1, cost2  

    length = len(dict1)  

    list1 = random.sample(dict1, K)  

    K1 = dict1[list1[0]]  

    K2 = dict1[list1[1]]  

    K3 = dict1[list1[2]]  

    clu1, clu2, clu3 = Step_one(dict1, K1, K2, K3)     #第一次聚类  

    K1_mean, K2_mean, K3_mean = Step_two(clu1, clu2, clu3)  

    cost1 = Cost_function(clu1, clu2, clu3, K1_mean, K2_mean, K3_mean)  

    new_clu1, new_clu2, new_clu3 = Step_one(dict1, K1_mean, K2_mean, K3_mean)   #第二次聚类  

    K11_mean, K22_mean, K33_mean = Step_two(new_clu1, new_clu2, new_clu3)  

    cost2 = Cost_function(new_clu1, new_clu2, new_clu3, K11_mean, K22_mean, K33_mean)  

    error = cost1 - cost2  

    if error > 0.5:  

        cost1 = cost2  

        new_clu1, new_clu2, new_clu3 = Step_one(dict1, K11_mean, K22_mean, K33_mean)  

        K11_mean, K22_mean, K33_mean = Step_two(new_clu1, new_clu2, new_clu3)  

        cost2 = Cost_function(new_clu1, new_clu2, new_clu3, K11_mean, K22_mean, K33_mean)  

        error = cost1 - cost2  

    return new_clu1, new_clu2, new_clu3  

if __name__ == '__main__':  

    dict1 = storage()  

    K = 3  

    K1_cluster, K2_cluster, K3_cluster = K_means(dict1, K)  

    list1 = []  

    list2 = []  

    list3 = []  

    d = dict1.keys()  

    L1 = len(K1_cluster)  

    L2 = len(dict1)  

    # print K1_cluster  

    for each in d:  

        for i in K1_cluster:  

            # print i  

            if compare(dict1[each], i):  

                list1.append(each)  

        for j in K2_cluster:  

            if compare(dict1[each], j):  

                list2.append(each)  

        for k in K3_cluster:  

            if compare(dict1[each], k):  

                list3.append(each)  

    for i in list1:  

        print i,  

    print  

    for j in list2:  

        print j,  

    print  

    for k in list3:  

        print k,  

    print 
结果如下，我取的误差是0.5，也可以用迭代次数。

学习笔记第五篇之聚类算法

相关推荐