K-近邻算法_使用k最近邻算法进行多分类输入说明:输入由三行组成,每行由一个数组成,第一行表-程序员宅基地

K-近邻算法

优点：精度高，对异常值不敏感，无数据输入假定
缺点：计算复杂度高，空间复杂度高
适用数据范围：数值型和标称型

K-近邻算法的一般流程

收集数据：可以使用任何方法
准备数据：距离计算所需要的数值，最好是结构化的数据格式
分析数据：可以使用任何方法
训练方法：此步骤不适用于K-近邻算法
测试算法：计算错误率
使用算法：首先需要输入样本数据和结构化的输出结果，然后运行K-近邻算法判定输入数据分别属于那个分类，最后应用对计算出的分类执行后续的处理
tile（A,rep）:重复A的各个维度

from numpy import *
import operator
def createDataSet():
    group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels=['A','B','C','D']
    return group,labels

def classify0(inX,dataSet,labels,k):
    dataSetSize=dataSet.shape[0]
    #距离计算
    diffMat=tile(inX,(dataSetSize,1))-dataSet
    sqDoffMat=diffMat**2
    sqDistances=sqDoffMat.sum(axis=1)
    distances=sqDistances**0.5
    sortedDistIndices=distances.argsort()
    classCount={}
    #选择激励最小的K个点
    for i in range(k):
        voteIlabel=labels[sortedDistIndices[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
    sortedClassCount=sorted(classCount.items(),
                           key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]   

group,labels=createDataSet()
classify0([0,0],group,labels,3)

'C'

函数详解

tile([1,2],2)

array([1, 2, 1, 2])

tile([1,2],(2,2))

array([[1, 2, 1, 2],
       [1, 2, 1, 2]])

x=array([[1,2,3],[2,3,4]])
print(x.shape)
print(x.shape[0])

(2, 3)
2

x=array([[1,2,3],[2,3,4]])
print(x**2)

[[ 1  4  9]
 [ 4  9 16]]

x = np.array([[0, 3], [2, 2]])
np.argsort(x, axis=0)
np.argsort(x, axis=1)

array([[0, 1],
       [1, 0]])
array([[0, 1],
       [0, 1]])

dict = {
   'Name': 'Zara', 'Age': 27}
print "Value : %s" %  dict.get('Age')
print "Value : %s" %  dict.get('Sex', "Never")

Value : 27
Value : Never

""
Python 字典(Dictionary) items() 函数以列表返回可遍历的(键, 值) 元组数组
""
dict = {
   'Google': 'www.google.com', 'Runoob': 'www.runoob.com', 'taobao': 'www.taobao.com'}

print "字典值 : %s" %  dict.items()

# 遍历字典列表
for key,values in  dict.items():
    print key,values

字典值 : [('Google', 'www.google.com'), ('taobao', 'www.taobao.com'), ('Runoob', 'www.runoob.com')]
Google www.google.com
taobao www.taobao.com
Runoob www.runoob.com

#operator库块提供了一系列的函数操作。比如，operator.add(x, y)等于x+y 
abs(...)
        abs(a) -- Same as abs(a).
and_(...)
        and_(a, b) -- Same as a & b.
contains(...)
        contains(a, b) -- Same as b in a (note reversed operands).
eq(...)
        eq(a, b) -- Same as a==b.

operator模块提供的itemgetter函数用于获取对象的哪些维的数据，参数为一些序号。operator.itemgetter函数获取的不是值，而是定义了一个函数，通过该函数作用到对象上才能获取值。

a = [1,2,3] 
>>> b=operator.itemgetter(1)      //定义函数b，获取对象的第1个域的值
>>> b(a) 

2

>>> b=operator.itemgetter(1,0)  //定义函数b，获取对象的第1个域和第0个的值
>>> b(a) 
(2, 1)

sorted函数用来排序，sorted(iterable[, cmp[, key[, reverse]]])

其中key的参数为一个函数或者lambda函数。所以itemgetter可以用来当key的参数

a = [(‘john’, ‘A’, 15), (‘jane’, ‘B’, 12), (‘dave’, ‘B’, 10)]

根据第二个域和第三个域进行排序

sorted(students, key=operator.itemgetter(1,2))
只要是可迭代对象都可以用sorted 。

sorted(itrearble, cmp=None, key=None, reverse=False)

=号后面是默认值默认是升序排序的，如果想让结果降序排列，用reverse=True

最后会将排序的结果放到一个新的列表中，而不是对iterable本身进行修改。

1, 简单排序

sorted('123456')  字符串

['1', '2', '3', '4', '5', '6']

sorted([1,4,5,2,3,6])  列表
[1, 2, 3, 4, 5, 6]

sorted({
    1:'q',3:'c',2:'g'}) 字典， 默认对字典的键进行排序
[1, 2, 3]

 sorted({
    1:'q',3:'c',2:'g'}.keys())  对字典的键
[1, 2, 3]

sorted({
    1:'q',3:'c',2:'g'}.values())  对字典的值
['c', 'g', 'q']

sorted({
    1:'q',3:'c',2:'g'}.items())  对键值对组成的元组的列表
[(1, 'q'), (2, 'g'), (3, 'c')]

自定义比较函数

def comp(x, y):
if x < y:
return 1
elif x > y:
return -1
else:
return 0

nums = [3, 2, 8 ,0 , 1]
nums.sort(comp)
print nums # 降序排序[8, 3, 2, 1, 0]
nums.sort(cmp) # 调用内建函数cmp ，升序排序
print nums # 降序排序[0, 1, 2, 3, 8]

key在使用时必须提供一个排序过程总调用的函数

x = ['mmm', 'mm', 'mm', 'm' ]
x.sort(key = len)
print x # ['m', 'mm', 'mm', 'mmm']

在约会网站上使用K近邻算法

收集算法：提供文本文件
准备数据：使用Python解析文本文件
分析数据：使用matplotlib画二维扩散图
训练算法：不适用K近邻算法
测试算法：使用海伦提供的部分数据作为测试样本。
测试样本与非测试样本的区别在于：测试样本是已经完成分类的数据，如果预测分类与实际类别不同，则标记为一个错误
使用算法：产生简单的命令行程序，然后海伦可以输入一些特征数据以判断对方是否为自己喜欢的类型

完整代码：

from numpy import *
import operator

def classify0(inX,dataSet,labels,k):
    dataSetSize=dataSet.shape[0]
    #距离计算
    diffMat=tile(inX,(dataSetSize,1))-dataSet
    sqDoffMat=diffMat**2
    sqDistances=sqDoffMat.sum(axis=1)
    distances=sqDistances**0.5
    sortedDistIndices=distances.argsort()
    classCount={}
    #选择激励最小的K个点
    for i in range(k):
        voteIlabel=labels[sortedDistIndices[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
    sortedClassCount=sorted(classCount.items(),
                           key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def file2matrix(filename):
    fr=open(filename)
    arrayOLines=fr.readlines()
    numberOfLines=len(arrayOLines)
    returnMat=zeros((numberOfLines,3))
    classLabelVector=[]
    index=0
    for line in arrayOLines:
        #跳过所有的空格字符，使用tab‘\t’分割数据
        line=line.strip()
        listFromLine=line.split('\t')
        returnMat[index,:]=listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index+=1
    return returnMat,classLabelVector
data_path='E:/dataset/machinelearninginaction/Ch02/'
datMat,datLabel=file2matrix(data_path+'datingTestSet2.txt')
print(datMat)
print(datLabel[0:20])
#分析数据：使用matplotlib创建散点图
import matplotlib
import matplotlib.pyplot as plt
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(datMat[:,0],datMat[:,1],
           15.0*array(datLabel),15.0*array(datLabel))
plt.show()
#准备数据：归一化数据
def autoNorm(dataSet):
    minVals=dataSet.min(0)
    maxVals=dataSet.max(0)
    ranges=maxVals-minVals
    m=dataSet.shape[0]
    normData=dataSet-tile(minVals,(m,1))
    normData=normData/tile(ranges,(m,1))
    return normData,ranges,minVals
normData,ranges,minVals=autoNorm(datMat)
print(normData)
print(ranges)
#测试算法：作为完整程序验证分类器
def datingClassTest():
    hoRatio=0.10
    datingDataMat,datingDatalabel=file2matrix(data_path+\
                                              'datingTestSet2.txt')
    norm,range1,minVals=autoNorm(datingDataMat)
    m=norm.shape[0]
    numTest=int(m*hoRatio)
    errorCount=0
    for i in range(numTest):
        classResult=classify0(norm[i,:],norm[numTest:m,:],\
                              datLabel[numTest:m],3)
        print('分类器学习的结果，%d,真实值是%d'%(classResult,datingDatalabel[i]))
        if(classResult!=datingDatalabel[i]):errorCount+=1
    print("错误率是:%f"%(errorCount/numTest))
if __name__ == '__main__':
    datingClassTest()

实例：手写识别系统

收集数据：提供文本文件
准备数据：编写函数img2vector函数，将图像格式转换为分类器的向量形式
分析数据：在Python命令提示符中检查数据，确保它符合要求
训练算法：此步骤不适用KNN
测试算法：编写函数使用提供的部分数据集作为测试样本，测试样本与非测试样本的区别在于测试样本是已经完成分类的数据，如果测试分类与实际类别不同，则标记为一个错误
使用算法：本例没有完成此步骤，若你感兴趣可以构建完整的应用程序，从图像中提取数字，并完成数字识别，美国的邮件分拣系统就是一个实际运行的类似系统

准备数据：将图像转换为测试向量

from numpy import *
import operator
def img2vector(filename):
    file_path = "E:/dataset/machinelearninginaction/Ch02/digits/trainingDigits/"
    returnVec=zeros((1,1024))
    fr=open(file_path+filename)
    for i2 in range(32):
        lineStr=fr.readline()
        for j2 in range(32):
            returnVec[0,32*i2+j2]=int(lineStr[j2])
    return returnVec

def classify0(inX,dataSet,labels,k):
    dataSetSize=dataSet.shape[0]
    #距离计算
    diffMat=tile(inX,(dataSetSize,1))-dataSet
    sqDoffMat=diffMat**2
    sqDistances=sqDoffMat.sum(axis=1)
    distances=sqDistances**0.5
    sortedDistIndices=distances.argsort()
    classCount={}
    #选择激励最小的K个点
    for i in range(k):
        voteIlabel=labels[sortedDistIndices[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
    sortedClassCount=sorted(classCount.items(),
                           key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

import os
def handWritingClassTest():
    hwLabels=[]
    trainingFileList=os.listdir('E:/dataset/machinelearninginaction/Ch02/digits/trainingDigits')
    m=len(trainingFileList)
    print(m)
    trainVec=zeros((m,1024))
    #traingLabel=zeros((m,1))
    traingLabel=[]
    i=0
    for filename in trainingFileList:
        img=img2vector(filename)
        trainVec[i,:]=img
        label=filename.split('_')[0]
        traingLabel.append(int(label))
        i+=1
    testFileList=os.listdir("E:/dataset/machinelearninginaction/Ch02/digits/testDigits")
    #n=len(testFileList)
    i=0
    for filename in testFileList:
        img=img2vector(filename)
        #testVect[i]=img
        resu=classify0(img,trainVec,traingLabel,3)
        label = filename.split('_')[0]
        print("predict:%d    the true value:%d"%(resu,int(label)))
        if(resu!=label):
            i+=1
    print("the precision is %f"%(i/len(testFileList)))

file_path="E:/dataset/machinelearninginaction/Ch02/digits/trainingDigits/"
#testVec=img2vector(file_path+"0_0.txt")
#print(testVec[0:32])
handWritingClassTest()

the precision is 1.000000

本文链接：https://blog.csdn.net/u012778718/article/details/78681504

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

Python3.x 标准模块库目录(下篇)_python 3.x标准库-程序员宅基地

文章浏览阅读3.6k次，点赞2次，收藏10次。Python Standard Library 翻译: Python 江湖群10/06/07 20:10:08 编译 0.1. 关于本书 0.2. 代码约定 0.3. 关于例子 0.4. 如何联系我们核心模块 1.1. 介绍 1..._python 3.x标准库

Android沉浸式状态栏 + actionBar透明渐变 + scrollView顶部伸缩_android沉浸式状态栏和背景图拉伸-程序员宅基地

文章浏览阅读6.7k次，点赞4次，收藏14次。闲话不多说，直接上图。给大家讲讲我的编程思想吧。第一部分：沉浸式状态栏（API-Level 19, Android4.4 KitKat 之后加入的东西），而且在Api-Level 21版本中新增了一个属性（下面会说到）。所以，style文件应该声明三份。valuesname="TranslucentTheme" parent="@_android沉浸式状态栏和背景图拉伸

TVS管的选取计算_tvs功率计算-程序员宅基地

文章浏览阅读2k次。TVS管的选取计算选取时应注意以下几点： ① TVS额定反向关断电压VWM应大于或等于被保护电路的最大工作电压。 ② 最小击穿电压VBR=VWM/KBR (其中，KBR=0.8～0.9)。 ③ TVS的最大箝位电压VC应小于被保护电路的损坏电压，即VC=KC×VBR (其中，KC=1.3)。 ④ 在规定的脉冲持续时间内，TVS的最大峰值脉冲功耗PM必须大于被保护电路内可能出现的峰值脉冲功_tvs功率计算

【分类预测】基于粒子群优化算法优化堆叠去噪自编码器PSO-SDAE的数据分类预测附Matlab实现_去噪自编码器 matlab-程序员宅基地

文章浏览阅读921次，点赞21次，收藏24次。近年来，随着人工智能和机器学习技术的快速发展，数据分类预测算法在各个领域中得到了广泛的应用。其中，基于粒子群优化算法优化堆叠去噪自编码器（PSO-SDAE）的数据分类预测算法备受关注。本文将对这一算法进行深入研究，探讨其在数据分类预测中的应用和优势。首先，让我们来了解一下堆叠去噪自编码器（SDAE）和粒子群优化算法（PSO）的基本概念。SDAE是一种无监督学习算法，通过学习数据的特征表示来实现特征提取和降维。它通过将输入数据进行编码和解码，从而学习数据的高阶特征表示。_去噪自编码器 matlab

理解原始类型与对象类型-程序员宅基地

文章浏览阅读87次。首先，我们来看 JavaScript的内置原始类型。除了最常见的 number / string / boolean / null / undefined， ECMAScript 2015（ES6）、2020 (ES11) 又分别引入了 2 个新的原始类型：symbol 与 bigint 。在 TypeScript 中它们都有对应的类型注解：其中，除了 null 与 undefined 以外，余下的类型基本上可以完全对应到 JavaScript 中的数据类型概念，因此这里我们只对 null 与 undef_原始类型

思必驰董事长高始兴：疫情危机改变不了大时代_高始兴背景-程序员宅基地

文章浏览阅读301次。采访整理/刘煜编辑/ 严睿2020年新年开端这场重大疫情的奇袭，不仅是对国人个体免疫力的挑战，对群体意识协同性的考验，也是一次对于国运的洗礼。疫情之下，各个行业受到了怎样的冲击，企业的领导者们又在思考和决定了什么？他们是否感到焦虑？他们如何应对这场危机？关注疫情进程的同时，他们又如何思量疫后建设的问题？也因此，我们希冀通过一组对不同行业领域的企业家、创业家的访谈，以及对他..._高始兴背景