首页 > 学术百科

KNN算法之KD树（K-dimensionTree）实现K近邻查询

KD树是⼀种分割k维数据空间的数据结构，主要应⽤于多维空间关键数据的搜索，如范围搜索和最近邻搜索。

KD树使⽤了分治的思想，对⽐⼆叉搜索树（BST），KD树解决的是多维空间内的最近点（K近点）问题。（思想与之前见过的最近点对问题很相似，将所有点分为两边，对于可能横跨划分线的点对再进⼀步讨论）

KD树⽤来优化KNN算法中的查询复杂度。

⼀、建树

建⽴KDtree，主要有两步操作：选择合适的分割维度，选择中值节点作为分割节点。

分割维度的选择遵循的原则是，选择范围最⼤的纬度，也即是⽅差最⼤的纬度作为分割维度，为什么⽅差最⼤的适合作为特征呢？

因为⽅差⼤，数据相对“分散”，选取该特征来对数据集进⾏分割，数据散得更“开”⼀些。

分割节点的选择原则是，将这⼀维度的数据进⾏排序，选择正中间的节点作为分割节点，确保节点左边的点的维度值⼩于节点的维度值，节点右边的点的维度值⼤于节点的维度值。

这两步步骤影响搜索效率，⾮常关键。

⼆、搜索K近点

需要的数据结构：最⼤堆（此处我对距离取负从⽽⽤最⼩堆实现的最⼤堆，因为python的heapq模块只有最⼩堆）、堆栈（⽤列表实现）

a.利⽤⼆叉搜索到叶⼦节点并将搜索的结点路径压⼊堆栈stack中

b.通过保存在堆栈中的搜索路径回溯，直⾄堆栈中没有结点了

对于b步骤，需要区分叶⼦结点和⾮叶结点：

1、叶⼦结点：

叶⼦结点：计算与⽬标点的距离。若候选堆中不⾜K个点，将叶⼦结点加⼊候选堆中；如果K个点够了，判断是否⽐候选堆中距离最⼩的结点（因为距离取了相反数）还要⼤，说明应当加⼊候选堆；

2、⾮叶结点：

对于⾮叶结点，处理步骤和叶⼦结点差不多，只是需要额外考虑以⽬标点为圆⼼，最⼤堆的堆顶元素为半径的超球体是否和划分当前空间的超平⾯相交，如果相交说明未访问的另⼀边的空间有可能包含⽐当前已有的K个近点更近的点，需要搜索另⼀边的空间；此外，当候选堆中没有K个点，那么不管有没有相交，都应当搜索未访问的另⼀边空间，因为候选堆的点不够K个。

步骤：计算与⽬标点的距离

1、若不⾜K个点，将结点加⼊候选堆中；

如果K个点够了，判断是否⽐候选堆中距离最⼩的结点（因为距离取了相反数）还要⼤。

2、判断候选堆中的最⼩距离是否⼩于Xi离当前超平⾯的距离（即是否需要判断未访问的另⼀边要不要搜索）当然如果不⾜K个点，虽然超平⾯不相交，依旧要搜索另⼀边，直到到叶⼦结点，并且把路径加⼊回溯栈中。

三、预测

KNN通常⽤来分类或者回归问题，笔者已经封装好了两种预测的⽅法。

python代码实现：

1import heapq

2class KDNode(object):

3def__init__(self,feature=None,father=None,left=None,right=None,split=None):

4 self.feature=feature # dimension index (按第⼏个维度的特征进⾏划分的)

5 self.father=father #并没有⽤到

6 self.left=left

7 self.right=right

8 self.split=split # X value and Y value （元组，包含特征X和真实值Y）

10class KDTree(object):

斩波器11def__init__(self):

12 =KDNode()

13pass

15def _get_variance(self,X,row_indexes,feature_index):

16# X (2D list): samples * dimension

17# row_indexes (1D list): choose which row can be calculated

18# feature_index (int): calculate which dimension

19 n = len(row_indexes)

20 sum1 = 0

21 sum2 = 0

22for id in row_indexes:

23 sum1 = sum1 + X[id][feature_index]

24 sum2 = sum2 + X[id][feature_index]**2

26return sum2/n - (sum1/n)**2

28def _get_max_variance_feature(self,X,row_indexes):

29 mx_var = -1

30 dim_index = -1

31for dim in range(len(X[0])):

32 dim_var = self._get_variance(X,row_indexes,dim)

33if dim_var>mx_var:

34 mx_var=dim_var

35 dim_index=dim

36# return max variance feature index (int)

37return dim_index

39def _get_median_index(self,X,row_indexes,feature_index):

40 median_index = len(row_indexes)//2

小毛孩夺宝奇缘

41 select_X = [(idx,X[idx][feature_index]) for idx in row_indexes]

42 sorted_X = select_X

43 sorted(sorted_X,key= lambda x:x[1])

44#return median index in feature_index dimension (int)

45return sorted_X[median_index][0]

47def _split_feature(self,X,row_indexes,feature_index,median_index):

48 left_ids = []

49 right_ids = []

50 median_val = X[median_index][feature_index]

51for id in row_indexes:

52if id==median_index:

53continue

54 val = X[id][feature_index]

55if val < median_val:

56 left_ids.append(id)

57else:

58 right_ids.append(id)

59# return (left points index and right points index)(list,list)

60# 把当前的样本按feature维度进⾏划分为两份

61return left_ids, right_ids

63def build_tree(self,X,Y):

64 row_indexes =[i for i in range(len(X))]

65 node =

乐视云视频

66 queue = [(node,row_indexes)]

67# BFS创建KD树

68while queue:

69 root,ids = queue.pop(0)

70if len(ids)==1:

71 root.feature = 0 #如果是叶⼦结点，维度赋0

72 root.split = (X[ids[0]],Y[ids[0]])

73continue

74# 选取⽅差最⼤的特征维度划分，取样本的中位数作为median

75 feature_index = self._get_max_variance_feature(X,ids)

76 median_index = self._get_median_index(X,ids,feature_index)

77 left_ids,right_ids = self._split_feature(X,ids,feature_index,median_index)

78 root.feature = feature_index

79 root.split = (X[median_index],Y[median_index])

80if left_ids:

81 root.left = KDNode()

82 root.left.father = root

83 queue.append((root.left,left_ids))

84if right_ids:

85 root.right = KDNode()

86 root.right.father = root

87 queue.append((root.right,right_ids))

89def _get_distance(self,Xi,node,p=2):

90# p=2 default Euclidean distance

91 nx = node.split[0]

92 dist = 0

93for i in range(len(Xi)):

94 dist=dist + (abs(Xi[i]-nx[i])**p)

95 dist = dist**(1/p)

96return dist

98def _get_hyperplane_distance(self,Xi,node):

99 xx = node.split[0]

100 dist = abs(Xi[node.feature] - xx[node.feature])

101return dist

102

103def _is_leaf(self,node):

104if node.left is None and node.right is None:

105return True

106else:

107return False

灭菌检测

108

109def get_nearest_neighbour(self,Xi,K=1):

110 search_paths = []

111 max_heap = [] #use min heap achieve max heap （因为python只有最⼩堆）

112 priority_num = 0 # remove same distance

113 heapq.heappush(max_heap,(float('-inf'),priority_num,None))

114 priority_num +=1

115 node =

116# 到离Xi最近的叶⼦结点

117while node is not None:

118 search_paths.append(node)

119if Xi[node.feature] < node.split[0][node.feature]:

120 node = node.left

121else:

122 node = node.right

123

124while search_paths:

125 now = search_paths.pop()

126# 叶⼦结点：计算与Xi的距离，若不⾜K个点，将叶⼦结点加⼊候选堆中；

127# 如果K个点够了，判断是否⽐候选堆中距离最⼩的结点（因为距离取了相反数）还要⼤，

128# 说明应当加⼊候选堆；

129if self._is_leaf(now):

130 dist = self._get_distance(Xi,now)

131 dist = -dist

132 mini_dist = max_heap[0][0]

133if len(max_heap) < K :

134 heapq.heappush(max_heap,(dist,priority_num,now))

135 priority_num+=1

136elif dist > mini_dist:

137 _ = heapq.heappop(max_heap)

138 heapq.heappush(max_heap,(dist,priority_num,now))

139 priority_num+=1

140# ⾮叶结点：计算与Xi的距离

141# 1、若不⾜K个点，将结点加⼊候选堆中；

142# 如果K个点够了，判断是否⽐候选堆中距离最⼩的结点（因为距离取了相反数）还要⼤，

143# 2、判断候选堆中的最⼩距离是否⼩于Xi离当前超平⾯的距离（即是否需要判断另⼀边要不要搜索）144# 当然如果不⾜K个点，虽然超平⾯不相交，依旧要搜索另⼀边，

145# 直到到叶⼦结点，并且把路径加⼊回溯栈中

146else :

147 dist = self._get_distance(Xi, now)

148 dist = -dist

149 mini_dist = max_heap[0][0]

150if len(max_heap) < K:

151 heapq.heappush(max_heap, (dist, priority_num, now))

152 priority_num += 1

153elif dist > mini_dist:

154 _ = heapq.heappop(max_heap)

155 heapq.heappush(max_heap, (dist, priority_num, now))

156 priority_num += 1

157

158 mini_dist = max_heap[0][0]

159if len(max_heap)<K or -(self._get_hyperplane_distance(Xi,now)) > mini_dist: 160# search another child tree

161if Xi[now.feature] >= now.split[0][now.feature]:

162 child_node = now.left

春天里农民工163else:

164 child_node = now.right

165# record path until find child leaf node

166while child_node is not None:

167 search_paths.append(child_node)

168if Xi[child_node.feature] < child_node.split[0][child_node.feature]:

169 child_node = child_node.left

170else:

171 child_node = child_node.right

172return max_heap

173

174def predict_classification(self,Xi,K=1):

175# 多分类问题预测

176 y =_nearest_neighbour(Xi,K)

177 mp = {}

178for i in y:

179if i[2].split[1] in mp:

180 mp[i[2].split[1]]+=1

181else:

182 mp[i[2].split[1]]=1

183 pre_y = -1

184 max_cnt =-1

185for k,v in mp.items():

186if v>max_cnt:

187 max_cnt=v

叶小纲大地之歌188 pre_y=k

189return pre_y

190

191def predict_regression(self,Xi,K=1):

192#回归问题预测

193 pre_y = _nearest_neighbour(Xi,K)

194return sum([i[2].split[1] for i in pre_y])/K

195

196

197# t =KDTree()

198# xx = [[3,3],[1,2],[5,6],[999,999],[5,5]]

199# z = [1,0,1,1,1]

200# t.build_tree(xx,z)

201# y=t.predict_regression([4,5.5],K=5)

202# print(y)

参考资料：

《统计学习⽅法》——李航著

本文发布于:2024-09-22 01:40:40，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/326072.html

上一篇：机器学习之近邻算法模型（KNN）

下一篇：KNN算法及KNN的优化算法-加权KNN

标签：结点节点是否搜索分割

留言与评论（共有 0 条评论）