【时间序列聚类】KMedoids聚类+DTW算法
前言
KMedoids的聚类有时比KMeans的聚类效果要好。手上正好有一批时序数据,今天用KMedoids试下聚类效果
安装
KMedoids可以使用sklearn的拓展聚类模块scikit-learn-extra,模块需要保证
-
Python (>=3.6) scikit-learn(>=0.22)
安装 scikit-learn-extra
PyPi: pip install scikit-learn-extra Conda: conda install -c conda-forge scikit-learn-extra Git: pip install https://github.com/scikit-learn-contrib/scikit-learn-extra/archive/master.zip
安装 tslearn
PyPi: python -m pip install tslearn Conda: conda install -c conda-forge tslearn Git: python -m pip install https://github.com/tslearn-team/tslearn/archive/master.zip
为什么要使用scikit-learn-extra和tslearn两个模块?因为sklearn里面没有自带的KMedoids,也没有类似DTW的时序度量算法,但组合两者恰好能解决问题
测试
import numpy as np from sklearn_extra.cluster import KMedoids import tslearn.metrics as metrics # 自定义数据处理 import data_process from tslearn.clustering import silhouette_score from tslearn.preprocessing import TimeSeriesScalerMeanVariance from tslearn.generators import random_walks from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt # shape(X) = (100,2800+) X = np.loadtxt("top100.txt",dtype=np.float,delimiter=",") # 降成30维 X = data_process.downsample(X,30) seed = 0 # elbow法则找最佳聚类数,结果:elbow = 5 def test_elbow(): global X,seed distortions = [] dists = metrics.cdist_dtw(X) # dba + dtw # dists = metrics.cdist_soft_dtw_normalized(X,gamma=.5) # softdtw for i in range ( 2 , 15 ): km = KMedoids(n_clusters=i,random_state=seed,metric="precomputed") km.fit(dists) #记录误差和 distortions.append(km.inertia_) plt.plot(range ( 2 , 15 ), distortions, marker= o ) plt.xlabel( Number of clusters ) plt.ylabel( Distortion ) plt.show() def test_kmedoids(): num_cluster = 5 # 声明precomputed自定义相似度计算方法 km = KMedoids(n_clusters= num_cluster, random_state=0,metric="precomputed") # 采用tslearn中的DTW系列及变种算法计算相似度,生成距离矩阵dists dists = metrics.cdist_dtw(X) # dba + dtw # dists = metrics.cdist_soft_dtw_normalized(X,gamma=0.5) # softdtw y_pred = km.fit_predict(dists) np.fill_diagonal(dists,0) score = silhouette_score(dists,y_pred,metric="precomputed") print(X.shape) print(y_pred.shape) print("silhouette_score: " + str(score)) for yi in range(num_cluster): plt.subplot(3, 2, yi + 1) for xx in X[y_pred == yi]: plt.plot(xx.ravel(), "k-", alpha=.3) # 注意这里的_cluster_centers要写成X[km.medoid_indices_[yi]],因为你是precomputed,源码里面当precomputed时_cluster_centers等于None plt.plot(X[km.medoid_indices_[yi]], "r-") plt.text(0.55, 0.85,Cluster %d % (yi + 1), transform=plt.gca().transAxes) if yi == 1: plt.title("KMedoids" + " + DBA-DTW") plt.tight_layout() plt.show() #test_elbow() test_kmedoids()
采用KMedoids + DBA-DTW聚类效果 # 轮廓系数silhouette_score: 0.5465097470777784
采用KMedoids + SoftDTW聚类效果 # 轮廓系数silhouette_score: 0.6528261125440392
直接采用欧氏距离的聚类效果 # 轮廓系数silhouette_score: 0.5209641775604567
相比采用KMeans,KMedoids在的聚类中心(红线部分)从视觉上要似乎要更好(KMeans+DTW聚类效果可见该),但轮廓系数却不如KMeans。但仅欧氏距离而言,KMedoids的轮廓系数要比KMeans更好那么一点