数据离散化多是针对连续数据进行,处理之后的数据值域分布将重连续属性变为离散属性。
1、对时间数据的离散化:(1)将一天中的数据离散化,即将时间戳转换为秒、分钟、小时或上下午;
(2)将日粒度以上的数据离散化,即将日期转换为周数、月、年等;
2、对离散数据的离散化:将对原本划分的区间进行重新划分。
3、对连续数据的离散化:
常见方法 (1)分位数法:使用四分位、五分位等进行划分;
(2)距离区间法:使用等距离区间或自定义区间进行划分;
(3)频率区间法:将数据按照频率进行排序,然后按照等频率或指定频率离散化
(4)聚类法:例如使用K均值将样本分为多个离散化的簇;
(5)卡方:使用基于卡方的离散化方法,找出数据的最佳临近区间并合并,形成较大的区间。
#读入数据
1 import numpy as np 2 import pandas as pd 3 from sklearn.cluster import KMeans 4 df=pd.read_table(r'E:\data analysis\test\test4.txt',names=['id','var1','var2','datetime','age']) 5 print(df.head(3))
id var1 var2 datetime age
0 15093 1390 10.40 2017-04-30 19:24:13 0-10
1 15062 4024 4.68 2017-04-27 22:44:59 70-80
2 15028 6359 3.84 2017-04-27 10:07:55 40-50
#时间的离散化
1 df_weekday=df.copy() 2 weekdays=pd.to_datetime(df_weekday['datetime']) 3 for i in range(len(df)): 4 df_weekday['datetime'][i]=weekdays[i].weekday() 5 print(df_weekday.head())
id var1 var2 datetime age
0 15093 1390 10.40 6 0-10
1 15062 4024 4.68 3 70-80
2 15028 6359 3.84 3 40-50
3 15012 7759 3.70 1 30-40
4 15021 331 4.25 5 70-80
#离散数据的离散化
1 print(pd.unique(df['age'])) 2 map_df=pd.DataFrame([['0-10','0-40'],['10-20','0-40'],['20-30','0-40'],['30-40','0-40'], 3 ['40-50','40-80'],['50-60','40-80'],['60-70','40-80'],['70-80','40-80'], 4 ['80-90','>80'],['>90','>80']],columns=['age','age_new']) 5 df_new=df.merge(map_df,left_on='age',right_on='age',how='inner') 6 df_new=df_new.drop('age',axis=1) 7 print(df_new.head())
['0-10' '70-80' '40-50' '30-40' '10-20' '20-30' '>90' '50-60' '60-70'
'80-90']
id var1 var2 datetime age_new
0 15093 1390 10.40 2017-04-30 19:24:13 0-40
1 15064 7952 4.40 2017-04-03 14:45:29 0-40
2 15080 503 5.72 2017-04-22 12:34:54 0-40
3 15068 1668 3.19 2017-04-15 21:56:31 0-40
4 15019 6710 3.20 2017-04-03 22:22:28 0-40
#连续数据的离散化--自定义分箱区间
1 df_cut=df.copy() 2 print(df['var1'].max(),df['var1'].min()) 3 bins=[0,500,1000,5000,10000] 4 df_cut['var1_cut']=pd.cut(df['var1'],bins) 5 print(df_cut.head()) 6 print(pd.value_counts(df_cut['var1_cut']))
7952 176
id var1 var2 datetime age var1_cut
0 15093 1390 10.40 2017-04-30 19:24:13 0-10 (1000, 5000]
1 15062 4024 4.68 2017-04-27 22:44:59 70-80 (1000, 5000]
2 15028 6359 3.84 2017-04-27 10:07:55 40-50 (5000, 10000]
3 15012 7759 3.70 2017-04-04 07:28:18 30-40 (5000, 10000]
4 15021 331 4.25 2017-04-08 11:14:00 70-80 (0, 500]
(1000, 5000] 48
(5000, 10000] 38
(500, 1000] 7
(0, 500] 7
Name: var1_cut, dtype: int64
#连续数据的离散化--按四分位数分区
1 df_cut['var1_cut2']=pd.cut(df['var1'],4,labels=['excellent','good','mediate','bad ']) 2 print(df_cut.head())
id var1 var2 datetime age var1_cut var1_cut2
0 15093 1390 10.40 2017-04-30 19:24:13 0-10 (1000, 5000] excellent
1 15062 4024 4.68 2017-04-27 22:44:59 70-80 (1000, 5000] good
2 15028 6359 3.84 2017-04-27 10:07:55 40-50 (5000, 10000] bad
3 15012 7759 3.70 2017-04-04 07:28:18 30-40 (5000, 10000] bad
4 15021 331 4.25 2017-04-08 11:14:00 70-80 (0, 500] excellent
#连续数据的离散化--Kmeans
1 data=df['var1'] 2 data_reshape=data.values.reshape((len(data),1)) 3 model=KMeans(n_clusters=4,random_state=0) 4 res=model.fit_predict(data_reshape) 5 df_cut['kmeans']=res 6 print(df_cut.head()) 7 print(df_cut['kmeans'].unique())
id var1 var2 datetime age var1_cut var1_cut2 kmeans
0 15093 1390 10.40 2017-04-30 19:24:13 0-10 (1000, 5000] excellent 0
1 15062 4024 4.68 2017-04-27 22:44:59 70-80 (1000, 5000] good 2
2 15028 6359 3.84 2017-04-27 10:07:55 40-50 (5000, 10000] bad 3
3 15012 7759 3.70 2017-04-04 07:28:18 30-40 (5000, 10000] bad 3
4 15021 331 4.25 2017-04-08 11:14:00 70-80 (0, 500] excellent 0
[0 2 3 1]