1.数据标准化(Standardization or Mean Removal and Variance Scaling)
进行标准化缩放的数据均值为0,具有单位方差。
from sklearn import preprocessing
X = [[1., -1., 2.],
[2., 0., 0.],
[0., 1., -1.]]
X_scaled = preprocessing.scale(X)
print X_scaled
#[[ 0. -1.22474487 1.33630621]
# [ 1.22474487 0. -0.26726124]
# [-1.22474487 1.22474487 -1.06904497]]
print X_scaled.mean(axis = 0)
print X_scaled.std(axis = 0)
#[ 0. 0. 0.]
#[ 1. 1. 1.]
同样我们也可以通过preprocessing模块提供的Scaler(StandardScaler 0.15以后版本)工具类来实现这个功能:
scaler = preprocessing.StandardScaler().fit(X)
print scaler
#StandardScaler(copy=True, with_mean=True, with_std=True)
print scaler.mean_
#[ 1. 0. 0.33333333]
print scaler.scale_#之前版本scaler.std_
#[ 0.81649658 0.81649658 1.24721913]
print scaler.transform(X)
#[[ 0. -1.22474487 1.33630621]
# [ 1.22474487 0. -0.26726124]
# [-1.22474487 1.22474487 -1.06904497]]
注:上述代码与下面代码等价
scaler = preprocessing.StandardScaler().fit_transform(X)
print scaler
#[[ 0. -1.22474487 1.33630621]
# [ 1.22474487 0. -0.26726124]
# [-1.22474487 1.22474487 -1.06904497]]
print scaler.mean(axis = 0)
#[ 0. 0. 0.]
print scaler.std(axis = 0)
#[ 1. 1. 1.]
2.数据规范化(Normalization)
把数据集中的每个样本所有数值缩放到(-1,1)之间。
X = [[1., -1., 2.],
[2., 0., 0.],
[0., 1., -1.]]
X_normalized = preprocessing.normalize(X)
print X_normalized
#[[ 0.40824829 -0.40824829 0.81649658]
# [ 1. 0. 0. ]
# [ 0. 0.70710678 -0.70710678]]
等价于:
normalizer = preprocessing.Normalizer().fit(X)
print normalizer
#Normalizer(copy=True, norm='l2')
print normalizer.transform(X)
#[[ 0.40824829 -0.40824829 0.81649658]
# [ 1. 0. 0. ]
# [ 0. 0.70710678 -0.70710678]]
注:上述代码与下面代码等价
normalizer = preprocessing.Normalizer().fit_transform(X)
print normalizer
#[[ 0.40824829 -0.40824829 0.81649658]
# [ 1. 0. 0. ]
# [ 0. 0.70710678 -0.70710678]]
3.二进制化(Binarization)
将数值型数据转化为布尔型的二值数据,可以设置一个阈值(threshold)。
X = [[1., -1., 2.],
[2., 0., 0.],
[0., 1., -1.]]
binarizer = preprocessing.Binarizer().fit(X) # 默认阈值为0.0
print binarizer
#Binarizer(copy=True, threshold=0.0)
print binarizer.transform(X)
#[[ 1. 0. 1.]
# [ 1. 0. 0.]
# [ 0. 1. 0.]] binarizer = preprocessing.Binarizer(threshold=1.1) # 设定阈值为1.1
print binarizer.transform(X)
#[[ 0. 0. 1.]
# [ 1. 0. 0.]
# [ 0. 0. 0.]]
4.标签预处理(Label preprocessing)
4.1)标签二值化(Label binarization)
LabelBinarizer通常用于通过一个多类标签(label)列表,创建一个label指示器矩阵.
lb = preprocessing.LabelBinarizer()
print lb.fit([1, 2, 6, 4, 2])
#LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
print lb.classes_
#[1 2 4 6]
print lb.transform([1, 6])
#[[1 0 0 0]
# [0 0 0 1]]
4.2)标签编码(Label encoding)
le = preprocessing.LabelEncoder()
print le.fit([1, 2, 2, 6])
#LabelEncoder()
print le.classes_
#[1 2 6]
print le.transform([1, 1, 2, 6])
#[0 0 1 2]
print le.inverse_transform([0, 0, 1, 2])
#[1 1 2 6]
也可以用于非数值类型的标签到数值类型标签的转化:
le = preprocessing.LabelEncoder()
print le.fit(["paris", "paris", "tokyo", "amsterdam"])
#LabelEncoder()
print list(le.classes_)
#['amsterdam', 'paris', 'tokyo']
print le.transform(["tokyo", "tokyo", "paris"])
#[2 2 1]
print list(le.inverse_transform([2, 2, 1]))
#['tokyo', 'tokyo', 'paris']