I have a dataframe in pandas where each column has different value range. For example:
我在熊猫中有一个dataframe,每一列都有不同的值范围。例如:
df:
df:
A B C
1000 10 0.5
765 5 0.35
800 7 0.09
Any idea how I can normalize the columns of this dataframe where each value is between 0 and 1?
知道如何使这个dataframe的列规范化吗?每个值都在0和1之间?
My desired output is:
我的期望输出值是:
A B C
1 1 1
0.765 0.5 0.7
0.8 0.7 0.18(which is 0.09/0.5)
9 个解决方案
#1
67
You can use the package sklearn and its associated preprocessing utilities to normalize the data.
您可以使用软件包sklearn及其相关的预处理工具来规范化数据。
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pandas.DataFrame(x_scaled)
For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.
有关更多信息,请参阅关于预处理数据的scikit-learn文档:按范围扩展特性。
#2
86
one easy way by using Pandas: (here I want to use mean normalization)
使用熊猫的一个简单方法是:(这里我想使用均值标准化)
normalized_df=(df-df.mean())/df.std()
to use min-max normalization:
使用min-max标准化:
normalized_df=(df-df.min())/(df.max()-df.min())
#3
27
Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range
基于本文:https://stats.stackexchange.com/questions/70801/how-to normalize-data to-0-1-range
You can do the following:
你可以做以下事情:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.
你不需要一直担心你的价值观是消极的还是积极的。值应该很好地分布在0和1之间。
#4
12
If you like using the sklearn package, you can keep the column and index names by using pandas loc
like so:
如果你喜欢使用sklearn包,你可以像这样使用熊猫loc来保持列和索引名:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(df)
df.loc[:,:] = scaled_values
#5
11
Your problem is actually a simple transform acting on the columns:
你的问题实际上是一个简单的变换作用在列上:
def f(s):
return s/s.max()
frame.apply(f, axis=0)
Or even more terse:
甚至更简洁:
frame.apply(lambda x: x/x.max(), axis=0)
#6
7
I think that a better way to do that in pandas is just
我认为在熊猫中更好的方法是
df = df/df.max().astype(np.float64)
Edit If in your data frame negative numbers are present you should use instead
编辑如果在你的数据帧的负数,你应该使用相反。
df = df/df.loc[df.abs().idxmax()].astype(np.float64)
#7
2
The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.
Sandman和Praveen给出的解决方案非常好。唯一的问题是,如果在数据框的其他列中有分类变量,那么这个方法将需要一些调整。
My solution to this type of issue is following:
我对这类问题的解决办法如下:
from sklearn import preprocesing
x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
x_new = pd.DataFrame(x_scaled)
df = pd.concat([df.Categoricals,x_new])
#8
2
Simple is Beautiful:
简单的是美丽的:
df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()
#9
0
def normalize(x):
try:
x = x/np.linalg.norm(x,ord=1)
return x
except :
raise
data = pd.DataFrame.apply(data,normalize)
From the document of pandas,DataFrame structure can apply an operation (function) to itself .
从熊猫的文档中,DataFrame结构可以将操作(函数)应用到自己。
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.
沿数据存储器的输入轴应用函数。传递给函数的对象是具有DataFrame的索引(axis=0)或列(axis=1)的系列对象。返回类型取决于传递的函数是否聚集,如果DataFrame为空,则取决于reduce参数。
You can apply a custom function to operate the DataFrame .
您可以应用一个自定义函数来操作DataFrame。
#1
67
You can use the package sklearn and its associated preprocessing utilities to normalize the data.
您可以使用软件包sklearn及其相关的预处理工具来规范化数据。
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pandas.DataFrame(x_scaled)
For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.
有关更多信息,请参阅关于预处理数据的scikit-learn文档:按范围扩展特性。
#2
86
one easy way by using Pandas: (here I want to use mean normalization)
使用熊猫的一个简单方法是:(这里我想使用均值标准化)
normalized_df=(df-df.mean())/df.std()
to use min-max normalization:
使用min-max标准化:
normalized_df=(df-df.min())/(df.max()-df.min())
#3
27
Based on this post: https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range
基于本文:https://stats.stackexchange.com/questions/70801/how-to normalize-data to-0-1-range
You can do the following:
你可以做以下事情:
def normalize(df):
result = df.copy()
for feature_name in df.columns:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
return result
You don't need to stay worrying about whether your values are negative or positive. And the values should be nicely spread out between 0 and 1.
你不需要一直担心你的价值观是消极的还是积极的。值应该很好地分布在0和1之间。
#4
12
If you like using the sklearn package, you can keep the column and index names by using pandas loc
like so:
如果你喜欢使用sklearn包,你可以像这样使用熊猫loc来保持列和索引名:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_values = scaler.fit_transform(df)
df.loc[:,:] = scaled_values
#5
11
Your problem is actually a simple transform acting on the columns:
你的问题实际上是一个简单的变换作用在列上:
def f(s):
return s/s.max()
frame.apply(f, axis=0)
Or even more terse:
甚至更简洁:
frame.apply(lambda x: x/x.max(), axis=0)
#6
7
I think that a better way to do that in pandas is just
我认为在熊猫中更好的方法是
df = df/df.max().astype(np.float64)
Edit If in your data frame negative numbers are present you should use instead
编辑如果在你的数据帧的负数,你应该使用相反。
df = df/df.loc[df.abs().idxmax()].astype(np.float64)
#7
2
The solution given by Sandman and Praveen is very well. The only problem with that if you have categorical variables in other columns of your data frame this method will need some adjustments.
Sandman和Praveen给出的解决方案非常好。唯一的问题是,如果在数据框的其他列中有分类变量,那么这个方法将需要一些调整。
My solution to this type of issue is following:
我对这类问题的解决办法如下:
from sklearn import preprocesing
x = pd.concat([df.Numerical1, df.Numerical2,df.Numerical3])
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
x_new = pd.DataFrame(x_scaled)
df = pd.concat([df.Categoricals,x_new])
#8
2
Simple is Beautiful:
简单的是美丽的:
df["A"] = df["A"] / df["A"].max()
df["B"] = df["B"] / df["B"].max()
df["C"] = df["C"] / df["C"].max()
#9
0
def normalize(x):
try:
x = x/np.linalg.norm(x,ord=1)
return x
except :
raise
data = pd.DataFrame.apply(data,normalize)
From the document of pandas,DataFrame structure can apply an operation (function) to itself .
从熊猫的文档中,DataFrame结构可以将操作(函数)应用到自己。
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty.
沿数据存储器的输入轴应用函数。传递给函数的对象是具有DataFrame的索引(axis=0)或列(axis=1)的系列对象。返回类型取决于传递的函数是否聚集,如果DataFrame为空,则取决于reduce参数。
You can apply a custom function to operate the DataFrame .
您可以应用一个自定义函数来操作DataFrame。