In R, I'm drawing a rather large boxplot from a data.frame with approximately 150 columns. I know that there are some "anomalous" columns where the distribution is too different from the rest of the data set and I want to identify which ones precisely.
在R中,我从一个大约150列的data.frame中绘制了一个相当大的箱线图。我知道有一些“异常”的列,其分布与数据集的其余部分太不同,我想确切地识别哪些。
Rather unsurprisingly, there is not enough room for the labels and even if there were, it would be probably inconvenient to check by hand. So I thought I could use R's identify
function to locate the offending columns. Such a function however needs x and y coordinates, and so far I was unable to get it to work.
不出所料,标签没有足够的空间,即使有,也可能不方便手工检查。所以我认为我可以使用R的识别功能来定位有问题的列。然而,这样的函数需要x和y坐标,到目前为止我无法使其工作。
I tried
boxplot(dd.noctr$TGS, outline=F)
identify(xy.coords(dd.noctr$TGS)$x, y=xy.coords(dd.noctr$TGS)$y)
where dd.noctr$TGS
is my data (a matrix or data.frame), only to get the error
其中dd.noctr $ TGS是我的数据(矩阵或data.frame),只是为了得到错误
warning: no point within 0.25 inches
meaning that no point was identified.
这意味着没有发现任何一点。
Is there an alternative solution to identify column names (not single points)?
是否有另一种解决方案来识别列名(而不是单点)?
2 个解决方案
#1
1
This solution seems a bit clunky, so there is probably a better solution.
这个解决方案看起来有点笨重,所以可能有更好的解决方案。
-
Set up some example data with three columns:
设置一些包含三列的示例数据:
TGS = data.frame(A = rnorm(100), B = rnorm(100), C=rnorm(100))
-
Next plot the boxplot
接下来绘制箱线图
boxplot(TGS, outline=F)
-
Now we construct the
identity
function.现在我们构造身份函数。
identify(x=rep(1:ncol(TGS), each=nrow(TGS)), y=as.vector(unlist(TGS)), label=rep(colnames(TGS), each=nrow(TGS)))
The labels are the column names. This function only works if you click near the centre of the boxplot.
标签是列名。仅当您在箱线图的中心附近单击时,此功能才有效。
#2
0
If you want to get a list of outliers, you can use the 'out' component of boxplot.
如果要获取异常值列表,可以使用boxplot的“out”组件。
example: Create a dataframe : with a few random values with mean 20, and add some outliers. This code will display the outliers.
示例:创建一个数据框:带有一些平均值为20的随机值,并添加一些异常值。此代码将显示异常值。
df1 = data.frame(A = c(rnorm(15,20,3),7,8,35,32)) #15 rnorm and 4 extreme values
bplot=boxplot(df1)
bplot$out
#1
1
This solution seems a bit clunky, so there is probably a better solution.
这个解决方案看起来有点笨重,所以可能有更好的解决方案。
-
Set up some example data with three columns:
设置一些包含三列的示例数据:
TGS = data.frame(A = rnorm(100), B = rnorm(100), C=rnorm(100))
-
Next plot the boxplot
接下来绘制箱线图
boxplot(TGS, outline=F)
-
Now we construct the
identity
function.现在我们构造身份函数。
identify(x=rep(1:ncol(TGS), each=nrow(TGS)), y=as.vector(unlist(TGS)), label=rep(colnames(TGS), each=nrow(TGS)))
The labels are the column names. This function only works if you click near the centre of the boxplot.
标签是列名。仅当您在箱线图的中心附近单击时,此功能才有效。
#2
0
If you want to get a list of outliers, you can use the 'out' component of boxplot.
如果要获取异常值列表,可以使用boxplot的“out”组件。
example: Create a dataframe : with a few random values with mean 20, and add some outliers. This code will display the outliers.
示例:创建一个数据框:带有一些平均值为20的随机值,并添加一些异常值。此代码将显示异常值。
df1 = data.frame(A = c(rnorm(15,20,3),7,8,35,32)) #15 rnorm and 4 extreme values
bplot=boxplot(df1)
bplot$out