如何在R中使用循环进行有效的细分?

时间:2020-11-29 11:40:45

I have a csv file named "table_parameter". Please, download from here. Data look like this:

我有一个名为“table_parameter”的csv文件。请从这里下载。数据看起来像这样:

           time        avg.PM10            sill       range         nugget
    1   2012030101  52.2692307692308    0.11054330  45574.072   0.0372612157
    2   2012030102  55.3142857142857    0.20250974  87306.391   0.0483153769
    3   2012030103  56.0380952380952    0.17711558  56806.827   0.0349567088
    4   2012030104  55.9047619047619    0.16466350  104767.669  0.0307528346
    .
    .
    .
    25  2012030201  67.1047619047619    0.14349774  72755.326   0.0300378129
    26  2012030202  71.6571428571429    0.11373430  72755.326   0.0320594776
    27  2012030203  73.352380952381 0.13893530  72755.326   0.0311135434
    28  2012030204  70.2095238095238    0.12642303  29594.037   0.0281416079
    .
    .

In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.

在我的dataframe中有一个名为time的变量,它以数值形式包含2012年3月01日到2012年3月7日的小时值。例如,2012年3月01日,凌晨1点写为2012030101等等。

From this dataset I want subset (24*11) datframe like the table below:

从这个数据集中,我想要子集(24*11)数据帧,如下表所示:

如何在R中使用循环进行有效的细分?

for example, for 1 am (2012030101,2012030201....2012030701) and for avg.PM10<10, I want 1 dataframe. In this case, probably you found that for some data frame there will be no observation. But its okay, because I will work with very large data set.

例如,对于1点(2012030101,2012030101 .... 2012030101)和avg.PM10 < 10,我希望1 dataframe。在这种情况下,您可能会发现,对于某些数据帧,不会有任何观察。但是没关系,因为我要处理非常大的数据集。

I can do this subsetting manually by writing (24*11)240 lines code like this!

我可以通过像这样写(24*11)240行代码来手动完成这个子设置!

table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))

par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
.
.
.
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times  ==24 & avg.PM10>100)

But I understand this code is very inefficient. Is there any way to do it efficiently by using a loop?

但是我知道这个代码非常低效。有什么方法可以通过循环来有效地完成它吗?

FYI: Actually in future, by using these (24*11) dataset I want to draw some plot.

小提示:事实上,在未来,我想通过使用这些(24*11)数据集来绘制一些图。

Update: After this subsetting, I want to plot the boxplots using the range of every dataset. But problem is, I want to show all boxplots (24*11)[like above figure] of range in one plot like a matrix! If you have any further inquery, please let me know. Thanks a lot in advance.

更新:在此子设置之后,我希望使用每个数据集的范围来绘制boxplot。但问题是,我想在一个图中显示所有的boxplot(24*11)[如上图]的range。如果你还有任何疑问,请告诉我。非常感谢。

2 个解决方案

#1


2  

You can do this using some plyr, dplyr and tidyr magic :

你可以使用一些plyr, dplyr和tidyr魔法:

library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway

# Read data
dfData <- read.csv("table_parameter.csv")

dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(hour, roundedPM.10) %>% 
  # Count the number of occurences per hour
  count(roundedPM.10, hour) %>% 
  # Use spread (from tidyr) to transform it into wide format
  spread(hour, n)

If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.

如果您打算使用ggplot2,您可以忘记tidyr和代码的最后一行,以保持数据aframe的长格式,这样绘制会更容易。

EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :

编辑:读了你的评论后,我意识到我误解了你的问题。这将给出AVG.PM10的每两个小时和间隔的箱线图:

library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it 
# for the round_any function anyway

# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")

dfDataPlot <- dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(roundedPM.10, hour, range)

# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) + 
  geom_boxplot() + 
  facet_grid(roundedPM.10~.)

#2


0  

How about a double loop like this:

像这样的双重循环怎么样:

table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))

#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]

t_list=seq(1,24,1)
PM_list=seq(0,100,10)

for (t in t_list){
  #t=t_list[1]
  for (PM in PM_list){
    #PM=PM_list[4]
    PM2=PM+10
    sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
    if (length(sub$X)!=0) {    #to avoid errors because of empty sub
      name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
      sub$name = name
      sub.df  <- rbind(sub.df , sub) }
  }  
}

sub.df #print data frame

#1


2  

You can do this using some plyr, dplyr and tidyr magic :

你可以使用一些plyr, dplyr和tidyr魔法:

library(tidyr)
library(dplyr)
# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway

# Read data
dfData <- read.csv("table_parameter.csv")

dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(hour, roundedPM.10) %>% 
  # Count the number of occurences per hour
  count(roundedPM.10, hour) %>% 
  # Use spread (from tidyr) to transform it into wide format
  spread(hour, n)

If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.

如果您打算使用ggplot2,您可以忘记tidyr和代码的最后一行,以保持数据aframe的长格式,这样绘制会更容易。

EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :

编辑:读了你的评论后,我意识到我误解了你的问题。这将给出AVG.PM10的每两个小时和间隔的箱线图:

library(tidyr)
library(dplyr)
library(ggplot2)
# I am not loading plyr there because it interferes with dplyr, I just want it 
# for the round_any function anyway

# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")

dfDataPlot <- dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(roundedPM.10, hour, range)

# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) + 
  geom_boxplot() + 
  facet_grid(roundedPM.10~.)

#2


0  

How about a double loop like this:

像这样的双重循环怎么样:

table_par<-read.csv("table_parameter.csv")
times<-as.numeric(substr(table_par$time,9,10))

#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]

t_list=seq(1,24,1)
PM_list=seq(0,100,10)

for (t in t_list){
  #t=t_list[1]
  for (PM in PM_list){
    #PM=PM_list[4]
    PM2=PM+10
    sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
    if (length(sub$X)!=0) {    #to avoid errors because of empty sub
      name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
      sub$name = name
      sub.df  <- rbind(sub.df , sub) }
  }  
}

sub.df #print data frame