
时间:2021-10-16 14:58:20

I have a csv file named "table_parameter". Please, download from here. Data look like this:


           time        avg.PM10            sill       range         nugget
    1   2012030101  52.2692307692308    0.11054330  45574.072   0.0372612157
    2   2012030102  55.3142857142857    0.20250974  87306.391   0.0483153769
    3   2012030103  56.0380952380952    0.17711558  56806.827   0.0349567088
    4   2012030104  55.9047619047619    0.16466350  104767.669  0.0307528346
    25  2012030201  67.1047619047619    0.14349774  72755.326   0.0300378129
    26  2012030202  71.6571428571429    0.11373430  72755.326   0.0320594776
    27  2012030203  73.352380952381 0.13893530  72755.326   0.0311135434
    28  2012030204  70.2095238095238    0.12642303  29594.037   0.0281416079

In my dataframe there is a variable named time contains hours value from 01 march 2012 to 7 march 2012 in numeric form. for example 01 march 2012, 1.00 a.m. is written as 2012030101 and so on.


From this dataset I want subset (24*11) datframe like the table below:



for example, for 1 am (2012030101,2012030201....2012030701) and for avg.PM10<10, I want 1 dataframe. In this case, probably you found that for some data frame there will be no observation. But its okay, because I will work with very large data set.

例如,对于1点(2012030101,2012030101 .... 2012030101)和avg.PM10 < 10,我希望1 dataframe。在这种情况下,您可能会发现,对于某些数据帧,不会有任何观察。但是没关系,因为我要处理非常大的数据集。

I can do this subsetting manually by writing (24*11)240 lines code like this!



par_1am_0to10 <-subset(table_par,times ==1 & avg.PM10<=10)
par_1am_10to20 <-subset(table_par,times ==1 & avg.PM10>10 & avg.PM10<=20)
par_1am_20to30 <-subset(table_par,times ==1 & avg.PM10>20 & avg.PM10<=30)
par_24pm_80to90 <-subset(table_par,times ==24 & avg.PM10>80 & avg.PM10<=90)
par_24pm_90to100 <-subset(table_par,times==24 & avg.PM10>90 & avg.PM10<=100)
par_24pm_100up <-subset(table_par,times  ==24 & avg.PM10>100)

But I understand this code is very inefficient. Is there any way to do it efficiently by using a loop?


FYI: Actually in future, by using these (24*11) dataset I want to draw some plot.


Update: After this subsetting, I want to plot the boxplots using the range of every dataset. But problem is, I want to show all boxplots (24*11)[like above figure] of range in one plot like a matrix! If you have any further inquery, please let me know. Thanks a lot in advance.


2 个解决方案



You can do this using some plyr, dplyr and tidyr magic :

你可以使用一些plyr, dplyr和tidyr魔法:

# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway

# Read data
dfData <- read.csv("table_parameter.csv")

dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(hour, roundedPM.10) %>% 
  # Count the number of occurences per hour
  count(roundedPM.10, hour) %>% 
  # Use spread (from tidyr) to transform it into wide format
  spread(hour, n)

If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.


EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :


# I am not loading plyr there because it interferes with dplyr, I just want it 
# for the round_any function anyway

# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")

dfDataPlot <- dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(roundedPM.10, hour, range)

# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) + 
  geom_boxplot() + 



How about a double loop like this:



#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]


for (t in t_list){
  for (PM in PM_list){
    sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
    if (length(sub$X)!=0) {    #to avoid errors because of empty sub
      name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
      sub$name = name
      sub.df  <- rbind(sub.df , sub) }

sub.df #print data frame



You can do this using some plyr, dplyr and tidyr magic :

你可以使用一些plyr, dplyr和tidyr魔法:

# I am not loading plyr there because it interferes with dplyr, I just want it for the round_any function anyway

# Read data
dfData <- read.csv("table_parameter.csv")

dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(hour, roundedPM.10) %>% 
  # Count the number of occurences per hour
  count(roundedPM.10, hour) %>% 
  # Use spread (from tidyr) to transform it into wide format
  spread(hour, n)

If you plan on using ggplot2, you can forget about tidyr and the last line of the code in order to keep the dataframe in long format, it will be easier to plot this way.


EDIT : After reading your comment, I realised I misunderstood your question. This will give you a boxplot for each couple of hour and interval of AVG.PM10 :


# I am not loading plyr there because it interferes with dplyr, I just want it 
# for the round_any function anyway

# Read data
dfData <- read.csv("C:/Users/pformont/Desktop/table_parameter.csv")

dfDataPlot <- dfData %>% 
  # Extract hour and compute the rounded Avg.PM10 using round_any
  mutate(hour = as.numeric(substr(time, 9, 10)),
         roundedPM.10 = plyr::round_any(Avg.PM10, 10, floor),
         roundedPM.10 = ifelse(roundedPM.10 > 100, 100,roundedPM.10)) %>% 
  # Keep only the relevant columns
  select(roundedPM.10, hour, range)

# Plot range as a function of hour (as a factor to have separate plots)
# and facet it according to roundedPM.10 on the y axis
ggplot(dfDataPlot, aes(factor(hour), range)) + 
  geom_boxplot() + 



How about a double loop like this:



#create empty dataframe for output
sub.df <- data.frame(name=NA, X=NA, time=NA,Avg.PM10=NA,sill=NA,range=NA,nugget=NA)[numeric(0), ]


for (t in t_list){
  for (PM in PM_list){
    sub <-subset(table_par,times ==t & Avg.PM10>PM & Avg.PM10<=PM2)
    if (length(sub$X)!=0) {    #to avoid errors because of empty sub
      name = paste("par_",t,"am_",PM,"to",PM2 , sep="")
      sub$name = name
      sub.df  <- rbind(sub.df , sub) }

sub.df #print data frame