如何避免超时错误并行加载大文件到mysql

时间:2022-07-02 13:51:39

I have to cycle through several sets of data stored separately in individual csv files - about 1500 files. Each file represents the amount of activity for a given day, and the amount of data in each file varies from basically nothing to huge.

我必须循环浏览单独存储在单个csv文件中的几组数据 - 大约1500个文件。每个文件代表给定日期的活动量,每个文件中的数据量基本上从无到有。

My process is I load the data into R, preprocess and clean, and then I insert into a local mysql table. This works fine running as a single loop.

我的过程是我将数据加载到R,预处理和清理,然后我插入到本地mysql表中。这可以作为单个循环正常运行。

To increase the speed of the load, I attempted to move the loop into a parallel process, where I have several workers preprocessing separate files and then each loading them into database.

为了提高加载速度,我尝试将循环移动到并行进程中,我有几个工作程序预处理单独的文件,然后将每个文件加载到数据库中。

I run into an error when some of the files are large, and the dbwritetable call takes some time to process. This appears to put a lock on the table, not allowing the other processes to complete their writes, leading to timeout errors.

当某些文件很大时,我遇到错误,dbwritetable调用需要一些时间来处理。这似乎会锁定表,不允许其他进程完成写入,从而导致超时错误。

I am fairly new to mysql, and I am vaguely aware I could change the mysql timeout time at the database level (using conf files, I believe). Is this best/only solution? I would prefer keeping all the control at the R level, so I could run this code on different machines more easily. I was wondering if there was another approach to keeping the processes in queue and writing the loads in order as the table is available.

我对mysql相当新,我隐约知道我可以在数据库级别更改mysql超时时间(我相信使用conf文件)。这是最好/唯一的解决方案吗?我希望将所有控件保持在R级别,这样我就可以更轻松地在不同的机器上运行此代码。我想知道是否有另一种方法可以保持队列中的进程并在表可用时按顺序写入负载。

Update:

I thought this was a simple question on how to handle timeouts, but it sounds like more information is desired. Here is one example - its not complicated.

我认为这是一个关于如何处理超时的简单问题,但听起来似乎需要更多信息。这是一个例子 - 它并不复杂。

  Bluetooth.loadloop <- foreach(i=jump:LoopLength, .inorder=TRUE) %dopar% {


    Start.Time <- Sys.time()
    x <- filenames[i]  

    if(!length(readLines(x))) {
      Stats <- data.frame(
        i=i, 
        time=difftime(Sys.time(),Start.Time,units='secs'),
        DataList= 0 )

      source("/Shared_Functions/MYSQL_LOGIN.r" )
      invisible(dbWriteTable(con, name="RawData_Loadloop_Stats_Bluetooth",value=Stats, append=TRUE, row.names=FALSE))
      dbDisconnect(con)
      print( "Zero length");

      return("Zero length")
    } else {


      ### Loads Data from CSV
      print(paste(i,"Start"))
      datalist <-readLines(x,encoding="UTF-8")
      filelocation <- (paste(i,x))
      datalist.length <- length(datalist)
      print(datalist.length)


      ### Proceses the individual fields 
      deviceID <- v1 <- gsub("([0-9]+),.*", "\\1", datalist)
      Time <- v2 <- gsub("[0-9]+,([0-9]+),.*", "\\1", datalist)
      Label <- v3 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\1", datalist)
      MacAddress <- v4 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\2", datalist)
      Strength <- v5 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\3", datalist)

      Label <- BlueToothFilterRules(Label)
      Encoding(Label) <- 'UTF-8'
      BlueToothTable <- data.frame(i=i, DeviceID = deviceID, Time= Time, Label= Label, Mac = MacAddress, Strength= Strength, stringsAsFactors = FALSE)

      Stats <- data.frame(
        i=i, 
        time=difftime(Sys.time(),Start.Time,units='secs'),
        DataList= datalist.length
      )

      ### Writes to the Database
      source("/Shared_Functions/MYSQL_LOGIN.r" )
      dbSendQuery(con, 'set character set "utf8"')
      invisible(dbWriteTable(con, name="RawData_Events_Bluetooth",value=BlueToothTable, append=TRUE, row.names=FALSE))
      invisible(dbWriteTable(con, name="RawData_Loadloop_Stats_Bluetooth",value=Stats, append=TRUE, row.names=FALSE))
      dbDisconnect(con)
      print(paste(i,"END"));

      return("Finished")
    }
  }

So when this is run in parallel, like I said, it bottlenecks at the write, leading to timeouts.

因此,当这个并行运行时,就像我说的那样,它在写入时会出现瓶颈,导致超时。

I am not sure how to process what user3666197 said, other then its good to be prepared for these faults and have things well engineered in handling them.

我不知道如何处理user3666197所说的内容,除此之外还有为这些故障做好准备并且在处理它们时设计得很好。

Separating the readin from the write is not a realistic option either because the resulting object will be too large to handle well.

将readin与write分离不是一个现实的选择,因为生成的对象太大而无法很好地处理。

I understand that running this in parallel will provide limited improvements, because of the bottleneck, but it will decrease the total processing time by allowing the other workers to load in and processes the raw data while one is loading everything.

我知道并行运行这将提供有限的改进,因为瓶颈,但它将通过允许其他工作人员在加载所有内容时加载和处理原始数据来减少总处理时间。

So my original question is still - how to best handle the mysql write queue so when running this in parallel it doesn't create timeout errors?

所以我的原始问题仍然是 - 如何最好地处理mysql写入队列,以便在并行运行时不会产生超时错误?

Thank you

1 个解决方案

#1


0  

Having no code in an MCVE-formulated question, let's go ahead without any code in the answer.

在MCVE制定的问题中没有代码,让我们在答案中没有任何代码。

While all the due mathematical formulations are in the tail section of this, a human-readable story might be a better approach to get the solid grasp of the principal TimeOut problem root-cause:

虽然所有应有的数学公式都在此尾部,但人类可读的故事可能是更好地掌握主要TimeOut问题根本原因的方法:

Using a parallel syntax-constructor?
Syntactically possible in many places and many fashions, but ...

The point here ( always ) is, whether it makes any sense. For answering this, one has to know a sum of all the costs of using it.

这里(总是)的一点是,它是否有意义。为了回答这个问题,我们必须知道使用它的所有成本的总和。

Costs are, as our core motivation is processing speedup, the overheads in time we need to pay, before we can enjoy to use the "accelerated" results.

在我们享受使用“加速”结果之前,成本是我们的核心动力,即处理加速,我们需要支付的时间开销。


Let's sketch up an example - a Cape Cod airport Story:

A Lord Mayor of a city decided to build a new highway, so as to allow welcome citizens to ride faster to an airport on a tip of the Cape Cod.

一个城市的市长决定建造一条新的高速公路,以便让受欢迎的公民能够更快地骑到科德角一角的机场。

This sounds logical and great. So, they started investing, 've paid hundreds of landlords for all the new plots acquired ( bought the property to raise the new road on ), stone miners to excavate all the needed rocks, millers for grinding it in the right sized stones, sands, cement and asphalt, all needed for the highway, bulldozers' operators to put the road rocks bases, concrete vendors for mixing and delivery of the cement in a Just-In-Time fashion, as the highway construction was going forwards etc, etc.

这听起来很合乎逻辑。因此,他们开始投资,为所有新购买的土地(购买房产以筹集新的道路)支付了数百名业主,石矿工挖掘所有需要的岩石,用适当大小的石头研磨它的磨坊,沙子高速公路所需的水泥和沥青,推土机的操作员将道路岩石基地,混凝*应商用于及时混合和交付水泥,公路建设正在进行中等等。

After some months or years ( depends on local economy, contractors' performance, city budget and its distance from the airport ... sure ), the highway was complete and the opening ceremony took place.

经过几个月或几年(取决于当地经济,承包商的表现,城市预算和与机场的距离......确定),高速公路已经完工并举行了开幕式。

Two things were sure at this moment -- The Cost -- how much did it take to build such Highway in money, plus -- The Latency -- how long did it take in time, before a first car could ride from city as far as to the airport.

目前有两件事情是肯定的 - 成本 - 用钱建造这样的高速公路需要多少钱,加上 - 潜伏期 - 在第一辆车从城市出发前需要多长时间才能到达至于机场。

The last thing is -- The net Speedup achieved -- so how faster is this new, 100-parallel lane wide Highway finally, if compared to the old and for ages used, lovely countryside road.

最后一点是 - 净速度达到了 - 所以这个新的,100平行的车道宽高速公路终于有多快,如果与旧的和多年来使用的,可爱的乡村道路相比。

The Result?

A known performance baseline first:

首先是已知的性能基准:

The cars A, B, C, D, ... X, Y, Z all have started its way to the airport about some months ( years ) ago, drove comfortably at a speed of ~40 mph and have all got to the airport in about one hour, one after another, in a pure [SERIAL] manner. Nothing sexy, but a solid, repeatable result, as next time, these will enjoy the almost same result again.

汽车A,B,C,D,...... X,Y,Z都已经开始大约几个月(几年)前往机场,以约40英里/小时的速度舒适地开着,都到了机场大约一个小时,一个又一个,以纯粹的[SERIAL]方式。没有什么性感,但是坚实,可重复的结果,下次,这些将再次享受几乎相同的结果。

And new parallel-riders, to compare with:

和新的并行车手相比:

The cars A1, A2, A3, ... A100, B101, ... B200, ... Z2600 all have the same goal, to arrive at the airport as fast as possible, yet had to wait the few months ( years ) for the new highway to get built ( an obvious processing setup latency ), next it took about a few minutes to enjoy the super-fast ride on this new highway, having Germany-like no speed limit, so 200 mph? 300 mph? No problem.

汽车A1,A2,A3,... A100,B101,... B200,...... Z2600都有相同的目标,尽快到达机场,但不得不等待几个月(年)为了建造新的高速公路(一个明显的处理设置延迟),接下来需要几分钟才能在这条新高速公路上享受超快速的行程,德国没有速度限制,所以200英里每小时? 300英里每小时?没问题。

Until, a thrilling 3-minute ultra-fast experience of driving cars 300 mph on a 100-parallel-lanes wide New Highway has reached the coast-line and next, each and every car had to enter in a queue and wait for the next Ferry-boat arrival ( having a capacity of just about a 30 cars, serving also all the traditional, local trafic ).

直到,在100平行车道宽的新高速公路上以300英里/小时的速度驾驶汽车的惊人的3分钟超快速体验已经到达海岸线,接下来,每辆车都必须排队等待下一个渡船抵达(可容纳约30辆汽车,也提供所有传统的本地交通)。

Given a Ferry line has a service RTT ( Round Trip Time - loading first 30 cars, going to Cape Cod, unloading cars and loading those who go back, going from Cape Cod back to the continent port ) about some 40 minutes, the game is almost over.

鉴于渡轮线路有一个服务RTT(往返时间 - 装载前30辆车,前往科德角,卸载汽车并装载那些返回,从科德角返回大陆港口)约40分钟,游戏是即将结束。

Results:
- the whole group of A:Z has made the task in T0 +01:00:00 time
- the first subset of <=30-parallel-riders has made the same in T0 +8760:23:00
- the next subset of <=30-parallel-riders has made the same in T0 +8760:43:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:03:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:23:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:43:00
- the next subset of <=30-parallel-riders has made the same in T0 +8762:03:00
...
- the last subset of <=30-parallel-riders has made the same in T0 +8788:43:00

结果: - 整个A:Z组在T0 +01:00:00时间完成任务 - <= 30-parallel-riders的第一个子集在T0 +8760:23:00 - 下一个<= 30-parallel-riders的子集在T0 +8760:43:00中相同 - <= 30-parallel-riders的下一个子集在T0 +8761:03:00 - 下一个子集<= 30-parallel-riders在T0 +8761:23:00中也是如此 - <= 30-parallel-riders的下一个子集在T0 +8761:43:00中是相同的 - <=的下一个子集30-parallel-riders在T0 +8762:03:00中也做了同样的事情 - <= 30-parallel-riders的最后一个子集在T0 +8788中也是如此:43:00

all this on conditions given:
- no local traffic ever appeared during these about 29 hours
- the first Ferry was both empty and ready to start loading the first set ~ 30 cars in port right upon theirs arrival
- no Ferry-boat service interruptions or time-table service irregularities ever appeared

在所给出的条件下所有这些: - 在这29小时内没有出现过当地交通 - 第一艘渡轮都是空的,准备开始在他们到达时装载第一套约30辆汽车 - 没有渡船服务中断或时间 - 有条不紊的服务违规行为


The End of a Story:

No "better" result is possible in real-world, no better result is possible in resources-constrained parallel-computing-graph execution on real hardware.

在现实世界中没有“更好”的结果是可能的,在真实硬件上的资源受限的并行计算图形执行中不可能有更好的结果。

Epilogue:

Having a single "just"-[CONCURRENT] element in an otherwise true-[PARALLEL] setup could be this expensive - here, fortunately all the costs of ~ x.000.000.000 USD and ~ 8760+ hours were magically just " free of charge ",
but
the real computing is never so forgiving
and the code will always pay them all, always ( design-wise, code-execution-wise ), so a due engineering efforts and proper design care is always in place not to get any << 1 speedup suprises.

在一个真正的[[PARALLEL]设置中只有一个“just” - [CONCURRENT]元素可能会很昂贵 - 幸运的是,所有成本~x.000.000.000 USD和~8760 +小时的神奇只是“免费”收费“,但真正的计算永远不会那么宽容,而且代码总是会付出所有代价,总是(设计方面,代码执行方面),所以应该做出应有的工程努力和适当的设计护理,而不是<< 1加速惊喜。

#1


0  

Having no code in an MCVE-formulated question, let's go ahead without any code in the answer.

在MCVE制定的问题中没有代码,让我们在答案中没有任何代码。

While all the due mathematical formulations are in the tail section of this, a human-readable story might be a better approach to get the solid grasp of the principal TimeOut problem root-cause:

虽然所有应有的数学公式都在此尾部,但人类可读的故事可能是更好地掌握主要TimeOut问题根本原因的方法:

Using a parallel syntax-constructor?
Syntactically possible in many places and many fashions, but ...

The point here ( always ) is, whether it makes any sense. For answering this, one has to know a sum of all the costs of using it.

这里(总是)的一点是,它是否有意义。为了回答这个问题,我们必须知道使用它的所有成本的总和。

Costs are, as our core motivation is processing speedup, the overheads in time we need to pay, before we can enjoy to use the "accelerated" results.

在我们享受使用“加速”结果之前,成本是我们的核心动力,即处理加速,我们需要支付的时间开销。


Let's sketch up an example - a Cape Cod airport Story:

A Lord Mayor of a city decided to build a new highway, so as to allow welcome citizens to ride faster to an airport on a tip of the Cape Cod.

一个城市的市长决定建造一条新的高速公路,以便让受欢迎的公民能够更快地骑到科德角一角的机场。

This sounds logical and great. So, they started investing, 've paid hundreds of landlords for all the new plots acquired ( bought the property to raise the new road on ), stone miners to excavate all the needed rocks, millers for grinding it in the right sized stones, sands, cement and asphalt, all needed for the highway, bulldozers' operators to put the road rocks bases, concrete vendors for mixing and delivery of the cement in a Just-In-Time fashion, as the highway construction was going forwards etc, etc.

这听起来很合乎逻辑。因此,他们开始投资,为所有新购买的土地(购买房产以筹集新的道路)支付了数百名业主,石矿工挖掘所有需要的岩石,用适当大小的石头研磨它的磨坊,沙子高速公路所需的水泥和沥青,推土机的操作员将道路岩石基地,混凝*应商用于及时混合和交付水泥,公路建设正在进行中等等。

After some months or years ( depends on local economy, contractors' performance, city budget and its distance from the airport ... sure ), the highway was complete and the opening ceremony took place.

经过几个月或几年(取决于当地经济,承包商的表现,城市预算和与机场的距离......确定),高速公路已经完工并举行了开幕式。

Two things were sure at this moment -- The Cost -- how much did it take to build such Highway in money, plus -- The Latency -- how long did it take in time, before a first car could ride from city as far as to the airport.

目前有两件事情是肯定的 - 成本 - 用钱建造这样的高速公路需要多少钱,加上 - 潜伏期 - 在第一辆车从城市出发前需要多长时间才能到达至于机场。

The last thing is -- The net Speedup achieved -- so how faster is this new, 100-parallel lane wide Highway finally, if compared to the old and for ages used, lovely countryside road.

最后一点是 - 净速度达到了 - 所以这个新的,100平行的车道宽高速公路终于有多快,如果与旧的和多年来使用的,可爱的乡村道路相比。

The Result?

A known performance baseline first:

首先是已知的性能基准:

The cars A, B, C, D, ... X, Y, Z all have started its way to the airport about some months ( years ) ago, drove comfortably at a speed of ~40 mph and have all got to the airport in about one hour, one after another, in a pure [SERIAL] manner. Nothing sexy, but a solid, repeatable result, as next time, these will enjoy the almost same result again.

汽车A,B,C,D,...... X,Y,Z都已经开始大约几个月(几年)前往机场,以约40英里/小时的速度舒适地开着,都到了机场大约一个小时,一个又一个,以纯粹的[SERIAL]方式。没有什么性感,但是坚实,可重复的结果,下次,这些将再次享受几乎相同的结果。

And new parallel-riders, to compare with:

和新的并行车手相比:

The cars A1, A2, A3, ... A100, B101, ... B200, ... Z2600 all have the same goal, to arrive at the airport as fast as possible, yet had to wait the few months ( years ) for the new highway to get built ( an obvious processing setup latency ), next it took about a few minutes to enjoy the super-fast ride on this new highway, having Germany-like no speed limit, so 200 mph? 300 mph? No problem.

汽车A1,A2,A3,... A100,B101,... B200,...... Z2600都有相同的目标,尽快到达机场,但不得不等待几个月(年)为了建造新的高速公路(一个明显的处理设置延迟),接下来需要几分钟才能在这条新高速公路上享受超快速的行程,德国没有速度限制,所以200英里每小时? 300英里每小时?没问题。

Until, a thrilling 3-minute ultra-fast experience of driving cars 300 mph on a 100-parallel-lanes wide New Highway has reached the coast-line and next, each and every car had to enter in a queue and wait for the next Ferry-boat arrival ( having a capacity of just about a 30 cars, serving also all the traditional, local trafic ).

直到,在100平行车道宽的新高速公路上以300英里/小时的速度驾驶汽车的惊人的3分钟超快速体验已经到达海岸线,接下来,每辆车都必须排队等待下一个渡船抵达(可容纳约30辆汽车,也提供所有传统的本地交通)。

Given a Ferry line has a service RTT ( Round Trip Time - loading first 30 cars, going to Cape Cod, unloading cars and loading those who go back, going from Cape Cod back to the continent port ) about some 40 minutes, the game is almost over.

鉴于渡轮线路有一个服务RTT(往返时间 - 装载前30辆车,前往科德角,卸载汽车并装载那些返回,从科德角返回大陆港口)约40分钟,游戏是即将结束。

Results:
- the whole group of A:Z has made the task in T0 +01:00:00 time
- the first subset of <=30-parallel-riders has made the same in T0 +8760:23:00
- the next subset of <=30-parallel-riders has made the same in T0 +8760:43:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:03:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:23:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:43:00
- the next subset of <=30-parallel-riders has made the same in T0 +8762:03:00
...
- the last subset of <=30-parallel-riders has made the same in T0 +8788:43:00

结果: - 整个A:Z组在T0 +01:00:00时间完成任务 - <= 30-parallel-riders的第一个子集在T0 +8760:23:00 - 下一个<= 30-parallel-riders的子集在T0 +8760:43:00中相同 - <= 30-parallel-riders的下一个子集在T0 +8761:03:00 - 下一个子集<= 30-parallel-riders在T0 +8761:23:00中也是如此 - <= 30-parallel-riders的下一个子集在T0 +8761:43:00中是相同的 - <=的下一个子集30-parallel-riders在T0 +8762:03:00中也做了同样的事情 - <= 30-parallel-riders的最后一个子集在T0 +8788中也是如此:43:00

all this on conditions given:
- no local traffic ever appeared during these about 29 hours
- the first Ferry was both empty and ready to start loading the first set ~ 30 cars in port right upon theirs arrival
- no Ferry-boat service interruptions or time-table service irregularities ever appeared

在所给出的条件下所有这些: - 在这29小时内没有出现过当地交通 - 第一艘渡轮都是空的,准备开始在他们到达时装载第一套约30辆汽车 - 没有渡船服务中断或时间 - 有条不紊的服务违规行为


The End of a Story:

No "better" result is possible in real-world, no better result is possible in resources-constrained parallel-computing-graph execution on real hardware.

在现实世界中没有“更好”的结果是可能的,在真实硬件上的资源受限的并行计算图形执行中不可能有更好的结果。

Epilogue:

Having a single "just"-[CONCURRENT] element in an otherwise true-[PARALLEL] setup could be this expensive - here, fortunately all the costs of ~ x.000.000.000 USD and ~ 8760+ hours were magically just " free of charge ",
but
the real computing is never so forgiving
and the code will always pay them all, always ( design-wise, code-execution-wise ), so a due engineering efforts and proper design care is always in place not to get any << 1 speedup suprises.

在一个真正的[[PARALLEL]设置中只有一个“just” - [CONCURRENT]元素可能会很昂贵 - 幸运的是,所有成本~x.000.000.000 USD和~8760 +小时的神奇只是“免费”收费“,但真正的计算永远不会那么宽容,而且代码总是会付出所有代价,总是(设计方面,代码执行方面),所以应该做出应有的工程努力和适当的设计护理,而不是<< 1加速惊喜。