客户端服务器系统中退避机制的模式

I have a system which needs to send requests to an external system, whenever a user does search on my system.

我有一个系统,只要用户在我的系统上搜索,就需要向外部系统发送请求。

If the external system is down or is taking an unusually long time to answer, I would like my system to "back off" for a while. Instead of trying to make more requests to the external system, I want to just let the user of my system know immediately that we will not be processing their requests at the moment.

如果外部系统出现故障或需要花费非常长的时间来回答,我希望我的系统能够“退回”一段时间。我想让我的系统用户立即知道我们现在不会处理他们的请求,而不是尝试向外部系统发出更多请求。

This would result in a better experience for the user (doesn't have to wait for timeouts), less resource usage in my system (threads won't be busy waiting for no responses or timeouts from the external system) and it would spare the external system. (in a situation where it probably is already struggling with load)

这将为用户带来更好的体验(不必等待超时),减少系统中的资源使用(线程不会忙于等待没有响应或外部系统超时)并且它将节省外部系统。 (在可能已经在加载的情况下)

After some time, or when my system discovers that the external system is responding again, I would like to resume normal behaviour again.

一段时间后,或当我的系统发现外部系统再次响应时,我想再次恢复正常行为。

Is there any patterns or standard ways of doing this kind of thing? Specifically the mechanism for keeping track of timed out/long requests, and some sort of control mechanism for when we should start trying again.

做这种事有什么模式或标准方法吗?特别是跟踪超时/长时间请求的机制,以及我们何时应该再次开始尝试的某种控制机制。

1 个解决方案

#1

I don't remember seeing this described in the literature, but the pattern I've noticed for such tasks centers on a "scheduling queue" -- a way to make various things happen (==get functions or methods called back) at certain times unless previously canceled (e.g. Python's sched standard library module). When you send an (async) request to the back-end you also schedule a timeout event for X seconds from now; either the request object knows the ID of the scheduled timeout (to cancel it if the request is satisfied before then), or a set of pending requests is also maintained (so the timeout knows when it's not really needed) -- which is a good idea anyway as it makes handling "timeouts that really mean it" easier, see below.

我不记得在文献中看到过这种情况,但我注意到这些任务的模式集中在一个“调度队列” - 一种使各种事情发生的方法(==得到函数或方法回调)除非先前取消(例如Python的sched标准库模块)。当您向后端发送(异步)请求时,您还要计划从现在起X秒的超时事件;请求对象是否知道调度超时的ID(如果在此之前满足请求则取消它),或者还保持一组待处理请求(因此超时知道何时不需要它) - 这是一个好的无论如何它的想法,因为它使处理“真正意味着它的超时”更容易,见下文。

When a timeout does occur it schedules a retry for Y seconds in the future and moves all pending requests from that container over to a container of requests to be retried in the future (and cancels all other timeouts if that's how the system is set up), and also sends the notifications "backend is slow, we'll retry in Y seconds" to all waiting clients.

当发生超时时,它会在将来安排重试Y秒,并将来自该容器的所有待处理请求移动到将来要重试的请求容器(如果系统设置的话,则取消所有其他超时) ,并且还向所有等待的客户端发送通知“后端很慢,我们将在Y秒内重试”。

When a retry event occurs, etc etc. If new requests arrive while the system is suspended, they go right into the "to be retried" bin.

当重试事件发生时等等。如果在系统暂停时新的请求到达,它们将直接进入“待重试”的bin。

While I can't find this pattern described, if anywhere, it's probably in Schmidt's excellent book... highly recommended reading anyway!-)

虽然我无法找到这种描述的模式,但如果在任何地方,它可能都在施密特的优秀书中...无论如何强烈推荐阅读! - )

#1