Case Study — An Online Retailer
An online retailer provides block storage to their key applications using the storage arrays connected via Fibre Channel fabrics. During the peak shopping season, an FC port, which provides connectivity to their tier-1 application, started alerting on Invalid Transmission Words. As per their operational practices, they investigated the issue but were not allowed to make corrective changes, such as replacing SFP or cables, due to a no-change policy enforced during the peak season. Hence, they moved the application to another host with healthy links. This change was expensive because instead of one host, they had to provision (and were billed for) two hosts with enough resources to run the tier-1 application, while one host remained unused due to bit errors on its link. 一家在线零售商使用通过光纤通道 Fabric 连接的存储阵列为其关键应用程序提供块存储。在购物旺季,为其一级应用程序提供连接的一个 FC 端口开始发出无效传输字警告。按照他们的操作惯例,他们对该问题进行了调查,但由于旺季期间执行的 "不更改 "政策,他们无法进行纠正性更改,如更换 SFP 或电缆。因此,他们将应用程序转移到另一台链路健康的主机上。这一变更代价高昂,因为他们不得不为两台主机配置(并为此付费)足够的资源来运行一级应用程序,而其中一台主机却因链路上的位错误而闲置。
The online retailer uses a robust monitoring infrastructure. They wanted to know if there was something they could have done better to prevent this expensive change during the peak business days. For example, had they known about this condition earlier, they could have replaced the faulty component that caused the bit errors before the no-change period, and therefore could have avoided being billed for two hosts instead of one. 这家在线零售商使用了强大的监控基础设施。他们想知道是否有更好的办法来避免在业务高峰期发生这种昂贵的变更。例如,如果他们能更早地了解到这种情况,就可以在禁用期之前更换导致位错误的故障组件,从而避免为两台主机而不是一台主机付费。
Observations
The operations team of the online retailer company monitors all the network ports continuously. Their monitoring infrastructure collects counters and plots them on a graph periodically. The FC port under investigation was no different. 在线零售公司的运营团队持续监控所有网络端口。他们的监控基础设施会定期收集计数器并将其绘制在图表上。被调查的 FC 端口也不例外。
Figures 2-33 to 2-36 show the FEC corrected blocks, FEC uncorrected blocks, Invalid Transmission Words, and CRC errors on the investigated port from October end till mid-December. 图 2-33 至图 2-36 显示了 10 月底至 12 月中旬调查端口的 FEC 修正块、FEC 未修正块、无效传输字和