Citymobil-用于在初创企业业务增长中提高可用性的手册。 第三部分

Citymobil-用于在初创企业业务增长中提高可用性的手册。 第三部分

This is the next article of the series describing how we’re increasing our service availability in Citymobil (you can read the previous parts here and here). In further parts, I’ll talk about the accidents and outages in detail. But first let me highlight something I should’ve talked about in the first article but didn’t. I found out about it from my readers’ feedback. This article gives me a chance to fix this annoying shortcoming.

这是该系列的下一篇文章,描述我们如何提高Citymobil的服务可用性(您可以在此处此处阅读之前的部分)。 在其他部分,我将详细讨论事故和停机。 但是首先让我强调我应该在第一篇文章中谈到但没有的事情。 我从读者的反馈中发现了这一点。 本文使我有机会解决这一烦人的缺点。

1.序言 (1. Prologue)

One reader asked me a very fair question: «What’s so complicating about backend of the ride-hailing service?» That’s a good question. Last summer, I asked myself that very question before starting to work at Citymobil. I was thinking: «that’s just a taxi service with its three-button app». How hard could that be? It turned to be a very high-tech product. To clarify a bit what I’m talking about and what a huge technological thing it is, I’m going to tell you about a few product directions at Citymobil:

一位读者问我一个非常公平的问题:“打车服务后端到底有什么复杂?” 这是个好问题。 去年夏天,在开始在Citymobil工作之前,我问了我一个问题。 我当时在想:“那只是带有三键式应用程序的出租车服务”。 那有多难? 它变成了高科技产品。 为了澄清我在说什么,这是什么重大的技术,我将向您介绍Citymobil的一些产品方向:

  • Pricing. Our pricing team deals with the problem of the best ride price at every point and at every moment of time. The price is determined by supply and demand balance prediction based on statistics and some other data. It’s all done by a complicated and constantly developing service based on machine learning. Also the pricing team deals with implementation of various payment methods, extra charges upon completing of a trip, chargebacks, billing, interaction with partners and drivers.

    价钱。 我们的定价团队会随时随地解决最佳乘车价格问题。 价格由基于统计数据和其他一些数据的供需平衡预测确定。 所有这些都是通过基于机器学习的复杂且不断开发的服务来完成的。 定价团队还负责各种付款方式的实施,旅行结束后的额外费用,退款,计费,与合作伙伴和驾驶员的互动。

  • Orders dispatching. Which car completes the client’s order? For example, an option of choosing the closest vehicle isn’t the best one in terms of maximization of a number of trips. Better option is to match cars and clients so that to maximize the trips number considering a probability of this specific client cancelling his order under these specific circumstances (because the wait is too long) and a probability of this specific driver cancelling or sabotaging the order (e.g. because the distance is too big or the price is too small).

    订单分派。 哪辆车完成客户的订单? 例如,就行程次数的最大化而言,选择最接近的车辆并不是最好的选择。 更好的选择是匹配汽车和客户,以便考虑到该特定客户在这些特定情况下(因为等待时间太长)取消订单的可能性以及该特定驾驶员取消或破坏订单的可能性,从而最大化旅行次数(例如,因为距离太大或价格太小)。

  • Geo. All about addresses search and suggesting, pickup points, adjustments of estimated time of arrival (our map supply partners don’t always provide us with accurate ETA information with allowance for traffic), direct and reverse geocoding accuracy increase, car arrival point accuracy increase. There’s lots of data, lots of analytics, lots of machine learning based services.

    地理位置 有关地址搜索和建议,接送点,预计到达时间的调整(我们的地图供应合作伙伴并不总是向我们提供允许交通的准确ETA信息),直接和反向地理编码精度提高,汽车到达点精度提高。 有很多数据,很多分析,很多基于机器学习的服务。

  • Antifraud. The difference in trip cost for a passenger and a driver (for instance, in short trips) creates an economic incentive for intruders trying to steal our money. Dealing with fraud is somewhat similar to dealing with mail spam — both precision and recall are very important. We need to block maximum number of frauds (recall), but at the same time we can’t take good users for frauds (precision).

    反欺诈。 乘客和驾驶员旅行成本的差异(例如,短途旅行)为试图窃取我们钱财的入侵者创造了经济诱因。 处理欺诈与处理垃圾邮件有点类似-准确性和召回率都非常重要。 我们需要阻止最大数量的欺诈(召回),但同时我们也无法将好的用户带入欺诈(精确)。

  • Driver incentives team oversees developing of everything that can increase the usage of our platform by drivers and the drivers’ loyalty due to different kinds of incentives. For example, complete X trips and get extra Y money. Or buy a shift for Z and drive around without commission.

    驾驶员激励团队负责监督一切开发工作,这些措施可以通过各种激励措施来增加驾驶员对我们平台的使用以及驾驶员的忠诚度。 例如,完成X趟旅行并获得额外的Y钱。 或购买Z的档次,然后无佣金地开车。

  • Driver app backend. List of orders, demand map (it shows a driver where to go to maximize her profits), status changes, system of communication with the drivers and lots of other stuff.

    驱动程序应用程序后端。 订单清单,需求图(显示驾驶员要最大化利润的去向),状态更改,与驾驶员的通信系统以及许多其他内容。

  • Client app backend (this is probably the most obvious part and that’s what people usually call «taxi backend»): order placement, information on order status, providing the movement of little cars on the map, tips backend, etc.

    客户端应用后端(这可能是最明显的部分,这就是人们通常所说的“出租车后端”):订单放置,订单状态信息,在地图上提供小汽车的移动,后端提示等。

This is just the tip of the iceberg. There’s much more functionality. There’s a huge underwater part of the iceberg behind what seems to be a pretty simple interface.

这只是冰山一角。 还有更多功能。 看起来很简单的界面背后是冰山的巨大水下部分。

And now let’s go back to accidents. Six months of accidents history logging resulted in the following classification:

现在让我们回到事故中。 六个月的事故历史记录进行了以下分类:

  • bad release: 500 internal server errors;

    错误版本:500个内部服务器错误;
  • bad release: database overload;

    不良发布:数据库过载;
  • unfortunate manual system operation interaction;

    不幸的是手动系统操作交互;
  • Easter eggs;

    复活节彩蛋;
  • external reasons;

    外部原因;
  • bad release: broken functionality.

    发行不当:功能损坏。

Below I’ll go in detail about the conclusions we’ve drawn regarding our most common accident types.

下面,我将详细介绍我们针对最常见的事故类型得出的结论。

2.发行错误:500个内部服务器错误 (2. Bad release: 500 internal server errors )

Our backend is mostly written in PHP — a weakly typed interpreted language. We’d release a code that crashed due to the error in class or function name. And that’s just one example when 500 error occurs. It can also be caused by logical error in the code; wrong branch was released; folder with the code was deleted by mistake; temporary artifacts needed for testing were left in the code; tables structure wasn’t altered according to the code; necessary cron scripts weren’t restarted or stopped.

我们的后端主要是用PHP(一种弱类型的解释语言)编写的。 我们将发布由于类或函数名称错误而崩溃的代码。 这只是发生500个错误时的一个示例。 它也可能是由代码中的逻辑错误引起的。 错误的分支被释放; 包含代码的文件夹被误删除; 测试所需的临时工件保留在代码中; 表格结构未根据代码进行更改; 必要的cron脚本未重新启动或停止。

We were gradually addressing this issue in stages. The trips lost due to a bad release are obviously proportional to its in-production time. Therefore, we should do our best and make sure to minimize the bad release in-production time. Any change in the development process that reduce an average time of bad release operating time even by 1 second is good for business and must be implemented.

我们正在逐步解决此问题。 由于发行不当而造成的旅行损失显然与其生产时间成正比。 因此,我们应尽力并确保减少不良发行的生产时间。 开发过程中的任何更改,即使将错误发布操作时间的平均时间减少了甚至1秒也对业务有利,因此必须予以实施。

Bad release and, in fact, any accident in production has two states that we named «a passive stage» and «an active stage». During the passive stage we aren’t aware of an accident yet. The active stage means that we already know. An accident starts in the passive stage; in time it goes into the active stage — that’s when we find out about it and start to address it: first we diagnose it and then — fix it.

不良释放以及实际上生产中的任何事故都有两种状态,我们将其称为“被动阶段”和“主动阶段”。 在被动阶段,我们还没有发现事故。 活动阶段意味着我们已经知道。 事故始于被动阶段。 随着时间的流逝,它进入了活动阶段,也就是我们发现并开始解决它的时候:首先我们诊断出它,然后将其修复。

To reduce duration of any outage, we need to reduce duration of active and passive stages. The same goes to a bad release since it’s considered a kind of an outage.

为了减少任何中断的持续时间,我们需要减少主动和被动阶段的持续时间。 糟糕的发行版也是如此,因为它被认为是一种中断。

We started analyzing the history of troubleshooting of outages. Bad releases that we experienced when we just started to analyze the accidents caused an average of 20-25-minute downtimes (complete or partial). Passive stage would usually take 15 minutes, and the active one — 10 minutes. During the passive stage we’d receive user complaints that were processed by our call center; and after some specific threshold the call center would complain in a Slack chat. Sometimes one of our colleagues would complain about not being able to get a taxi. The colleague’s complain would signal about a serious problem. After a bad release entered the active stage, we began the problem diagnostics, analyzing recent releases, various graphs and logs in order to find out the cause of the accident. Upon determining the causes, we’d roll back if the bad release was the latest or we’d perform a new deployment with the reverted commit.

我们开始分析故障排除的历史记录。 我们刚开始分析事故时所经历的不良发行平均导致20-25分钟的停机时间(完整或部分停机)。 被动阶段通常需要15分钟,主动阶段通常需要10分钟。 在被动阶段,我们将收到由我们的呼叫中心处理的用户投诉; 在达到特定的阈值后,呼叫中心会在闲聊中抱怨。 有时,我们的一位同事会抱怨无法打车。 同事的抱怨将预示着一个严重的问题。 在不良的发布进入活动阶段之后,我们开始了问题诊断,分析了最新的发布,各种图表和日志,以找出事故原因。 确定原因后,我们将回退错误版本是否为最新版本,或者使用还原的提交执行新的部署。

This is the bad release handling process we were set to improve.

这是我们准备改进的不良发行处理流程。

Passive stage: 20 minutes.

被动阶段:20分钟。

Active stage: 10 minutes.

活动阶段:10分钟。

3.被动阶段减少 (3. Passive stage reduction)

First of all, we noticed that if a bad release was accompanied by 500 errors, we could tell that a problem had occurred even without users’ complains. Luckily, all 500 errors were logged in New Relic (this is one of the monitoring system we use) and all we had to do was to add SMS and IVR notifications about exceeding of a specific number of 500 errors. The threshold would be continuously lowered as time went on.

首先,我们注意到,如果一个错误的发行版伴随500个错误,我们可以判断出即使没有用户的抱怨也发生了问题。 幸运的是,所有500个错误都记录在New Relic(这是我们使用的监视系统之一)中,我们要做的就是添加关于超过500个特定数量错误的SMS和IVR通知。 随着时间的流逝,阈值将不断降低。

The process in times of an accident would look like that:

发生事故时的过程如下所示:

  1. An engineer deploys a release.

    工程师部署发布。
  2. The release leads to an accident (massive amount of 500s).

    释放会导致事故(大量500s)。
  3. Text message is received.

    收到短信。
  4. Engineers and devops start looking into it. Sometimes not right away but in 2-3 minutes: text message could be delayed, phone sounds might be off; and of course, the habit of immediate reaction upon receiving this text can’t be formed overnight.

    工程师和开发人员开始对其进行研究。 有时不是马上,而是在2-3分钟内:短信可能会延迟,手机声音可能会关闭; 当然,接收到此文本后立即做出React的习惯不可能一overnight而就。
  5. The accident active stage begins and lasts the same 10 minutes as before.

    事故**阶段开始并持续与之前相同的10分钟。

As a result, the active stage of «Bad release: 500 internal server errors» type of accident would begin 3 minutes after a release. Therefore, the passive stage was reduced from 15 minutes to 3.

因此,“不良释放:500个内部服务器错误”类型的事故的活动阶段将在释放后3分钟开始。 因此,被动阶段从15分钟减少到3分钟。

Result:

结果:

Passive stage: 3 minutes.

被动阶段:3分钟。

Active stage: 10 minutes.

活动阶段:10分钟。

4.进一步减少被动阶段 (4. Further reduction of a passive stage)

Even though the passive stage had been reduced to 3 minutes, it’s still bothered us more than active one since during the active stage we were doing something trying to fix the problem, and during the passive stage the service was totally or partially down, and we were absolutely clueless.

即使从被动阶段减少到3分钟,它仍然比主动阶段更困扰我们,因为在主动阶段我们正在尝试解决问题,并且在被动阶段服务已全部或部分关闭,我们绝对毫无头绪

To further reduce the passive stage, we decided to sacrifice 3 minutes of our engineers’ time after each release. The idea was very simple: we’d deploy code and for three minutes afterwards we were looking for 500 errors in New Relic, Sentry and Kibana. As soon as we saw an issue there, we’d assume it to be code related and began troubleshooting.

为了进一步减少被动阶段,我们决定在每次发布后牺牲3分钟的工程师时间。 这个想法非常简单:我们将部署代码,三分钟后,我们在New Relic,Sentry和Kibana中查找500个错误。 一旦发现问题,我们就假定它与代码相关,并开始进行故障排除。

We chose this three-minute period based on statistics: sometimes the issues appeared in graphs within 1-2 minutes, but never later than in 3 minutes.

我们根据统计数据选择了这三分钟的时间段:有时问题会在1-2分钟内出现在图表中,但不会迟于3分钟。

This rule was added to the do’s and dont’s. At first, it wasn’t always followed, but over time our engineers got used to this rule like they did to basic hygiene: brushing one’s teeth in the morning takes some time also, but it’s still necessary.

此规则已添加到“执行和不执行”中。 起初并不总是遵循该规则,但是随着时间的流逝,我们的工程师已经习惯了这一规则,就像他们习惯了基本的卫生习惯一样:早上刷牙也需要一些时间,但是仍然有必要。

As a result, the passive stage was reduced to 1 minute (the graphs were still being late sometimes). It also reduced the active stage as a nice bonus. Because now an engineer would face the problem prepared and be ready to roll her code back right away. Even though it didn’t always help, since the problem could’ve been caused by a release deployed simultaneously by somebody else. That said, the active stage in average reduced to five minutes.

结果,被动阶段减少到了1分钟(图表有时仍然很晚)。 它还减少了活跃阶段的奖金。 因为现在工程师将面对准备好的问题,并准备立即将其代码回滚。 即使并非总是有帮助,因为问题可能是由其他人同时部署的发行版引起的。 也就是说,活动阶段平均减少到五分钟。

Result:

结果:

Passive stage: 1 minutes.

被动阶段:1分钟。

Active stage: 5 minutes.

活动阶段:5分钟。

5.进一步减少活跃阶段 (5. Further reduction of an active stage)

We got more or less satisfied with 1-minute passive stage and started thinking about how to further reduce an active stage. First of all we focused our attention on the history of outages (it happens to be a cornerstone in a building of our availability!) and found out that in most cases we don’t roll a release back right away since we don’t know which version we should go for: there are many parallel releases. To solve this problem we introduced the following rule (and wrote it down into the do’s and dont’s): right before a release one should notify everyone in a Slack chat about what you’re about to deploy and why; in case of an accident one should write: «Accident, don’t deploy!» We also started notifying those who don’t read the chat about the releases via SMS.

我们对1分钟的被动阶段或多或少感到满意,并开始考虑如何进一步减少主动阶段。 首先,我们将注意力集中在中断的历史上(这恰恰是建立可用性的基石!),发现在大多数情况下,我们不立即回滚发行版,因为我们不知道我们应该使用哪个版本:有许多并行发行版。 为了解决这个问题,我们引入了以下规则(并将其写下来做为“做”和“不做”):在发布之前,应该在Slack聊天中通知所有人您将要部署的内容以及原因; 万一发生事故,应该写:“事故,不要部署!” 我们还开始通过SMS通知那些没有阅读聊天记录的人。

This simple rule drastically lowered number of releases during an ongoing accident, decreases the duration of troubleshooting, and reduced the active stage from 5 minutes to 3.

这个简单的规则大大减少了持续发生的事故期间的释放数量,减少了故障排除的时间,并将活动阶段从5分钟减少到3分钟。

Result:

结果:

Passive stage: 1 minutes.

被动阶段:1分钟。

Active stage: 3 minutes.

活动阶段:3分钟。

6.更大程度地减少活跃阶段 (6. Even bigger reduction of an active stage)

Despite the fact that we posted warnings in the chat regarding all the releases and accidents, race conditions still sometimes occurred — someone posted about a release and another engineer was deploying at that very moment; or an accident occurred, we wrote about it in the chat but someone had just deployed her code. Such circumstances prolonged troubleshooting. In order to solve this issue, we implemented automatic ban on parallel releases. It was a very simple idea: for 5 minutes after every release, the CI/CD system forbids another deployment for anyone but the latest release author (so that she could roll back or deploy hotfix if needed) and several well-experienced developers (in case of emergency). More than that, CI/CD system prevents deployments in time of accidents (that is, from the moment the notification about accident beginning arrives and until arrival of the notification about its ending).

尽管我们在聊天室中发布了有关所有发布和事故的警告,但有时仍会发生比赛情况-有人发布了发布的消息,而此时正部署另一名工程师。 或发生事故时,我们在聊天室中写下了它,但有人刚刚部署了她的代码。 这种情况会延长故障排除时间。 为了解决此问题,我们对并行发行版实施了自动禁止。 这是一个非常简单的想法:每个发行版之后的5分钟内,CI / CD系统都禁止对除最新发行版作者(以便她可以回滚或部署修补程序,如果需要的话)和经验丰富的开发人员之外的任何人进行其他部署。紧急情况下)。 不仅如此,CI / CD系统还可以防止在事故发生时进行部署(也就是说,从有关事故开始的通知到达的那一刻起,直到有关事故结束的通知到达为止)。

So, our process started looking like this: an engineer deploys a release, monitors the graphs for three minutes, and after that no one can deploy anything for another two minutes. In case if a problem occurs, the engineer rolls the release back. This rule drastically simplified troubleshooting, and total duration of the active and passive stages reduced from 3+1=4 minutes to 1+1=2 minutes.

因此,我们的过程开始看起来像这样:工程师部署了一个发行版,监视了三分钟的图形,然后在两分钟内没有人可以部署任何东西。 万一发生问题,工程师会回滚发行版。 该规则极大地简化了故障排除,主动和被动阶段的总持续时间从3 + 1 = 4分钟减少到1 + 1 = 2分钟。

But even a two-minute accident was too much. That’s why we kept working on our process optimization.

但是即使是两分钟的事故也太多了。 这就是为什么我们继续进行流程优化的原因。

Result:

结果:

Passive stage: 1 minute.

被动阶段:1分钟。

Active stage: 1 minute.

活动阶段:1分钟。

7.自动事故确定和回滚 (7. Automatic accident determination and rollback)

We’d been thinking for a while how to reduce duration of the accidents caused by bad releases. We even tried forcing ourselves into looking into tail -f error_log | grep 500. But in the end, we opted for a drastic automatic solution.

我们已经思考了一段时间,以减少因不良发布而导致的事故持续时间。 我们甚至试图强迫自己查看tail -f error_log | grep 500 tail -f error_log | grep 500 。 但是最后,我们选择了一种激进的自动解决方案。

In a nutshell, it’s an automatic rollback. We’ve got a separate web server and loaded it via balancer 10 times less than the rest of our web servers. Every release would be automatically deployed by CI/CD systems on this separate server (we called it preprod, but despite its name it’d receive real load from the real users). Then the script would perform tail -f error_log | grep 500. If within a minute there was no 500 error, CI/CD would deploy the new release in production onto other web servers. In case there were errors, the system rolled it all back. At the balancer level, all the requests resulted in 500 errors on preprod would be re-sent on one of the production web servers.

简而言之,这是自动回滚。 我们有一个单独的Web服务器,并通过均衡器加载它,比其他Web服务器少10倍。 CI / CD系统将在每个单独的服务器上自动部署每个发行版(我们称其为preprod,但是尽管它的名字,它将从真实用户那里收到实际负载)。 然后脚本将执行tail -f error_log | grep 500 tail -f error_log | grep 500 。 如果在一分钟内没有500错误,CI / CD会将生产中的新版本部署到其他Web服务器上。 万一出现错误,系统将其全部回滚。 在平衡器级别,所有导致preprod错误500的请求将在生产Web服务器之一上重新发送。

This measure reduced the 500 errors releases impact to zero. That said, just in case of bugs in automatic controls, we didn’t abolish our three-minute graph watch rule. That’s all about bad releases and 500 errors. Let’s move onto the next type of accidents.

这项措施将500个错误的发布影响降低为零。 就是说,为了防止自动控制中的错误,我们没有取消3分钟的图表监视规则。 这些都是关于不良版本和500个错误的。 让我们进入下一类事故。

Result:

结果:

Passive stage: 0 minutes.

被动阶段:0分钟。

Active stage: 0 minutes.

活动阶段:0分钟。



In further parts, I’m going to talk about other types of outages in Citymobil experience and go into detail about every outage type; I’ll also tell you about the conclusions we made about the outages, how we modified the development process, what automation we introduced. Stay tuned!

在其他部分,我将讨论Citymobil经验中的其他类型的中断,并详细介绍每种中断类型; 我还将告诉您有关停机的结论,如何修改开发流程以及引入了哪些自动化。 敬请关注!

翻译自: https://habr.com/en/company/mailru/blog/449708/