ACCIDENT HISTORY

From Bloomex Wiki
Revision as of 02:04, 6 February 2025 by Sergey Ershov (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
NN Start Date Start Time Resolve Date Resolve Time Issue Downtime Solution Comments
1 09-12-2022 Принудительнвй апгрейд версии 5.6 MySQL со стороны AWS из-за прекращения поддержки Амазоном версии 5.6 12 часов Восстановление из резервной копии баз данных с последующией миграцией на поддерживаемую версию MySQL В скором времени версия 5.7 перестанет поддерживаться Амазоном. Нужно совместно с программистами запланировать переезд на новую версию. Желательно на 8.0. Сергей Ершов в курсе о задаче, решение о переезде постоянно откладывалось
2 23-12-2022 DDOS-атака на bloomex.com.au 20 минут Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
3 11-02-2023 DDOS-атака на bloomex.com.au 12 минут Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
4 12-02-2023 DDOS-атака на bloomex.com.au 12 минут Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
5 13-02-2023 DDOS-атака на bloomex.com.au 9 минут Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
6 13-02-2023 Нарушение работоспособности Mailbot 10 часов Дождались пока проснется программист и поправит автовакуум на PostgrSQL Необходим штатный DBA для тюнинга баз данных, которые находятся локально на инстансе. Сергей и Дмитрий в курсе ситуации, мне штатную единицу не дали
7 11.02.2023 - 14.02.2023 Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях Плавающая проблема, в основном в моменты наплыва звонков Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий Нужно пересматривать параметры провайдера и переходить на другие, кассается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков
8 01.04.2023 - 06.04.2023 Несколько сайтов локалшопов имели просроченные SSL- скртификаты, вследствие чего 6 дней Certbot от Letsencrypt обновил свои пакеты и изменил политику обновления бесплатных сертификатов, увидели не сразу Решение по переезду на платные сертификаты не нашло поддержки, нужно следить за сертификатами, обновлять скрипты сертбота и наблюдать за кроном
9 11.05.2023 - 17.05.2023 Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях Плавающая проблема, в основном в моменты наплыва звонков Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий Нужно пересматривать параметры провайдера и переходить на другие, касается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков
10 24.06.2023 10:15 UTC 25.06.2023 18:00 UTC Payment gateway experienced intermittent failures, causing transaction processing delays. 8 hours Updated payment gateway configurations and restarted services to restore functionality. Implement automated monitoring for early detection of transaction failures.
11 15.09.2023 20.09.2023 IP-адрес почтового сервера попал в спамлист, мы потеряли ресурс почтовика до восстановления 9 часов даунтайм Переезд почтового сервера на новый инстанс со сменой адресов (локального и внешнего), перенастройка DNS, MX-записей Вскрыли ящик wecare@bloomexusa.com. Предположительно через фронт Американского сайта. Используя полученные креды было отправлено 9,5 миллионов писем через наш почтовые сервер за сутки с этого адреса. Вследствие чего адрес был скомпрометирован и внесен в стоплист. Нужно закрывать дыры в коде на bloomexusa.com
12 23.09.2023 11:00 PM 24.09.2023 7:00 PM в 13:08 пришла информация в чате от Анастасии что не работает оплата 20 часов Отключили подтверждение оплаты через 2 часа, после девелоперы пофиксили и включили обратно
13 03.10.2023 https://api.bloomex.ca/dhlordermanager - Invalid SSL certificate Error code 526 35 минут Обновил сертификат Максам отдали мониторинг сертификатов
14 21.12.2023 10:00 AM 21.12.2023 3:00 PM web3-3 was dead because of DDoS and after reboot instance is die 5 hours Reconnected the root disk from the backup from another host that was a backup
15 06.02.2023 7:00 AM 06.02.2023 9:00 AM bloomexusa.com - no longer available. 3 hours
XX 14.02.2023 1:00 PM 14.02.2023 7:00 PM chat.bloomex.ca - Watson bot dead 6 hours Cherepanov: помогло удаление очередей и контейнеров,логи есть, днем гляну и найду причину,навскидку - часов 5-6 назад завис супервайзер
16 06.03.2024 20:17 UTC 06.03.2024 20:36 UTC mail.necs.ca server down 0 hrs 19 mins Due to urgent SSH access issue we were forced to edit configuration files manually, which led to a server stop. After the files were edited directly on the server drive it was boot up back again immediately. sshd_config file should NOT be modified in the future if there's a key-only login policy.
17 07.03.2024 21:47 UTC 07.03.2024 22:00 UTC adm.bloomex.ca server down 0 hrs 13 mins No free space on server storage due to extremely large log files. No zabbix alerts fired due to Zabbix being turned off for planned maintenance. Important, yet large log files should not be kept on production server. Due to be moved to another location.Zabbix should not be stopped for continuous periods of time.
18 11.03.2024 11:00 UTC 11.03.2024 17:00 UTC adm-eu.necs.ca down (status 500) (http://tasks.bloomex.ca/redmine/issues/16047), weird store orders/confirmation emails behavior 6 hrs 0 mins Cron policies on production server running reoccurring jobs more often than a job run cycle take led to MySQL processes queue growth up to limit. It is recommended to reconfigure production cron mechanisms to wait for jobs to finish before starting them over if it takes more than a run cycle for them to do that.
19 18.03.2024 14:47 18.03.2024 18:05 sip2.bloomex.ca issue with NZ inbound lines 3 hours 13 mins create new server from backup and restore configuration appears due to "deletion of old unnecessary numbers"
20 19.04.2024 12:41 19.04.2024 12:54 localshops sites down 13 min fixed php socket name manual for creation new shops was updated Create localshop (Laravel on dev3-2)
21 27.04.2024 13:23 EDT 27.04.2024 16:10 EDT DDoS on Bloomex.ca from an unidentified network 2 hrs 47 min Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled. Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs.
22 30.04.2024 13:16 EDT 30.04.2024 13:45 EDT DDoS on Bloomex.com.au from an unidentified network 29 min Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled for Australina site. Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs.Still we need to tune up our monitoring system to react to suspicious patterns of high load faster - at least BEFORE than the head of the organization becomes aware of it.
23 9.05.2024 1:31 PM 9.05.2024 4:03 PM GE network issues 2 hour 32 min After troubleshooting and reboot switches and GW, was reconfigured GW OpenVPN profile and it is fix a problem
24 11.05.2024 12:12 EDT 11.05.2024 13:28 EDT DDoS on bloomex.ca main prod 0 minutes (no downtime) An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact.
25 21.05.2024 5.06 AM 21.05.2024 9.08 AM mailbot did not work 4 h 2 min SSL certificate was not up updated automatically according to crontab, have to update it and restart mailbot.
25 28.05.2024 11:45 28.05.2024 13:00 DDoS on bloomex.ca main prod Partial downtime ~10 min total An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact.
26 03.06.2024 4.24 03.06.2024 7.49 users could not to add items in the cart 3.25 free inodes on the server, fixed the root cause of high inodes consuming http://tasks.bloomex.ca/redmine/issues/18106If inodes are equal 100%, runfind /mnt/storage2 -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -n 20 AND you will get php session path, after that run find /mnt/storage2/www/stage2.bloomex.ca/php-session -type f -delete for mnt change php session location
27 05.06.2024 20:41 05.06.2024 21:35 bloomex.ca wasn't available 56 mins Prometheus overloaded prod DBChanged DB source for Prometheus to prod-replica and killed executable processes from Prometheus in prod DB
28 27.06.2024 14:30 27.06.2024 22:00 adm-eu@bloomex.ca secret comporomised; mail.necs.ca down Partial downtime whole day ~ 4hours spam bombing via adm-eu; zimbra can not started; due to a weird workload issue that is now fully addressed, some of the emails that were sent during the period of time between 08:00 and 14:00 EDT MIGHT be lost changed credentials for mailbox adm-eu; blocked by ip for attackers; added resources for zimbra server host; cleaned queues
29 01.07.2024 19:00 EDT 01.07.2024 22:00 EDT victory.nadum@bloomex.ca mailbox compromised, used for spam messaging. Uptime not affected. No downtime Compromised mailbox deactivated; creds changed
30 03.07.2024 11:27 EDT 03.07.2024 11:30 EDT DNS name and resolved IP were changed in Cloudflare, after some time prod server went down down, because the host instance did not had enough resources for stable run (t2.micro instead m5.2xlarge) 3 minutes DNS name and resolved IP in Cloudflare were changed to old host
31 05.07.2024 10:53 AM PST 05.07.2024 11:13 AM PST I turned off api2 at Eershov's request(Audit letter -Bloomex team migrated the API server from Debian 9 to Debian 11 in a new server instance around 3 weeks before engaging for PFI. The old instance is still up and running.),and NZ and adm payment it down, then turned it back on because NZ works through it 20 min Create ticket for Levon to fix workflow, also fixed in adm.bloomex.ca and adm.bloomex.com.au from api2 to api-pay
32 15.07.2024 8:23 AM PST 15.07.2024 10:37 AM PST Local shops phone issues 2 hour 24 min An issue with route on media, they was deleted
33 19.07.2024 06:34 AM PST 19.07.2024 8:20 AM PST Periodic downtime issues with retail shops because of ticket #19186 5 minutes This was fixed by comment email credentials, rollback installation, and increased server resources
34 24.07.2024 09:14 AM PST 24.07.2024 10:20 AM PST Local shops phone issues: voice layer not transfered, clients and managers do not hear each other ~ 1 hour This was happen by revoked security group in AWS which named Rostov_main and had port ranges 10000 - 20000 For resolve SG was resurrected and name was changed to correct onehttp://tasks.bloomex.ca/redmine/issues/19280
35 29.07.2024 18:24 EDT 29.07.2024 21:10 EDT Outdated security certificate on mail.necs.ca No actual downtime; some performance degradation Actions run as per Mail server A cert can be renewed in several ways. BE AWARE that in case of OUR MAIL SERVER you should only use link on the left of this cell <--
36 07.08.2024 08:30 EDT 07.08.2024 09:15 EDT Curl Error: SSL certificate problem: unable to get local issuer certificate ~45mins Happen by transferred adm-eu instance from Frankfurt to Oregon region Added Security Group for instance with api ip
37 11.08.2024 10:30 EDT 11.08.2024 10:55 EDT DDoS on bloomex.com.au No downtime DDoS on Australia has been prevented immediately by CloudFlare rules (Under Attack Mode & custom WAF rules was enabled at the moment)
38 17.08.2024 18:20 EDT (APPRX) 18.08.2024 18:20 EDT bloomex.ca malfunction: inability to perform several operations, including placing an order.External audit recommendation to blacklist backend server on frontend server firewall led to site functioning improperly, the following command being the cause of the accident: iptables -A INPUT -s 195.2.92.206 -j DROP && sudo iptables -A INPUT -s 37.1.213.196 -j DROP && sudo iptables -A INPUT -s 34.210.253.67 -j DROP && sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP && sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && sudo iptables -A OUTPUT -d 34.210.253.67 -j DROP No actual downtime; performance degradation for ~24 hrs Rolling back new firewall rules resolved the issue. Command to cancel looks as follows:sudo iptables -A INPUT -s 37.1.213.196 -j DROP && \ sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && \

sudo iptables -A INPUT -s 195.2.92.206 -j DROP && \ sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP

Giving more attention to audit requests recommended
39 06.09.2024 16:00 EDT 06.09.2024 19:00 EDT Problems with internet provider for GE office ~3 hours Gateway ISP hotswitch mechanism fail, switch performed manually. Need to fix the hotswitch/replace the current gateway PC with Fortigate http://tasks.bloomex.ca/redmine/issues/20151http://tasks.bloomex.ca/redmine/issues/20152
40 07.09.2024 ~10:00 EDT 07.09.2024 ~11:30 EDT Retail's shops phone issue - all of them have been dropped from queue several times ~1 hour Looks like internet connection problems for Carlton Place and some regions too.Problem's gone by itself. http://tasks.bloomex.ca/redmine/issues/20158
41 09.09.2024 ~07:00 EDT 09.09.2024 ~11:30 EDT Retail's shops phone issue - all subnets for retails not working: phones, cameras, gateways ~4,5 hours Openvpn service has been reloaded with all routes lost. Routes re-added back manually, backup file created. http://tasks.bloomex.ca/redmine/issues/20326
42 09.09.2024 ~09:30 EDT 09.09.2024 ~23:30 EDT host 10.0.0.91 had problems with performance, but it was not clear to understood the reason. ~14 hours DDoS attack from unknown aws instance. Direct attack bypassing all external firewall rules. After blocking with iptables all performance issues are gone. http://tasks.bloomex.ca/redmine/issues/20185
43 10.09.2024 09:30 EDT 10.09.2024 13:30 EDT As a result of the investigation of the previous incident, there were more than 30 active sessions on the VPN server. When they were disabled, the VPN service crashed, but was restored within half an hour. However, after 20 minutes, for unknown reasons, the service dropped again. It was decided to change the password for root and reboot the server. After this, all dynamic routes and firewall rules were lost, which resulted in the breakdown of 2 of the 3 subnets. 4 hours All routes and firewall rules were restored from backups need ticket
44 26.10.2024 03:26 EDT 26.10.2024 08:00 EDT Verification code did not arrive to some users. The problem detected on adm side of CA, but the point that mailsender leak the space No actual downtime; for~4,5 hours some of mails did not arrive by users Because of mail logic via php code for admins systems flow to mailsender, it can be broken, if something will happen with last one. http://tasks.bloomex.ca/redmine/issues/21141
45 09.11.2024 00:12 EDT 09.11.2024 17:00 EST Printing labels does not work via purolator, fedex, canada post No actual downtime by upgrading server OS from debian 11 to debian 12 some php modules have been disabled (soap) http://tasks.bloomex.ca/issues/21318
46 12.11.2024 11:00EDT 12.11.2024 12:00EDT Canadian admin Down No actual downtime The low RAM issue was resolved by restarting the php-fpm socket version 7.0, which helped reduce the number of active connections. The increase in connections was due to a DDoS attack.
47 14.11.2024 05:40 EDT 14.11.2024 06:00 EDT VPN server got leak of space ~20mins /var/log got leak of free space because logs of EDR, velociraptor, auditd, rsyslog ate all of it make a rotation for /var/log
48 30.11.2024 07:23 EDT 30.11.2024 08:23 EDT Sip-clients cant finish the registration on Asterisk 1h Asterisk did not accepted registration from sip-clients, but at all working well. Fixed by reboot after unsuccessful investigationAdded ulimit 65353 https://tasks.bloomex.ca/issues/21584
49 12.12.2024 14:30 EDT 12.12.2024 16:20 EDT Huge DDoS attack onto the chat.bloomex.ca No actual downtime; performance degradation for ~2hrs Huge DDoS attack onto the chat.bloomex.ca from many ip's Added block rules, resized instancehttps://tasks.bloomex.ca/issues/21961
50 12.01.2025 20:00UTC 12.01.2025 23:00UTC AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects. ~2hrs The overload on the DBMS has passed over time The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend
51 21.01.2025 13:20 UTC 21.01.2025 15:50 UTC CA, AU, USA admin system hangs due to peak CPU load on main-db1/main-db2 and caused by peak of database connects. ~1.5hrs
52 22.01.2025 2:40 UTC 22.01.2025 4:50 UTC AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects. ~1.5hrs Huge DDoS from msnbot for bloomex.com.au and from AndroidDownloadManager/5.1 for bloomexusa, and a number of other unspecified addresses. The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend
53 30.01.2025 1:18UTC 30.01.2025 1:57 UTC The retail host became unavailable over the network at one point. 39 mins 30 sec For a short period of time, the load on the DB increased. Retail sites simply became unavailable over the network for no apparent reason. A change in the retailer's car plan helped. The host was moved to another resource pool. Changing the network interface during resizing may have helped.