ACCIDENT HISTORY: Difference between revisions

Latest revision as of 02:04, 6 February 2025

NN	Start Date	Start Time	Resolve Date	Resolve Time	Issue	Downtime	Solution	Comments
1	09-12-2022				Принудительнвй апгрейд версии 5.6 MySQL со стороны AWS из-за прекращения поддержки Амазоном версии 5.6	12 часов	Восстановление из резервной копии баз данных с последующией миграцией на поддерживаемую версию MySQL	В скором времени версия 5.7 перестанет поддерживаться Амазоном. Нужно совместно с программистами запланировать переезд на новую версию. Желательно на 8.0. Сергей Ершов в курсе о задаче, решение о переезде постоянно откладывалось
2	23-12-2022				DDOS-атака на bloomex.com.au	20 минут	Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables	Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
3	11-02-2023				DDOS-атака на bloomex.com.au	12 минут	Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables	Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
4	12-02-2023				DDOS-атака на bloomex.com.au	12 минут	Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables	Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
5	13-02-2023				DDOS-атака на bloomex.com.au	9 минут	Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables	Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса
6	13-02-2023				Нарушение работоспособности Mailbot	10 часов	Дождались пока проснется программист и поправит автовакуум на PostgrSQL	Необходим штатный DBA для тюнинга баз данных, которые находятся локально на инстансе. Сергей и Дмитрий в курсе ситуации, мне штатную единицу не дали
7	11.02.2023 - 14.02.2023				Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях	Плавающая проблема, в основном в моменты наплыва звонков	Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий	Нужно пересматривать параметры провайдера и переходить на другие, кассается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков
8	01.04.2023 - 06.04.2023				Несколько сайтов локалшопов имели просроченные SSL- скртификаты, вследствие чего	6 дней	Certbot от Letsencrypt обновил свои пакеты и изменил политику обновления бесплатных сертификатов, увидели не сразу	Решение по переезду на платные сертификаты не нашло поддержки, нужно следить за сертификатами, обновлять скрипты сертбота и наблюдать за кроном
9	11.05.2023 - 17.05.2023				Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях	Плавающая проблема, в основном в моменты наплыва звонков	Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий	Нужно пересматривать параметры провайдера и переходить на другие, касается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков
10	24.06.2023	10:15 UTC	25.06.2023	18:00 UTC	Payment gateway experienced intermittent failures, causing transaction processing delays.	8 hours	Updated payment gateway configurations and restarted services to restore functionality.	Implement automated monitoring for early detection of transaction failures.
11	15.09.2023		20.09.2023		IP-адрес почтового сервера попал в спамлист, мы потеряли ресурс почтовика до восстановления	9 часов даунтайм	Переезд почтового сервера на новый инстанс со сменой адресов (локального и внешнего), перенастройка DNS, MX-записей	Вскрыли ящик wecare@bloomexusa.com. Предположительно через фронт Американского сайта. Используя полученные креды было отправлено 9,5 миллионов писем через наш почтовые сервер за сутки с этого адреса. Вследствие чего адрес был скомпрометирован и внесен в стоплист. Нужно закрывать дыры в коде на bloomexusa.com
12	23.09.2023	11:00 PM	24.09.2023	7:00 PM	в 13:08 пришла информация в чате от Анастасии что не работает оплата	20 часов	Отключили подтверждение оплаты через 2 часа, после девелоперы пофиксили и включили обратно
13	03.10.2023				https://api.bloomex.ca/dhlordermanager - Invalid SSL certificate Error code 526	35 минут	Обновил сертификат	Максам отдали мониторинг сертификатов
14	21.12.2023	10:00 AM	21.12.2023	3:00 PM	web3-3 was dead because of DDoS and after reboot instance is die	5 hours	Reconnected the root disk from the backup from another host that was a backup
15	06.02.2023	7:00 AM	06.02.2023	9:00 AM	bloomexusa.com - no longer available.	3 hours
XX	14.02.2023	1:00 PM	14.02.2023	7:00 PM	chat.bloomex.ca - Watson bot dead	6 hours	Cherepanov: помогло удаление очередей и контейнеров,логи есть, днем гляну и найду причину,навскидку - часов 5-6 назад завис супервайзер
16	06.03.2024	20:17 UTC	06.03.2024	20:36 UTC	mail.necs.ca server down	0 hrs 19 mins	Due to urgent SSH access issue we were forced to edit configuration files manually, which led to a server stop. After the files were edited directly on the server drive it was boot up back again immediately.	sshd_config file should NOT be modified in the future if there's a key-only login policy.
17	07.03.2024	21:47 UTC	07.03.2024	22:00 UTC	adm.bloomex.ca server down	0 hrs 13 mins	No free space on server storage due to extremely large log files. No zabbix alerts fired due to Zabbix being turned off for planned maintenance.	Important, yet large log files should not be kept on production server. Due to be moved to another location.Zabbix should not be stopped for continuous periods of time.
18	11.03.2024	11:00 UTC	11.03.2024	17:00 UTC	adm-eu.necs.ca down (status 500) (http://tasks.bloomex.ca/redmine/issues/16047), weird store orders/confirmation emails behavior	6 hrs 0 mins	Cron policies on production server running reoccurring jobs more often than a job run cycle take led to MySQL processes queue growth up to limit.	It is recommended to reconfigure production cron mechanisms to wait for jobs to finish before starting them over if it takes more than a run cycle for them to do that.
19	18.03.2024	14:47	18.03.2024	18:05	sip2.bloomex.ca issue with NZ inbound lines	3 hours 13 mins	create new server from backup and restore configuration	appears due to "deletion of old unnecessary numbers"
20	19.04.2024	12:41	19.04.2024	12:54	localshops sites down	13 min	fixed php socket name	manual for creation new shops was updated Create localshop (Laravel on dev3-2)
21	27.04.2024	13:23 EDT	27.04.2024	16:10 EDT	DDoS on Bloomex.ca from an unidentified network	2 hrs 47 min	Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled.	Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs.
22	30.04.2024	13:16 EDT	30.04.2024	13:45 EDT	DDoS on Bloomex.com.au from an unidentified network	29 min	Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled for Australina site.	Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs.Still we need to tune up our monitoring system to react to suspicious patterns of high load faster - at least BEFORE than the head of the organization becomes aware of it.
23	9.05.2024	1:31 PM	9.05.2024	4:03 PM	GE network issues	2 hour 32 min	After troubleshooting and reboot switches and GW, was reconfigured GW OpenVPN profile and it is fix a problem
24	11.05.2024	12:12 EDT	11.05.2024	13:28 EDT	DDoS on bloomex.ca main prod	0 minutes (no downtime)	An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact.
25	21.05.2024	5.06 AM	21.05.2024	9.08 AM	mailbot did not work	4 h 2 min	SSL certificate was not up updated automatically according to crontab, have to update it and restart mailbot.
25	28.05.2024	11:45	28.05.2024	13:00	DDoS on bloomex.ca main prod	Partial downtime ~10 min total	An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact.
26	03.06.2024	4.24	03.06.2024	7.49	users could not to add items in the cart	3.25	free inodes on the server, fixed the root cause of high inodes consuming	http://tasks.bloomex.ca/redmine/issues/18106If inodes are equal 100%, runfind /mnt/storage2 -xdev -printf '%h\n' \| sort \| uniq -c \| sort -rn \| head -n 20 AND you will get php session path, after that run find /mnt/storage2/www/stage2.bloomex.ca/php-session -type f -delete for mnt change php session location
27	05.06.2024	20:41	05.06.2024	21:35	bloomex.ca wasn't available	56 mins	Prometheus overloaded prod DBChanged DB source for Prometheus to prod-replica and killed executable processes from Prometheus in prod DB
28	27.06.2024	14:30	27.06.2024	22:00	adm-eu@bloomex.ca secret comporomised; mail.necs.ca down	Partial downtime whole day ~ 4hours	spam bombing via adm-eu; zimbra can not started; due to a weird workload issue that is now fully addressed, some of the emails that were sent during the period of time between 08:00 and 14:00 EDT MIGHT be lost	changed credentials for mailbox adm-eu; blocked by ip for attackers; added resources for zimbra server host; cleaned queues
29	01.07.2024	19:00 EDT	01.07.2024	22:00 EDT	victory.nadum@bloomex.ca mailbox compromised, used for spam messaging. Uptime not affected.	No downtime	Compromised mailbox deactivated; creds changed
30	03.07.2024	11:27 EDT	03.07.2024	11:30 EDT	DNS name and resolved IP were changed in Cloudflare, after some time prod server went down down, because the host instance did not had enough resources for stable run (t2.micro instead m5.2xlarge)	3 minutes	DNS name and resolved IP in Cloudflare were changed to old host
31	05.07.2024	10:53 AM PST	05.07.2024	11:13 AM PST	I turned off api2 at Eershov's request(Audit letter -Bloomex team migrated the API server from Debian 9 to Debian 11 in a new server instance around 3 weeks before engaging for PFI. The old instance is still up and running.),and NZ and adm payment it down, then turned it back on because NZ works through it	20 min	Create ticket for Levon to fix workflow, also fixed in adm.bloomex.ca and adm.bloomex.com.au from api2 to api-pay
32	15.07.2024	8:23 AM PST	15.07.2024	10:37 AM PST	Local shops phone issues	2 hour 24 min	An issue with route on media, they was deleted
33	19.07.2024	06:34 AM PST	19.07.2024	8:20 AM PST	Periodic downtime issues with retail shops because of ticket #19186	5 minutes	This was fixed by comment email credentials, rollback installation, and increased server resources
34	24.07.2024	09:14 AM PST	24.07.2024	10:20 AM PST	Local shops phone issues: voice layer not transfered, clients and managers do not hear each other	~ 1 hour	This was happen by revoked security group in AWS which named Rostov_main and had port ranges 10000 - 20000	For resolve SG was resurrected and name was changed to correct onehttp://tasks.bloomex.ca/redmine/issues/19280
35	29.07.2024	18:24 EDT	29.07.2024	21:10 EDT	Outdated security certificate on mail.necs.ca	No actual downtime; some performance degradation	Actions run as per Mail server	A cert can be renewed in several ways. BE AWARE that in case of OUR MAIL SERVER you should only use link on the left of this cell <--
36	07.08.2024	08:30 EDT	07.08.2024	09:15 EDT	Curl Error: SSL certificate problem: unable to get local issuer certificate	~45mins	Happen by transferred adm-eu instance from Frankfurt to Oregon region	Added Security Group for instance with api ip
37	11.08.2024	10:30 EDT	11.08.2024	10:55 EDT	DDoS on bloomex.com.au	No downtime	DDoS on Australia has been prevented immediately by CloudFlare rules (Under Attack Mode & custom WAF rules was enabled at the moment)
38	17.08.2024	18:20 EDT (APPRX)	18.08.2024	18:20 EDT	bloomex.ca malfunction: inability to perform several operations, including placing an order.External audit recommendation to blacklist backend server on frontend server firewall led to site functioning improperly, the following command being the cause of the accident: iptables -A INPUT -s 195.2.92.206 -j DROP && sudo iptables -A INPUT -s 37.1.213.196 -j DROP && sudo iptables -A INPUT -s 34.210.253.67 -j DROP && sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP && sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && sudo iptables -A OUTPUT -d 34.210.253.67 -j DROP	No actual downtime; performance degradation for ~24 hrs	Rolling back new firewall rules resolved the issue. Command to cancel looks as follows:sudo iptables -A INPUT -s 37.1.213.196 -j DROP && \ sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && \ sudo iptables -A INPUT -s 195.2.92.206 -j DROP && \ sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP	Giving more attention to audit requests recommended
39	06.09.2024	16:00 EDT	06.09.2024	19:00 EDT	Problems with internet provider for GE office	~3 hours	Gateway ISP hotswitch mechanism fail, switch performed manually. Need to fix the hotswitch/replace the current gateway PC with Fortigate	http://tasks.bloomex.ca/redmine/issues/20151http://tasks.bloomex.ca/redmine/issues/20152
40	07.09.2024	~10:00 EDT	07.09.2024	~11:30 EDT	Retail's shops phone issue - all of them have been dropped from queue several times	~1 hour	Looks like internet connection problems for Carlton Place and some regions too.Problem's gone by itself.	http://tasks.bloomex.ca/redmine/issues/20158
41	09.09.2024	~07:00 EDT	09.09.2024	~11:30 EDT	Retail's shops phone issue - all subnets for retails not working: phones, cameras, gateways	~4,5 hours	Openvpn service has been reloaded with all routes lost. Routes re-added back manually, backup file created.	http://tasks.bloomex.ca/redmine/issues/20326
42	09.09.2024	~09:30 EDT	09.09.2024	~23:30 EDT	host 10.0.0.91 had problems with performance, but it was not clear to understood the reason.	~14 hours	DDoS attack from unknown aws instance. Direct attack bypassing all external firewall rules. After blocking with iptables all performance issues are gone.	http://tasks.bloomex.ca/redmine/issues/20185
43	10.09.2024	09:30 EDT	10.09.2024	13:30 EDT	As a result of the investigation of the previous incident, there were more than 30 active sessions on the VPN server. When they were disabled, the VPN service crashed, but was restored within half an hour. However, after 20 minutes, for unknown reasons, the service dropped again. It was decided to change the password for root and reboot the server. After this, all dynamic routes and firewall rules were lost, which resulted in the breakdown of 2 of the 3 subnets.	4 hours	All routes and firewall rules were restored from backups	need ticket
44	26.10.2024	03:26 EDT	26.10.2024	08:00 EDT	Verification code did not arrive to some users. The problem detected on adm side of CA, but the point that mailsender leak the space	No actual downtime; for~4,5 hours some of mails did not arrive by users	Because of mail logic via php code for admins systems flow to mailsender, it can be broken, if something will happen with last one.	http://tasks.bloomex.ca/redmine/issues/21141
45	09.11.2024	00:12 EDT	09.11.2024	17:00 EST	Printing labels does not work via purolator, fedex, canada post	No actual downtime	by upgrading server OS from debian 11 to debian 12 some php modules have been disabled (soap)	http://tasks.bloomex.ca/issues/21318
46	12.11.2024	11:00EDT	12.11.2024	12:00EDT	Canadian admin Down	No actual downtime	The low RAM issue was resolved by restarting the php-fpm socket version 7.0, which helped reduce the number of active connections. The increase in connections was due to a DDoS attack.
47	14.11.2024	05:40 EDT	14.11.2024	06:00 EDT	VPN server got leak of space	~20mins	/var/log got leak of free space because logs of EDR, velociraptor, auditd, rsyslog ate all of it	make a rotation for /var/log
48	30.11.2024	07:23 EDT	30.11.2024	08:23 EDT	Sip-clients cant finish the registration on Asterisk	1h	Asterisk did not accepted registration from sip-clients, but at all working well.	Fixed by reboot after unsuccessful investigationAdded ulimit 65353 https://tasks.bloomex.ca/issues/21584
49	12.12.2024	14:30 EDT	12.12.2024	16:20 EDT	Huge DDoS attack onto the chat.bloomex.ca	No actual downtime; performance degradation for ~2hrs	Huge DDoS attack onto the chat.bloomex.ca from many ip's	Added block rules, resized instancehttps://tasks.bloomex.ca/issues/21961
50	12.01.2025	20:00UTC	12.01.2025	23:00UTC	AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects.	~2hrs	The overload on the DBMS has passed over time	The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend
51	21.01.2025	13:20 UTC	21.01.2025	15:50 UTC	CA, AU, USA admin system hangs due to peak CPU load on main-db1/main-db2 and caused by peak of database connects.	~1.5hrs
52	22.01.2025	2:40 UTC	22.01.2025	4:50 UTC	AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects.	~1.5hrs	Huge DDoS from msnbot for bloomex.com.au and from AndroidDownloadManager/5.1 for bloomexusa, and a number of other unspecified addresses.	The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend
53	30.01.2025	1:18UTC	30.01.2025	1:57 UTC	The retail host became unavailable over the network at one point.	39 mins 30 sec	For a short period of time, the load on the DB increased. Retail sites simply became unavailable over the network for no apparent reason.	A change in the retailer's car plan helped. The host was moved to another resource pool. Changing the network interface during resizing may have helped.

ACCIDENT HISTORY: Difference between revisions

Latest revision as of 02:04, 6 February 2025

Navigation menu

Search