ACCIDENT HISTORY: Difference between revisions
Jump to navigation
Jump to search
(Created page with "here") |
No edit summary |
||
Line 1: | Line 1: | ||
{| class="wikitable" | |||
|'''NN''' | |||
|'''Start Date''' | |||
|'''Start Time''' | |||
|'''Resolve Date''' | |||
|'''Resolve Time''' | |||
|'''Issue''' | |||
|'''Downtime''' | |||
|'''Solution''' | |||
|'''Comments''' | |||
|- | |||
|1 | |||
|09-12-2022 | |||
| | |||
| | |||
| | |||
|Принудительнвй апгрейд версии 5.6 MySQL со стороны AWS из-за прекращения поддержки Амазоном версии 5.6 | |||
|12 часов | |||
|Восстановление из резервной копии баз данных с последующией миграцией на поддерживаемую версию MySQL | |||
|В скором времени версия 5.7 перестанет поддерживаться Амазоном. Нужно совместно с программистами запланировать переезд на новую версию. Желательно на 8.0. Сергей Ершов в курсе о задаче, решение о переезде постоянно откладывалось | |||
|- | |||
|2 | |||
|23-12-2022 | |||
| | |||
| | |||
| | |||
|DDOS-атака на bloomex.com.au | |||
|20 минут | |||
|Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | |||
|Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
|- | |||
|3 | |||
|11-02-2023 | |||
| | |||
| | |||
| | |||
|DDOS-атака на bloomex.com.au | |||
|12 минут | |||
|Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | |||
|Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
|- | |||
|4 | |||
|12-02-2023 | |||
| | |||
| | |||
| | |||
|DDOS-атака на bloomex.com.au | |||
|12 минут | |||
|Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | |||
|Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
|- | |||
|5 | |||
|13-02-2023 | |||
| | |||
| | |||
| | |||
|DDOS-атака на bloomex.com.au | |||
|9 минут | |||
|Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | |||
|Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
|- | |||
|6 | |||
|13-02-2023 | |||
| | |||
| | |||
| | |||
|Нарушение работоспособности Mailbot | |||
|10 часов | |||
|Дождались пока проснется программист и поправит автовакуум на PostgrSQL | |||
|Необходим штатный DBA для тюнинга баз данных, которые находятся локально на инстансе. Сергей и Дмитрий в курсе ситуации, мне штатную единицу не дали | |||
|- | |||
|7 | |||
|11.02.2023 - 14.02.2023 | |||
| | |||
| | |||
| | |||
|Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях | |||
|Плавающая проблема, в основном в моменты наплыва звонков | |||
|Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий | |||
|Нужно пересматривать параметры провайдера и переходить на другие, кассается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков | |||
|- | |||
|8 | |||
|01.04.2023 - 06.04.2023 | |||
| | |||
| | |||
| | |||
|Несколько сайтов локалшопов имели просроченные SSL- скртификаты, вследствие чего | |||
|6 дней | |||
|Certbot от Letsencrypt обновил свои пакеты и изменил политику обновления бесплатных сертификатов, увидели не сразу | |||
|Решение по переезду на платные сертификаты не нашло поддержки, нужно следить за сертификатами, обновлять скрипты сертбота и наблюдать за кроном | |||
|- | |||
|9 | |||
|11.05.2023 - 17.05.2023 | |||
| | |||
| | |||
| | |||
|Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях | |||
|Плавающая проблема, в основном в моменты наплыва звонков | |||
|Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий | |||
|Нужно пересматривать параметры провайдера и переходить на другие, касается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков | |||
|- | |||
|10 | |||
|24.06.2023 | |||
|10:15 UTC | |||
|25.06.2023 | |||
|18:00 UTC | |||
|Payment gateway experienced intermittent failures, causing transaction processing delays. | |||
|8 hours | |||
|Updated payment gateway configurations and restarted services to restore functionality. | |||
|Implement automated monitoring for early detection of transaction failures. | |||
|- | |||
|11 | |||
|15.09.2023 | |||
| | |||
|20.09.2023 | |||
| | |||
|IP-адрес почтового сервера попал в спамлист, мы потеряли ресурс почтовика до восстановления | |||
|9 часов даунтайм | |||
|Переезд почтового сервера на новый инстанс со сменой адресов (локального и внешнего), перенастройка DNS, MX-записей | |||
|Вскрыли ящик wecare@bloomexusa.com. Предположительно через фронт Американского сайта. Используя полученные креды было отправлено 9,5 миллионов писем через наш почтовые сервер за сутки с этого адреса. Вследствие чего адрес был скомпрометирован и внесен в стоплист. Нужно закрывать дыры в коде на bloomexusa.com | |||
|- | |||
|12 | |||
|23.09.2023 | |||
|11:00 PM | |||
|24.09.2023 | |||
|7:00 PM | |||
|в 13:08 пришла информация в чате от Анастасии что не работает оплата | |||
|20 часов | |||
|Отключили подтверждение оплаты через 2 часа, после девелоперы пофиксили и включили обратно | |||
| | |||
|- | |||
|13 | |||
|03.10.2023 | |||
| | |||
| | |||
| | |||
|<nowiki>https://api.bloomex.ca/dhlordermanager</nowiki> - Invalid SSL certificate Error code 526 | |||
|35 минут | |||
|Обновил сертификат | |||
|Максам отдали мониторинг сертификатов | |||
|- | |||
|14 | |||
|21.12.2023 | |||
|10:00 AM | |||
|21.12.2023 | |||
|3:00 PM | |||
|web3-3 was dead because of DDoS and after reboot instance is die | |||
|5 hours | |||
|Reconnected the root disk from the backup from another host that was a backup | |||
| | |||
|- | |||
|15 | |||
|06.02.2023 | |||
|7:00 AM | |||
|06.02.2023 | |||
|9:00 AM | |||
|bloomexusa.com - no longer available. | |||
|3 hours | |||
| | |||
| | |||
|- | |||
|XX | |||
|14.02.2023 | |||
|1:00 PM | |||
|14.02.2023 | |||
|7:00 PM | |||
|chat.bloomex.ca - Watson bot dead | |||
|6 hours | |||
|Cherepanov: помогло удаление очередей и контейнеров,логи есть, днем гляну и найду причину,навскидку - часов 5-6 назад завис супервайзер | |||
| | |||
|- | |||
|16 | |||
|06.03.2024 | |||
|20:17 UTC | |||
|06.03.2024 | |||
|20:36 UTC | |||
|mail.necs.ca server down | |||
|0 hrs 19 mins | |||
|Due to urgent SSH access issue we were forced to edit configuration files manually, which led to a server stop. After the files were edited directly on the server drive it was boot up back again immediately. | |||
|sshd_config file should NOT be modified in the future if there's a key-only login policy. | |||
|- | |||
|17 | |||
|07.03.2024 | |||
|21:47 UTC | |||
|07.03.2024 | |||
|22:00 UTC | |||
|adm.bloomex.ca server down | |||
|0 hrs 13 mins | |||
|No free space on server storage due to extremely large log files. No zabbix alerts fired due to Zabbix being turned off for planned maintenance. | |||
|Important, yet large log files should not be kept on production server. Due to be moved to another location.Zabbix should not be stopped for continuous periods of time. | |||
|- | |||
|18 | |||
|11.03.2024 | |||
|11:00 UTC | |||
|11.03.2024 | |||
|17:00 UTC | |||
|adm-eu.necs.ca down (status 500) (<nowiki>http://tasks.bloomex.ca/redmine/issues/16047</nowiki>), weird store orders/confirmation emails behavior | |||
|6 hrs 0 mins | |||
|Cron policies on production server running reoccurring jobs more often than a job run cycle take led to MySQL processes queue growth up to limit. | |||
|It is recommended to reconfigure production cron mechanisms to wait for jobs to finish before starting them over if it takes more than a run cycle for them to do that. | |||
|- | |||
|19 | |||
|18.03.2024 | |||
|14:47 | |||
|18.03.2024 | |||
|18:05 | |||
|sip2.bloomex.ca issue with NZ inbound lines | |||
|3 hours 13 mins | |||
|create new server from backup and restore configuration | |||
|appears due to "deletion of old unnecessary numbers" | |||
|- | |||
|20 | |||
|19.04.2024 | |||
|12:41 | |||
|19.04.2024 | |||
|12:54 | |||
|localshops sites down | |||
|13 min | |||
|fixed php socket name | |||
|[[Create localshop (Laravel on dev3-2)|manual for creation new shops was updated Create localshop (Laravel on dev3-2)]] | |||
|- | |||
|21 | |||
|27.04.2024 | |||
|13:23 EDT | |||
|27.04.2024 | |||
|16:10 EDT | |||
|DDoS on Bloomex.ca from an unidentified network | |||
|2 hrs 47 min | |||
|Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled. | |||
|Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs. | |||
|- | |||
|22 | |||
|30.04.2024 | |||
|13:16 EDT | |||
|30.04.2024 | |||
|13:45 EDT | |||
|DDoS on Bloomex.com.au from an unidentified network | |||
|29 min | |||
|Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled for Australina site. | |||
|Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs.Still we need to tune up our monitoring system to react to suspicious patterns of high load faster - at least BEFORE than the head of the organization becomes aware of it. | |||
|- | |||
|23 | |||
|9.05.2024 | |||
|1:31 PM | |||
|9.05.2024 | |||
|4:03 PM | |||
|GE network issues | |||
|2 hour 32 min | |||
|After troubleshooting and reboot switches and GW, was reconfigured GW OpenVPN profile and it is fix a problem | |||
| | |||
|- | |||
|24 | |||
|11.05.2024 | |||
|12:12 EDT | |||
|11.05.2024 | |||
|13:28 EDT | |||
|DDoS on bloomex.ca main prod | |||
|0 minutes (no downtime) | |||
|An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact. | |||
| | |||
|- | |||
|25 | |||
|21.05.2024 | |||
|5.06 AM | |||
|21.05.2024 | |||
|9.08 AM | |||
|mailbot did not work | |||
|4 h 2 min | |||
|SSL certificate was not up updated automatically according to crontab, have to update it and restart mailbot. | |||
| | |||
|- | |||
|25 | |||
|28.05.2024 | |||
|11:45 | |||
|28.05.2024 | |||
|13:00 | |||
|DDoS on bloomex.ca main prod | |||
|Partial downtime ~10 min total | |||
|An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact. | |||
| | |||
|- | |||
|26 | |||
|03.06.2024 | |||
|4.24 | |||
|03.06.2024 | |||
|7.49 | |||
|users could not to add items in the cart | |||
|3.25 | |||
|free inodes on the server, fixed the root cause of high inodes consuming | |||
|<nowiki>http://tasks.bloomex.ca/redmine/issues/18106If inodes are equal 100%, runfind /mnt/storage2 -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -n 20 AND you will get php session path, after that run find /mnt/storage2/www/stage2.bloomex.ca/php-session -type f -delete for mnt change php session location</nowiki> | |||
|- | |||
|27 | |||
|05.06.2024 | |||
|20:41 | |||
|05.06.2024 | |||
|21:35 | |||
|bloomex.ca wasn't available | |||
|56 mins | |||
|Prometheus overloaded prod DBChanged DB source for Prometheus to prod-replica and killed executable processes from Prometheus in prod DB | |||
| | |||
|- | |||
|28 | |||
|27.06.2024 | |||
|14:30 | |||
|27.06.2024 | |||
|22:00 | |||
|adm-eu@bloomex.ca secret comporomised; mail.necs.ca down | |||
|Partial downtime whole day ~ 4hours | |||
|spam bombing via adm-eu; zimbra can not started; due to a weird workload issue that is now fully addressed, some of the emails that were sent during the period of time between 08:00 and 14:00 EDT MIGHT be lost | |||
|changed credentials for mailbox adm-eu; blocked by ip for attackers; added resources for zimbra server host; cleaned queues | |||
|- | |||
|29 | |||
|01.07.2024 | |||
|19:00 EDT | |||
|01.07.2024 | |||
|22:00 EDT | |||
|victory.nadum@bloomex.ca mailbox compromised, used for spam messaging. Uptime not affected. | |||
|No downtime | |||
|Compromised mailbox deactivated; creds changed | |||
| | |||
|- | |||
|30 | |||
|03.07.2024 | |||
|11:27 EDT | |||
|03.07.2024 | |||
|11:30 EDT | |||
|DNS name and resolved IP were changed in Cloudflare, after some time prod server went down down, because the host instance did not had enough resources for stable run (t2.micro instead m5.2xlarge) | |||
|3 minutes | |||
|DNS name and resolved IP in Cloudflare were changed to old host | |||
| | |||
|- | |||
|31 | |||
|05.07.2024 | |||
|10:53 AM PST | |||
|05.07.2024 | |||
|11:13 AM PST | |||
|I turned off api2 at Eershov's request(Audit letter -Bloomex team migrated the API server from Debian 9 to Debian 11 in a new server instance around 3 weeks before engaging for PFI. The old instance is still up and running.),and NZ and adm payment it down, then turned it back on because NZ works through it | |||
|20 min | |||
|Create ticket for Levon to fix workflow, also fixed in adm.bloomex.ca and adm.bloomex.com.au from api2 to api-pay | |||
| | |||
|- | |||
|32 | |||
|15.07.2024 | |||
|8:23 AM PST | |||
|15.07.2024 | |||
|10:37 AM PST | |||
|Local shops phone issues | |||
|2 hour 24 min | |||
|An issue with route on media, they was deleted | |||
| | |||
|- | |||
|33 | |||
|19.07.2024 | |||
|06:34 AM PST | |||
|19.07.2024 | |||
|8:20 AM PST | |||
|Periodic downtime issues with retail shops because of ticket #19186 | |||
|5 minutes | |||
|This was fixed by comment email credentials, rollback installation, and increased server resources | |||
| | |||
|- | |||
|34 | |||
|24.07.2024 | |||
|09:14 AM PST | |||
|24.07.2024 | |||
|10:20 AM PST | |||
|Local shops phone issues: voice layer not transfered, clients and managers do not hear each other | |||
|~ 1 hour | |||
|This was happen by revoked security group in AWS which named Rostov_main and had port ranges 10000 - 20000 | |||
|For resolve SG was resurrected and name was changed to correct onehttp://tasks.bloomex.ca/redmine/issues/19280 | |||
|- | |||
|35 | |||
|29.07.2024 | |||
|18:24 EDT | |||
|29.07.2024 | |||
|21:10 EDT | |||
|Outdated security certificate on mail.necs.ca | |||
|No actual downtime; some performance degradation | |||
|[[Mail server|Actions run as per Mail server]] | |||
|A cert can be renewed in several ways. BE AWARE that in case of OUR MAIL SERVER you should only use link on the left of this cell <-- | |||
|- | |||
|36 | |||
|07.08.2024 | |||
|08:30 EDT | |||
|07.08.2024 | |||
|09:15 EDT | |||
|Curl Error: SSL certificate problem: unable to get local issuer certificate | |||
|~45mins | |||
|Happen by transferred adm-eu instance from Frankfurt to Oregon region | |||
|Added Security Group for instance with api ip | |||
|- | |||
|37 | |||
|11.08.2024 | |||
|10:30 EDT | |||
|11.08.2024 | |||
|10:55 EDT | |||
|DDoS on bloomex.com.au | |||
|No downtime | |||
|DDoS on Australia has been prevented immediately by CloudFlare rules (Under Attack Mode & custom WAF rules was enabled at the moment) | |||
| | |||
|- | |||
|38 | |||
|17.08.2024 | |||
|18:20 EDT (APPRX) | |||
|18.08.2024 | |||
|18:20 EDT | |||
|bloomex.ca malfunction: inability to perform several operations, including placing an order.External audit recommendation to blacklist backend server on frontend server firewall led to site functioning improperly, the following command being the cause of the accident: iptables -A INPUT -s 195.2.92.206 -j DROP && sudo iptables -A INPUT -s 37.1.213.196 -j DROP && sudo iptables -A INPUT -s 34.210.253.67 -j DROP && sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP && sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && sudo iptables -A OUTPUT -d 34.210.253.67 -j DROP | |||
|No actual downtime; performance degradation for ~24 hrs | |||
|Rolling back new firewall rules resolved the issue. Command to cancel looks as follows:sudo iptables -A INPUT -s 37.1.213.196 -j DROP && \ sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && \ | |||
sudo iptables -A INPUT -s 195.2.92.206 -j DROP && \ | |||
sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP | |||
|Giving more attention to audit requests recommended | |||
|- | |||
|39 | |||
|06.09.2024 | |||
|16:00 EDT | |||
|06.09.2024 | |||
|19:00 EDT | |||
|Problems with internet provider for GE office | |||
|~3 hours | |||
|Gateway ISP hotswitch mechanism fail, switch performed manually. Need to fix the hotswitch/replace the current gateway PC with Fortigate | |||
|<nowiki>http://tasks.bloomex.ca/redmine/issues/20151http://tasks.bloomex.ca/redmine/issues/20152</nowiki> | |||
|- | |||
|40 | |||
|07.09.2024 | |||
|~10:00 EDT | |||
|07.09.2024 | |||
|~11:30 EDT | |||
|Retail's shops phone issue - all of them have been dropped from queue several times | |||
|~1 hour | |||
|Looks like internet connection problems for Carlton Place and some regions too.Problem's gone by itself. | |||
|<nowiki>http://tasks.bloomex.ca/redmine/issues/20158</nowiki> | |||
|- | |||
|41 | |||
|09.09.2024 | |||
|~07:00 EDT | |||
|09.09.2024 | |||
|~11:30 EDT | |||
|Retail's shops phone issue - all subnets for retails not working: phones, cameras, gateways | |||
|~4,5 hours | |||
|Openvpn service has been reloaded with all routes lost. Routes re-added back manually, backup file created. | |||
|<nowiki>http://tasks.bloomex.ca/redmine/issues/20326</nowiki> | |||
|- | |||
|42 | |||
|09.09.2024 | |||
|~09:30 EDT | |||
|09.09.2024 | |||
|~23:30 EDT | |||
|host 10.0.0.91 had problems with performance, but it was not clear to understood the reason. | |||
|~14 hours | |||
|DDoS attack from unknown aws instance. Direct attack bypassing all external firewall rules. After blocking with iptables all performance issues are gone. | |||
|<nowiki>http://tasks.bloomex.ca/redmine/issues/20185</nowiki> | |||
|- | |||
|43 | |||
|10.09.2024 | |||
|09:30 EDT | |||
|10.09.2024 | |||
|13:30 EDT | |||
|As a result of the investigation of the previous incident, there were more than 30 active sessions on the VPN server. When they were disabled, the VPN service crashed, but was restored within half an hour. However, after 20 minutes, for unknown reasons, the service dropped again. It was decided to change the password for root and reboot the server. After this, all dynamic routes and firewall rules were lost, which resulted in the breakdown of 2 of the 3 subnets. | |||
|4 hours | |||
|All routes and firewall rules were restored from backups | |||
|need ticket | |||
|- | |||
|44 | |||
|26.10.2024 | |||
|03:26 EDT | |||
|26.10.2024 | |||
|08:00 EDT | |||
|Verification code did not arrive to some users. The problem detected on adm side of CA, but the point that mailsender leak the space | |||
|No actual downtime; for~4,5 hours some of mails did not arrive by users | |||
|Because of mail logic via php code for admins systems flow to mailsender, it can be broken, if something will happen with last one. | |||
|<nowiki>http://tasks.bloomex.ca/redmine/issues/21141</nowiki> | |||
|- | |||
|45 | |||
|09.11.2024 | |||
|00:12 EDT | |||
|09.11.2024 | |||
|17:00 EST | |||
|Printing labels does not work via purolator, fedex, canada post | |||
|No actual downtime | |||
|by upgrading server OS from debian 11 to debian 12 some php modules have been disabled (soap) | |||
|<nowiki>http://tasks.bloomex.ca/issues/21318</nowiki> | |||
|- | |||
|46 | |||
|12.11.2024 | |||
|11:00EDT | |||
|12.11.2024 | |||
|12:00EDT | |||
|Canadian admin Down | |||
|No actual downtime | |||
|The low RAM issue was resolved by restarting the php-fpm socket version 7.0, which helped reduce the number of active connections. The increase in connections was due to a DDoS attack. | |||
| | |||
|- | |||
|47 | |||
|14.11.2024 | |||
|05:40 EDT | |||
|14.11.2024 | |||
|06:00 EDT | |||
|VPN server got leak of space | |||
|~20mins | |||
|/var/log got leak of free space because logs of EDR, velociraptor, auditd, rsyslog ate all of it | |||
|make a rotation for /var/log | |||
|- | |||
|48 | |||
|30.11.2024 | |||
|07:23 EDT | |||
|30.11.2024 | |||
|08:23 EDT | |||
|Sip-clients cant finish the registration on Asterisk | |||
|1h | |||
|Asterisk did not accepted registration from sip-clients, but at all working well. | |||
|Fixed by reboot after unsuccessful investigationAdded ulimit 65353 <nowiki>https://tasks.bloomex.ca/issues/21584</nowiki> | |||
|- | |||
|49 | |||
|12.12.2024 | |||
|14:30 EDT | |||
|12.12.2024 | |||
|16:20 EDT | |||
|Huge DDoS attack onto the chat.bloomex.ca | |||
|No actual downtime; performance degradation for ~2hrs | |||
|Huge DDoS attack onto the chat.bloomex.ca from many ip's | |||
|Added block rules, resized instancehttps://tasks.bloomex.ca/issues/21961 | |||
|- | |||
|50 | |||
|12.01.2025 | |||
|20:00UTC | |||
|12.01.2025 | |||
|23:00UTC | |||
|AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects. | |||
|~2hrs | |||
|The overload on the DBMS has passed over time | |||
|The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend | |||
|- | |||
|51 | |||
|21.01.2025 | |||
|13:20 UTC | |||
|21.01.2025 | |||
|15:50 UTC | |||
|CA, AU, USA admin system hangs due to peak CPU load on main-db1/main-db2 and caused by peak of database connects. | |||
|~1.5hrs | |||
| | |||
| | |||
|- | |||
|52 | |||
|22.01.2025 | |||
|2:40 UTC | |||
|22.01.2025 | |||
|4:50 UTC | |||
|AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects. | |||
|~1.5hrs | |||
|Huge DDoS from msnbot for bloomex.com.au and from AndroidDownloadManager/5.1 for bloomexusa, and a number of other unspecified addresses. | |||
|The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend | |||
|- | |||
|53 | |||
|30.01.2025 | |||
|1:18UTC | |||
|30.01.2025 | |||
|1:57 UTC | |||
|The retail host became unavailable over the network at one point. | |||
|39 mins 30 sec | |||
|For a short period of time, the load on the DB increased. Retail sites simply became unavailable over the network for no apparent reason. | |||
|A change in the retailer's car plan helped. The host was moved to another resource pool. Changing the network interface during resizing may have helped. | |||
|} |
Latest revision as of 02:04, 6 February 2025
NN | Start Date | Start Time | Resolve Date | Resolve Time | Issue | Downtime | Solution | Comments |
1 | 09-12-2022 | Принудительнвй апгрейд версии 5.6 MySQL со стороны AWS из-за прекращения поддержки Амазоном версии 5.6 | 12 часов | Восстановление из резервной копии баз данных с последующией миграцией на поддерживаемую версию MySQL | В скором времени версия 5.7 перестанет поддерживаться Амазоном. Нужно совместно с программистами запланировать переезд на новую версию. Желательно на 8.0. Сергей Ершов в курсе о задаче, решение о переезде постоянно откладывалось | |||
2 | 23-12-2022 | DDOS-атака на bloomex.com.au | 20 минут | Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
3 | 11-02-2023 | DDOS-атака на bloomex.com.au | 12 минут | Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
4 | 12-02-2023 | DDOS-атака на bloomex.com.au | 12 минут | Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
5 | 13-02-2023 | DDOS-атака на bloomex.com.au | 9 минут | Отлавливание атакающего IP-адреса путем анализа логов по пути /var/log/apache2/www.bloomex.com.au-access.log с последующим баном адреса в .htaccess или в iptables | Легче всего написать регулярное выражение, которое будет анализировать лог-файл, отсортирует по количеству обращений, если превышает 10к, можно смело блорировать. Данный тип DDOS укладывает апач или быстро забивает место на диске, что приводит к недоступности ресурса | |||
6 | 13-02-2023 | Нарушение работоспособности Mailbot | 10 часов | Дождались пока проснется программист и поправит автовакуум на PostgrSQL | Необходим штатный DBA для тюнинга баз данных, которые находятся локально на инстансе. Сергей и Дмитрий в курсе ситуации, мне штатную единицу не дали | |||
7 | 11.02.2023 - 14.02.2023 | Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях | Плавающая проблема, в основном в моменты наплыва звонков | Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий | Нужно пересматривать параметры провайдера и переходить на другие, кассается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков | |||
8 | 01.04.2023 - 06.04.2023 | Несколько сайтов локалшопов имели просроченные SSL- скртификаты, вследствие чего | 6 дней | Certbot от Letsencrypt обновил свои пакеты и изменил политику обновления бесплатных сертификатов, увидели не сразу | Решение по переезду на платные сертификаты не нашло поддержки, нужно следить за сертификатами, обновлять скрипты сертбота и наблюдать за кроном | |||
9 | 11.05.2023 - 17.05.2023 | Работа телефонии с перебоями, линии разваливались в процессе разговора, проблема была на входящих линиях | Плавающая проблема, в основном в моменты наплыва звонков | Провайдер не мог нам предоставить нужную нам ширину канала. В качестве решения приняли смену провайдера входящих линий | Нужно пересматривать параметры провайдера и переходить на другие, касается и входящих, и исходящих линий. На праздниках проблем стало меньше, но они не оттестированы на крупном потоке входящих звонков | |||
10 | 24.06.2023 | 10:15 UTC | 25.06.2023 | 18:00 UTC | Payment gateway experienced intermittent failures, causing transaction processing delays. | 8 hours | Updated payment gateway configurations and restarted services to restore functionality. | Implement automated monitoring for early detection of transaction failures. |
11 | 15.09.2023 | 20.09.2023 | IP-адрес почтового сервера попал в спамлист, мы потеряли ресурс почтовика до восстановления | 9 часов даунтайм | Переезд почтового сервера на новый инстанс со сменой адресов (локального и внешнего), перенастройка DNS, MX-записей | Вскрыли ящик wecare@bloomexusa.com. Предположительно через фронт Американского сайта. Используя полученные креды было отправлено 9,5 миллионов писем через наш почтовые сервер за сутки с этого адреса. Вследствие чего адрес был скомпрометирован и внесен в стоплист. Нужно закрывать дыры в коде на bloomexusa.com | ||
12 | 23.09.2023 | 11:00 PM | 24.09.2023 | 7:00 PM | в 13:08 пришла информация в чате от Анастасии что не работает оплата | 20 часов | Отключили подтверждение оплаты через 2 часа, после девелоперы пофиксили и включили обратно | |
13 | 03.10.2023 | https://api.bloomex.ca/dhlordermanager - Invalid SSL certificate Error code 526 | 35 минут | Обновил сертификат | Максам отдали мониторинг сертификатов | |||
14 | 21.12.2023 | 10:00 AM | 21.12.2023 | 3:00 PM | web3-3 was dead because of DDoS and after reboot instance is die | 5 hours | Reconnected the root disk from the backup from another host that was a backup | |
15 | 06.02.2023 | 7:00 AM | 06.02.2023 | 9:00 AM | bloomexusa.com - no longer available. | 3 hours | ||
XX | 14.02.2023 | 1:00 PM | 14.02.2023 | 7:00 PM | chat.bloomex.ca - Watson bot dead | 6 hours | Cherepanov: помогло удаление очередей и контейнеров,логи есть, днем гляну и найду причину,навскидку - часов 5-6 назад завис супервайзер | |
16 | 06.03.2024 | 20:17 UTC | 06.03.2024 | 20:36 UTC | mail.necs.ca server down | 0 hrs 19 mins | Due to urgent SSH access issue we were forced to edit configuration files manually, which led to a server stop. After the files were edited directly on the server drive it was boot up back again immediately. | sshd_config file should NOT be modified in the future if there's a key-only login policy. |
17 | 07.03.2024 | 21:47 UTC | 07.03.2024 | 22:00 UTC | adm.bloomex.ca server down | 0 hrs 13 mins | No free space on server storage due to extremely large log files. No zabbix alerts fired due to Zabbix being turned off for planned maintenance. | Important, yet large log files should not be kept on production server. Due to be moved to another location.Zabbix should not be stopped for continuous periods of time. |
18 | 11.03.2024 | 11:00 UTC | 11.03.2024 | 17:00 UTC | adm-eu.necs.ca down (status 500) (http://tasks.bloomex.ca/redmine/issues/16047), weird store orders/confirmation emails behavior | 6 hrs 0 mins | Cron policies on production server running reoccurring jobs more often than a job run cycle take led to MySQL processes queue growth up to limit. | It is recommended to reconfigure production cron mechanisms to wait for jobs to finish before starting them over if it takes more than a run cycle for them to do that. |
19 | 18.03.2024 | 14:47 | 18.03.2024 | 18:05 | sip2.bloomex.ca issue with NZ inbound lines | 3 hours 13 mins | create new server from backup and restore configuration | appears due to "deletion of old unnecessary numbers" |
20 | 19.04.2024 | 12:41 | 19.04.2024 | 12:54 | localshops sites down | 13 min | fixed php socket name | manual for creation new shops was updated Create localshop (Laravel on dev3-2) |
21 | 27.04.2024 | 13:23 EDT | 27.04.2024 | 16:10 EDT | DDoS on Bloomex.ca from an unidentified network | 2 hrs 47 min | Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled. | Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs. |
22 | 30.04.2024 | 13:16 EDT | 30.04.2024 | 13:45 EDT | DDoS on Bloomex.com.au from an unidentified network | 29 min | Apache on the production server has been restarted to drop enormously large list of requests. After that, Attack mode on our Cloudflare account has been enabled for Australina site. | Harmful network payload raised slowly and wasn't easily identifiable as an intended attack during first hour of the incident. This is considered normal as our production nodes are capable of withstanding SOME amount of harmful requests before significant service degradation occurs.Still we need to tune up our monitoring system to react to suspicious patterns of high load faster - at least BEFORE than the head of the organization becomes aware of it. |
23 | 9.05.2024 | 1:31 PM | 9.05.2024 | 4:03 PM | GE network issues | 2 hour 32 min | After troubleshooting and reboot switches and GW, was reconfigured GW OpenVPN profile and it is fix a problem | |
24 | 11.05.2024 | 12:12 EDT | 11.05.2024 | 13:28 EDT | DDoS on bloomex.ca main prod | 0 minutes (no downtime) | An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact. | |
25 | 21.05.2024 | 5.06 AM | 21.05.2024 | 9.08 AM | mailbot did not work | 4 h 2 min | SSL certificate was not up updated automatically according to crontab, have to update it and restart mailbot. | |
25 | 28.05.2024 | 11:45 | 28.05.2024 | 13:00 | DDoS on bloomex.ca main prod | Partial downtime ~10 min total | An attack was successfully stopped by switching the Attack Mode/transparent cache (dev mode) on for a short periods of time during the high load period. No business impact. | |
26 | 03.06.2024 | 4.24 | 03.06.2024 | 7.49 | users could not to add items in the cart | 3.25 | free inodes on the server, fixed the root cause of high inodes consuming | http://tasks.bloomex.ca/redmine/issues/18106If inodes are equal 100%, runfind /mnt/storage2 -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -n 20 AND you will get php session path, after that run find /mnt/storage2/www/stage2.bloomex.ca/php-session -type f -delete for mnt change php session location |
27 | 05.06.2024 | 20:41 | 05.06.2024 | 21:35 | bloomex.ca wasn't available | 56 mins | Prometheus overloaded prod DBChanged DB source for Prometheus to prod-replica and killed executable processes from Prometheus in prod DB | |
28 | 27.06.2024 | 14:30 | 27.06.2024 | 22:00 | adm-eu@bloomex.ca secret comporomised; mail.necs.ca down | Partial downtime whole day ~ 4hours | spam bombing via adm-eu; zimbra can not started; due to a weird workload issue that is now fully addressed, some of the emails that were sent during the period of time between 08:00 and 14:00 EDT MIGHT be lost | changed credentials for mailbox adm-eu; blocked by ip for attackers; added resources for zimbra server host; cleaned queues |
29 | 01.07.2024 | 19:00 EDT | 01.07.2024 | 22:00 EDT | victory.nadum@bloomex.ca mailbox compromised, used for spam messaging. Uptime not affected. | No downtime | Compromised mailbox deactivated; creds changed | |
30 | 03.07.2024 | 11:27 EDT | 03.07.2024 | 11:30 EDT | DNS name and resolved IP were changed in Cloudflare, after some time prod server went down down, because the host instance did not had enough resources for stable run (t2.micro instead m5.2xlarge) | 3 minutes | DNS name and resolved IP in Cloudflare were changed to old host | |
31 | 05.07.2024 | 10:53 AM PST | 05.07.2024 | 11:13 AM PST | I turned off api2 at Eershov's request(Audit letter -Bloomex team migrated the API server from Debian 9 to Debian 11 in a new server instance around 3 weeks before engaging for PFI. The old instance is still up and running.),and NZ and adm payment it down, then turned it back on because NZ works through it | 20 min | Create ticket for Levon to fix workflow, also fixed in adm.bloomex.ca and adm.bloomex.com.au from api2 to api-pay | |
32 | 15.07.2024 | 8:23 AM PST | 15.07.2024 | 10:37 AM PST | Local shops phone issues | 2 hour 24 min | An issue with route on media, they was deleted | |
33 | 19.07.2024 | 06:34 AM PST | 19.07.2024 | 8:20 AM PST | Periodic downtime issues with retail shops because of ticket #19186 | 5 minutes | This was fixed by comment email credentials, rollback installation, and increased server resources | |
34 | 24.07.2024 | 09:14 AM PST | 24.07.2024 | 10:20 AM PST | Local shops phone issues: voice layer not transfered, clients and managers do not hear each other | ~ 1 hour | This was happen by revoked security group in AWS which named Rostov_main and had port ranges 10000 - 20000 | For resolve SG was resurrected and name was changed to correct onehttp://tasks.bloomex.ca/redmine/issues/19280 |
35 | 29.07.2024 | 18:24 EDT | 29.07.2024 | 21:10 EDT | Outdated security certificate on mail.necs.ca | No actual downtime; some performance degradation | Actions run as per Mail server | A cert can be renewed in several ways. BE AWARE that in case of OUR MAIL SERVER you should only use link on the left of this cell <-- |
36 | 07.08.2024 | 08:30 EDT | 07.08.2024 | 09:15 EDT | Curl Error: SSL certificate problem: unable to get local issuer certificate | ~45mins | Happen by transferred adm-eu instance from Frankfurt to Oregon region | Added Security Group for instance with api ip |
37 | 11.08.2024 | 10:30 EDT | 11.08.2024 | 10:55 EDT | DDoS on bloomex.com.au | No downtime | DDoS on Australia has been prevented immediately by CloudFlare rules (Under Attack Mode & custom WAF rules was enabled at the moment) | |
38 | 17.08.2024 | 18:20 EDT (APPRX) | 18.08.2024 | 18:20 EDT | bloomex.ca malfunction: inability to perform several operations, including placing an order.External audit recommendation to blacklist backend server on frontend server firewall led to site functioning improperly, the following command being the cause of the accident: iptables -A INPUT -s 195.2.92.206 -j DROP && sudo iptables -A INPUT -s 37.1.213.196 -j DROP && sudo iptables -A INPUT -s 34.210.253.67 -j DROP && sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP && sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && sudo iptables -A OUTPUT -d 34.210.253.67 -j DROP | No actual downtime; performance degradation for ~24 hrs | Rolling back new firewall rules resolved the issue. Command to cancel looks as follows:sudo iptables -A INPUT -s 37.1.213.196 -j DROP && \ sudo iptables -A OUTPUT -d 37.1.213.196 -j DROP && \
sudo iptables -A INPUT -s 195.2.92.206 -j DROP && \ sudo iptables -A OUTPUT -d 195.2.92.206 -j DROP |
Giving more attention to audit requests recommended |
39 | 06.09.2024 | 16:00 EDT | 06.09.2024 | 19:00 EDT | Problems with internet provider for GE office | ~3 hours | Gateway ISP hotswitch mechanism fail, switch performed manually. Need to fix the hotswitch/replace the current gateway PC with Fortigate | http://tasks.bloomex.ca/redmine/issues/20151http://tasks.bloomex.ca/redmine/issues/20152 |
40 | 07.09.2024 | ~10:00 EDT | 07.09.2024 | ~11:30 EDT | Retail's shops phone issue - all of them have been dropped from queue several times | ~1 hour | Looks like internet connection problems for Carlton Place and some regions too.Problem's gone by itself. | http://tasks.bloomex.ca/redmine/issues/20158 |
41 | 09.09.2024 | ~07:00 EDT | 09.09.2024 | ~11:30 EDT | Retail's shops phone issue - all subnets for retails not working: phones, cameras, gateways | ~4,5 hours | Openvpn service has been reloaded with all routes lost. Routes re-added back manually, backup file created. | http://tasks.bloomex.ca/redmine/issues/20326 |
42 | 09.09.2024 | ~09:30 EDT | 09.09.2024 | ~23:30 EDT | host 10.0.0.91 had problems with performance, but it was not clear to understood the reason. | ~14 hours | DDoS attack from unknown aws instance. Direct attack bypassing all external firewall rules. After blocking with iptables all performance issues are gone. | http://tasks.bloomex.ca/redmine/issues/20185 |
43 | 10.09.2024 | 09:30 EDT | 10.09.2024 | 13:30 EDT | As a result of the investigation of the previous incident, there were more than 30 active sessions on the VPN server. When they were disabled, the VPN service crashed, but was restored within half an hour. However, after 20 minutes, for unknown reasons, the service dropped again. It was decided to change the password for root and reboot the server. After this, all dynamic routes and firewall rules were lost, which resulted in the breakdown of 2 of the 3 subnets. | 4 hours | All routes and firewall rules were restored from backups | need ticket |
44 | 26.10.2024 | 03:26 EDT | 26.10.2024 | 08:00 EDT | Verification code did not arrive to some users. The problem detected on adm side of CA, but the point that mailsender leak the space | No actual downtime; for~4,5 hours some of mails did not arrive by users | Because of mail logic via php code for admins systems flow to mailsender, it can be broken, if something will happen with last one. | http://tasks.bloomex.ca/redmine/issues/21141 |
45 | 09.11.2024 | 00:12 EDT | 09.11.2024 | 17:00 EST | Printing labels does not work via purolator, fedex, canada post | No actual downtime | by upgrading server OS from debian 11 to debian 12 some php modules have been disabled (soap) | http://tasks.bloomex.ca/issues/21318 |
46 | 12.11.2024 | 11:00EDT | 12.11.2024 | 12:00EDT | Canadian admin Down | No actual downtime | The low RAM issue was resolved by restarting the php-fpm socket version 7.0, which helped reduce the number of active connections. The increase in connections was due to a DDoS attack. | |
47 | 14.11.2024 | 05:40 EDT | 14.11.2024 | 06:00 EDT | VPN server got leak of space | ~20mins | /var/log got leak of free space because logs of EDR, velociraptor, auditd, rsyslog ate all of it | make a rotation for /var/log |
48 | 30.11.2024 | 07:23 EDT | 30.11.2024 | 08:23 EDT | Sip-clients cant finish the registration on Asterisk | 1h | Asterisk did not accepted registration from sip-clients, but at all working well. | Fixed by reboot after unsuccessful investigationAdded ulimit 65353 https://tasks.bloomex.ca/issues/21584 |
49 | 12.12.2024 | 14:30 EDT | 12.12.2024 | 16:20 EDT | Huge DDoS attack onto the chat.bloomex.ca | No actual downtime; performance degradation for ~2hrs | Huge DDoS attack onto the chat.bloomex.ca from many ip's | Added block rules, resized instancehttps://tasks.bloomex.ca/issues/21961 |
50 | 12.01.2025 | 20:00UTC | 12.01.2025 | 23:00UTC | AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects. | ~2hrs | The overload on the DBMS has passed over time | The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend |
51 | 21.01.2025 | 13:20 UTC | 21.01.2025 | 15:50 UTC | CA, AU, USA admin system hangs due to peak CPU load on main-db1/main-db2 and caused by peak of database connects. | ~1.5hrs | ||
52 | 22.01.2025 | 2:40 UTC | 22.01.2025 | 4:50 UTC | AU admin system hangs due to peak CPU load on main-db1 and caused by peak of database connects. | ~1.5hrs | Huge DDoS from msnbot for bloomex.com.au and from AndroidDownloadManager/5.1 for bloomexusa, and a number of other unspecified addresses. | The exact cause of peak connectivity needs to be found. Most likely, these are requests from the frontend |
53 | 30.01.2025 | 1:18UTC | 30.01.2025 | 1:57 UTC | The retail host became unavailable over the network at one point. | 39 mins 30 sec | For a short period of time, the load on the DB increased. Retail sites simply became unavailable over the network for no apparent reason. | A change in the retailer's car plan helped. The host was moved to another resource pool. Changing the network interface during resizing may have helped. |