On April 3rd at 7:26 AM EDT, our application experienced an incident resulting in unresponsiveness and the inability to serve requests. Upon immediate awareness, remediation efforts were initiated, and a temporary fix was implemented by 8:25 AM EDT, at which time resolution of the underlying cause was believed to be achieved. On April 4th at 8:48 AM EDT, the application again became unresponsive. A swift response facilitated the deployment of another temporary fix by 9:03 AM EDT on April 4th. Subsequent investigation revealed the root cause to be the unintended accumulation of temporary server operating logs, leading to disk space exhaustion on the application server. A permanent fix has since been implemented to prevent further log accumulation, and no additional space consumption has been observed. No action is required from our customers regarding this incident.
Root Cause:
During runtime our logging system was failing to send logs to our external logging tools this prompted the server to fail back to logging within the server, the logging directory was ephemeral storage so a new deployment would temporarily fix the issue. It was not until April 3rd at 7:26AM EDT that our temporary storage had completely filled causing the server to not have any disk swap space; rendering the server to be unable to respond to request
Actions: