In our effort to uphold Information Security Management Systems (ISMS) and enhance resource usage and performance, we explore the foundational principles of Site Reliability Engineering (SRE), focusing on the Four Golden Signals.
Four golden signals
Implementing the four golden signals is important to WorkingMouse because it provides a comprehensive approach to monitoring and optimising system performance. The four golden signals, which include latency, traffic, errors, and saturation, offer crucial insights into the health and efficiency of a system. Monitoring the four golden signals allows us to identify potential issues, understand user experience, detect errors, and assess system capacity. This allows us to proactively address issues, enhance reliability, and ensure smooth operation of systems. Ultimately this leads to improved overall performance, user satisfaction and security.
Introducing SRE
Maintaining the stability and reliability of the infrastructure at WorkingMouse is crucial, not only for our Information Security Management Systems (ISMS) but to additionally monitor our resource usage and performance. Implementing strong monitoring systems enables us to proactively detect and address potential issues or abnormalities before they develop into significant problems. This approach ensures the continuous operations of our systems, mitigating the risk of downtime or performance decline.
The concept of Golden Signals was initially introduced by Google within the framework of Site Reliability Engineering (SRE) practices and has since become the essential framework for effective monitoring. SRE is used to continuously build and maintain more reliable services and is a functional way to apply software development solutions to IT operation problems.
An SRE’s role is to improve the overall resilience of a system and provide visibility to the health and performance of services across all applications and infrastructure. The four golden signals are the essential building blocks for an effective monitoring strategy. The fundamental idea behind Golden Signals centres on implementing Google’s four important metrics: latency, traffic, errors, and saturation. This approach promotes observability by monitoring application run times, user experiences, synthetic or black-box monitoring, and creating informative dashboards for the monitored components. The four golden signals serve as crucial metrics for understanding and measuring the health and performance of applications.
The foundations
Latency
- Refers to the time a system takes to respond to a request, it is commonly measured in milliseconds and can be segmented into components like network latency, server processing time, and client rendering time. Maintaining low latency is essential for a positive user experience, as users anticipate prompt responses from applications. Monitoring latency serves as a valuable tool in recognising bottlenecks and performance issues that could potentially affect responsiveness, ensuring a seamless user experience with the system.
Traffic
- Represents the volume of requests or transactions processed by a system, quantified through metrics, such as requests per second (RPS) or transactions per second (TPS). As traffic increases, stress on the system increases. Monitoring traffic is crucial for teams to understand usage patterns, identify peaks, and strategise for scalability. Unexpected surges in traffic pose risks of performance deterioration or outages. Additionally, the risk increases if the system is not adequately prepared to an increased load. By being vigilant and observing traffic, teams can proactively manage system resources and ensure smooth functionality even during peak periods.
Errors
- Errors signify situations in which the system fails to fulfill a request successfully, including HTTP error codes, such as 404 for not found, 500 for server error, or application-specific error messages. Monitoring errors is crucial as it plays a vital role in identifying issues that impact both user experience and system reliability. High error rates may indicate the presence of bugs, misconfigurations, or external factors that affect the system. Emphasising the importance of attentive error monitoring for overall system health.
Saturation
- Measures the extent to which a systems resources are utilised, including CPU usage, memory usage, disk input/output (I/O), and network bandwidth. Monitoring saturation is crucial for teams to gauge how close a system is to its capacity limits. High saturation levels can result in performance degradation or outages. This signal is vital for capacity planning, ensuring that the system possesses adequate resources to handle both current and future workloads.
Implementing AC
To help ensure the smooth operation of our systems and prevent any potential downtime or performance degradation, we implement acceptance criteria (AC) based on the Four Golden Signals framework.
Latency
- Successful HTTP request latency is tracked
- Failed HTTP request latency is tracked
- Successful DB request latency is tracked for the postgres database
- Alerts are configured for unusual latency values for both failed requests and successful requests
- Latency monitoring is documented in the DevOps knowledge base
Traffic
- HTTP requests per second are tracked to identify demand
- Database demand is measured by reads and writes per second
- Alters are configured for unusual demand
- Traffic monitoring is documented in the DevOps knowledge base
Errors
- Errors are logged with the frequency and amount being tracked
- Spikes in errors trigger an alert
- Error tracking is documented in the DevOps knowledge base
Saturation
- CPU utilisation is tracked
- Memory utilisation is tracked
- Persistent storage is tracked (db file system allocation and cluster file system allocation)
- Alerts are configured resource saturation (i.e. hitting memory limits) or unusual resources usage (i.e. CPU spike)
- Saturation monitoring is documented in the DevOps knowledge base
Logging & monitoring (ISMS)
At WorkingMouse, we have adopted a thorough approach to Information Security Management Systems which is proven through our ISO27001 accreditation, read our full blog on our accreditation here. Systems are configured to generate logs and alerts in response to excessive port scanning. For added security measures, when a network-based intrusion detection system is in place, it is essential to configure it to alert the network operations team about activities.
Ensuring network availability is a priority, and this is achieved through the implementation of suitable simple network management protocol (SNMP)-based network management tools, such as Nagios or WhatsUp Gold. Automation of recovery actions is seamlessly integrated where possible.
DevOps reporting
To deliver quality and reliable software, we integrate the Four Golden signals into our DevOps report. This allows us to refine our monitoring and optimisation practices to enhance the reliability of our services. We’ve been doing this by providing detailed insights into Latency, Traffic, Errors, and Saturation. Our DevOps report equips our customers with a clear understanding of their applications’ performance. This enhancement enables our DevOps team to proactively address potential issues, ensuring a smoother and more responsive user experience. This addition to our reporting framework adds significant value between WorkingMouse and our customers, supporting a more informed and engaged approach to overseeing your digital environment.
Conclusion
To wrap things up… the integration of the Four Golden Signals into our DevOps report represents the value and commitment to delivering quality and reliable software at WorkingMouse. By providing detailed insights into the Latency, Traffic, Errors, and Saturation, our reporting framework allows a deeper understanding of our customers application performance. Providing this proactive approach ensures the continual operation of our systems, minimising the risk of downtime or performance decline. This ensures the stability and reliability of WorkingMouse’s infrastructure for both our ISMS and monitoring resource usage which is valuable to our customers and ourselves.