DZone Research Report: A look at our developer audience, their tech stacks, and topics and tools they're exploring.
Getting Started With Large Language Models: A guide for both novices and seasoned practitioners to unlock the power of language models.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
Effective Log Data Analysis With Amazon CloudWatch: Harnessing Machine Learning
O11y Guide, Cloud Native Observability Pitfalls: Controlling Costs
Today, we’re proudly announcing the launch of Kubecost 2.0. It’s available for free to all users and can be accessed in seconds. Our most radical release yet—we’re shipping more than a dozen major new features and an entirely new API backend. Let’s delve into key features and enhancements that make Kubecost 2.0 the best Kubernetes cost management solution available. Here’s an overview of all the great new functionality you can find in Kubecost 2.0: Network Monitoring Visualizes All Traffic Costs Kubecost’s Network Monitoring provides full visibility into Kubernetes and cloud network costs. By monitoring the cost of pods, namespaces, clusters, and cloud services, you can quickly determine where in your infrastructure is driving spend in near real-time. Interacting with this feature, you can discover more about the sources of your inbound and outbound traffic costs, drag and drop icons, or hone in on specific traffic routes. This functionality is especially helpful for larger organizations or teams hoping to learn more about their complex network costs. Learn more in our Network Monitoring doc. Collections Combine Kubernetes and Cloud Costs The new Collections page lets you create custom spend categories comprised of both Kubernetes and cloud costs while removing any overlapping or duplicate costs. This is especially helpful for teams with complex and multi-faceted cost sources that don’t wish to relabel their costs in the cloud or Kubecost. Additionally, aggregating and filtering ensure you only see the costs you want to see and nothing else. Read more in our Collections doc. Kubecost Actions Kubecost Actions provides users with automated workflows to optimize Kubernetes costs. It’s available today with three core actions: dynamic request sizing, cluster turndown, and namespace turndown. We’ve made it easier to create your schedules and get the most out of our offered savings functionality. Learn more in our Actions doc. Forecast Spend With Machine Learning New machine learning-based forecasting models leverage historical Kubernetes and cloud data to provide accurate predictions, allowing teams to anticipate cost fluctuations and allocate resources efficiently. You can access forecasting through Kubecost’s major monitoring dashboards, Allocations, Assets, and the Cloud Cost Explorer by selecting from any future date ranges. You will then see projected future costs along with your realized spending. Learn about forecasting here. Anomaly Detection Anomaly Detection takes cost forecasting a step beyond by allowing you to detect when actual spend deviates from spend predicted by Kubecost. You can quickly identify unexpected spending across key areas and address overages quickly where appropriate. Allowing users to ensure their cloud or Kubernetes spending does not exceed expectations significantly. Read more in our Anomaly Detection doc. 100X Performance Improvement at Scale Kubecost 2.0 introduces a major upgrade with a new API backend, delivering a massive 100x performance improvement at scale, coupled with a 3x enhancement in resource efficiency. This means teams can now experience significantly faster and more responsive interactions with both Kubecost APIs and UI, especially when dealing with large-scale Kubernetes environments. The ability to query 3+ years of historical data provides engineering and FinOps teams with a comprehensive view of resource utilization trends, enabling more informed decision-making and long-term trend analysis. Installing Kubecost You can upgrade to Kubecost 2.0 in seconds, and it’s free to install. Get started with the following helm command to begin visualizing your Kubernetes costs and identifying optimizations: Shell helm install kubecost cost-analyzer \ --repo https://kubecost.github.io/cost-analyzer/ \ --namespace kubecost --create-namespace \ --set kubecostToken="YnJldHRAa3ViZWNvc3QuY29txm343yadf98" Next Steps This is only a preview of the key features now available with Kubecost 2.0. Check out our full release notes full release notes to read about all the great features available in self-managed Kubecost. Other notable features of this release include real-time cost learning, team access management, monitoring shared GPUs, and more. Want to see Kubecost 2.0 in action? Join our Kubecost 2.0 webinar Kubecost 2.0 webinar on Thursday, February 15th at 1 PM ET (10 AM PT), where we will be doing a deep-dive on the new functionality to show you how you can empower your team with granular, actionable insights for efficient Kubernetes operations.
What Is Grafana? Grafana is an open-source tool to visualize the metrics and logs from different data sources. It can query those metrics, send alerts, and can be actively used for monitoring and observability, making it a popular tool for gaining insights. The metrics can be stored in various DBs, and Grafana supports most of them, like Prometheus, Zabbix, Graphite, MySQL, PostgreSQL, Elasticsearch, etc. If the data sources are not available then customized plugins can be developed to integrate these data sources. Grafana is used widely these days to monitor and visualize the metrics for 100s or 1000s of servers, Kubernetes Platforms, Virtual Machines, Big Data Platforms, etc. The key feature of Grafana is its ability to share these metrics in visual forms by creating dashboards so that the teams can collaborate for data analysis and provide support in real-time. Various platforms that are supported by Grafana today: Relational Databases Cloud Services like Google Cloud Monitoring, Amazon Cloud Watch, Azure Cloud Time Series Databases like InfluxDB for memory and CPU Usage graphs Other Data Sources like Elasticsearch and Graphite What Is Prometheus? It’s an open-source data source that is used for infrastructure monitoring and observability. It stores the time-series data, which is collected from various sources like applications developed in various programming languages, virtual machines, databases, servers, Kubernetes clusters, etc. To query these metrics, it uses a query language called PromQL that can be used to explore these metrics for various times and intervals and ensure insights into the health of the systems mentioned above. To create dashboards, send alerts, and ensure observability, tools like Grafana are used. What Is Zabbix? Zabbix is used for comprehensive monitoring which ensures the reliability and efficiency of the IT infrastructure like network, servers, and applications. It has three components: Zabbix Server, Zabbix Agent, and Frontend. Zabbix Server is used for gathering the data. The Zabbix agent collects and sends the data to the Zabbix Server. Frontend is a Web Interface for configuration. Comparison Between Zabbix and Prometheus Features/Aspects Prometheus Zabbix Primary Use Case Collection of metrics of the servers and services. Comprehensive monitoring of network, servers, and applications. Data Collection Collection of metrics in numbers and can be viewed via HTTP Endpoints. Agents (Zabbix Agent) for collecting performance data, SNMP, IPMI, and JMX support. Supports agentless monitoring for certain scenarios. Logging Cannot collect or analyze logs. Centralized logging analysis is not possible with Zabbix. You can monitor log files, but no analysis possibilities are available. Data Visualization Data visualization is possible through numbers and graphs. Graphical representation of monitored data through charts, graphs, maps, and dashboards. Application Metrics Application metrics can be collected if Prometheus is integrated with the web application. No Application-related metrics/dashboards/Alerts are available at this moment. Service Metrics Prometheus can collect metrics from web applications. Zabbix can monitor services like haproxy, databases MySQL, PostgreSQL, HTTP services, etc., and require configurations/integrations. Custom Metrics Custom metrics can be collected through the exporters. The Anomaly Detection feature is available from version 6.0. Alerting System Alerting is not possible via Prometheus. Robust alerting system with customizable triggers, actions, escalations, and notification channels.Alerts can trigger E-mail Notifications, Slack, Pagerduty, Jira, etc. Service Availability Not possible. Zabbix has a built-in feature to generate Service Availability Reports. Planned and unplanned outage windows require manual input to the script, which generates a service availability report. Retention Policy Prometheus can retain the metrics for 1 day to multiple days. Zabbix persists all the data in its own database. Security Features Prometheus and most exporters support TLS. Including authentication of clients via TLS client certificates. Details on configuring Prometheus are here. The Go projects share the same TLS library based on the Go crypto/tls library. Robust user roles and permissions, granular control over user access, and secure communication between components. Kubernetes Compatibility Prometheus can be fully integrated with Kubernetes clusters. It takes full advantage of exporters to collect the metrics and show them on the UI. CompatibilityZabbix is compatible with K8s and can be used to monitor various aspects of the K8s environment.No Logging using K8s Log ForwarderApplication logs cannot be forwarded to the Zabbix server.Container Monitoring with Zabbix The Zabbix Server can be configured to collect data from Zabbix Agents deployed on K8s nodes and applications.ScalabilityZabbix also scales horizontally, making it suitable for large K8s deployments with a growing number of nodes and applications. Adding a Data Source in Grafana A data source is the location from where the metrics are being sourced. These metrics are integrated into Grafana for visualization and other purposes. Prerequisites You need to have an ‘administrator’ role in Grafana to make these changes. You need to have connection details of the data source, like database name, login details, URL and port number of the database, and other relevant information. The below steps can be used to add a custom source: First, navigate to the sidebar and open the context menu. Then click on Configurations and then Data Sources. To create a data source, add it first. Click on the Connections from the menu and create a data source. Now, select the type of data source where the metrics are being sourced from. If it’s a custom data source, then click on custom data source. Provide the connection details that are required and were collected in pre-requisites. Save and Test the connectivity and ensure there are no errors. Once the data source is saved, you can explore the metrics and also create the dashboards. Demo on How To Integrate Grafana and Prometheus to Monitor the Metrics of a Server Assumptions Operating System: Centos 7 Linux Virtual Machine Internet is available to access the packages Root Access to the VM Single Machine setup: Grafana, Prometheus, and Node Exporter are installed on a single VM Node Exporter Install Grafana Disable Selinux and Firewall Plain Text systemctl stop firewalld systemctl disable firewalld Add yum.repo for Grafana Plain Text [grafana] name=grafana baseurl=https://packages.grafana.com/oss/rpm repo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://packages.grafana.com/gpg.key sslverify=1 sslcacert=/etc/pki/tls/certs/ca-bundle.crt Plain Text [root@vm-grafana ~]# yum install grafana [root@vm-grafana ~]# vim /etc/sysconfig/grafana-server Edit the Grafana configuration file to add the port and IP where Grafana is installed. [root@vm-grafana ~]# vim /etc/grafana/grafana.ini Uncomment # The http port to use http_port = 3000 # The public facing domain name used to access grafana from a browser domain = 127.0.0.1 Restart Grafana Server Service and Check Logs Log file location. [root@vm-grafana ~]# tail -f /var/log/grafana/grafana.log [root@vm-grafana ~]# systemctl restart grafana-server [root@vm-grafana ~]# systemctl status grafana-server Connect to the Web UI Grafana will connect to Port 3000. Image 1: Grafana Web UI Install Prometheus and Node-Exporter Download Package for Prometheus Location. Plain Text yum install wget wget https://github.com/prometheus/prometheus/releases/download/v2.49.1/prometheus-2.49.1.linux-amd64.tar.gz Saving to: ‘prometheus-2.49.1.linux-amd64.tar.gz’ Installation useradd --no-create-home --shell /bin/false prometheus mkdir /etc/Prometheus mkdir /var/lib/Prometheus chown prometheus:prometheus /etc/prometheus chown prometheus:prometheus /var/lib/prometheus Extract the package. Plain Text [root@vm-grafana ~]# tar zxvf prometheus-2.49.1.linux-amd64.tar.gz mv prometheus-2.49.1.linux-amd64 prometheuspackage cp prometheuspackage/prometheus /usr/local/bin/ cp prometheuspackage/promtool /usr/local/bin/ chown prometheus:prometheus /usr/local/bin/prometheus chown prometheus:prometheus /usr/local/bin/promtool cp -r prometheuspackage/consoles /etc/prometheus cp -r prometheuspackage/console_libraries /etc/Prometheus chown -R prometheus:prometheus /etc/prometheus/consoles chown -R prometheus:prometheus /etc/prometheus/console_libraries chown prometheus:prometheus /etc/prometheus/prometheus.yml Edit configuration file. Plain Text global: scrape_interval: 10s scrape_configs: - job_name: 'prometheus_master' scrape_interval: 5s static_configs: - targets: ['localhost:9090'] Create a Linux Service file for Prometheus. Plain Text vim /etc/systemd/system/prometheus.service [Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries [Install] WantedBy=multi-user.target Start the service. Plain Text systemctl daemon-reload systemctl start prometheus systemctl status prometheus Access Prometheus Web UI. Image 2: Prometheus Web UI Image 3: Prometheus Web UI Monitor Linux Server Using Prometheus and Integration With node_exporter Download the Setup Plain Text wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz [root@vm-grafana ~]# tar zxvf node_exporter-1.7.0.linux-amd64.tar.gz [root@vm-grafana ~]# ls -ld node_exporter-1.7.0.linux-amd64 drwxr-xr-x. 2 prometheus prometheus 56 Nov 13 00:03 node_exporter-1.7.0.linux-amd64 Setup Instructions Plain Text useradd -rs /bin/false nodeusr mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/ Create a Service File for the Node Exporter Plain Text vim /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter After=network.target [Service] User=nodeusr Group=nodeusr Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target Reload the System Daemon and Start Node Exporter Service Plain Text systemctl daemon-reload systemctl restart node_exporter systemctl enable node_exporter View the metrics browsing node exporter URL. Image 4: node_exporter Web UI Integrate node_exporter With Prometheus Login to the Prometheus server and modify the prometheus.yml configuration file. Add the following configurations under the scrape config. Plain Text vim /etc/prometheus/prometheus.yml - job_name: 'node_exporter_centos' scrape_interval: 5s static_configs: - targets: ['TARGET_SERVER_IP:9100'] Restart Prometheus service. Plain Text systemctl restart prometheus Login to Prometheus Server Web Interface, and Check Targets Follow this link. Image 5: node_exporter in Prometheus Image 6: node_exporter in Prometheus You can click the graph and query any server metrics and click execute to show output. It will show the console output. Image 7: Metrics from the VM Image 8: Building Graph from the Query Metrics Add Prometheus as DataSource in Grafana Click on Add a new data source and add Prometheus as the data source by entering the Prometheus URL. Image 9: Add a new data source Import a pre-built dashboard from Grafana using this link and ID. Dashboard ID: 10180 Click on Dashboard and go to the imported dashboard. You should now be able to see all the metrics of the server. Image 10: Grafana Dashboard Next Steps Users can explore more on setting up alerts, adding Role Based Access Control, importing metrics from a remote server, and Grafana Administration as the next steps.
OpenTelemetry and Prometheus are both open-source, but they can have a significant difference in how your cloud application functions. While OpenTelemetry is ideal for cloud-native applications and focuses on monitoring and improving application performance, Prometheus prioritizes reliability and accuracy. So, which one is the ideal option for your observability needs? The answer to this question is not as straightforward as you might expect. Both OpenTelemetry and Prometheus have their own strengths and weaknesses, catering to different needs and priorities. If you are confused about which option to go for, this blog post aims to be your guiding light through the intricacies of OpenTelemetry Vs Prometheus. We will unravel their architectures, dissect ease of use, delve into pricing considerations, and weigh the advantages and disadvantages. What Is OpenTelemetry? To comprehend the decision on whether to go with OpenTelemetry vs. Prometheus, we must first understand each option. Let's begin by decoding OpenTelemetry. OpenTelemetry is an open-source observability framework designed to provide comprehensive insights into the performance and behavior of software applications. Developed as a merger of OpenTracing and OpenCensus, OpenTelemetry is now a Cloud Native Computing Foundation (CNCF) project, enjoying widespread adoption within the developer community. OTel Architecture The OpenTelemetry architecture reflects this multi-dimensional vision. It comprises crucial components like: The API It acts as a universal translator, enabling applications to "speak" the language of telemetry, regardless of language or framework. These APIs provide a standardized way to generate traces, metrics, and logs, ensuring consistency and interoperability across tools. It also offers a flexible foundation for instrumenting code and capturing telemetry data in a structured format. The SDKs Language-specific libraries (available for Java, Python, JavaScript, Go, and more) that implement the OpenTelemetry API. They provide convenient tools to instrument code, generate telemetry data, and send it to the collector. SDKs help simplify the process of capturing telemetry data, making it easier for developers to integrate observability into their applications. The Collector The Otel collector is the heart of the OpenTelemetry architecture and is responsible for receiving, processing, and exporting telemetry data to various backends. It can be deployed as an agent on each host or as a centralized service. OpenTelemery offers a range of configurations and exporters for seamless integration with popular observability tools like Prometheus, Jaeger, Middleware, Datadog, Grafana, and more. Exporters Exporters are crucial in OpenTelemetry for transmitting collected telemetry data to external systems. The platform supports a variety of exporters, ensuring compatibility with popular observability platforms and backends. Context Propagation OpenTelemetry incorporates context propagation mechanisms to link distributed traces seamlessly. This ensures that a trace initiated in one part of your system can be followed through various interconnected services. Benefits of OpenTelemetry This modular design offers unmatched flexibility. You can choose the SDKs that suit your languages and environments and seamlessly integrate them with your existing observability tools. Moreover, OpenTelemetry boasts vendor-agnosticism, meaning you're not locked into a specific platform. It's your data, your freedom. However, this complexity comes with some trade-offs. OpenTelemetry is still evolving, and its ecosystem is less mature than Prometheus. Getting started might require more effort, and the Instrumentation overhead can be slightly higher. It's a trade-off between a richer picture and immediate usability. So, is OpenTelemetry suitable for you? If you seek the power of complete observability, the flexibility to adapt, and the freedom to choose, then OpenTelemetry might be your ideal partner. But be prepared to invest the time and effort to leverage its full potential. What Is Prometheus? Now, let’s understand the tool that provides a complete range of observability solutions. Prometheus, an open-source monitoring and alerting toolkit, was conceived at SoundCloud in 2012 and later donated to the Cloud Native Computing Foundation (CNCF). Praised for its simplicity and reliability, Prometheus has become a cornerstone for organizations seeking a robust solution to monitor their applications and infrastructure. Its focus is laser-sharp: collecting time-series data that paints a quantitative picture of your system's health and performance. This includes its pull-based model, where your exporter pushes metrics to Prometheus on its terms and minimizes operational overhead. The PromQL query language lets you slice and dice your metrics with surgical precision, creating insightful graphs and alerts. Key Components of the Prometheus Architecture To appreciate the nuances of Prometheus, it's essential to comprehend the underlying architecture that propels its monitoring capabilities. Prometheus server: At the core of Prometheus is its server, which is responsible for scraping and storing time-series data through HTTP pull requests. Data model: Prometheus embraces a multi-dimensional data model, utilizing key-value pairs for labels to identify time-series data uniquely. PromQL: A powerful query language, PromQL, enables users to retrieve and analyze time-series data collected by Prometheus. Alerting rules: Prometheus incorporates a robust alerting system, allowing users to define rules based on queries and thresholds. Exporters: Similar to OpenTelemetry, Prometheus leverages APIs to gather metrics from various sources, ensuring flexibility in monitoring diverse components. So, when is Prometheus the perfect fit? If your primary concern is monitoring key metrics across your system, and you value operational simplicity and robust tools, then Prometheus won't disappoint. It's ideal for situations where you need clear, quantitative insights without the complexities of multi-dimensional data collection. OpenTelemetry vs. Prometheus Now that we have understood both platforms let us make a head-to-head comparison of OpenTelemetry Metrics and Prometheus to understand their strengths and weaknesses. Ease of Use Criteria OpenTelemetry Prometheus Instrumentation It offers libraries for multiple languages, making it accessible to diverse ecosystems. Requires exporters for instrumentation, which may be perceived as an additional step. Configuration Features auto-instrumentation for common frameworks, simplifying setup. Configuration can be manual, necessitating a deeper understanding of settings. Learning Curve Users familiar with OpenTracing or OpenCensus may find the transition smoother. PromQL and Prometheus-specific concepts may pose a learning curve for some users. Use Case Criteria OpenTelemetry Prometheus Application Types Well-suited for complex, distributed microservices architectures. It is ideal for monitoring containerized environments and providing real-time insights. Data Types Captures both traces and metrics, offering comprehensive observability. Primarily focused on time-series metrics but has some support for event-based monitoring. Ecosystem Integration Widespread adoption and compatibility with various observability platforms. Strong integration with Kubernetes and native support for exporters and service discovery. Pricing Criteria OpenTelemetry Prometheus Licensing OpenTelemetry is open source with an Apache 2.0 license, offering flexibility. Prometheus follows the open-source model with an Apache 2.0 license, providing freedom of use. Operational Costs Costs may vary based on the chosen backend and hosting options. Typically, operational costs are associated with storage and scalability requirements. Advantages OpenTelemetry Comprehensive observability with both traces and metrics. Wide language support and ecosystem integration. Active community support and continuous development. Vendor-agnostic, flexible, richer data context, future-proof. Prometheus Efficient real-time monitoring with a powerful query language (PromQL). Strong support for containerized environments. Robust alerting capabilities. Proven stability, efficient data collection, familiar tools, and integrations. Disadvantages OpenTelemetry Higher instrumentation overhead, less mature ecosystem. Some users may experience a learning curve. Exporter configuration can be complex. Prometheus Limited data scope (no traces or logs), potential vendor lock-in for specific integrations. Configuration may seem manual and intricate for beginners. Conclusion The ultimate choice hinges on your needs. Weigh your needs, assess your resources, and listen to your system's requirements. Does it call for a multifaceted architecture or a focused, metric-driven solution? The answer will lead you to your ideal observability platform. OpenTelemetry offers a unified observability solution, while Prometheus excels in specialized scenarios. But remember, this is not a competition but a collaboration. You can integrate both OpenTelemetry and Prometheus to combine their strengths. Start by using OpenTelemetry to capture your system's observability data, and let Prometheus translate it into actionable insights through its metric-powered lens.
Recently, I was back at the Cloud Native London meetup, having been given the opportunity to present due to a speaker canceling at the last minute. This group has 7,000+ members and is, "...the official Cloud Native Computing Foundation (CNCF) Meetup group dedicated to building a strong, open, diverse developer community around the Cloud Native platform and technologies in London." You can also find them on their own Slack channel, so feel free to drop in for a quick chat if you like. There were over 85 attendees who braved the cold London evening to join us for pizza, drinks, and a bit of fun with my session having a special design this time around. I went out on a limb and tried something I'd never seen before - a sort of choose-your-own-adventure presentation. Below I've included a description of how I think it went, the feedback I got, and where you can find both the slides and recording online if you missed it. About the Presentation Here are the schedule details for the day: Check out the three fantastic speakers we've got lined up for you on Wednesday 10 January: 18:00 Pizza and drinks 18:30 Welcome 18:45 Quickwit: Cloud-Native Logging and Distributed Tracing (Francois Massot, Quickwit) 19:15 - 3 Pitfalls Everyone Should Avoid with Cloud Native Observability (Eric D. Schabell, Chronosphere) 19:45 Break 20:00 Transcending microservices hell for Supergraph Nirvana (Tom Harding, Hasura) 20:30 Wrap up See you there! The agenda for the January Cloud Native London Meetup is now up. If you're not able to join us, don't forget to update your RSVP before 10am on Wednesday! Or alternatively, join us via the YouTube stream without signing up. As I mentioned, my talk is a new idea I've been working on for the last year. I want to share insights into the mistakes and pitfalls that I'm seeing customers and practitioners make repeatedly on their cloud-native observability journey. Not only were there new ideas for content, but I wanted to try something a bit more daring this time around and tried to engage the audience with a bit of choose-your-own-adventure in which they were choosing which pitfall would be covered next. I started with a generic introduction, then gave them the following six choices: Ignoring costs in the application landscape Focusing on The Pillars Sneaky sprawling tooling mess Controlling costs Losing your way in the protocol jungles Underestimating cardinality For this Cloud Native London session, we ended up going in this order: pitfalls #6, #3, and #4. This meant the session recording posted online from the event contained the following content: Introduction to cloud-native and cloud-native observability problems (framing the topic) Pitfall #1 - Underestimating cardinality Pitfall #2 - Sneaky sprawling tooling mess Pitfall #3 - Controlling costs It went pretty smoothly and I was excited to get a lot of feedback from attendees who enjoyed the content, the takes on cloud-native observability pitfalls, and they loved the engaging style of choosing your own adventure! If you get the chance to see this talk next time I present it, there's a good chance it will contain completely different content. Video, Slides, and Abstract Session Video Recording Session Slides 3 Pitfalls Everyone Should Avoid with Cloud Native Observability from Eric D. Schabell Abstract Are you looking at your organization's efforts to enter or expand into the cloud-native landscape and feeling a bit daunted by the vast expanse of information surrounding cloud-native observability? When you're moving so fast with agile practices across your DevOps, SREs, and platform engineering teams, it's no wonder this can seem a bit confusing. Unfortunately, the choices being made have a great impact on your business, your budgets, and the ultimate success of your cloud-native initiatives. That hasty decision up front leads to big headaches very quickly down the road. In this talk, I'll introduce the problem facing everyone with cloud-native observability followed by 3 common mistakes that I'm seeing organizations make and how you can avoid them! Coming Up I am scheduled to return in May to present again and look forward to seeing everyone in London in the spring!
In the bustling digital marketplace, web applications are like vibrant cities, constantly humming with activity as users come and go. Just as cities use various systems to keep track of their inhabitants and visitors, web applications rely on user session management to maintain a smooth experience for each person navigating through them. But what exactly is user session management, and why is it so crucial for maintaining the vitality of web apps? User session management is the mechanism by which a web application recognizes, tracks, and interacts with its users during their visit. In the quest to deliver stellar user experiences, the role of efficient user session management cannot be overstated. Imagine walking into a store where the staff remembers your name, preferences, and the last item you looked at. That personalized service makes you feel valued and understood. Similarly, when an application preserves a user's state and interactions, it allows for a more personalized and efficient experience. From the moment a user logs in to the time they log out, their session - a series of interactions with the application - is maintained through a unique identifier, usually stored in a cookie or session token. This process involves several important functions, such as authentication (verifying who the user is), authorization (determining what the user is allowed to do), and session persistence (keeping the user 'logged in' as they navigate). Efficient user session management can significantly reduce load times, improve security, and ensure that personalization and user-specific data persist throughout the interaction, laying the groundwork for a satisfying use introduction. What Is AWS Elasticache? Now, let's turn our attention to the role that AWS Elasticache plays in optimizing this process. AWS Elasticache is a cloud-based, fully managed in-memory data store service provided by Amazon Web Services (AWS) designed to deploy, operate, and scale an in-memory cache in the cloud with ease. It's like having a turbocharger for your web applications; it accelerates data retrieval times by storing critical pieces of data in memory, which is much faster than continually reading from a disk-based database. With AWS Elasticache, developers have the power to enhance the performance of their applications without the complexity of managing a caching infrastructure. This service offers high availability, failover capabilities, and a multitude of metrics for monitoring, making it a robust choice for organizations looking to scale and maintain high-performance web applications. AWS Elasticache supports two popular open-source in-memory engines: Redis and Memcached. However, Redis, with its rich set of features including data persistence, atomic operations, and various data structures, stands out as an ideal candidate for managing user sessions effectively. When it comes to user session management, AWS Elasticache can be a game-changer. By leveraging its capabilities, applications can access session data at lightning speeds, leading to faster load times, smoother interactions, and an overall better user experience. As we delve deeper into the potential of Redis and AWS Elasticache in subsequent sections, we'll uncover just how impactful these technologies can be for modern web applications. Understanding Redis for User Session Management Redis stands as a beacon of speed and efficiency in the realm of user session management. But what exactly is this technology that's been so pivotal in enhancing web application performance? Redis, or Remote Dictionary Server, is an open-source, in-memory data structure store. It provides lightning-fast operations by keeping data in memory instead of on disk, which is akin to having a conversation with someone in the same room versus having to shout across a large field. Redis is versatile. It's not just a simple cache; it's a multi-structured data store that supports strings, lists, sets, hashes, and more. These data structures are essential building blocks for developers, offering the flexibility to tailor data management according to the application's needs. It's open-source too, which means a community of dedicated experts continually refines it, ensuring it stays at the forefront of innovation. Scalability and performance are the twin towers of modern web application architecture, and Elasticache for Redis excels at both. By offloading session management to Elasticache for Redis, applications can handle an increasing number of simultaneous user sessions without breaking a sweat. This scalability ensures that as your user base grows, your application can grow with it, maintaining performance and reliability. Moreover, performance isn't just about speed; it's about consistency and reliability. Redis provides sub-millisecond latency, meaning that user interactions are almost instantaneously reflected, crafting an experience that feels fluid and responsive. It's not just about making your application faster; it's about making it feel instantaneous, which in the world of user experience, is pure gold. Elasticache for Redis is generally used as an in-memory cache for RDS or other databases, that speeds up data retrieval. The following are the advantages of using it in a different avatar as a session manager. Speed: The in-memory nature of Redis ensures rapid read and write operations, far outpacing traditional disk-based databases. This speed is crucial for session management, where every millisecond counts in delivering a smooth user experience. Data Structures: Redis's support for various data structures makes it incredibly adaptable for managing sessions. You can use simple key-value pairs for straightforward cases or more complex structures for nuanced data management. Persistence: Despite being in-memory, Redis offers configurable persistence options. This means that user sessions can be recovered after the server restarts, preventing unexpected data loss and frustration. Atomic Operations: Redis operations are atomic, allowing multiple actions to be completed in one go without interference from other processes. This atomicity ensures consistent session states across the board. In sum, Elasticache for Redis isn't just another tool; it's a strategic asset in the world of user session management. Its in-memory datastore capabilities ensure rapid access to data, while its support for various data structures allows for flexibility and nuanced control. Whether it's managing millions of user sessions or ensuring consistent performance during peak traffic, Redis stands ready to elevate your web application's user experience to new heights. AWS Elasticache for Redis Implementation Now that you know about the speed and agility Redis brings to user session management, let’s turn our focus on harnessing this power within the AWS ecosystem. First things first: setting up AWS Elasticache for Redis. Launching an Elasticache Redis cluster through the AWS Management Console begins this process. You'll select a suitable instance type and configure the number of nodes required. Remember to choose a node type that aligns with your expected load and performance criteria. After you've set the parameters, AWS takes care of the heavy lifting, provisioning resources, and initializing the Redis software. Once your Redis cluster is up and running, you can connect it to your web application. Using the provided endpoint, applications can seamlessly read from and write to the Redis cache. For user session management, you'd typically serialize session objects and store them as key-value pairs in Redis. When a user revisits your application, the session can be quickly retrieved using the unique session identifier. Best Practices for Configuring AWS Elasticache Configuring AWS Elasticache correctly is vital for reaping its full benefits. Here's a distilled list of best practices to keep your Redis implementation on AWS as performant and reliable as possible: Enable Automatic Failover: Use Multi-AZ with automatic failover to ensure high availability. If a primary node fails, Elasticache automatically switches to a replica. Backup and Recovery: Schedule regular backups and understand the restoration process to avoid data loss. Monitoring and Alerts: Utilize Amazon CloudWatch to monitor your Redis metrics, and set up alerts for unusual activity or thresholds crossing to proactively manage issues. Choose the Right Node Size: Select a node size that fits your memory and compute requirements to avoid unnecessary costs or performance bottlenecks. Remember, these best practices not only enhance performance but also contribute to the robustness of your user session management system. Integration of AWS Elasticache With Web Applications Integrating AWS Elasticache with your web applications should be a breeze, but it requires careful planning. You want to ensure that sessions are reliably stored and retrieved without hiccups. Most modern web development frameworks provide Redis connectors or libraries that simplify this process. For instance, in a Node.js application, you might use the 'ioredis' library, while in Python, 'redis-py' is a popular choice. When integrating, consider encrypting sensitive session data before storage and using Elasticache's native security features to protect your data. Also, think about session expiration policies. Redis allows you to set a Time-To-Live (TTL) for each session key, which is a neat way to handle session invalidation automatically. The beauty of using AWS Elasticache lies in its seamless fit into the AWS ecosystem. Services like AWS Lambda can trigger functions based on cache data changes, and Amazon Route53 can direct traffic to the nearest Elasticache endpoint, optimizing latency. By syncing AWS Elasticache with your web application, you're setting up a system that's not just fast but also resilient, secure, and primed for scale. The result? A user session management system that users won't even notice — because it works just flawlessly. Enhancing User Experience Through Optimized Session Management User satisfaction hinges on how seamlessly they can navigate your application. With optimized session management, you're looking at near-instantaneous access to user preferences, shopping carts, and game states, to name a few. This speed and precision in recalling session data translate into a smoother, more intuitive user experience. After all, isn't delighting users the ultimate goal of any application? Speed is the currency of the digital realm. A lag in loading times or a hiccup in processing user actions can lead to frustration and, ultimately, user churn. Efficient session management, particularly through in-memory data stores like AWS Elasticache, directly impacts the speed and reliability of these interactions. It's not just about faster data retrieval; it's also about maintaining a stateful connection with users in an otherwise stateless HTTP protocol. Consequently, the application feels more robust and responsive, creating a positive user perception and retention. When your application grows, will your user session management keep up? Scalability is a vital consideration, and AWS Elasticache for Redis shines in this regard. It allows for effortless scaling of session management capabilities to handle more users and more complex data without missing a beat. Additionally, let's not forget security. A well-implemented session management strategy also means enhanced security, safeguarding sensitive session data against threats and vulnerabilities, which further solidifies user trust in your application. Conclusion As we wrap up our exploration into the dynamic world of user session management with AWS Elasticache, let's reflect on the pivotal advantages that have been highlighted throughout this post. Redis, with its lightning-fast read and write capabilities, offers a robust solution for managing user sessions at scale. When deployed via AWS Elasticache, it provides a seamless experience that enhances both the performance and reliability of web applications. Embracing AWS Elasticache for Redis does not mean venturing into unknown territory alone. There are ample resources available to guide you through every step of the process. From comprehensive documentation provided by AWS to vibrant community forums where fellow developers share insights, you have a wealth of knowledge at your fingertips. Remember, the journey to optimizing your application's user experience is continuous and ever-evolving. By leveraging Redis for user session management through AWS Elasticache, you're not just following best practices; you're setting a standard others will aspire to reach. So go ahead, take the leap, and watch as your application rises to new heights of efficiency and user satisfaction.
As organizations increasingly migrate their applications to the cloud, efficient and scalable load balancing becomes pivotal for ensuring optimal performance and high availability. This article provides an overview of Azure's load balancing options, encompassing Azure Load Balancer, Azure Application Gateway, Azure Front Door Service, and Azure Traffic Manager. Each of these services addresses specific use cases, offering diverse functionalities to meet the demands of modern applications. Understanding the strengths and applications of these load-balancing services is crucial for architects and administrators seeking to design resilient and responsive solutions in the Azure cloud environment. What Is Load Balancing? Load balancing is a critical component in cloud architectures for various reasons. Firstly, it ensures optimized resource utilization by evenly distributing workloads across multiple servers or resources, preventing any single server from becoming a performance bottleneck. Secondly, load balancing facilitates scalability in cloud environments, allowing resources to be scaled based on demand by evenly distributing incoming traffic among available resources. Additionally, load balancers enhance high availability and reliability by redirecting traffic to healthy servers in the event of a server failure, minimizing downtime, and ensuring accessibility. From a security perspective, load balancers implement features like SSL termination, protecting backend servers from direct exposure to the internet, and aiding in mitigating DDoS attacks and threat detection/protection using Web Application Firewalls. Furthermore, efficient load balancing promotes cost efficiency by optimizing resource allocation, preventing the need for excessive server capacity during peak loads. Finally, dynamic traffic management across regions or geographic locations capabilities allows load balancers to adapt to changing traffic patterns, intelligently distributing traffic during high-demand periods and scaling down resources during low-demand periods, leading to overall cost savings. Overview of Azure’s Load Balancing Options Azure Load Balancer: Unleashing Layer 4 Power Azure Load Balancer is a Layer 4 (TCP, UDP) load balancer that distributes incoming network traffic across multiple Virtual Machines or Virtual Machine Scalesets to ensure no single server is overwhelmed with too much traffic. There are 2 options for the load balancer: a Public Load Balancer primarily used for internet traffic and also supports outbound connection, and a Private Load Balancer to load balance traffic with a virtual network. The load balancer uses a five-tuple (source IP, source port, destination IP, destination port, protocol). Features High availability and redundancy: Azure Load Balancer efficiently distributes incoming traffic across multiple virtual machines or instances in a web application deployment, ensuring high availability, redundancy, and even distribution, thereby preventing any single server from becoming a bottleneck. In the event of a server failure, the load balancer redirects traffic to healthy servers. Provide outbound connectivity: The frontend IPs of a public load balancer can be used to provide outbound connectivity to the internet for backend servers and VMs. This configuration uses source network address translation (SNAT) to translate the virtual machine's private IP into the load balancer's public IP address, thus preventing outside sources from having a direct address to the backend instances. Internal load balancing: Distribute traffic across internal servers within a Virtual Network (VNet); this ensures that services receive an optimal share of resources Cross-region load balancing: Azure Load Balancer facilitates the distribution of traffic among virtual machines deployed in different Azure regions, optimizing performance and ensuring low-latency access for users of global applications or services with a user base spanning multiple geographic regions. Health probing and failover: Azure Load Balancer monitors the health of backend instances continuously, automatically redirecting traffic away from unhealthy instances, such as those experiencing application errors or server failures, to ensure seamless failover. Port-level load balancing: For services running on different ports within the same server, Azure Load Balancer can distribute traffic based on the specified port numbers. This is useful for applications with multiple services running on the same set of servers. Multiple front ends: Azure Load Balancer allows you to load balance services on multiple ports, multiple IP addresses, or both. You can use a public or internal load balancer to load balance traffic across a set of services like virtual machine scale sets or virtual machines (VMs). High Availability (HA) ports in Azure Load Balancer play a crucial role in ensuring resilient and reliable network traffic management. These ports are designed to enhance the availability and redundancy of applications by providing failover capabilities and optimal performance. Azure Load Balancer achieves this by distributing incoming network traffic across multiple virtual machines to prevent a single point of failure. Configuration and Optimization Strategies Define a well-organized backend pool, incorporating healthy and properly configured virtual machines (VMs) or instances, and consider leveraging availability sets or availability zones to enhance fault tolerance and availability. Define load balancing rules to specify how incoming traffic should be distributed. Consider factors such as protocol, port, and backend pool association. Use session persistence settings when necessary to ensure that requests from the same client are directed to the same backend instance. Configure health probes to regularly check the status of backend instances. Adjust probe settings, such as probing intervals and thresholds, based on the application's characteristics. Choose between the Standard SKU and the Basic SKU based on the feature set required for your application. Implement frontend IP configurations to define how the load balancer should handle incoming network traffic. Implement Azure Monitor to collect and analyze telemetry data, set up alerts based on performance thresholds for proactive issue resolution, and enable diagnostics logging to capture detailed information about the load balancer's operations. Adjust the idle timeout settings to optimize the connection timeout for your application. This is especially important for applications with long-lived connections. Enable accelerated networking on virtual machines to take advantage of high-performance networking features, which can enhance the overall efficiency of the load-balanced application. Azure Application Gateway: Elevating To Layer 7 Azure Application Gateway is a Layer 7 load balancer that provides advanced traffic distribution and web application firewall (WAF) capabilities for web applications. Features Web application routing: Azure Application Gateway allows for the routing of requests to different backend pools based on specific URL paths or host headers. This is beneficial for hosting multiple applications on the same set of servers. SSL termination and offloading: Improve the performance of backend servers by transferring the resource-intensive task of SSL decryption to the Application Gateway and relieving backend servers of the decryption workload. Session affinity: For applications that rely on session state, Azure Application Gateway supports session affinity, ensuring that subsequent requests from a client are directed to the same backend server for a consistent user experience. Web Application Firewall (WAF): Implement a robust security layer by integrating the Azure Web Application Firewall with the Application Gateway. This helps safeguard applications from threats such as SQL injection, cross-site scripting (XSS), and other OWASP Top Ten vulnerabilities. You can define your own WAF custom firewall rules as well. Auto-scaling: Application Gateway can automatically scale the number of instances to handle increased traffic and scale down during periods of lower demand, optimizing resource utilization. Rewriting HTTP headers: Modify HTTP headers for requests and responses, as adjusting these headers is essential for reasons including adding security measures, altering caching behavior, or tailoring responses to meet client-specific requirements. Ingress Controller for AKS: The Application Gateway Ingress Controller (AGIC) enables the utilization of Application Gateway as the ingress for an Azure Kubernetes Service (AKS) cluster. WebSocket and HTTP/2 traffic: Application Gateway provides native support for the WebSocket and HTTP/2 protocols. Connection draining: This pivotal feature ensures the smooth and graceful removal of backend pool members during planned service updates or instances of backend health issues. This functionality promotes seamless operations and mitigates potential disruptions by allowing the system to handle ongoing connections gracefully, maintaining optimal performance and user experience during transitional periods Configuration and Optimization Strategies Deploy the instances in a zone-aware configuration, where available. Use Application Gateway with Web Application Firewall (WAF) within a virtual network to protect inbound HTTP/S traffic from the Internet. Review the impact of the interval and threshold settings on health probes. Setting a higher interval puts a higher load on your service. Each Application Gateway instance sends its own health probes, so 100 instances every 30 seconds means 100 requests per 30 seconds. Use App Gateway for TLS termination. This promotes the utilization of backend servers because they don't have to perform TLS processing and easier certificate management because the certificate only needs to be installed on Application Gateway. When WAF is enabled, every request gets buffered until it fully arrives, and then it gets validated against the ruleset. For large file uploads or large requests, this can result in significant latency. The recommendation is to enable WAF with proper testing and validation. Having appropriate DNS and certificate management for backend pools is crucial for improved performance. Application Gateway does not get billed in stopped state. Turn it off for the dev/test environments. Take advantage of features for autoscaling and performance benefits, and make sure to have scale-in and scale-out instances based on the workload to reduce the cost. Use Azure Monitor Network Insights to get a comprehensive view of health and metrics, crucial in troubleshooting issues. Azure Front Door Service: Global-Scale Entry Management Azure Front Door is a comprehensive content delivery network (CDN) and global application accelerator service that provides a range of use cases to enhance the performance, security, and availability of web applications. Azure Front Door supports four different traffic routing methods latency, priority, weighted, and session affinity to determine how your HTTP/HTTPS traffic is distributed between different origins. Features Global content delivery and acceleration: Azure Front Door leverages a global network of edge locations, employing caching mechanisms, compressing data, and utilizing smart routing algorithms to deliver content closer to end-users, thereby reducing latency and enhancing overall responsiveness for an improved user experience. Web Application Firewall (WAF): Azure Front Door integrates with Azure Web Application Firewall, providing a robust security layer to safeguard applications from common web vulnerabilities, such as SQL injection and cross-site scripting (XSS). Geo filtering: In Azure Front Door WAF you can define a policy by using custom access rules for a specific path on your endpoint to allow or block access from specified countries or regions. Caching: In Azure Front Door, caching plays a pivotal role in optimizing content delivery and enhancing overall performance. By strategically storing frequently requested content closer to the end-users at the edge locations, Azure Front Door reduces latency, accelerates the delivery of web applications, and prompts resource conservation across entire content delivery networks. Web application routing: Azure Front Door supports path-based routing, URL redirect/rewrite, and rule sets. These help to intelligently direct user requests to the most suitable backend based on various factors such as geographic location, health of backend servers, and application-defined routing rules. Custom domain and SSL support: Front Door supports custom domain configurations, allowing organizations to use their own domain names and SSL certificates for secure and branded application access. Configuration and Optimization Strategies Use WAF policies to provide global protection across Azure regions for inbound HTTP/S connections to a landing zone. Create a rule to block access to the health endpoint from the internet. Ensure that the connection to the back end is re-encrypted as Front Door does support SSL passthrough. Consider using geo-filtering in Azure Front Door. Avoid combining Traffic Manager and Front Door as they are used for different use cases. Configure logs and metrics in Azure Front Door and enable WAF logs for debugging issues. Leverage managed TLS certificates to streamline the costs and renewal process associated with certificates. Azure Front Door service issues and rotates these managed certificates, ensuring a seamless and automated approach to certificate management, thereby enhancing security while minimizing operational overhead. Use the same domain name on Front Door and your origin to avoid any issues related to request cookies or URL redirections. Disable health probes when there’s only one origin in an origin group. It's recommended to monitor a webpage or location that you specifically designed for health monitoring. Regularly monitor and adjust the instance count and scaling settings to align with actual demand, preventing overprovisioning and optimizing costs. Azure Traffic Manager: DNS-Based Traffic Distribution Azure Traffic Manager is a global DNS-based traffic load balancer that enhances the availability and performance of applications by directing user traffic to the most optimal endpoint. Features Global load balancing: Distribute user traffic across multiple global endpoints to enhance application responsiveness and fault tolerance. Fault tolerance and high availability: Ensure continuous availability of applications by automatically rerouting traffic to healthy endpoints in the event of failures. Routing: Support various routing globally. Performance-based routing optimizes application responsiveness by directing traffic to the endpoint with the lowest latency. Geographic traffic routing is based on the geographic location of end-users, priority-based, weighted, etc. Endpoint monitoring: Regularly check the health of endpoints using configurable health probes, ensuring traffic is directed only to operational and healthy endpoints. Service maintenance: You can have planned maintenance done on your applications without downtime. Traffic Manager can direct traffic to alternative endpoints while the maintenance is in progress. Subnet traffic routing: Define custom routing policies based on IP address ranges, providing flexibility in directing traffic according to specific network configurations. Configuration and Optimization Strategies Enable automatic failover to healthy endpoints in case of endpoint failures, ensuring continuous availability and minimizing disruptions. Utilize appropriate traffic routing methods, such as Priority, Weighted, Performance, Geographic, and Multi-value, to tailor traffic distribution based on specific application requirements. Implement a custom page to use as a health check for your Traffic Manager. If the Time to Live (TTL) interval of the DNS record is too long, consider adjusting the health probe timing or DNS record TTL. Consider nested Traffic Manager profiles. Nested profiles allow you to override the default Traffic Manager behavior to support larger, more complex application deployments. Integrate with Azure Monitor for real-time monitoring and logging, gaining insights into the performance and health of Traffic Manager and endpoints. How To Choose When selecting a load balancing option in Azure, it is crucial to first understand the specific requirements of your application, including whether it necessitates layer 4 or layer 7 load balancing, SSL termination, and web application firewall capabilities. For applications requiring global distribution, options like Azure Traffic Manager or Azure Front Door are worth considering to efficiently achieve global load balancing. Additionally, it's essential to evaluate the advanced features provided by each load balancing option, such as SSL termination, URL-based routing, and application acceleration. Scalability and performance considerations should also be taken into account, as different load balancing options may vary in terms of throughput, latency, and scaling capabilities. Cost is a key factor, and it's important to compare pricing models to align with budget constraints. Lastly, assess how well the chosen load balancing option integrates with other Azure services and tools within your overall application architecture. This comprehensive approach ensures that the selected load balancing solution aligns with the unique needs and constraints of your application. Service Global/Regional Recommended traffic Azure Front Door Global HTTP(S) Azure Traffic Manager Global Non-HTTP(S) and HTTPS Azure Application Gateway Regional HTTP(S) Azure Load Balancer Regional or Global Non-HTTP(S) and HTTPS Here is the decision tree for load balancing from Azure. Source: Azure
Observability platforms are akin to the immune system. Just like immune cells are everywhere in human bodies, an observability platform patrols every corner of your devices, components, and architectures, identifying any potential threats and proactively mitigating them. However, I might have gone too far with that metaphor, because till these days, we have never invented a system as sophisticated as the human body, but we can always make advancements. The key to upgrading an observability platform is to increase data processing speed and reduce costs. This is based on two reasons: The faster you can identify abnormalities in your data, the more you can contain the potential damage. An observability platform needs to store a sea of data, and low storage cost is the only way to make that sustainable. This post is about how GuanceDB, an observability platform, makes progress in these two aspects by replacing Elasticsearch with Apache Doris as its query and storage engine. The result is 70% less storage costs and 200%~400% data query performance. GuanceDB GuanceDB is an all-around observability solution. It provides services including data analytics, data visualization, monitoring and alerting, and security inspection. From GuanceDB, users can have an understanding of their objects, network performance, applications, user experience, system availability, etc. From the standpoint of a data pipeline, GuanceDB can be divided into two parts: data ingestion and data analysis. I will get to them one by one. Data Integration For data integration, GuanceDB uses its self-made tool called DataKit. It is an all-in-one data collector that extracts from different end devices, business systems, middleware, and data infrastructure. It can also preprocess data and relate it with metadata. It provides extensive support for data, from logs, and time series metrics, to data of distributed tracing, security events, and user behaviors from mobile apps and web browsers. To cater to diverse needs across multiple scenarios, it ensures compatibility with various open-source probes and collectors as well as data sources of custom formats. Query & Storage Engine Data collected by DataKit goes through the core computation layer and arrives in GuanceDB, which is a multi-model database that combines various database technologies. It consists of the query engine layer and the storage engine layer. Decoupling the query engine and the storage engine, it enables pluggable and interchangeable architecture. For time series data, they built Metric Store, which is a self-developed storage engine based on VictoriaMetrics. For logs, they integrate Elasticsearch and OpenSearch. GuanceDB is performant in this architecture, while Elasticsearch demonstrates room for improvement: Data Writing: Elasticsearch consumes a big share of CPU and memory resources. It is not only costly but also disruptive to query execution. Schemaless Support: Elasticsearch provides schemaless support by Dynamic Mapping, but that's not enough to handle large amounts of user-defined fields. In this case, it can lead to field-type conflict and thus data loss. Data Aggregation: Large aggregation tasks often trigger a timeout error in Elasticsearch. So this is where the upgrade happens. GuanceDB tried and replaced Elasticsearch with Apache Doris. DQL In the GuanceDB observability platform, almost all queries involve timestamp filtering. Meanwhile, most data aggregations need to be performed within specified time windows. Additionally, there is a need to perform rollups of time series data on individual sequences within a time window. Expressing these semantics using SQL often requires nested subqueries, resulting in complex and cumbersome statements. That's why GuanceDB developed its own Data Query Language (DQL). With simplified syntax elements and computing functions optimized for observability use cases, this DQL can query metrics, logs, object data, and data from distributed tracing. This is how DQL works together with Apache Doris. GuanceDB has found a way to make full use of the analytic power of Doris while complementing its SQL functionalities. As is shown below, Guance-Insert is the data writing component, while Guance-Select is the DQL query engine. Guance-Insert: It allows data of different tenants to be accumulated in different batches, and strikes a balance between writing throughput and writing latency. When logs are generated in large volumes, it can maintain a low data latency of 2~3 seconds. Guance-Select: For query execution, if the query SQL semantics or function is supported in Doris, Guance-Select will push the query down to the Doris Frontend for computation; if not, it will go for a fallback option: acquire columnar data in Arrow format via the Thrift RPC interface, and then finish computation in Guance-Select. The catch is that it cannot push the computation logic down to Doris Backend, so it can be slightly slower than executing queries in Doris Frontend. Observations Storage Cost 70% Down, Query Speed 300% Up Previously, with Elasticsearch clusters, they used 20 cloud virtual machines (16vCPU 64GB) and had independent index writing services (that's another 20 cloud virtual machines). Now with Apache Doris, they only need 13 cloud virtual machines of the same configuration in total, representing a 67% cost reduction. This is contributed by three capabilities of Apache Doris: High Writing Throughput: Under a consistent writing throughput of 1GB/s, Doris maintains a CPU usage of less than 20%. That equals 2.6 cloud virtual machines. With low CPU usage, the system is more stable and better prepared for sudden writing peaks. High Data Compression Ratio: Doris utilizes the ZSTD compression algorithm on top of columnar storage. It can realize a compression ratio of 8:1. Compared to 1.5:1 in Elasticsearch, Doris can reduce storage costs by around 80%. Tiered Storage: Doris allows a more cost-effective way to store data: to put hot data in local disks and cold data object storage. Once the storage policy is set, Doris can automatically manage the "cooldown" process of hot data and move cold data to object storage. Such data lifecycle is transparent to the data application layer so it is user-friendly. Also, Doris speeds up cold data queries by local cache. With lower storage costs, Doris does not compromise query performance. It doubles the execution speed of queries that return a single row and those that return a result set. For aggregation queries without sampling, Doris runs at 4 times the speed of Elasticsearch. To Sum Up, Apache Doris Achieves 2~4 Times the Query Performance of Elasticsearch With Only 1/3 of the Storage Cost It Consumes. Inverted Index for Full-Text Search The inverted index is the magic potion for log analytics because it can considerably increase full-text search performance and reduce query overheads. It is especially useful in these scenarios: Full-text search by MATCH_ALL, MATCH_ANY, and MATCH_PHRASE. MATCH_PHRASE in combination with an inverted index is the alternative to the Elasticsearch full-text search functionality. Equivalence queries (=, ! =, IN), range queries (>, >=, <, <=), and support for numerics, DateTime, and strings. MySQL CREATE TABLE httplog ( `ts` DATETIME, `clientip` VARCHAR(20), `request` TEXT, INDEX idx_ip (`clientip`) USING INVERTED, INDEX idx_req (`request`) USING INVERTED PROPERTIES("parser" = "english") ) DUPLICATE KEY(`ts`) ... -- Retrieve the latest 10 records of Client IP "8.8.8.8" SELECT * FROM httplog WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10; -- Retrieve the latest 10 records with "error" or "404" in the "request" field SELECT * FROM httplog WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10; -- Retrieve the latest 10 records with "image" and "faq" in the "request" field SELECT * FROM httplog WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10; -- Retrieve the latest 10 records with "query error" in the "request" field SELECT * FROM httplog WHERE request MATCH_PHRASE 'query error' ORDER BY ts DESC LIMIT 10; As a powerful accelerator for full-text searches, the inverted index in Doris is flexible because we witness the need for on-demand adjustments. In Elasticsearch, indexes are fixed upon creation, so there needs to be good planning of which fields need to be indexed, otherwise, any changes to the index will require a complete rewrite. In contrast, Doris allows for dynamic indexing. You can add an inverted index to a field during runtime and it will take effect immediately. You can also decide which data partitions to create indexes on. A New Data Type for Dynamic Schema Change By nature, an observability platform requires support for dynamic schema, because the data it collects is prone to changes. Every click by a user on the webpage might add a new metric to the database. Looking around the database landscape, you will find that static schema is the norm. Some databases take a step further. For example, Elasticsearch realizes dynamic schema by mapping. However, this functionality can be easily interrupted by field type conflicts or unexpired historical fields. The Doris solution for dynamic schema is a newly-introduced data type: Variant, and GuanceDB is among the first to try it out. (It will officially be available in Apache Doris V2.1.) The Variant data type is the move of Doris to embrace semi-structured data analytics. It can solve a lot of the problems that often harass database users: JSON Data Storage: A Variant column in Doris can accommodate any legal JSON data, and can automatically recognize the subfields and data types. Schema Explosion Due To Too Many Fields: The frequently occurring subfields will be stored in a column-oriented manner to facilitate analysis, while the less frequently seen subfields will be merged into the same column to streamline the data schema. Write Failure Due To Data Type Conflicts: A Variant column allows different types of data in the same field and applies different storage for different data types. Difference Between Variant and Dynamic Mapping From a functional perspective, the biggest difference between Variant in Doris and Dynamic Mapping in Elasticsearch is that the scope of Dynamic Mapping extends throughout the entire lifecycle of the current table, while that of Variant can be limited to the current data partition. For example, if a user has changed the business logic and renamed some Variant fields today, the old field name will remain on the partitions before today, but will not appear on the new partitions since tomorrow. So there is a lower risk of data type conflict. In the case of field type conflicts in the same partition, the two fields will be changed to JSON type to avoid data error or data loss. For example, there are two status fields in the user's business system: One is strings, and the other is numerics, so in queries, the user can decide whether to query the string field, the numeric field, or both. (E.g. If you specify status = "ok" in the filters, the query will only be executed on the string field.) From the users' perspective, they can use the Variant type as simply as other data types. They can add or remove Variant fields based on their business needs, and no extra syntax or annotation is required. Currently, the Variant type requires extra type assertion, we plan to automate this process in future versions of Doris. GuanceDB is one step faster in this aspect. They have realized auto-type assertions for their DQL queries. In most cases, type assertion is based on the actual data type of Variant fields. In some rare cases when there is a type conflict, the Variant fields will be upgraded to JSON fields, and then type assertion will be based on the semantics of operators in DQL queries. Conclusion GuanceDB's transition from Elasticsearch to Apache Doris showcases a big stride in improving data processing speed and reducing costs. For these purposes, Apache Doris has optimized itself in the two major aspects of data processing: data integration and data analysis. It has expanded its schemaless support to flexibly accommodate more data types and introduced features like inverted index and tiered storage to enable faster and more cost-effective queries. Evolution is an ongoing process. Apache Doris has never stopped improving itself.
Murphy's Law ("Anything that can go wrong will go wrong and at the worst possible time.") is a well-known adage, especially in engineering circles. However, its implications are often misunderstood, especially by the general public. It's not just about the universe conspiring against our systems; it's about recognizing and preparing for potential failures. Many view Murphy's Law as a blend of magic and reality. As Site Reliability Engineers (SREs), we often ponder its true nature. Is it merely a psychological bias where we emphasize failures and overlook our unnoticed successes? Psychology has identified several related biases, including Confirmation and Selection biases. The human brain tends to focus more on improbable failures than successes. Moreover, our grasp of probabilities is often flawed – the Law of Truly Large Numbers suggests that coincidences are, ironically, quite common. However, in any complex system, a multitude of possible states exist, many of which can lead to failure. While safety measures make a transition from a functioning state to a failure state less likely, over time, it's more probable for a system to fail than not. The real lesson from Murphy's Law isn't just about the omnipresence of misfortune in engineering but also how we respond to it: through redundancies, high availability systems, quality processes, testing, retries, observability, and logging. Murphy's Law makes our job more challenging and interesting! Today, however, I'd like to discuss a complementary or reciprocal aspect of Murphy's Law that I've often observed while working on large systems: Complementary Observations to Murphy's Law The Worst Possible Time Complement Often overlooked, this aspect highlights the 'magic' of Murphy's Law. Complex systems do fail, but not so frequently that we forget them. In our experience, a significant number of failures (about one-third) occur at the worst possible times, such as during important demos. For instance, over the past two months, we had a couple of important demos. In the first demo, the web application failed due to a session expiration issue, which rarely occurs. In the second, a regression embedded in a merge request caused a crash right during the demo. These were the only significant demos we had in that period, and both encountered failures. This phenomenon is often referred to as the 'Demo Effect.' The Conjunction of Events Complement The combination of events leading to a breakdown can be truly astonishing. For example, I once inadvertently caused a major breakdown in a large application responsible for sending electronic payrolls to 5 million people, coinciding with its production release day. The day before, I conducted additional benchmarks (using JMeter) on the email sending system within the development environment. Our development servers, like others in the organization, were configured to route emails through a production relay, which then sent them to the final server in the cloud. Several days prior, I had set the development server to use a mock server since my benchmark simulated email traffic peaks of several hundred thousand emails per hour. However, the day after my benchmarking, when I was off work, my boss called to inquire if I had made any special changes to email sending, as the entire system was jammed at the final mail server. Here’s what had happened: An automated Infrastructure as Code (IAC) tool overwrote my development server configuration, causing it to send emails to the actual relay instead of the mock server; The relay, recognized by the cloud provider, had its IP address changed a few days earlier; The whitelist on the cloud side hadn't been updated, and a throttling system blocked the final server; The operations team responsible for this configuration was unavailable to address the issue. The Squadron Complement Problems often cluster, complicating resolution efforts. These range from simultaneous issues exacerbating a situation to misleading issues that divert us from the real problem. I can categorize these issues into two types: 1. The Simple Additional Issue: This typically occurs at the worst possible moment, such as during another breakdown, adding more work, or slowing down repairs. For instance, in a current project I'm involved with, due to legacy reasons, certain specific characters inputted into one application can cause another application to crash, necessitating data cleanup. This issue arises roughly once every 3 or 4 months, often triggered by user instructions. Notably, several instances of this issue have coincided with much more severe system breakdowns. 2. The Deceitful Additional Issue: These issues, when combined with others, significantly complicate post-mortem analysis and can mislead the investigation. A recent example was an application bug in a Spring batch job that remained obscured due to a connection issue with the state-storing database caused by intermittent firewall outages. The Camouflage Complement Using ITIL's problem/incidents framework, we often find incidents that appear similar but have different causes. We apply the ITIL framework's problem/incident dichotomy to classify issues where a problem can generate one or more incidents. When an incident occurs, it's crucial to conduct a thorough analysis by carefully examining logs to figure out if this is only a new incident of a known problem or an entire new problem. Often, we identify incidents that appear similar to others, possibly occurring on the same day, exhibiting comparable effects but stemming from different causes. This is particularly true when incorrect error-catching practices are in place, such as using overly broad catch(Exception) statements in Java, which can either trap too many exceptions or, worse, obscure the root cause. The Over-Accident Complement Like chain reactions in traffic accidents, one incident in IT can lead to others, sometimes with more severe consequences. I can recall at least three recent examples illustrating our challenges: 1. Maintenance Page Caching Issue: Following a system failure, we activated a maintenance page, redirecting all API and frontend calls to this page. Unfortunately, this page lacked proper cache configuration. Consequently, when a few users made XHR calls precisely at the time the maintenance page was set up, it was cached in their browsers for the entire session. Even after maintenance ended and the web application frontend resumed normal operation, the API calls continued to retrieve the HTML maintenance page instead of the expected JSON response due to this browser caching. 2. Debug Verbosity Issue: To debug data sent by external clients, we store payloads into a database. To maintain a reasonable database size, we limited the stored payload sizes. However, during an issue with a partner organization, we temporarily increased the payload size limit for analysis purposes. This change was inadvertently overlooked, leading to an enormous database growth and nearly causing a complete application crash due to disk space saturation. 3. API Gateway Timeout Handling: Our API gateway was configured to replay POST calls that ended in timeouts due to network or system issues. This setup inadvertently led to catastrophic duplicate transactions. The gateway reissued requests that timed out, not realizing these transactions were still processing and would eventually complete successfully. This resulted in a conflict between robustness and data integrity requirements. The Heisenbug Complement A 'heisenbug' is a type of software bug that seems to alter or vanish when one attempts to study it. This term humorously references the Heisenberg Uncertainty Principle in quantum mechanics, which posits that the more precisely a particle's position is determined, the less precisely its momentum can be known, and vice versa. Heisenbugs commonly arise from race conditions under high loads or other factors that render the bug's behavior unpredictable and difficult to replicate in different conditions or when using debugging tools. Their elusive nature makes them particularly challenging to fix, as the process of debugging or introducing diagnostic code can change the execution environment, causing the bug to disappear. I've encountered such issues in various scenarios. For instance, while using a profiler, I observed it inadvertently slowing down threads to such an extent that it hid the race conditions. On another occasion, I demonstrated to a perplexed developer how simple it was to reproduce a race condition on non-thread-safe resources with just two or three threads running simultaneously. However, he was unable to replicate it in a single-threaded environment. The UFO Issue Complement A significant number of issues are neither fixed nor fully understood. I'm not referring to bugs that are understood but deemed too costly to fix in light of their severity or frequency. Rather, I'm talking about those perplexing issues whose occurrence is extremely rare, sometimes happening only once. Occasionally, we (partially) humorously attribute such cases to Single Event Errors caused by cosmic particles. For example, in our current application that generates and sends PDFs to end-users through various components, we encountered a peculiar issue a few months ago. A user reported, with a screenshot as evidence, a PDF where most characters appeared as gibberish symbols instead of letters. Despite thorough investigations, we were stumped and ultimately had to abandon our efforts to resolve it due to a complete lack of clues. The Non-Existing Issue Complement One particularly challenging type of issue arises when it seems like something is wrong, but in reality, there is no actual bug. These non-existent bugs are the most difficult to resolve! The misconception of a problem can come from various factors, including looking in the wrong place (such as the incorrect environment or server), misinterpreting functional requirements, or receiving incorrect inputs from end-users or partner organizations. For example, we recently had to address an issue where our system rejected an uploaded image. The partner organization assured us that the image should be accepted, claiming it was in PNG format. However, upon closer examination (that took us several staff-days), we discovered that our system's rejection was justified: the file was not actually a PNG. The False Hope Complement I often find Murphy's Law to be quite cruel. You spend many hours working on an issue, and everything seems to indicate that it is resolved, with the problem no longer reproducible. However, once the solution is deployed in production, the problem reoccurs. This is especially common with issues related to heavy loads or concurrency. The Anti-Murphy's Reciprocal In every organization I've worked for, I've noticed a peculiar phenomenon, which I'd call 'Anti-Murphy's Law.' Initially, during the maintenance phase of building an application, Murphy’s Law seems to apply. However, after several more years, a contrary phenomenon emerges: even subpar software appears not only immune to Murphy's Law but also more robust than expected. Many legacy applications run glitch-free for years, often with less observation and fewer robustness features, yet they still function effectively. The better the design of an application, the quicker it reaches this state, but even poorly designed ones eventually get there. I have only some leads to explain this strange phenomenon: Over time, users become familiar with the software's weaknesses and learn to avoid them by not using certain features, waiting longer, or using the software during specific hours. Legacy applications are often so difficult to update that they experience very few regressions. Such applications rarely have their technical environment (like the OS or database) altered to avoid complications. Eventually, everything that could go wrong has already occurred and been either fixed or worked around: it's as if Murphy's Law has given up. However, don't misunderstand me: I'm not advocating for the retention of such applications. Despite appearing immune to issues, they are challenging to update and increasingly fail to meet end-user requirements over time. Concurrently, they become more vulnerable to security risks. Conclusion Rather than adopting a pessimistic view of Murphy's Law, we should be thankful for it. It drives engineers to enhance their craft, compelling them to devise a multitude of solutions to counteract potential issues. These solutions include robustness, high availability, fail-over systems, redundancy, replays, integrity checking systems, anti-fragility, backups and restores, observability, and comprehensive logging. In conclusion, addressing a final query: can Murphy's Law turn against itself? A recent incident with a partner organization sheds light on this. They mistakenly sent us data and relied on a misconfiguration in their own API Gateway to prevent this erroneous transmission. However, by sheer coincidence, the API Gateway had been corrected in the meantime, thwarting their reliance on this error. Thus, the answer appears to be a resounding NO.
OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior. OpenTelemetry code is supported by many popular programming languages like C++, C#/.NET, Erlang/Elixir, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Swift, and Other languages. This is the OpenTelemetry Python project setup documentation. This documentation is designed to help you understand how to get started using OpenTelemetry Python. This doc/guide is for a Linux environment (WSL2) running on the Windows platform. For more information about WSL2, refer to the below URL: Windows Subsystem for Linux Documentation | Microsoft Learn Step 1: Install WSL2 or WSL2+ Rancher Desktop Installation steps are given in the Appendix. You must refer to them. Once installation is done, go to the next steps. Step 2: Open Ubuntu on Your Machine and Set the Local Port Open Ubuntu on your machine and set the local port as below process: Inside Ubuntu, run the command: sudo nano /etc/resolv.conf Then, set the WSL configuration servers as below: nameserver 8.8.8.8 nameserver 4.4.4.4 After that, save this modified buffer. Step 3: Running Backstage Application in WSL2 Fork the below git repository in your personal git account. Then, cloning the backstage project: Run git clone. The cloned code will be available in the Linux (WSL2) platform. Step 4: Create a Virtual Environment First, create a virtual environment for the project. Virtualenv is a tool to set up Python environments. Since python3.xx, a subset of it has been integrated into the standard library under the venv module. To install Venv to host Python, run this command in the terminal: pip install virtualenv For more step-by-step details, refer to the below article: How to Set Up a Virtual Environment in Python – And Why It's Useful (freecodecamp.org) Note: Never forget to activate the virtual environment using the command below. source env/bin/activate After activating the environment, the Ubuntu terminal looks like the one below, in which the env has been mentioned. Step 5: Open Project in Editor and Connect VS Code Editor to WSL If we must see the project coding part, then we need a Python editor. You can use any editor like PyCharm Visual Studio as per your comfort. Here, we are going through Visual Studio. You can download Visual Studio's latest version .exe file and install it. Once successfully done, the visual studio is installed, and then the project is opened inside it. Once this has been opened, then have to connect the WSL remote in the VS code editor and click on Open a Remote Window option as per below: First, click on the Windows remote button, then click on the connect to WSL option as shown in the above fig. Once the VS code editor connects to WSL, open a project folder using the project path: \\wsl.localhost\Ubuntu\root\opentelemetry-python As shown in the figure below. After that, run the project using the VS code editor terminal or Ubuntu terminal. Before running the code, never forget to activate the virtual environment using the below command- source env/bin/activate Then set path in the terminal and activate the virtual environment like the below: Step 6: Run Code in VS Editor When trying to run any file, then facing some error in the initial phase like ModuleNotFoundErroras below: Then, as per the git repository README.md file, you have installed all packages and modules. First, check and install requirements.txt file modules, which are already available in the repository, using the below commands: pip freeze > dev-requirements.txt pip install -r dev-requirements.txt pip install -r docs-requirements.txt pip freeze > docs-requirements.txt After that, if you face the same issue, then install as per the error or required module. e.g., If facing issues like: Then use the below command: sudo apt-get install python3-opentelemetry OR pip3 install opentelemetry --force-reinstall --upgrade If that issue is not resolved, then you must purge the cache using: pip cache purge Again, go through step 6. As per the project requirements, we must install many modules one by one. Once you complete that installation, you can check your installed environment using: pip list Finally, use any test case file from the project and try to run it, as shown in the figure below. If it runs successfully without generating any error, that means the code setup in a local machine is successfully done. Appendix Install WSL Command Now, you have everything you need to run WSL with a single command. Open PowerShell or Windows Command Prompt in administrator mode by right-clicking and selecting "Run as administrator," enter the WSL —install command, then restart the machine. wsl --install For more steps to install LINUX on Windows with WSL, refer to the URL: Install WSL | Microsoft Learn. After successfully installing WSL2, go to the Turn Windows feature on or off and select the following checkbox to be enabled. Open PowerShell as administrator and execute the command. wsl –set-default-version-2 Step To Installing Ubuntu Run the following command in PowerShell to check Ubuntu is available in WSL2. Run wsl –install -d Ubuntu-22.04 to install the latest version of Ubuntu on your machine. Run Ubuntu once it’s installed, then set up the Unix name and password to create an account.To enable Ubuntu to access the internet, edit the following file using the below command:sudo nano /etc/resolv.conf Add Google DNS to the file to resolve the connectivity issue. Now Ubuntu will have access to the internet that can install software like yarn, node, git, etc. Installing Rancher Desktop and Setting It to Ubuntu Rancher Desktop is delivered as a desktop application. You can download it from the releases page on GitHub and install it as per the instructions given on Installation | Rancher Desktop Docs. While installing the Rancher desktop, it will ask for the two options contained and docked; choose docked to access docker API and images. It will automatically install the Kubernetes cluster and configurations. Go to settings in Rancher Desktop and enable the following option to set Ubuntu access with Rancher Desktop. After that, run the Open Telemetry Application in WSL2.
This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report From cultural and structural challenges within an organization to balancing daily work and dividing it between teams and individuals, scaling teams of site reliability engineers (SREs) comes with many challenges. However, fostering a resilient site reliability engineering (SRE) culture can facilitate the gradual and sustainable growth of an SRE team. In this article, we explore the challenges of scaling and review a successful scaling framework. This framework is suitable for guiding emerging teams and startups as they cultivate an evolving SRE culture, as well as for established companies with firmly entrenched SRE cultures. The Challenges of Scaling SRE Teams As teams scale, complexity may increase as it can be more difficult to communicate, coordinate, and maintain a team's coherence. Below is a list of challenges to consider as your team and/or organization grows: Rapid growth – Rapid growth leads to more complex systems, which can outpace the capacity of your SRE team, leading to bottlenecks and reduced reliability. Knowledge-sharing – Maintaining a shared understanding of systems and processes may become difficult, making it challenging to onboard new team members effectively. Tooling and automation – Scaling without appropriate tooling and automation can lead to increased manual toil, reducing the efficiency of the SRE team. Incident response – Coordinating incident responses can become more challenging, and miscommunications or delays can occur. Maintaining a culture of innovation and learning – This can be challenging as SREs may become more focused on solving critical daily problems and less focused on new initiatives. Balancing operational and engineering work – Since SREs are responsible for both operational tasks and engineering work, it is important to ensure that these teams have enough time to focus on both areas. A Framework for Scaling SRE Teams Scaling may come naturally if you do the right things in the right order. First, you must identify what your current state is in terms of infrastructure. How well do you understand the systems? Determine existing SRE processes that need improvement. For the SRE processes that are necessary but are not employed yet, find the tools and the metrics necessary to start. Collaborate with the appropriate stakeholders, use feedback, iterate, and improve. Step 1: Assess Your Current State Understand your system and create a detailed map of your infrastructure, services, and dependencies. Identify all the components in your infrastructure, including servers, databases, load balancers, networking equipment, and any cloud services you utilize. It is important to understand how these components are interconnected and dependent on each other — this includes understanding which services rely on others and the flow of data between them. It's also vital to identify and evaluate existing SRE practices and assess their effectiveness: Analyze historical incident data to identify recurring issues and their resolutions. Gather feedback from your SRE team and other relevant stakeholders. Ask them about pain points, challenges, and areas where improvements are needed. Assess the performance metrics related to system reliability and availability. Identify any trends or patterns that indicate areas requiring attention. Evaluate how incidents are currently being handled. Are they being resolved efficiently? Are post-incident reviews being conducted effectively to prevent recurrences? Step 2: Define SLOs and Error Budgets Collaborate with stakeholders to establish clear and meaningful service-level objectives (SLOs) by determining the acceptable error rate and creating error budgets based on the SLOs. SLOs and error budgets can guide resource allocation optimization. Computing resources can be allocated to areas that directly impact the achievement of the SLOs. SLOs set clear, achievable goals for the team and provide a measurable way to assess the reliability of a service. By defining specific targets for uptime, latency, or error rates, SRE teams can objectively evaluate whether the system is meeting the desired standards of performance. Using specific targets, a team can prioritize their efforts and focus on areas that need improvement, thus fostering a culture of accountability and continuous improvement. Error budgets provide a mechanism for managing risk and making trade-offs between reliability and innovation. They allow SRE teams to determine an acceptable threshold for service disruptions or errors, enabling them to balance the need for deploying new features or making changes to maintain a reliable service. Step 3: Build and Train Your SRE Team Identify talent according to the needs of each and every step of this framework. Look for the right skillset and cultural fit, and be sure to provide comprehensive onboarding and training programs for new SREs. Beware of the golden rule that culture eats strategy for breakfast: Having the right strategy and processes is important, but without the right culture, no strategy or process will succeed in the long run. Step 4: Establish SRE Processes, Automate, Iterate, and Improve Implement incident management procedures, including incident command and post-incident reviews. Define a process for safe and efficient changes to the system. Figure 1: Basic SRE process One of the cornerstones of SRE involves how to identify and handle incidents through monitoring, alerting, remediation, and incident management. Swift incident identification and management are vital in minimizing downtime, which can prevent minor issues from escalating into major problems. By analyzing incidents and their root causes, SREs can identify patterns and make necessary improvements to prevent similar issues from occurring in the future. This continuous improvement process is crucial for enhancing the overall reliability and performance whilst ensuring the efficiency of systems at scale. Improving and scaling your team can go hand in hand. Monitoring Monitoring is the first step in ensuring the reliability and performance of a system. It involves the continuous collection of data about the system's behavior, performance, and health. This can be broken down into: Data collection – Monitoring systems collect various types of data, including metrics, logs, and traces, as shown in Figure 2. Real-time observability – Monitoring provides real-time visibility into the system's status, enabling teams to identify potential issues as they occur. Proactive vs. reactive – Effective monitoring allows for proactive problem detection and resolution, reducing the need for reactive firefighting. Figure 2: Monitoring and observability Alerting This is the process of notifying relevant parties when predefined conditions or thresholds are met. It's a critical prerequisite for incident management. This can be broken down into: Thresholds and conditions – Alerts are triggered based on predefined thresholds or conditions. For example, an alert might be set to trigger when CPU usage exceeds 90% for five consecutive minutes. Notification channels – Alerts can be sent via various notification channels, including email, SMS, or pager, or even integrated into incident management tools. Severity levels – Alerts should be categorized by severity levels (e.g., critical, warning, informational) to indicate the urgency and impact of the issue. Remediation This involves taking actions to address issues detected through monitoring and alerting. The goal is to mitigate or resolve problems quickly to minimize the impact on users. Automated actions – SRE teams often implement automated remediation actions for known issues. For example, an automated scaling system might add more resources to a server when CPU usage is high. Playbooks – SREs follow predefined playbooks that outline steps to troubleshoot and resolve common issues. Playbooks ensure consistency and efficiency during remediation efforts. Manual interventions – In some cases, manual intervention by SREs or other team members may be necessary for complex or unexpected issues. Incident Management Effective communication, knowledge-sharing, and training are crucial during an incident, and most incidents can be reproduced in staging environments for training purposes. Regular updates are provided to stakeholders, including users, management, and other relevant teams. Incident management includes a culture of learning and continuous improvement: The goal is not only to resolve the incident but also to prevent it from happening again. Figure 3: Handling incidents A robust incident management process ensures that service disruptions are addressed promptly, thus enhancing user trust and satisfaction. In addition, by effectively managing incidents, SREs help preserve the continuity of business operations and minimize potential revenue losses. Incident management plays a vital role in the scaling process since it establishes best practices and promotes collaboration, as shown in Figure 3. As the system scales, the frequency and complexity of incidents are likely to increase. A well-defined incident management process enables the SRE team to manage the growing workload efficiently. Conclusion SRE is an integral part of the SDLC. At the end of the day, your SRE processes should be integrated into the entire process of development, testing, and deployment, as shown in Figure 4. Figure 4: Holistic view of development, testing, and the SRE process Iterating on and improving the steps above will inevitably lead to more work for SRE teams; however, this work can pave the way for sustainable and successful scaling of SRE teams at the right pace. By following this framework and overcoming the challenges, you can effectively scale your SRE team while maintaining system reliability and fostering a culture of collaboration and innovation. Remember that SRE is an ongoing journey, and it is essential to stay committed to the principles and practices that drive reliability and performance. This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report
Joana Carvalho
Site Reliability Engineering,
Virtuoso
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone