Methodologies Resources

DZone's Featured Methodologies Resources

What Is Platform Engineering?

By Josephine Eskaline Joyce

Platform engineering is the creation and management of foundational infrastructure and automated processes, incorporating principles like abstraction, automation, and self-service, to empower development teams, optimize resource utilization, ensure security, and foster collaboration for efficient and scalable software development. In today's fast-paced world of software development, the evolution of "platform engineering" stands as a transformative force, reshaping the landscape of software creation and management. This comprehensive exploration aims to demystify the intricate realm of platform engineering, shedding light on its fundamental principles, multifaceted functions, and its pivotal role in revolutionizing streamlined development processes across industries. Key Concepts and Principles Platform engineering encompasses several key concepts and principles that underpin the design and implementation of internal platforms. One fundamental concept is abstraction, which involves shielding developers from the complexities of underlying infrastructure through well-defined interfaces. Automation is another crucial principle, emphasizing the use of scripting and tools to streamline repetitive tasks, enhance efficiency, and maintain consistency in development processes. Self-service is pivotal, empowering development teams to independently provision and manage resources. Scalability ensures that platforms can efficiently adapt to varying workloads, while resilience focuses on the system's ability to recover from failures. Modularity encourages breaking down complex systems into independent components, fostering flexibility and reusability. Consistency promotes uniformity in deployment and configuration, aiding troubleshooting and stability. API-first design prioritizes the development of robust interfaces, and observability ensures real-time monitoring and traceability. Lastly, security by design emphasizes integrating security measures throughout the entire development lifecycle, reinforcing the importance of a proactive approach to cybersecurity. Together, these concepts and principles guide the creation of robust, scalable, and developer-friendly internal platforms, aligning with the evolving needs of modern software development. Diving Into the Role of a Platform Engineering Team The platform engineering team operates at the intersection of software development, operational efficiency, and infrastructure management. Their primary objective revolves around sculpting scalable and efficient internal platforms that empower developers. Leveraging automation, orchestration, and innovative tooling, these teams create standardized environments for application deployment and management, catalyzing productivity and performance. Image source Elaborating further on the team's responsibilities, it's essential to highlight their continuous efforts in optimizing resource utilization, ensuring security and compliance, and establishing robust monitoring and logging mechanisms. Their role extends beyond infrastructure provisioning, encompassing the facilitation of collaboration among development, operations, and security teams to achieve a cohesive and agile software development ecosystem. Building Blocks of Internal Platforms Central to platform engineering is the concept of an Internal Developer Platform (IDP) - a tailored environment equipped with an array of tools, services, and APIs. This environment streamlines the development lifecycle, offering self-service capabilities that enable developers to expedite the build, test, deployment, and monitoring of applications. Internal platforms in the context of platform engineering encompass various components that work together to provide a unified and efficient environment for the development, deployment, and management of applications. The specific components may vary depending on the platform's design and purpose, but here are some common components: Infrastructure as Code (IaC) Containerization and orchestration Service mesh API Gateway CI/CD pipelines Monitoring and logging Security components Database and data storage Configuration management Workflow orchestration Developer tools Policy and governance Benefits of Internal Platforms Internal platforms in platform engineering offer a plethora of benefits, transforming the software development landscape within organizations. These platforms streamline and accelerate the development process by providing self-service capabilities, enabling teams to independently provision resources and reducing dependencies on dedicated operations teams. Automation through CI/CD pipelines enhances efficiency and ensures consistent, error-free deployments. Internal platforms promote scalability, allowing organizations to adapt to changing workloads and demands. The modularity of these platforms facilitates code reusability, reducing development time and effort. By abstracting underlying infrastructure complexities, internal platforms empower developers to focus on building applications rather than managing infrastructure. Collaboration is enhanced through centralized tools, fostering communication and knowledge sharing. Additionally, internal platforms contribute to improved system reliability, resilience, and observability, enabling organizations to deliver high-quality, secure software at a faster pace. Overall, these benefits make internal platforms indispensable for organizations aiming to stay agile and competitive in the ever-evolving landscape of modern software development. Challenges in Platform Engineering Platform engineering, while offering numerous benefits, presents a set of challenges that organizations must navigate. Scalability issues can arise as the demand for resources fluctuates, requiring careful design and management to ensure platforms can efficiently scale. Maintaining a balance between modularity and interdependence poses a challenge, as breaking down systems into smaller components can lead to complexity and potential integration challenges. Compatibility concerns may emerge when integrating diverse technologies, requiring meticulous planning to ensure seamless interactions. Cultural shifts within organizations may be necessary to align teams with the principles of platform engineering, and skill gaps may arise, necessitating training initiatives. Additionally, achieving consistency across distributed components and services can be challenging, impacting the reliability and predictability of the platform. Balancing security measures without hindering development speed is an ongoing challenge, and addressing these challenges demands a holistic and strategic approach to platform engineering that considers technical, organizational, and cultural aspects. Implementation Strategies in Platform Engineering Following are the top five implementation strategies: Start small and scale gradually: Begin with a focused and manageable scope, such as a pilot project or a specific team. This allows for the identification and resolution of any initial challenges in a controlled environment. Once the initial implementation proves successful, gradually scale the platform across the organization. Invest in training and skill development: Provide comprehensive training programs to ensure that development and operations teams are well-versed in the tools, processes, and concepts associated with platform engineering. Investing in skill development ensures that teams can effectively utilize the platform and maximize its benefits. Automate key processes with CI/CD: Implement Continuous Integration (CI) and Continuous Deployment (CD) pipelines to automate crucial aspects of the development lifecycle, including code building, testing, and deployment. Automation accelerates development cycles, reduces errors, and enhances overall efficiency. Cultivate DevOps practices: Embrace DevOps practices that foster collaboration and communication between development and operations teams. promotes shared responsibility, collaboration, and a holistic approach to software development, aligning with the principles of platform engineering. Iterative improvements based on feedback: Establish a feedback loop to gather insights and feedback from users and stakeholders. Regularly review performance metrics, user experiences, and any challenges faced during the implementation. Use this feedback to iteratively improve the platform, addressing issues and continuously enhancing its capabilities. These top five strategies emphasize a phased and iterative approach, coupled with a strong focus on skill development, automation, and collaborative practices. Starting small, investing in training, and embracing a DevOps culture contribute to the successful implementation and ongoing optimization of platform engineering practices within an organization. Platform Engineering Tools Various tools aid platform engineering teams in building, maintaining, and optimizing platforms. Examples include: Backstage: Developed by Spotify, it offers a unified interface for accessing essential tools and services. Kratix: An open-source tool designed for infrastructure management and streamlining development processes Crossplane: An open-source tool automating infrastructure via declarative APIs, supporting tailored platform solutions Humanitec: A comprehensive platform engineering tool facilitating easy platform building, deployment, and management Port: A platform enabling the building of developer platforms with a rich software catalog and role-based access control Case Studies of Platform Engineering Spotify Spotify is known for its adoption of a platform model to empower development teams. They use a platform called "Backstage," which acts as an internal developer portal. Backstage provides a centralized location for engineers to discover, share, and reuse services, tools, and documentation. It streamlines development processes, encourages collaboration, and improves visibility into the technology stack. Netflix Netflix is a pioneer in adopting a microservices architecture and has developed an internal platform called the Netflix Internal Platform Engineering (NIPE). The platform enables rapid application deployment, facilitates service discovery, and incorporates fault tolerance. Uber Uber has implemented an internal platform called "Michelangelo" to streamline machine learning (ML) workflows. Michelangelo provides tools and infrastructure to support end-to-end ML development, from data processing to model deployment. Salesforce Salesforce has developed an internal platform known as "Salesforce Lightning Platform." This platform enables the creation of custom applications and integrates with the Salesforce ecosystem. It emphasizes low-code development, allowing users to build applications with minimal coding, accelerating the development process, and empowering a broader range of users. Distinguishing Platform Engineering From SRE While both platform engineering and Site Reliability Engineering (SRE) share goals of ensuring system reliability and scalability, they diverge in focus and approach. Platform engineering centers on crafting foundational infrastructure and tools for development, emphasizing the establishment of internal platforms that empower developers. In contrast, SRE focuses on operational excellence, managing system reliability, incident response, and ensuring the overall reliability, availability, and performance of production systems. Further Reading: Top 10 Open Source Projects for SREs and DevOps Engineers. ACTORS Platform Engineering SRE Scope Focused on creating a development-friendly platform and environment. Focused on reliability and performance of applications and services in production. Responsibilities Platform Engineers design and maintain internal platforms, emphasizing tools and services for development teams. SREs focus on operational aspects, automating tasks, and ensuring the resilience and reliability of production systems. Abstraction Level Platform Engineering abstracts infrastructure complexities for developers, providing a high-level platform. SRE deals with lower-level infrastructure details, ensuring the reliability of the production environment. DevOps vs Platform Engineering DevOps and platform engineering are distinct methodologies addressing different aspects of software development. DevOps focuses on collaboration and automation across the entire software delivery lifecycle, while platform engineering concentrates on providing a unified and standardized platform for developers. The table below outlines the differences between DevOps and platform engineering. Factors DevOps Platform Engineering Objective Streamline development and operations Provide a unified and standardized platform for developers Principles Collaboration, Automation, CI, CD Enable collaboration, Platform as a Product, Abstraction, Standardization, Automation Scope Extends to the entire software delivery lifecycle Foster collaboration between dev and ops teams, providing a consistent environment for the entire lifecycle Tools Uses a wide range of tools at different stages in the lifecycle Integrates a diverse set of tools into the platform Benefits Faster development & deployment cycles, higher collaboration Efficient and streamlined development environment, improved productivity, and flexibility for developers Future Trends in Platform Engineering Multi-cloud and hybrid platforms: Platform engineering is expected to focus on providing solutions that seamlessly integrate and manage applications across different cloud providers and on-premises environments. Edge computing platforms: Platforms will need to address challenges related to latency, connectivity, and management of applications deployed closer to end-users. AI-driven automation: The integration of artificial intelligence (AI) and machine learning (ML) into platform engineering is expected to increase. AI-driven automation can optimize resource allocation, improve predictive analytics for performance monitoring, and enhance security measures within platforms. Serverless architectures: Serverless computing is anticipated to become more prevalent, leading to platform engineering solutions that support serverless architectures. This trend focuses on abstracting server management, allowing developers to focus solely on writing code. Observability and AIOps: Observability, including monitoring, tracing, and logging, will continue to be a key focus. AIOps (Artificial Intelligence for IT Operations) will likely play a role in automating responses to incidents and predicting potential issues within platforms. Low-code/no-code platforms: The rise of low-code/no-code platforms is likely to influence platform engineering, enabling a broader range of users to participate in application development with minimal coding. Platform engineering will need to support and integrate with these development approaches. Quantum computing integration: As quantum computing progresses, platform engineering may need to adapt to support the unique challenges and opportunities presented by quantum applications and algorithms. Zero Trust Security: Zero Trust Security models are becoming increasingly important. Future platform engineering will likely focus on implementing and enhancing security measures at every level, considering the principles of zero trust in infrastructure and application security. More

A Brief History of DevOps and the Link to Cloud Development Environments

By Laurent Balmelli, PhD

The history of DevOps is definitely worth reading in a few good books about it. On that topic, “The Phoenix Project,” self-characterized as “a novel of IT and DevOps,” is often mentioned as a must-read. Yet for practitioners like myself, a more hands-on one is “The DevOps Handbook” (which shares Kim as an author in addition to Debois, Willis, and Humble) that recounts some of the watershed moments around the evolution of software engineering and provides good references around implementation. This book actually describes how to replicate the transformation explained in the Phoenix Project and provides case studies. In this brief article, I will use my notes on this great book to regurgitate a concise history of DevOps, add my personal experience and opinion, and establish a link to Cloud Development Environments (CDEs), i.e., the practice of providing access to and running, development environments online as a service for developers. In particular, I explain how the use of CDEs concludes the effort of bringing DevOps “fully online.” Explaining the benefits of this shift in development practices, plus a few personal notes, is my main contribution in this brief article. Before clarifying the link between DevOps and CDEs, let’s first dig into the chain of events and technical contributions that led to today’s main methodology for delivering software. The Agile Manifesto The creation of the Agile Manifesto in 2001 sets forth values and principles as a response to more cumbersome software development methodologies like Waterfall and the Rational Unified Process (RUP). One of the manifesto's core principles emphasizes the importance of delivering working software frequently, ranging from a few weeks to a couple of months, with a preference for shorter timescales. The Agile movement's influence expanded in 2008 during the Agile Conference in Toronto, where Andrew Shafer suggested applying Agile principles to IT infrastructure rather than just to the application code. This idea was further propelled by a 2009 presentation at the Velocity Conference, where a paper from Flickr demonstrated the impressive feat of "10 deployments a day" using Dev and Ops collaboration. Inspired by these developments, Patrick Debois organized the first DevOps Days in Belgium, effectively coining the term "DevOps." This marked a significant milestone in the evolution of software development and operational practices, blending Agile's swift adaptability with a more inclusive approach to the entire IT infrastructure. The Three Ways of DevOps and the Principles of Flow All the concepts that I discussed so far are today incarnated into the “Three Ways of DevOps,” i.e., the foundational principles that guide the practices and processes in DevOps. In brief, these principles focus on: Improving the flow of work (First Way), i.e., the elimination of bottlenecks, reduction of batch sizes, and acceleration of workflow from development to production, Amplifying feedback loops (Second Way), i.e., quickly and accurately collect information about any issues or inefficiencies in the system and Fostering a culture of continuous learning and experimentation (Third Way), i.e., encouraging a culture of continuous learning and experimentation. Following the leads from Lean Manufacturing and Agile, it is easy to understand what led to the definition of the above three principles. I delve more deeply into each of these principles in this conference presentation. For the current discussion, though, i.e., how DevOps history leads to Cloud Development Environments, we just need to look at the First Way, the principle of flow, to understand the causative link. Chapter 9 of the DevOps Handbook explains that the technologies of version control and containerization are central to implementing DevOps flows and establishing a reliable and consistent development process. At the center of enabling the flow is the practice of incorporating all production artifacts into version control to serve as a single source of truth. This enables the recreation of the entire production environment in a repeatable and documented fashion. It ensures that production-like code development environments can be automatically generated and entirely self-serviced without requiring manual intervention from Operations. The significance of this approach becomes evident at release time, which is often the first time where an application's behavior is observed in a production-like setting, complete with realistic load and production data sets. To reduce the likelihood of issues, developers are encouraged to operate production-like environments on their workstations, created on-demand and self-serviced through mechanisms such as virtual images or containers, utilizing tools like Vagrant or Docker. Putting these environments under version control allows for the entire pre-production and build processes to be recreated. Note that production-like environments really refer to environments that, in addition to having the same infrastructure and application configuration as the real production environments, also contain additional applications and layers necessary for development. Developers are encouraged to operate production-like environments (Docker icon) on their workstations using mechanisms such as virtual images or containers to reduce the likelihood of execution issues in production. From Developer Workstations to a CDE Platform The notion of self-service is already emphasized in the DevOps Handbook as a key enabler to the principle of flow. Using 2016 technology, this is realized by downloading environments to the developers’ workstations from a registry (such as DockerHub) that provides pre-configured, production-like environments as files (dubbed infrastructure-as-code). Docker is often a tool to implement this function. Starting from this operation, developers create an application in effect as follows: They access and copy files with development environment information to their machines, Add source code to it in the local storage, and Build the application locally using their workstation computing resources. This is illustrated in the left part of the figure below. Once the application works correctly, the source code is sent (“pushed) to a central code repository, and the application is built and deployed online, i.e., using Cloud-based resources and applications such as CI/CD pipelines. The three development steps listed above are, in effect, the only operations in addition to the authoring of source code using an IDE that is “local,” i.e., they use workstations’ physical storage and computing resources. All the rest of the DevOps operations are performed using web-based applications and used as-a-service by developers and operators (even when these applications are self-hosted by the organization.). The basic goal of Cloud Development Environments is to move these development steps online as well. To do that, CDE platforms, in essence, provide the following basic services, illustrated in the right part of the figure below: Manage development environments online as containers or virtual machines such that developers can access them fully built and configured, substituting step (1) above; then Provide a mechanism for authoring source code online, i.e., inside the development environment using an IDE or a terminal, substituting step (2); and finally Provide a way to execute build commands inside the development environment (via the IDE or terminal), substituting step (3). Figure: (left) The classic development data flow requires the use of the local workstation resources. (right) The cloud development data flow replaced local storage and computing while keeping a similar developer experience. On each side, operations are (1) accessing environment information, (2) adding code, and (3) building the application. Note that the replacement of step (2) can be done in several ways. For example, for example, the IDE can be browser-based (aka a Cloud IDE), or a locally installed IDE can implement a way to remotely author the code in the remote environment. It is also possible to use a console text editor via a terminal such as vim. I cannot conclude this discussion without mentioning that, often multiple containerized environments are used for testing on the workstation, in particular in combination with the main containerized development environment. Hence, cloud IDE platforms need to reproduce the capability to run containerized environments inside the Cloud Development Environment (itself a containerized environment). If this recursive process becomes a bit complicated to grasp, don’t worry; we have reached the end of the discussion and can move to the conclusion. What Comes Out of Using Cloud Development Environments in DevOps A good way to conclude this discussion is to summarize the benefits of moving development environments from the developers’ workstations online using CDEs. As a result, the use of CDEs for DevOps leads to the following advantages: Streamlined Workflow: CDEs enhance the workflow by removing data from the developer's workstation and decoupling the hardware from the development process. This ensures the development environment is consistent and not limited by local hardware constraints. Environment Definition: With CDEs, version control becomes more robust as it can uniformize not only the environment definition but all the tools attached to the workflow, leading to a standardized development process and consistency across teams across the organization. Centralized Environments: The self-service aspect is improved by centralizing the production, maintenance, and evolution of environments based on distributed development activities. This allows developers to quickly access and manage their environments without the need for Operations manual work. Asset Utilization: Migrating the consumption of computing resources from local hardware to centralized and shared cloud resources not only lightens the load on local machines but also leads to more efficient use of organizational resources and potential cost savings. Improved Collaboration: Ubiquitous access to development environments, secured by embedded security measures in the access mechanisms, allows organizations to cater to a diverse group of developers, including internal, external, and temporary workers, fostering collaboration across various teams and geographies. Scalability and Flexibility: CDEs offer scalable cloud resources that can be adjusted to project demands, facilitating the management of multiple containerized environments for testing and development, thus supporting the distributed nature of modern software development teams. Enhanced Security and Observability: Centralizing development environments in the Cloud not only improves security (more about secure CDEs) but also provides immediate observability due to their online nature, allowing for real-time monitoring and management of development activities. By integrating these aspects, CDEs become a solution for modern, in particular cloud-native software development, and align with the principles of DevOps to improve flow, but also feedback, and continuous learning. In an upcoming article, I will discuss the contributions of CDEs across all three ways of DevOps. In the meantime, you're welcome to share your feedback with me. More

The Next Major Shift in Enterprise Software Engineering: From Platforms to “Platformless”

By Asanka Abeysinghe

Demystifying Event Storming: Design Level, Identifying Aggregates (Part 3)

By Alireza Rahmani Khalili

CORE

Understanding Site Reliability Engineering

By Kellyn Gorman

CORE

The Agile Scrum Ceremony Most Talked About but Least Paid Attention To

In this digital world where all companies want their products to have a cutting edge over others and they want faster go to market, most companies want their teams to follow Agile scrum methodology; however, we observed most teams are following Agile scrum ceremonies for the name sake only. Among all scrum ceremonies, the Sprint retrospective is the most important and most talked about ceremony but the least paid attention to. Many times, scrum masters keep doing the same canned single routine format of a retrospective, which is: what went well? What didn't go well? and What is to improve? Let us analyze what are the problems the team faces, their impact, and recommendations to overcome. Problems and impact of routine format Sprint retrospective: Doing a routine single format made teams uninterested, and they started losing interest. Either team members stopped attending this ceremony, kept silent, or didn't participate. Often, action items came out of retrospectives not being followed up during the sprint. Status of action items not discussed in next sprint retrospective The team started losing faith in the ceremony when they saw previous sprint action items still existed and kept accumulated This leads to missing key feedback and actions sprint after sprint and hampers the team's improvements. With this, even after 20-30 sprints, teams keep making the same mistakes again and again. Ultimately, the team never becomes mature. Recommendations for Efficient Sprint Retrospective We think visually. Try following fun-filled visual retrospective techniques: Speed car retrospective Speed boat retrospective Build and reflect Mad, Sad, Glad 4 Ls Retrospective One-word Retrospective Horizontal Line retrospective Continue, stop, and start-improve What went well? What didn't go well? What is to improve? Always record, publish, and track action items. Ensure leadership does not join sprint retrospectives, which will make the team uncomfortable in sharing honest feedback. Every sprint retrospective first discusses the status of action items from the previous sprint; this will give confidence to the team that their feedback is being heard and addressed. Now let us discuss these visual, fun-filled sprint retrospective techniques in detail: 1. Speed Car Retrospective This retrospective shows that the car represents the team, the engine depicts the team's strength, the Parachute represents the impediments that slow down the car's speed, the Abyss shows the danger the team foresees ahead, and the Bridge indicates the team's suggestions on how to overcome and cross this abyss without falling into it. 2. Speed Boat Retrospective This retrospective shows that the boat represents the team; the anchors represent problems that are not allowing the boat to move or slowing it down and turn these anchors into gusts of winds, which in turn represents the team's suggestions, which the team thinks will help the boat move forward. 3. Build and Reflect Bring legos set and divide teams into multiple small groups, then ask the team to build two structures. One represents how the sprint went, and one represents how it should be and then ask each group to talk about their structures and suggestions for the sprint. 4. Mad, Sad, Glad This technique discusses what makes the team mad, sad, and glad during the sprint and how we can move from mad, sad columns to glad columns. 5. Four Ls: Liked, Learned, Lacked and Longed This technique talks about four Ls. What team "Liked," What team "Learned," What team "Lacked," and What team "Longed" during the sprint, and then discuss each item with the team. 6. One-Word Retrospective Sometimes, to keep the retrospective very simple, ask the team to describe the sprint experience in "one word" and then ask why they describe sprint with this particular word and what can be improved. 7. Horizontal Line Retrospective Another simple retrospective technique is to draw a horizontal line and, above the line, put items that the team feels are "winning items" and below line items that the team feels are "failures" during the sprint. 8. Continue, Stop, Start-Improve This is another technique to capture feedback in three categories, viz. "Continue" means which team feels the team did great and needs to continue, "Stop" talks about activities the team wants to stop, and "Start-Improve" talks about activities that the team suggested to start doing or improve. 9. What Went Well? What Didn’t Go Well? And What Is To Improve? This is well well-known and most practiced retrospective technique to note down points in the mentioned three categories. We can keep reshuffling these retrospective techniques to keep the team enthusiastic to participate and share feedback in a fun, fun-filled, and constructive environment. Remember, feedback is a gift and should always be taken constructively to improve the overall team's performance. Go, Agile team!!

By SAMANT KUMAR

7 Tips for Building and Maintaining an SRE Team in Your Company

Many of today’s hottest jobs didn’t exist at the turn of the millennium. Social media managers, data scientists, and growth hackers were unheard of at the time. Another relatively new job role in demand is that of a Site Reliability Engineer or SRE. The profession is quite new. It’s noted that 64% of SRE teams are less than three years old. But despite being new, the job role adds a lot of value to an organization. SRE vs. DevOps Site reliability engineering is the merging of development and operations into one. Most people tend to mix up SRE and DevOps. By principle, the two intertwine, but DevOps serves as the principle and SRE the practice. Any company looking to implement site reliability engineering in their organization might want to start by following these seven tips to build and maintain an SRE team. 1. Start Small and Internally There is a high chance that your company needs an SRE team but doesn’t need a whole department right away. Site reliability management’s role is to ensure that an online service remains reliable through alert creation, incident investigation, root cause remediation, and incident postmortem. The average tech-based company faces a few bugs every so often. In the past, operations and development teams would come together to fix those issues in software or service. An SRE approach merges those two into one. If you’re just starting to build your SRE team, you can start by putting together some people from your operations and technical department and give them the sole responsibility of maintaining a service’s reliability. 2. Get the Right People In cases where you’re ready to scale, the time might come where you’ll need to get additional help for your site reliability engineering team. SRE professionals are in hot demand nowadays. There are more than 1,300 site reliability engineering jobs on Indeed. The key to finding the right people for your SRE team is to know what you’re looking for. Here are a few qualifications to look for in a site reliability engineer. Problem-solving and troubleshooting skills: Much of the SRE team’s responsibilities have to do with addressing incidents and issues in software. Most times, these problems have to do with systems or applications that they didn’t create themselves. So the ability to quickly debug even without in-depth knowledge of a system is a must-have skill. A knack for automation: Toil can often become a big problem in many tech-based services. The right site reliability engineer will look for ways to automate away the toil, reducing manual work to a minimum so that staff only deal with high-priority items. Constant learning: As systems evolve, so will problems. So good SREs will have to keep brushing up their knowledge on systems, codes, and processes that change with time. Teamwork: Addressing incidents will rarely be a one-man-job so SREs need to work well with teams. Collaboration and communication are the skills to look out for definitely. Bird’s eye view perspective: When addressing bugs, it can be easy to get caught up with the wrong things when you’re stuck in the middle of it. That’s why good SREs will need the ability to see the bigger picture and find solutions in larger contexts. A successful site reliability engineer will find the root cause and create an overarching solution. 3. Define Your SLOs An SRE team will most likely succeed with service level objectives in place. Service level objectives or SLOs are the key performance metrics for a site. SLOs can vary depending on the kind of service a business offers. Generally, any user-facing serving system will have to set availability, latency, and throughput as indicators. Storage-based systems will often place more emphasis on latency, availability, and durability. Setting up SLOs also involves placing values that a company would like to maintain in terms of indicators. The numbers your SLOs should show are the minimum thresholds that the system should hold on to. When setting an SLO, don’t base them on current performance as this might put you in a position to meet unrealistic targets. Keep your objectives simple and avoid placing any absolutes. The fewer SLOs you have in place, the better, so only measure what indicators matter to you most. 4. Set Holistic Systems to Handle Incident Management Incident management is one of the most important aspects of site reliability engineering. In a survey by Catchpoint, 49% of respondents said that they had worked on an incident in the last week or so. When handling incidents, a system needs to be in place to keep the debugging and maintenance process as smooth as possible. One of the most important aspects of an incident management system is keeping track of on-call responsibilities. SRE team responsibilities can get extremely exhausting without an effective means to control the flow of on-call incidents. Using the right incident management tool can help resolve incidents with more clarity and structure. 5. Accept Failure as Part of the Norm Most people don’t like experiencing failure, but if your company wants to maintain a healthy and productive SRE team, one of the themes that each member must get used to is accepting failure as a part of the profession. Perfection is rarely ever the case in any system, most especially when in the early development stages. Many SRE teams mistake setting the bar too high right away and putting up unrealistic SLO definitions and targets. The best operational practice has always been to shoot for a minimum viable product and then slowly increase the parameters once the team and company as a whole build up confidence. 6. Perform Incident Postmortems to Learn from Failures and Mistakes There’s an old saying that goes this way: “Dead men tell no tales.” But that isn’t the case with system incidents. There is much to learn from incidents even after problems have been resolved. That’s why it’s a great practice to perform incident postmortems so that SRE teams can learn from their mistakes. A proper SRE approach would take into account the best practices for postmortem. When performing post-incident analysis, there are sets of parameters that site reliability crews must analyze. First, they should look into the cause and triggers of the failure. What caused the system to fail? Secondly, the team should pinpoint as many of the effects as they can find. What did the system failure affect? For example, a payment gateway error might have caused a discrepancy in payments made or collections, which can be a headache if left unturned for even a few days. Lastly, a successful postmortem will look into possible solutions and recommendations if a similar error might occur in the future. 7. Maintain a Simple Incident Management System An SRE team structure isn’t enough to create a productive team. There also needs to be a project and incident management system in place. There are various services and different IT management software use cases available to SRE teams today. Some of the factors that team managers need to consider are ease of use, communication barriers, available integrations, and collaboration capabilities. Setting Your SRE Team Up for Success An SRE team can be likened to an aircraft maintenance crew fixing a plane while it’s 50,000 feet in the air. Setting your SRE team up for success is crucial as they will assure that your company’s service is available to your clients. While errors and bugs are inevitable in any software as a service, it can be kept to a minimum, making outages and errors a rare occasion. But for that to happen, you’ll need a solid SRE team in place, proactively finding ways to avoid errors and being ready to spring into action when duty calls.

By Vishal Padghan

Project Hygiene, Part 2: Combatting Goodhart’s Law and Other “Project Smells”

This is a continuation of the Project Hygiene series about best software project practices that started with this article. Background “It works until it doesn’t” is a phrase that sounds like a truism at first glance but can hold a lot of insight in software development. Take, for instance, the very software that gets produced. There is no shortage of jokes and memes about how the “prettiness” of what the end-user sees when running a software application is a mere façade that hides a nightmare of kludges, “temporary” fixes that have become permanent, and other less-than-ideal practices. These get bundled up into a program that works just as far as the developers have planned out; a use case that falls outside of what the application has been designed for could cause the entire rickety code base to fall apart. When a catastrophe of this kind does occur, a post-mortem is usually conducted to find out just how things went so wrong. Maybe it was some black-swan moment that simply never could’ve been predicted (and would be unlikely to occur again in the future), but it’s just as possible that there was some issue within the project that never got treated until it was too late. Code Smells... Sections of code that may indicate that there are deeper issues within the code base are called “code smells” because, like the milk carton in the fridge that’s starting to give off a bad odor, they should provoke a “there’s something off about this” reaction in a veteran developer. Sometimes, these are relatively benign items, like this Java code snippet: Java var aValue = foo.isFizz() ? foo.isFazz() ? true : false : false; This single line contains two different code smells: Multiple ternary operators are being used within the same statement: This makes the code hard to reason and needlessly increases the cognitive load of the code base. Returning hard-coded boolean values in the ternary statement, itself already being a boolean construction: This is unnecessary redundancy and suggests that there’s a fundamental misunderstanding of what the ternary statement is to be used for. In other words, the developers do not understand the tools that they are using. Both of these points can be addressed by eliminating the use of ternary statements altogether: Java var aValue = foo.isFizz() && foo.isFazz(); Some code smells, however, might indicate an issue that would require a more thorough evaluation and rewrite. For example, take this constructor for a Kotlin class: Kotlin class FooService( private val fieldOne: SomeServiceOne, private val fieldTwo: SomeServiceTwo, private val fieldThree: SomeServiceThree, private val fieldFour: SomeServiceFour, private val fieldFive: SomeServiceFive, private val fieldSix: SomeServiceSix, private val fieldSeven: SomeServiceSeven, private val fieldEight: SomeServiceEight, private val fieldNine: SomeServiceNine, private val fieldTen: SomeServiceTen, ) { Possessing a constructor that takes in ten arguments for ten different class members is an indicator that the FooService class might be conducting too much work within the application; i.e., the so-called “God Object” anti-pattern. Unfortunately, there’s no quick fix this time around. The code architecture would need to be re-evaluated to determine whether some of the functionality within FooService could be transferred to another class within the system, or whether FooService needs to be split up into multiple classes that conduct different parts of the workflow on their own. …And Their “Project” Equivalent Such a concept can be elevated to the level of the entire project around the code being developed as well: that there exist items within the project that signal that the project team is conducting practices that could lead to issues down the road. At a quick glance, all may appear to be fine in the project - no fires are popping up, the application is working as desired, and so on - but push just a bit, and the problems quickly reveal themselves, as the examples outlined below demonstrate. Victims Of Goodhart’s Law To reduce the likelihood of software quality issues causing problems for a software code base and its corresponding applications, the industry has developed tools to monitor the quality of the code that gets written like code linters, test coverage analysis, and more. Without a doubt, these are excellent tools; their effectiveness, however, depends on exactly how they’re being used. Software development departments have leveraged the reporting mechanisms of these code quality tools to produce metrics that function as gates for whether to let a development task proceed to the next stage of the software development lifecycle. For example, if a code coverage analysis service like SonarQube reports that the code within a given pull request’s branch only has testing for 75% of the code base, then the development team may be prohibited from integrating the code until the test coverage ratio improves. Note the specific wording there of whether the ratio improves, *not* whether more test cases have been added - the difference frequently comes to haunt the project’s quality. For those unfamiliar with “Goodhart’s Law,” it can be briefly summed up as, “When a measure becomes a target, it ceases to be a good measure.” What this means in software development is that teams run the risk of developing their code in exact accordance with the metrics that have been imposed upon the team and/or project. Take the aforementioned case of a hypothetical project needing to improve its test coverage ratio. Working in the spirit of the metric would compel a developer to add more test cases to the existing code so that more parts of the code base are being covered, but with Goodhart’s Law in effect, one would only need to improve the ratio however possible, and that could entail: Modifying the code coverage tool’s configuration so that it excludes whole swathes of the code base that do not have test coverage. This may have legitimate use cases - for example, excluding testing a boilerplate web request mechanism because the only necessary tests are for the generated client API that accompanies it - but it can easily be abused to essentially silence the code quality monitor. Generate classes that are ultimately untouched in actual project usage and whose only purpose is to be tested so that the code base has indeed “improved” its code coverage ratio. This has no defensible purpose, but tools like SonarQube will not know that they’re being fooled. Furthermore, there can be issues with the quality of the code quality verification itself. Code coverage signifies that code is being reached in the tests for the code base - nothing more, nothing less. Here’s a hypothetical test case for a web application (in Kotlin): Kotlin @Test fun testGetFoo() { val fooObject: FooDTO = generateRandomFoo() `when`(fooService.getFoo(1L)).thenReturn(fooObject) mockMvc.perform( MockMvcRequestBuilders.get("/foo/{fooId}", 1L) ).andExpect(status().isOk()) } This test code is minimally useful for actually verifying the behavior of the controller - the only piece of behavior being verified here is the HTTP code that the controller endpoint produces - but code coverage tools will nonetheless mark the code within the controller for this endpoint as “covered." A team that produces this type of testing is not actually checking for the quality of its code - it is merely satisfying the requirements imposed on it in the most convenient and quickest way possible - and is ultimately leaving itself open to issues with the code in the future due to not effectively validating its code. Too Much Of A Good Thing Manual execution in software testing is fraught with issues: input errors, eating up the limited time that a developer could be using for other tasks, the developer simply forgetting to conduct the testing, and so on. Just like with code quality, the software development world has produced tools (i.e., CI/CD services like Jenkins or CircleCI) that allow for tests to be executed automatically in a controlled environment, either at the behest of the developer or (ideally) entirely autonomously upon the occurrence of a configured event like the creation of a pull request within the code base. This brings enormous convenience to the developer as well as improves the ability to identify potential code quality issues within the project, but its availability can turn into a double-edged sword. It could be easy for the project team to develop an over-dependence on the service and run any and all tests only via the service and never on the developers’ local environments. In one Kotlin-based Spring Boot project that I had just joined, launching the tests for the code base on my machine would always fail due to certain tests not passing, yet: I had pulled the main code branch, hadn’t modified the code, and had followed all build instructions. The code base was passing all tests on the CI/CD server. No other coworkers were complaining about the same issue. Almost all the other coworkers on the project were using Macs, whereas I had been assigned a Windows as my work laptop. Another coworker had a Windows machine as well, yet even they weren’t complaining about the tests’ constant failures. Upon asking this coworker how they were able to get these tests to pass, I received an unsettling answer: they never ran the tests locally, instead letting the CI/CD server do all the testing for them whenever they had to change code. As bad as the answer was in terms of the quality of the project’s development - refusing to run tests locally meant a longer turnaround time between writing code and checking it, ultimately reducing programmer productivity - it at least gave me the lead to check on the difference between Windows and non-Windows environments with regards to the failing tests. Ultimately, the issue boiled down to the system clock: Instant.now() calls on *nix platforms like Linux and Mac were conducted with microsecond-level precision, whereas calls to Instant.now() on Windows were being conducted with nanosecond-level precision. Time comparisons in the failing tests were being conducted based on hard-coded values; since the developers were almost all using *nix-based Mac environments, these values were based on the lesser time precision, so when the tests ran in an environment with more precise time values, they failed. After forcing microsecond-based precision for all time values within the tests - and updating the project’s documentation to indicate as much for future developers - the tests passed, and now both my Windows-based colleague and I could run all tests locally. Ignoring Instability In software terminology, a test is deemed “unstable” if it has an inconsistent result: it might pass just as much as it might fail if one executes it, despite no changes having been done to the test or the affected code in between executions. A multitude of causes could be the fault of this condition: insufficient thread safety within the code, for example, could lead to a race condition wherein result A or result B could be produced depending on which thread is executed by the machine. What’s important is how the project development team opts to treat the issue. The “ideal” solution, of course, would be to investigate the unstable test(s) and determine what is causing the instability. However, there exist other “solutions” to test instability as well: Disable the unstable tests. Like code coverage exclusions, this does have utility under the right circumstances; e.g., when the instability is caused by an external factor that the development team has no ability to affect. Have the CI/CD service re-execute the code base’s tests until all tests eventually pass. Even this has valid applications - the CI/CD service might be subject to freak occurrences that cause testing one-off failures, for example - although it’s far likelier that the team just wants to get past the “all tests pass” gate of the software development lifecycle as quickly as possible. This second point was an issue that I came across in the same Kotlin-based Spring Boot project as above. While it was possible to execute all tests for the code base and have them pass, it was almost as likely to have different tests fail as well. Just as before, the convenience of the CI/CD testing automation meant that all that one needed to do upon receiving the failing test run notification was to hit the “relaunch process” button and then go back to one’s work on other tasks while the CI/CD service re-ran the build and testing process. These seemingly random test failures were occurring with enough frequency that it was evident that some issue was plaguing the testing code, yet the team’s lack of confronting the instability issue head-on and instead relying on what was, essentially, a roll of the dice with the CI/CD service was resulting in the software development lifecycle being prolonged unnecessarily and the team’s productivity being reduced. After conducting an investigation of the different tests that were randomly failing, I ultimately discovered that the automated database transaction rollback mechanism (via the @Transactional annotation for the tests) was not working as the developers had expected. This was due to two issues: Some code required that another database transaction be opened (via the @Transactional(propagation = Propagation.REQUIRES_NEW) annotation), and this new transaction fell outside of the reach of the “automated rollback” test transaction. Database transactions run within Kotlin coroutines were not being included within the test’s transaction mechanism, as those coroutines were being executed outside of the “automated rollback” test transaction’s thread. As a result, some data artifacts were being left in the database tables after certain tests; this “dirty” database was subsequently causing failures in other tests. Re-running the tests in CI/CD meant that these problems were being ignored in favor of simply brute-forcing an “all tests pass” outcome; in addition, the over-reliance on CI/CD for running the code base’s tests in general meant that there was no database that the team’s developers could investigate, as the CI/CD service would erase the test database layer after executing the testing job. A Proposal So, how do we go about discovering and rectifying such issues? Options might appear to be limited at first glance. Automated tools like SonarQube or CI/CD don’t have a “No, not like that!” setting where they detect where their own usefulness has been deliberately blunted by the development team’s practices. Even if breakthroughs in artificial intelligence were to produce some sort of meta-analysis capability, the ability for a team to configure exceptions would still need to exist. If a team has to fight against its tools too much, it’ll look for others that are more accommodating. Plus, Goodheart’s Law will still reign supreme, and one should never underestimate a team’s ability to work around statically-imposed guidelines to implement what it deems is necessary. Spontaneous discovery and fixes of project smells within the project team - that is, not having been brought on by the investigation that followed some catastrophe - are unlikely to occur. The development team’s main goal is going to be providing value to the company via the code that they develop; fixing project smells like unstable tests does not have the same demonstrable value as deploying new functionality to production. Besides, it’s possible that the team - being so immersed in their environment and the day-to-day practices - is simply unaware that there’s any issue at all. The fish, as they say, is the last to discover water! A better approach would be to have an outside look at the project and how it’s being developed from time to time. Providing a disinterested point of view with the dedicated task of finding project smells will have a better chance of success in rooting these issues out than people within the project who have the constraint of needing to actively develop the project as their principal work assignment. Take, say, a group of experienced developers from across the entire product development department and form a sort of “task force” that conducts an inspection of the different projects within the department at the interval of every three or six months. This team would examine, for example: What is the quality of the code coverage and whether the tests can be improved Whether tests - and the application itself! - can be run on all platforms that the project is supposed to support; i.e., both on the developer’s machine as well as in dedicated testing and production environments What is the frequency of unstable test results occurring and how these unstable tests are resolved Upon conducting this audit of the project, the review team would present its findings to the project development team and make suggestions for how to improve any issues that the review team has found. In addition, the review team would ideally conduct a follow-up with the team to understand the context of how such project smells came about. Such causes might be: A team’s lack of awareness of better practices, both project-wise and within the code that they are developing. Morale within the team is low enough that it only conducts the bare minimum to get the assigned functionality out the door. Company requirements are overburdening the team such that it can *only* conduct the bare minimum to get the assigned functionality out the door. This follow-up would be vital to help determine what changes can be made to both the team’s and the department’s practices in order to reduce the likelihood of such project smells recurring, as the three underlying causes listed above - along with other potential causes - would produce significantly different recommendations for improvement compared to one another. A Caveat As with any review process - such as code reviews or security audits - there must never be blame placed on one or more people in the team for any problematic practices that have been uncovered. The objective of the project audit is to identify what can be improved within the project, not to “name and shame” developers and project teams for supposedly “lazy” practices or other project smells. Some sense of trepidation about one’s project being audited would already be present - nobody wants to receive news that they’ve been working in a less-than-ideal way, after all - but making the process an event to be dreaded would make developer morale plummet. In addition, it’s entirely possible that the underlying reasons for the project smells existing were ultimately outside of the development team’s hands. A team that’s severely under-staffed, for example, might be fully occupied as it is frantically achieving its base objectives; anything like well-written tests or ensuring testability on a developer’s machine would be a luxury. Conclusion Preventative care can be a hard sell, as its effectiveness is measured by the lack of negative events occurring at some point in the future and cannot be effectively predicted. If a company were to be presented with this proposal to create an auditing task force to detect and handle project smells, it might object that its limited resources are better spent on continuing the development of new functionality. The Return On Investment for a brand-new widget, after all, is much easier to quantify than working towards preventing abstract (to the non-development departments in the company, at least) issues that may or may not cause problems for the company sometime down the line. Furthermore, if everything’s “working," why waste time changing it? To repeat the phrase from the beginning of this article, “It works until it doesn’t”, and that “doesn’t” could range from one team having to fix an additional set of bugs in one development sprint to a multi-day downtime that costs the company considerable revenue (or much worse!). The news is replete with stories about companies that neglected issues within their software development department until those issues blew up in their faces. While some companies that have avoided such an event so far have simply been lucky that events have played out in their favor, it would be a far safer bet to be proactive in hunting down any project smells within a company and avoid one’s fifteen minutes of infamy.

By Severn Everett

Human-Centered Approach to Service Reliability: Building Culture, Communication, and Collaboration

In the complex world of service reliability, the human element remains crucial despite the focus on digital metrics. Culture, communication, and collaboration are essential for organizations to deliver reliable services. In this article, I am going to dissect the integral role of human factors in ensuring service reliability and demonstrate the symbiotic relationship between technology and the individuals behind it. Reliability-Focused Culture First of all, let’s define what is a reliability-focused culture. Here are the key aspects and features that help build a culture of reliability and constant improvement across the organization. A culture that prioritizes reliability lies at the heart of any reliable service. It's a shared belief that reliability is not an option but a fundamental requirement. This cultural ethos is not an individual entity but a collective mindset implemented at every level of the company. Accountability should be fostered across teams in order to build a reliability-focused culture. When every team member sees themselves as a custodian of service reliability, it creates a powerful force that allows for preventing errors and resolving issues rapidly. This proactive approach, rooted in culture, becomes a shield against potential disruptions. Meta's renowned mantra, "Nothing at Meta is someone else's problem," encapsulates it perfectly. Continuous learning and adaptation are what help an organization embrace the culture of reliability. Teams are encouraged to analyze incidents, share insights, and implement improvements. This ensures that the company evolves and keeps a competitive advantage by staying ahead of potential reliability challenges and outages. The 2021 Facebook outage is a poignant example, albeit a painful one, of incident management processes and a cultural emphasis on learning and adaptation. Now that we have figured out the main features of the reliability-centered and communication-driven culture let us focus on the aspects that help build effective team organization and set processes to achieve the best results. Examples of Human-Centric Reliability Models Here are some examples of how a collaborative approach to reliability is implemented in major tech companies: Google's Site Reliability Engineering Site Reliability Engineering is a set of engineering practices Google uses to run reliable production systems and keep its vast infrastructure reliable. Google’s culture emphasizes automation, learning from incidents, and shared responsibility. It is one of the major aspects that brings the highest level of reliability to Google's services. Amazon’s Two-Pizza Teams Amazon is committed to small agile teams. This structure is known as two-pizza teams — meaning each team is small enough to be fed by two pizzas. This approach fosters effective communication and collaboration. These teams consist of employees from different disciplines who work together to ensure the reliability of the services they own. Spotify’s Squad Model Spotify's engineering culture revolves around "squads." These are small cross-functional teams that have full ownership of services throughout the whole development process. The squads model ensures that reliability is considered and accounted for from the early development phase through to operations. This approach has shown an improvement in overall service dependability. Implementing a Human-Centric Reliability Model Even though, at first glance, the ways the approach is implemented in different companies seem very different. There are some key points that any company needs to address in order to successfully switch to a collaborative approach to reliability. Here are the steps to follow if you want to improve the reliability of the service in your organization. Break Down Silos Isolated departments are a thing of the past. Collaborative approaches that appear instead recognize that reliability is a collective responsibility. For example, DevOps brings together development and operations teams. This helps create a unified mindset of these teams towards service reliability and converge the expertise from different domains, building a more robust reliability strategy. Establish Cross-Functional Incident Response Reliability challenges are rarely confined to a single domain. Collaboration across functions is essential for a comprehensive incident response. For instance, in the event of an incident, developers, operations, and customer support must work together seamlessly to identify and address the issue in the most efficient way. Set Shared Objectives To Align Teams Towards Shared Reliability Goals When developers understand how their code affects operations and operations understand the intricacies of development, it leads to more reliable services. Shared objectives lift the boundaries between the teams, creating a unified process of response to potential reliability issues. Work on Effective Communication Communication is the glue that holds these teams together. In complex technological ecosystems, different teams need to effectively collaborate to sustain service reliability. The goal is to build a web of well-interconnected teams, from developers and operations to customer support. Transparent communication and sharing knowledge about changes, updates, and potential challenges are crucial. The information flow should be seamless to enable a holistic understanding of the service throughout the company and reinforce trust among the teams. When everyone is aware of what is going on, they can anticipate and prepare, reducing the risks of miscommunication or taking the wrong steps. Teams must have clear channels for immediate communication to coordinate efforts and share crucial information. If an incident occurs, the speed and accuracy of communication determine how swiftly and effectively the issue is resolved. Challenges and Strategies To Overcome Them Organizational changes never come easy, and shifting a work paradigm requires a lot of effort from all parties involved. I am going to share some tips on how to overcome the most common challenges and point out the areas that require the most attention. Overcoming Resistance To Change Sometimes, new ideas and changes face resistance from the teams, which usually comes from the fact that the current approach already provides a decent level of reliability. Shifting towards a reliability-focused culture requires effective leadership, communication, and showcasing the benefits of the new approach. Investing in Training and Development Building effective communication and collaboration requires time and effort. Successful integration of a human-centered approach to reliability takes a significant investment in training programs. These programs should mainly focus on soft skills, such as communication, teamwork, and adaptability. Measuring and Iterating It is important to measure and iterate on collaboration effectiveness. Establish feedback loops and conduct regular retrospectives to identify areas of improvement and refine collaborative processes. Conclusion Besides the technical aspects, the key to smooth operations is the people. A workplace where everyone is committed to making things work, communicating effectively, and collaborating during challenging times sets the foundation for dependable services. I have experienced many service reliability challenges and witnessed first-hand how human touch can make all the difference. In today's world, service reliability is not just about flashy tech. It is also about everyday commitment, conversations, and teamwork. By focusing on these aspects, you can ensure that the service is rock-solid.

By Dmitry Basalai

Trunk-Based Git Model

What Is Trunk-Based Development? To create high-quality software, we must be able to trace any changes and, if necessary, roll them back. In trunk-based development, developers frequently merge minor updates into a shared repository, often referred to as the core or trunk (usually the main or master branch). Within trunk-based development, developers create short-lived branches with only a few commits. This approach helps ensure a smooth flow of production releases, even as the team size and codebase complexity increase. Main branch usage - Engineers actively collaborate on the main/master branch, integrating their changes frequently Short-lived feature branches - Goal is to complete work on these branches quickly and merge them back into the main/master branch Frequent integration - Engineers perform multiple integrations daily Reduced branching complexity - Maintain simple branching structures and naming conventions Early detection of issues - Integrations aid in identifying issues and bugs during the development phase Continuous Delivery/Deployment - Changes are always in a deployable state Feature toggles - Feature flags used to hide incomplete or work-in-progress features Trunk-Based Development (Image Source ) Benefits of Trunk-Based Development Here are some benefits of trunk-based development: Allows continuous code integration Reduces the risk of introducing bugs Makes it easy to fix and deploy code quickly Allows asynchronous code reviews Enables comprehensive automated testing Proposed Approach for a Smooth Transition Transitioning to a trunk-based Git branching model requires careful planning and consideration. Here’s a comprehensive solution addressing various aspects of the process: Current State Analysis Conduct a thorough analysis of the current version control and branching strategy. Identify pain points, bottlenecks, and areas that hinder collaboration and integration. Transition Plan Develop a phased transition plan to minimize disruptions and get it approved by the Product team. Clearly communicate the plan to the development and QA teams. Define milestones and success criteria for each phase. Trunk-Based Development Model Establish a single integration branch (e.g., “main”, “master” or “trunk”). Allow features to be developed and tested on the feature branch first without affecting the user experience. Define clear guidelines for pull requests (PRs) to maintain code quality. Encourage peer reviews and collaboration during the code review process. Develop a robust GitHub Actions pipeline to automate the build, test, and deployment processes. Add GitHub Actions to trigger automated workflows upon code changes. Automated Testing If automated tests are not currently in use, begin creating a test automation framework (choose it wisely) as it will serve as a backbone in the long run. Assuming there is currently no Test Case Management (TCM) tool like TestRail, test cases will be written either in Confluence or in Excel. Strengthen the automated testing suite to cover Smoke, Integration, Confidence, and Regression tests. Integrate automated tests into the GitHub workflow for rapid feedback. Create and schedule a nightly confidence test job that will act as a health check of the app and will run every night according to a specified schedule. The results will be posted daily on a Slack/Teams channel. Monitoring and Rollback Procedures The QA team should follow the Agile process, where for each new feature, the test plan and automated tests should be prepared before deployment. Dev and QA must go hand in hand. Implement monitoring tools to detect issues early in the development process. Establish rollback procedures to quickly revert changes in case of unexpected problems. Documentation and Training Ensure each member of the engineering team is well-versed in the GitHub/Release workflow, from branch creation to production release. Develop comprehensive documentation detailing the new branching model and associated best practices. Conduct training sessions for the development and QA teams to facilitate adaptation to these changes. Communication and Collaboration Plan Clearly communicate the benefits of the trunk-based model to the entire organization; conduct regular sessions throughout the initial year. Foster a culture of collaboration among Devs, QA, Product, and Stakeholders, encouraging shared responsibility. To enhance collaboration between Dev and QA, consider Sprint Planning, identifying Dependencies, Regular Syncs, Collaborative Automation, Learning Sessions, and holding regular retrospectives together. Key Challenges When Adopting Trunk-Based Development Testing: Trunk-based development requires a robust testing process to ensure that code changes do not break existing functionality. Inadequate automated test coverage may lead to unstable builds. Code review: With all developers working on the same codebase, it can be challenging to review all changes and ensure that they meet the necessary standards. Frequent integration might cause conflicts and integration issues. Automation: Automation is important to ensure that the testing and deployment process is efficient and error-free. In the absence of a rollback plan, teams may struggle to address issues promptly. Discipline: Trunk-based development requires a high level of discipline among team members to ensure proper adherence to the development process. Developers might fear breaking the build due to continuous integration. Collaboration: Coordinating parallel development on the main branch can be challenging. Release Flow: Step by step Branching and Commit Devs open a short-lived feature branch for any changes/improvements/features to be added to the codebase from the trunk OR master. Follow a generic format for naming : <wok_type>-<dev_name>-<issue-tracker-number>-<short-description> Follow the naming convention for feature branches, like: feature-shivam-SCT-456-user-authentication release-v2.0.1 bugfix-shivam-SCT-789-fix-header-styling While working on it, to test changes live, devs can conduct direct testing on the local environment or review instances. When a commit is pushed to the feature branch, GitHub actions for the following: Unit tests PR title validation Static code analysis (SAST) Sonarqube checks Security checks (Trivy vulnerability scan), etc. will trigger automatically Pull Request and Review Open a Pull Request; if the PR isn’t ready yet (make sure to add WIP), also add configured labels to PR to categorize it. Add CODEOWNERS into .github folder of the project. This will automatically assign (pre-configured) reviewers to your PR. Add pull_request_template.md to .github folder. This will show you a predefined PR template for every pull request. As soon as a PR is opened, a notification will be sent on Teams/Slack to inform reviewers about the PR. When a PR is raised, a smoke test will automatically trigger on the locally deployed app instance (with the latest changes as in the PR). After test completion, the test report will be sent to developers via email and notifications will be sent via Slack/Teams. Test reports and artifacts will be available to download on demand. Reviewers will review the PR and leave comments (if any), and then Approve/Request Changes on PR (if further changes are needed). *If any failure is critical or major, the team will fix it in the test Merge and Build Integration Once all GitHub Actions have passed, the developer/reviewer can merge the pull request into the trunk (master). Immediately after the PR is merged into the trunk (master), an automated build job will trigger to build the app with the latest changes in the Integration Environment. If the build job is successful, regression tests will automatically trigger in the Integration environment, and notifications will be sent on Teams/Slack. If tests fail, the QA team will examine the failures. If any issues genuinely break functionality, they will either roll back the commit OR add a HotFix. After resolving such issues, the team can proceed with promoting the changes to the staging environment. Create Release Branch and Tag At this point, cut the release branch from the trunk, such as release-v2.0.1. Add rule: If the branch name starts with release, trigger a GitHub Action to build the app instance to staging. If the build job is a success, regression tests will automatically trigger in the staging environment, and notifications will be sent on Teams/Slack. At this point, create a tag on the release branch e.g. ‘git tag -a -m “Releasing version 2.0.1” release-v2.0.1’ Add rule: If a protected tag is added matching a specific pattern, then deploy the app to production. Send the release notes to the Teams/Slack channel, notifying them about the successful production deployment. QA will perform sanity testing (manual+automation) after the prod deployment. Upon promotion to production, any issues take precedence over ongoing work for developers, QA, and the design team in general. * At any point, if the product or design teams want to conduct quick QA while changes are in testing, they can and should do so. This applies only to product/UI features/changes. They can also do the same during test reviews. The shorter the feedback loop, the better. Exceptions If any steps are not followed in either of the promotion cases (i.e., test → staging or staging → production), it must be clearly communicated why it was skipped and should only occur under necessary conditions. After promotion to staging, if the team discovers any blocker/critical UI issues, the dev team will address them following the same process as described earlier. The only exception is for non-critical issues/UI bugs, where the product team will decide whether to proceed with promotion to production or not. Exceptions can occur for smaller fixes such as copy changes, CSS fixes, or config updates. When it’s more important to roll out fast, test/QA in later iterations. Since iterations in this platform are fast and convenient, our workflow should evolve with this in mind, and to keep it that way, always! Conclusion In conclusion, the Trunk-Based Git Model serves as a valuable tool in the software development landscape, particularly for teams seeking a more straightforward, collaborative, and continuous integration-focused approach. As with any methodology, its effectiveness largely depends on the specific needs, goals, and dynamics of the development team and project at hand. If you enjoyed this story, please like and share it to help others find it! Feel free to leave a comment below. Thanks for your interest. Connect with me on LinkedIn.

By Shivam Bharadwaj

Beyond Murphy's Law

Murphy's Law ("Anything that can go wrong will go wrong and at the worst possible time.") is a well-known adage, especially in engineering circles. However, its implications are often misunderstood, especially by the general public. It's not just about the universe conspiring against our systems; it's about recognizing and preparing for potential failures. Many view Murphy's Law as a blend of magic and reality. As Site Reliability Engineers (SREs), we often ponder its true nature. Is it merely a psychological bias where we emphasize failures and overlook our unnoticed successes? Psychology has identified several related biases, including Confirmation and Selection biases. The human brain tends to focus more on improbable failures than successes. Moreover, our grasp of probabilities is often flawed – the Law of Truly Large Numbers suggests that coincidences are, ironically, quite common. However, in any complex system, a multitude of possible states exist, many of which can lead to failure. While safety measures make a transition from a functioning state to a failure state less likely, over time, it's more probable for a system to fail than not. The real lesson from Murphy's Law isn't just about the omnipresence of misfortune in engineering but also how we respond to it: through redundancies, high availability systems, quality processes, testing, retries, observability, and logging. Murphy's Law makes our job more challenging and interesting! Today, however, I'd like to discuss a complementary or reciprocal aspect of Murphy's Law that I've often observed while working on large systems: Complementary Observations to Murphy's Law The Worst Possible Time Complement Often overlooked, this aspect highlights the 'magic' of Murphy's Law. Complex systems do fail, but not so frequently that we forget them. In our experience, a significant number of failures (about one-third) occur at the worst possible times, such as during important demos. For instance, over the past two months, we had a couple of important demos. In the first demo, the web application failed due to a session expiration issue, which rarely occurs. In the second, a regression embedded in a merge request caused a crash right during the demo. These were the only significant demos we had in that period, and both encountered failures. This phenomenon is often referred to as the 'Demo Effect.' The Conjunction of Events Complement The combination of events leading to a breakdown can be truly astonishing. For example, I once inadvertently caused a major breakdown in a large application responsible for sending electronic payrolls to 5 million people, coinciding with its production release day. The day before, I conducted additional benchmarks (using JMeter) on the email sending system within the development environment. Our development servers, like others in the organization, were configured to route emails through a production relay, which then sent them to the final server in the cloud. Several days prior, I had set the development server to use a mock server since my benchmark simulated email traffic peaks of several hundred thousand emails per hour. However, the day after my benchmarking, when I was off work, my boss called to inquire if I had made any special changes to email sending, as the entire system was jammed at the final mail server. Here’s what had happened: An automated Infrastructure as Code (IAC) tool overwrote my development server configuration, causing it to send emails to the actual relay instead of the mock server; The relay, recognized by the cloud provider, had its IP address changed a few days earlier; The whitelist on the cloud side hadn't been updated, and a throttling system blocked the final server; The operations team responsible for this configuration was unavailable to address the issue. The Squadron Complement Problems often cluster, complicating resolution efforts. These range from simultaneous issues exacerbating a situation to misleading issues that divert us from the real problem. I can categorize these issues into two types: 1. The Simple Additional Issue: This typically occurs at the worst possible moment, such as during another breakdown, adding more work, or slowing down repairs. For instance, in a current project I'm involved with, due to legacy reasons, certain specific characters inputted into one application can cause another application to crash, necessitating data cleanup. This issue arises roughly once every 3 or 4 months, often triggered by user instructions. Notably, several instances of this issue have coincided with much more severe system breakdowns. 2. The Deceitful Additional Issue: These issues, when combined with others, significantly complicate post-mortem analysis and can mislead the investigation. A recent example was an application bug in a Spring batch job that remained obscured due to a connection issue with the state-storing database caused by intermittent firewall outages. The Camouflage Complement Using ITIL's problem/incidents framework, we often find incidents that appear similar but have different causes. We apply the ITIL framework's problem/incident dichotomy to classify issues where a problem can generate one or more incidents. When an incident occurs, it's crucial to conduct a thorough analysis by carefully examining logs to figure out if this is only a new incident of a known problem or an entire new problem. Often, we identify incidents that appear similar to others, possibly occurring on the same day, exhibiting comparable effects but stemming from different causes. This is particularly true when incorrect error-catching practices are in place, such as using overly broad catch(Exception) statements in Java, which can either trap too many exceptions or, worse, obscure the root cause. The Over-Accident Complement Like chain reactions in traffic accidents, one incident in IT can lead to others, sometimes with more severe consequences. I can recall at least three recent examples illustrating our challenges: 1. Maintenance Page Caching Issue: Following a system failure, we activated a maintenance page, redirecting all API and frontend calls to this page. Unfortunately, this page lacked proper cache configuration. Consequently, when a few users made XHR calls precisely at the time the maintenance page was set up, it was cached in their browsers for the entire session. Even after maintenance ended and the web application frontend resumed normal operation, the API calls continued to retrieve the HTML maintenance page instead of the expected JSON response due to this browser caching. 2. Debug Verbosity Issue: To debug data sent by external clients, we store payloads into a database. To maintain a reasonable database size, we limited the stored payload sizes. However, during an issue with a partner organization, we temporarily increased the payload size limit for analysis purposes. This change was inadvertently overlooked, leading to an enormous database growth and nearly causing a complete application crash due to disk space saturation. 3. API Gateway Timeout Handling: Our API gateway was configured to replay POST calls that ended in timeouts due to network or system issues. This setup inadvertently led to catastrophic duplicate transactions. The gateway reissued requests that timed out, not realizing these transactions were still processing and would eventually complete successfully. This resulted in a conflict between robustness and data integrity requirements. The Heisenbug Complement A 'heisenbug' is a type of software bug that seems to alter or vanish when one attempts to study it. This term humorously references the Heisenberg Uncertainty Principle in quantum mechanics, which posits that the more precisely a particle's position is determined, the less precisely its momentum can be known, and vice versa. Heisenbugs commonly arise from race conditions under high loads or other factors that render the bug's behavior unpredictable and difficult to replicate in different conditions or when using debugging tools. Their elusive nature makes them particularly challenging to fix, as the process of debugging or introducing diagnostic code can change the execution environment, causing the bug to disappear. I've encountered such issues in various scenarios. For instance, while using a profiler, I observed it inadvertently slowing down threads to such an extent that it hid the race conditions. On another occasion, I demonstrated to a perplexed developer how simple it was to reproduce a race condition on non-thread-safe resources with just two or three threads running simultaneously. However, he was unable to replicate it in a single-threaded environment. The UFO Issue Complement A significant number of issues are neither fixed nor fully understood. I'm not referring to bugs that are understood but deemed too costly to fix in light of their severity or frequency. Rather, I'm talking about those perplexing issues whose occurrence is extremely rare, sometimes happening only once. Occasionally, we (partially) humorously attribute such cases to Single Event Errors caused by cosmic particles. For example, in our current application that generates and sends PDFs to end-users through various components, we encountered a peculiar issue a few months ago. A user reported, with a screenshot as evidence, a PDF where most characters appeared as gibberish symbols instead of letters. Despite thorough investigations, we were stumped and ultimately had to abandon our efforts to resolve it due to a complete lack of clues. The Non-Existing Issue Complement One particularly challenging type of issue arises when it seems like something is wrong, but in reality, there is no actual bug. These non-existent bugs are the most difficult to resolve! The misconception of a problem can come from various factors, including looking in the wrong place (such as the incorrect environment or server), misinterpreting functional requirements, or receiving incorrect inputs from end-users or partner organizations. For example, we recently had to address an issue where our system rejected an uploaded image. The partner organization assured us that the image should be accepted, claiming it was in PNG format. However, upon closer examination (that took us several staff-days), we discovered that our system's rejection was justified: the file was not actually a PNG. The False Hope Complement I often find Murphy's Law to be quite cruel. You spend many hours working on an issue, and everything seems to indicate that it is resolved, with the problem no longer reproducible. However, once the solution is deployed in production, the problem reoccurs. This is especially common with issues related to heavy loads or concurrency. The Anti-Murphy's Reciprocal In every organization I've worked for, I've noticed a peculiar phenomenon, which I'd call 'Anti-Murphy's Law.' Initially, during the maintenance phase of building an application, Murphy’s Law seems to apply. However, after several more years, a contrary phenomenon emerges: even subpar software appears not only immune to Murphy's Law but also more robust than expected. Many legacy applications run glitch-free for years, often with less observation and fewer robustness features, yet they still function effectively. The better the design of an application, the quicker it reaches this state, but even poorly designed ones eventually get there. I have only some leads to explain this strange phenomenon: Over time, users become familiar with the software's weaknesses and learn to avoid them by not using certain features, waiting longer, or using the software during specific hours. Legacy applications are often so difficult to update that they experience very few regressions. Such applications rarely have their technical environment (like the OS or database) altered to avoid complications. Eventually, everything that could go wrong has already occurred and been either fixed or worked around: it's as if Murphy's Law has given up. However, don't misunderstand me: I'm not advocating for the retention of such applications. Despite appearing immune to issues, they are challenging to update and increasingly fail to meet end-user requirements over time. Concurrently, they become more vulnerable to security risks. Conclusion Rather than adopting a pessimistic view of Murphy's Law, we should be thankful for it. It drives engineers to enhance their craft, compelling them to devise a multitude of solutions to counteract potential issues. These solutions include robustness, high availability, fail-over systems, redundancy, replays, integrity checking systems, anti-fragility, backups and restores, observability, and comprehensive logging. In conclusion, addressing a final query: can Murphy's Law turn against itself? A recent incident with a partner organization sheds light on this. They mistakenly sent us data and relied on a misconfiguration in their own API Gateway to prevent this erroneous transmission. However, by sheer coincidence, the API Gateway had been corrected in the meantime, thwarting their reliance on this error. Thus, the answer appears to be a resounding NO.

By Bertrand Florat

The New Frontier in Cybersecurity: Embracing Security as Code

How We Used to Handle Security A few years ago, I was working on a completely new project for a Fortune 500 corporation, trying to bring a brand new cloud-based web service to life simultaneously in 4 different countries in the EMEA region, which would later serve millions of users. It took me and my team two months to handle everything: cloud infrastructure as code, state-of-the-art CI/CD workflows, containerized microservices in multiple environments, frontend distributed to CDN, and tests passing in the staging environment. We were so prepared that we could go live immediately with just one extra click of a button. And we still had a whole month before the planned release date. I know, things looked pretty good for us; until they didn't: because it was precisely at that moment a "security guy" stepped in out of nowhere and caused us two whole weeks. Of course, the security guy. I knew vaguely that they were from the same organization but maybe a different operational unit. I also had no idea that they were involved in this project before they showed up. But I could make a good guess: writing security reports and conducting security reviews, of course. What else could it be? After finishing those reports and reviews, I was optimistic: "We still have plenty of time," I told myself. But it wasn't long before my thought was rendered untrue by another unexpected development: an external QA team jumped in and started security tests. Oh, by the way, to make matters worse, the security tests were manual. It was two crazy weeks of fixing and testing and rinsing and repeating. The launch was delayed, and even months after the Big Bang release, the whole team was still miserable: busy on-call, fixing issues, etc. Later, I would continue to see many other projects like this one, and I am sure you also have similar experiences. This is actually how we used to do security. Everything is smooth until it isn't because we traditionally tend to handle the security stuff at the end of the development lifecycle, which adds cost and time to fix those discovered security issues and causes delays. Over the years, software development has evolved to agile and automatic, but how we handle security hasn't changed much: security isn't tackled until the last minute. I keep asking myself: what could've been done differently? Understanding DevSecOps and Security as Code DevSecOps: Shift Security to the Left of the SDLC Based on the experience of the project above (and many other projects), we can easily conclude why the traditional way of handling security doesn't always work: Although security is an essential aspect of software, we put that aspect at the end of the software development lifecycle (SDLC). When we only start handling the critical aspect at the very end, it's likely to cause delays because it might cause extra unexpected changes and even rework. Since we tend to do security only once (at least we hope so), in the end, we usually wouldn't bother automating our security tests. To make security great again, we make two intuitive proposals for the above problems: Shift left: why does security work at the end of the project, risking delays and rework? To change this, we want to integrate security work into every stage of the SDLC: shifting from the end (right side) to the left (beginning of the SDLC, i.e., planning, developing, etc.), so that we can discover potential issues earlier when it's a lot easier and takes much less effort to fix or even rework. Automation: why do we do security work manually, which is time-consuming, error-prone, and hard to repeat? Automation comes to the rescue. Instead of manually defining policies and security test cases, we take a code-based approach that can be automated and repeated easily. We combine the "shift left" part and the "automation" part, and bam, we get DevSecOps: a practice of integrating security at every stage, throughout the SDLC to accelerate delivery via automation, collaboration, fast feedback, and incremental, iterative improvements. What Is Security as Code (SaC)? Shifting security to the left is more of a change in the mindset; what's more important is the automation, because it's the driving force and the key to achieving a better security model: without proper automation, it's difficult, if not impossible at all, to add security checks and tests at every stage of the SDLC without introducing unnecessary costs or delays. And this is the idea of Security as Code: Security as Code (SaC) is the practice of building and integrating security into tools and workflows by identifying places where security checks, tests, and gates may be included. For those tests and checks to run automatically on every code commit, we should define security policies, tests, and scans as code in the pipelines. Hence, security "as code". From the definition, we can see that Security as Code is part of DevSecOps, or rather, it's how we can achieve DevSecOps. The key differences between Security as Code/DevSecOps and the traditional way of handling security are shifting left and automation: we try to define security in the early stages of SDLC and tackle it in every stage automatically. The Importance and Benefits of Security as Code (SaC) The biggest benefit of Security as Code, of course, is that it's accelerating the SDLC. How so? I'll talk about it from three different standpoints: First of all, efficiency-boosting: security requirements are defined early at the beginning of a project when shifting left, which means there won't be major rework in the late stage of the project with clearly defined requirements in the first place, and there won't be a dedicated security stage before the release. With automated tests, developers can make sure every single incremental code commit is secure. Secondly, codified security allows repeatability, reusability, and consistency. Development velocity is increased by shorter release cycles without manual security tests; security components can be reused at other places and even in other projects; changes to security requirements can be adopted comprehensively in a "change once, apply everywhere" manner without repeated and error-prone manual labor. Last but not least, it saves a lot of time, resources, and even money because, with automated checks, potential vulnerabilities in the development and deployment process will be caught early on in the SDLC when remediating the issues has a much smaller footprint, cost- and labor-wise. To learn more about how adding security into DevSecOps accelerates the SDLC in every stage read this blog here. Key Components of Security as Code The components of Security as Code for application development are automated security tests, automated security scans, automated security policies, and IaC security. Automated security tests: automate complex and time-consuming manual tests (and even penetration tests) via automation tools and custom scripts, making sure they can be reused across different environments and projects. Automated security scans: we can integrate security scans CI/CD pipelines so that they can be triggered automatically and can be reused across different environments and projects. We can do all kinds of scans and analyses here, for example, static code scans, dynamic analyses, and vulnerability scans against known vulnerabilities. Automated security policies: we can define different policies as code using a variety of tools, and integrate the policy checks with our pipelines. For example, we can define access control in RBAC policies for different tools; we can enforce policies in microservices, Kubernetes, and even CI/CD pipelines (for example, with the Open Policy Agent). To know more about Policy as Code and Open Policy Agent, read this blog here. IaC security: in modern times, we often define our infrastructure (especially cloud-based) as code (IaC) and deploy it automatically. We can use IaC to ensure the same security configs and best practices are applied across all environments, and we can use Security as Code measures to make sure the infrastructure code itself is secure. To do so, we integrate security tests and checks within the IaC pipeline, as with the ggshield security scanner for your Terraform code. Best Practices for Security as Code With the critical components of Security as Code sorted out, let's move on to a few best practices to follow. Security-First Mindset for Security as Code/DevSecOps First of all, since Security as Code and DevSecOps are all about shifting left, which is not only a change of how and when we do things, but more importantly, a change of the mindset, the very first best practice for Security as Code and DewvSecOps is to build (or rather, transition into) a security-first mindset. At Amazon, there is a famous saying, which is "Security is job zero". Why do we say that? Since it's important, if you only start dealing with it in the end, there will be consequences. Similar to writing tests, trying to fix issues or even rework components because of security issues found at the end of a project's development lifecycle can be orders of magnitude harder compared to when the code is still fresh, the risk just introduced, and no other components relying on it yet. Because of its importance and close relationship with other moving parts, we want to shift security to the left, and the way to achieve that is by transitioning into a security-first mindset. If you want to know more about DevSecOps and why "adding" security into your SDLC doesn't slow down, but rather speeds up things, refer blog here detailing exactly that. Code Reviews + Automated Scanning Security as Code is all about automation, so it makes sense to start writing those automated tests as early as possible, so that they can be used at the very beginning of the SDLC, acting as checks and gates, accelerating the development process. For example, we can automate SAST/DAST (Static Application Security Testing and DynamicApplication Security Testing) with well-known tools (for example, SonarQube, Synopsys, etc.) into our CI/CD pipelines so that everything runs automatically on new commits. One thing worth pointing out is that SAST + DAST isn't enough: while static and dynamic application tests are the cornerstones of security, there are blind spots. For example, one hard-coded secret in the code is more than enough to compromise the entire system. Two approaches are recommended as complements to the automated security tests. First of all, regular code reviews help. It's always nice to have another person's input on the code change because the four-eyes principle can be useful. For more tips on conducting secure code reviews, read this blog here. However, code reviews can only be so helpful to a certain extent, because, first of all, humans still tend to miss mistakes, and second of all, during code reviews, we mainly focus on the diffs rather than what's already in the code base. As a complement to code reviews, having security scanning and detection in place also helps. Continuous Monitoring, Feedback Loops, Knowledge Sharing Having automated security policies as checks can only help so much if the automated security policies themselves aren't of high quality, or worse, the results can't reach the team. Thus, creating a feedback loop to continuously deliver the results to the developers and monitor the checks is critical, too. It's even better if the monitoring can create logs automatically and display the result in a dashboard to make sure no security risk is found, no sensitive data or secret is shared, and developers can find breaches early so that they can remediate issues early. Knowledge sharing and continuous learning can also be helpful, allowing developers to learn best practices during the coding process. Security as Code: What You Should Keep in Mind Besides the best practices above, there are a few other considerations and challenges when putting Security as Code into action: First of all, we need to balance speed and security when implementing Security as Code/DevSecOps. Yes, I know, earlier that we mentioned how security is job zero and how doing DevSecOps actually speeds things up, but still, implementing Security as Code in the early stages of the SDLC still costs some time upfront, and this is one of the balances we need to consider carefully, to which, unfortunately there is no one-size-fits-all answer. In general, for big and long-lasting projects, paying those upfront costs will most likely be very beneficial in the long run, but it could be worth a second thought if it's only a one-sprint or even one-day minor task with relatively clear changes and confined attack surface. Secondly, to effectively adopt Security as Code, the skills and gaps in the team need to be identified, and continuous knowledge sharing and learning are required. This can be challenging in the DevOps/DevSecOps/Cloud world, where things evolve fast with new tools and even new methodologies emerging from time to time. Under this circumstance, it's critical to keep up with the pace, identify what could potentially push engineering productivity and security to another level, figure out what needs to be learned because we've only got limited time and can't learn everything, and learn those highly-prioritized skills quickly. Last but not least, keep a close eye on newly discovered security threats and changes regarding security regulations, which are also evolving with time. Conclusions and Next Steps Security as Code isn't just a catchphrase; it requires continuous effort to make the best out of it: a change of mindset, continuous learning, new skills, new tooling, automation, and collaboration. As a recap, here's a list of the key components: Automated security tests Automated security scans Automated security policies IaC security And here's a list of the important aspects when adopting it proactively: Security-first, mindset change, shift left Automated testing/scanning with regular code reviews Continuous feedback and learning At last, let's end this article with an FAQ list on Security as Code and DevSecOps. F.A.Q. What is Security as Code (SaC)? Security as Code (SaC) is the practice of building and integrating security into tools and workflows by defining security policies, tests, and scans as code. It identifies places where security checks, tests, and gates may be included in pipelines without adding extra overhead. What is the main purpose of Security as Code? The main purpose of Security as Code is to boost the SDLC by increasing efficiency and saving time and resources while minimizing vulnerabilities and risks. This approach integrates security measures into the development process from the start, rather than adding them at the end. What is the relationship between Security as Code and DevSecOps? DevSecOps is achieved by shifting left and automation; Security as Code handles the automation part. Security as Code is the key to DevSecOps. What is the difference between DevSecOps and secure coding? DevSecOps focuses on automated security tests and checks, whereas secure coding is the practice of developing computer software in such a way that guards against the accidental introduction of security vulnerabilities. Defects, bugs, and logic flaws are consistently the primary cause of commonly exploited software vulnerabilities. What is Infrastructure as Code (IaC) security? IaC security uses the security as a code approach to enhance infrastructure code. For example, consistent cloud security policies can be embedded into the infrastructure code itself and the pipelines to reduce security risks.

By Tiexin Guo

Navigating Software Development With Kanban

Kanban, a well-known agile framework, has significantly influenced project management and software development. With its roots in Japanese manufacturing practices, Kanban has evolved to meet the changing needs of modern software development. This article explores Kanban's principles, history, benefits, and potential drawbacks and compares it with other agile methodologies. It also discusses tools for implementing Kanban. Kanban, which means "visual signal" or "card" in Japanese, is a workflow management method designed to optimize workflow. It was developed in the late 1940s by Toyota engineer Taiichi Ohno as a scheduling system for lean manufacturing. The application of Kanban to software development began in the early 2000s, providing a more flexible approach to managing software projects. Principles of Kanban in Software Development Kanban relies on the following principles: Visualize Work Visual models help teams understand and track tasks. A visual representation of ongoing work and the workflow provides teams with the opportunity to observe task progression through stages. This enhances transparency and gives a clear overview of workload and progress, aiding in effective project management. For instance, a software development team could use a Kanban board to visualize work. The board is divided into sections like "Backlog," "In Progress," "Testing," and "Done." Each task is a card that moves from one section to another as it progresses. The card might include details like task title, description, assignee, and deadline. This way, team members can easily see task status, understand workload distribution, and monitor project progress. Limit Work in Progress (WIP) This principle involves limiting the number of tasks in a specific phase at any given time. By controlling the amount of work in progress, potential bottlenecks that could slow the project down are reduced. This approach allows team members to concentrate on fewer tasks at once, boosting productivity and efficiency. It also promotes a smoother workflow and easier project tracking and management. For instance, a software development team using a Kanban board with sections like "To Do," "In Progress," "Review," and "Done" could set a WIP limit of 3 for the "In Progress" section. This means there can be a maximum of 3 tasks in the "In Progress" stage at any time. If a developer finishes a task and moves it to the "Review" section, only then can a new task be moved from "To Do" to "In Progress." This WIP limit helps the team avoid overloading and maintain focus, improving productivity and project flow. Continuous Improvement This involves monitoring and enhancing the workflow within a project or process to reduce cycle time. Decreasing cycle time can significantly increase delivery speed, improving efficiency and making the project or process more adaptable to changes. Effective flow management can lead to better project outcomes and higher customer satisfaction. For instance, if a software development team notices that tasks often get stuck in the "Code Review" section of their Kanban board, increasing the overall cycle time, they could decide to allocate more resources to that stage and hold regular meetings to discuss any issues slowing down the review process. As a result, the "Code Review" section becomes more fluid, the overall cycle time decreases, and the speed of delivery increases. This improved efficiency makes the project more responsive to changes and leads to higher customer satisfaction. Why Kanban Is Effective in Software Development Kanban is an effective method for software development due to its inherent flexibility and a strong emphasis on continuous delivery and improvement. Implementing a Kanban system allows software development teams to swiftly adapt to changes or unexpected challenges, prioritize tasks effectively based on their importance, urgency, and impact, and significantly reduce the time-to-market. This improves overall efficiency. The key benefits of implementing a Kanban system in software development are numerous: Improved Flexibility: Kanban's non-prescriptive nature lets teams adapt and customize processes to their specific needs and requirements. This flexibility promotes innovation and encourages continual workflow improvement, enhancing productivity. Enhanced Visibility: Kanban boards provide clear and immediate insights into project status, workflow, and potential bottlenecks. This visibility facilitates communication, promotes accountability, and enables quick issue identification and resolution. Reduced Overburdening: By limiting work in progress (WIP), Kanban prevents teams from taking on too much work at once, reducing overburdening risk. This ensures team members can focus on work quality rather than quantity, leading to higher-quality outcomes. Continuous Delivery: Kanban encourages a culture of continuous work and constant improvement, resulting in a steady flow of releases. This continuous delivery ensures regular delivery of new features, updates, or fixes to customers, enhancing customer satisfaction and maintaining a competitive edge. While Kanban bears similarities to other agile frameworks like Scrum, it distinguishes itself in several crucial areas: Flexibility: A significant difference between Kanban and Scrum is their scheduling approach. Scrum operates on fixed-length sprints with work divided into discrete, time-boxed periods. Kanban, in contrast, offers a more fluid approach. Work can be added, removed, or rearranged at any point, letting teams respond quickly to changes or new priorities. Roles and Ceremonies: Another unique aspect of Kanban is its lack of prescribed roles or ceremonies. Unlike Scrum, which has specific roles (such as Scrum Master and Product Owner) and ceremonies (like the sprint planning meeting and daily scrum), Kanban allows more freedom in team structure and operation, benefiting teams requiring more flexibility or unique operational needs. Challenges With Kanban Less Predictability The lack of a structured timeline due to the absence of fixed sprints can potentially make it more challenging to accurately predict when tasks will be completed. Without a set schedule, the timeframe for task completion may be more uncertain, which can potentially disrupt project timelines and deliverables, making the management of the project a bit more complex. Overemphasis on Current Work There's a risk that when we put too much focus and priority on our current tasks, we may inadvertently overlook aspects that are crucial for our future growth and development. This could potentially lead to the neglect of long-term planning and strategic objectives. It's important to strike a balance between managing current responsibilities and paving the way for future success. Reliance on Team Discipline In order for any team to function effectively and meet its objectives, a high level of discipline is absolutely essential. Team members must show commitment, adherence to rules, and a sense of responsibility. This discipline is not just about following orders but understanding them and their importance within the larger context of the team's mission. It is about each team member comprehending their role and carrying it out with consistency and dedication. This level of discipline and comprehension from the team is not just desirable but necessary for the achievement of collective goals. Tools for Implementing Kanban Several digital tools facilitate the implementation of Kanban in software development, including Jira, Trello, and Asana. These tools offer digital Kanban boards, WIP limits, and analytics for workflow management. Conclusion Kanban, as an agile framework, offers significant advantages for software development, including flexibility, visibility, reduced overburdening, and continuous delivery. It is a versatile methodology that can be shaped to meet the specific needs of a team or project. While it presents challenges such as less predictability, potential overemphasis on current work, and reliance on team discipline, understanding these challenges allows teams to implement strategies to counteract them. Despite these challenges, the benefits of implementing a Kanban system in software development can lead to enhanced productivity, improved efficiency, and higher customer satisfaction.

By Harsha Vardhan Mudumba Venkata

5 Steps To Tame Unplanned Work

In an ideal world, there are no zero-day security patches that absolutely must be applied today. There are no system outages - shortage never becomes full; APIs don't stop accepting parameters they accepted yesterday, users don't call support desks with tricky problems and everyone else writes code as good as you do so there are no bugs. Maybe one day, but until then, there is unplanned but urgent work. Whether you call it DevOps, support, maintenance, or some other euphemism, it is work that just appears and demands to be done. The problem is this work is highly disruptive and destroys the best-laid plans. While sticking one's head in the sand and pretending the work does not exist, it does exist and it demands attention. The question then is: how best can such work be managed? Step 1: Acceptance Despite attempts by Project Managers and Scrum Masters to keep teams dedicated to shiny new things, "Stuff happens," as they say. So to start off with, everyone needs to stop wishful thinking. Put it this way: if you come up with a way of handling unplanned but urgent work and it never happens again then you have only lost a few hours of thinking. However, if you don't plan and it does happen then it is going to take up a lot more time and effort. Let's also accept that there is no Bug Fixing Fairy or DevOps Goblin. One day ChatGPT might be able to handle the unexpected, but until then, doing this work is going to require time and effort from the people who would otherwise be doing the shiny new stuff. There is only one pot of capacity and if Lenny is applying a security patch, she isn't working on new features. Step 2: Capture and Make Visible Back in the days of physical boards, I would write out a yellow card and put it on the board. These days it is probably a ticket in an electronic system; however, you do it to make sure those tickets are easily identified so people can easily see them and they can be found later. This may sound obvious, but an awful lot of unplanned work is done under the covers; because "it should not exist," some people object to it being done. Even when the work is trivial it still represents disruption and task switching, and it might fall disproportionally on some individuals. A few years ago I met a team who were constantly interrupted by support work. They felt they couldn't do anything new because of the stream of small, and large, unexpected work. I introduced the yellow card system and with that one change, the problem went away. What happened was that it quickly became clear that the team had, on average, only one unexpected urgent request a week. But because the requests were high profile, people kept asking about them and pressure was heaped on to fix it. Once the request was a clearly identifiable yellow card on a work board, everybody in the company could see it. Anyone could walk up to the board and see the status and who was working on it. Overnight the bulk of the interruptions went away. A lot of the previous interruptions had been people asking, "What is happening?" It was that simple. Once the work is captured and made visible, people can start to see that requests are respected and action taken. Now attention can turn to priority. Step 3: Meaningful Prioritization When trust is absent, nobody believes that anything will be done so everyone demands their ask has top priority. In time, as people learn requests are respected and acted on; then meaningful priority decisions can be made. I met a company where one of the founders had a habit of picking up customer requests, walking across to a dev team member, and asking, "Can you just...?" Of course, the dev wouldn't say "no" to a founder, so their work was interrupted. In addition to the yellow card, a priority discussion was needed. The team was practicing a basic Agile system with all the work for the next Sprint listed in priority order on their Kanban board. When the founder came over, the dev would write the request on a yellow card, then walk over to the Kanban board and ask, "You can see the card I'm working on right now: should I stop what I'm doing and do the new request?" If the founder said "yes," then the dev would honor the request. But seeing what would be interrupted, the founder was likely to say, "Well, finish that first." Now the dev would put the card next to the priority work queue and ask, "Is this the new priority number one?" If it was, the card went to the top of the queue and the next available person would pick it up. If not, the dev would work down the list: "The new priority two? three? four?" We had a simple feedback process. Beforehand, the founder had no idea of the consequences of asking, "Can you just. . .?" Now they could see what work would be delayed, displaced, and interrupted. These three steps alone can manage a lot of unplanned but urgent work and improve team performance quickly, yet it can still be better provided your ticketing system is capturing all these instances physically or electronically. Step 4: Use Data To Get Better At the most basic level, count the yellow cards and count the other planned work cards. Now calculate the ratio of how much planned v. unplanned work is going on over time and feed this back into planning. For example, one development manager calculated that unplanned work accounted for 20% to 25% of the team's Sprint on average (using several months of data.) So to start with, they scheduled 20% less work per Sprint. This alone made the team more predictable. You don't even need to bother with estimates here - you can if you really want to, but this is not going to be perfect, just good enough. Simply knowing the ratio is a start. Over time you may refine your calculations, but start simple. Step 5: Fix at Source Use the yellow tickets as part of your retrospective. Start by considering if the tickets really did deserve to be fast-tracked into work. The team should talk about this with the Product Owner/Manager and other stakeholders. Now everyone can appreciate the impact you might find that some requests really shouldn't be prioritised. Next, you can look at the nature of the requests and see if there is some pattern. Perhaps many originate from one stakeholder. Perhaps somebody should go and talk with them about why they generate so many unplanned requests - maybe they can start making requests before Sprint planning. Similarly, it might be one customer who is raising many tickets. Perhaps this customer has some specific problems, perhaps they have never had training or simply don't appreciate how the system works. It could be that some particular sub-system or third-party component is causing problems. Some remedial work/some refactoring could help, or maybe the component needs to be replaced entirely. When there is data arguing for the time to do such work is a lot easier. You might find that the team lacks particular skills or needs expansion. The same manager who budgeted for 20% of unplanned work later analyzed the tickets and found that the majority of the unplanned tickets were for general IT support which didn't need specialist coding skills. These findings were taken to the big boss and permission was quickly given to hire an IT support technician. Getting a 20% boost to programmer productivity for less than the cost of another coder was too good a deal to miss! Sometimes expanding the team isn't an option, and sometimes the tickets don't point to a missing skill. Another solution is the "sacrificial one." In these cases, a team of five experiencing 20% unplanned work will have one member handle the unplanned stuff. This means one team member soaks up all the interruptions which allows the others to stay focused. Few people like being the sacrificial lamb so rotate the position regularly; say, every Sprint. In one example, team members were concerned it would not work because they each specialized in different parts of the system. They agreed to try and work on any ticket that came up during their week as on-support. Even though it would take them longer than a colleague who specialized in that part of the system, it could increase overall throughput. After a few weeks, team members found they could deal with far more tickets than they expected and were learning more about the system. If they were stuck they might get just 10 minutes of the specialist and then do it themselves. Work the Problem The thing is as much as you don't want unplanned and urgent work to exist, it does. Project Managers have spent years trying to stamp it out and, without great pain, they can't. In fact, the whole DevOps movement runs counter to this school of thought. The five steps set out here might not create the ideal world but they should make the real world more controlled and manageable. One day AI systems might take over unplanned work, but it is often in unexpected situations when things should work but don't that human ingenuity is needed most.

By Allan Kelly

CORE

The Reality of Low-Code and No-Code Applications

Low-code and no-code are slowly becoming the norm today. No-code refers to the approach where you can design software or applications with virtually no code. This is possible where platforms offer ready-made or done-for-you features. Today, many website builders, including WordPress and others, are prime examples of using drag-and-drop features as well as templates. Low-code is close in nature, but it still involves knowledge and coding applications. You'll find that in the same arena, website building, low-code is still in play. These approaches signify a major paradigm shift in the industry, democratizing the development process and challenging the earlier, code-heavy approach. Not to mention, with the advent of ChatGPT, low-code is practically the norm, and you cannot deny or resist the changes taking place in this direction. As with any technological advancement, they come with a set of advantages and unique challenges. This article delves into the reality of no-code and low-code, providing a balanced perspective and shedding light on when and where they can be most effective. Advantages of No-Code and Low-Code Let's start with the benefits and advantages of working with no-code and low-code for the everyday user and developers too. Speed of Development: One of the big benefits of no-code and low-code platforms is their ability to accelerate the development process significantly. With an intuitive drag-and-drop interface, developers can quickly create prototypes and deploy functional applications in a fraction of the time it would take with traditional coding. This speed is crucial for businesses looking to stay ahead of the competition and meet ever-changing market demands. It also removes the need to reinvent the wheel, a.k.a. doing things from scratch. Accessibility: The previous benefit leads to this one — these types of applications lower the barrier of entry into application development. Individuals with varying technical backgrounds (or NO technical backgrounds) can contribute. This opens doors for non-technical users, such as business analysts or marketers, to be involved in the development process. This leads to cost savings and increased efficiency, but it also allows different team members to realize their vision even without technical knowledge. Cost Efficiency: No-code and low-code platforms have the potential to reduce development costs significantly. By removing the need for extensive coding, these methods cut down on time and resources spent on development. This also leads to savings on maintenance costs as updates and changes can be made quickly and easily. Where Low-Code and No-Code are Applied There are many areas where low-code and no-code applications are already in place. For example, in website development and app development, we see drag-and-drop builders and entire sites ready to launch. This is especially true for WordPress which powers 43% of all websites and is the foundation for several no-code and low-code plugins that work seamlessly. Also, Airtable is a project management system that virtually helps people build apps such as databases, project management tools, and even content marketing systems. Similarly, you can find tools for IoT applications, building social networks, and much more. The list is growing and largely contained within the B2B sphere, although we can expect to see more B2C applications soon. Challenges and Limitations While such applications are game-changing and impressive, they do come with their own hangups. Let's explore them. Customization Constraints: While no-code and low-code platforms offer speed and simplicity, they come with limitations. Users often experience frustration when they have to customize their experiences. Such platforms have predefined templates and modules and need coding knowledge for unique functionalities. This can be a significant limitation for businesses and individuals with specific needs and requirements. Scalability Concerns: As businesses grow, their applications must be able to scale with them. But scalability gets tricky, then. Low-code/no-code platforms are designed for quicker development and often sacrifice scalability in the process. Or scaling means a sudden increase in expense by taking on higher tiers of service that a small business or a single person cannot afford, leaving them stuck. Learning Curve: While these platforms aim to be accessible, there might still be a learning curve for users without any technical background. Very few low and no-code platforms are completely intuitive. And people still need to have a baseline knowledge of the field they want to work in to leverage such tools. In fact, some people believe that these technologies simply offer false promises without a solid foundation in reality. Integration and Interoperability: One major concern for businesses using no-code and low-code platforms is integration with other systems. These platforms can have their unique components and modules, making it challenging to integrate them with existing legacy systems or third-party applications. This can limit the functionality and compatibility of these platforms. This isn't true everywhere, as we can see that web development via WordPress ensures that the whole ecosystem evolves with it. However, it is worth carefully considering integration needs before investing in such technologies. Reconsidering the Reliance on Low- and No-Code Technologies While no-code and low-code development have undoubtedly revolutionized the software development landscape, they have their limitations. However, I think there needs to be further integration of systems across industries and operations so that as many platforms, brands, and tools as possible come to a baseline uniformity of function. l also think that the rise of AI and its everyday use is a launchpad from which no-code and low-code applications will only grow. Not only will they grow but also become more cohesive, integrated, and consistent, but we'll see a period of messy growth in the meantime. As such, there's still room for development from scratch and building tailored applications to serve current business and individual needs.

By Syed Balkhi

Methodologies

DZone's Featured Methodologies Resources

Top Methodologies Experts

The Latest Methodologies Topics