DZone Research Report: A look at our developer audience, their tech stacks, and topics and tools they're exploring.
Getting Started With Large Language Models: A guide for both novices and seasoned practitioners to unlock the power of language models.
Continuous Integration and Continuous Delivery for Database Changes
Build a Flow Collectibles Portal Using Cadence (Part 2)
Enterprise Security
This year has observed a rise in the sophistication and nuance of approaches to security that far surpass the years prior, with software supply chains being at the top of that list. Each year, 幸运168飞艇体彩 investigates the state of application security, and our global developer community is seeing both more automation and solutions for data protection and threat detection as well as a more common security-forward mindset that seeks to understand the Why.In our 2023 Enterprise Security Trend Report, we dive deeper into the greatest advantages and threats to application security today, including the role of software supply chains, infrastructure security, threat detection, automation and AI, and DevSecOps. Featured in this report are insights from our original research and related articles written by members of the 幸运168飞艇体彩 Community — read on to learn more!
Getting Started With Large Language Models
Software Supply Chain Security
In the rapidly evolving landscape of artificial intelligence, text generation models have emerged as a cornerstone, revolutionizing how we interact with machine learning technologies. Among these models, GPT-4 stands out, showcasing an unprecedented ability to understand and generate human-like text. This article delves into the basics of text generation using GPT-4, providing Python code examples to guide beginners in creating their own AI-driven text generation applications. Understanding GPT-4 GPT-4, or Generative Pre-trained Transformer 4, represents the latest advancement in OpenAI's series of text generation models. It builds on the success of its predecessors by offering more depth and a nuanced understanding of context, making it capable of producing text that closely mimics human writing in various styles and formats. At its core, GPT-4 operates on the principles of deep learning, utilizing a transformer architecture. This architecture enables the model to pay attention to different parts of the input text differently, allowing it to grasp the nuances of language and generate coherent, contextually relevant responses. Getting Started With GPT-4 and Python To experiment with GPT-4, one needs access to OpenAI's API, which provides a straightforward way to utilize the model without the need to train it from scratch. The following Python code snippet demonstrates how to use the OpenAI API to generate text with GPT-4: Python from openai import OpenAI # Set OpenAI API key client = OpenAI(api_key = 'you_api_key_goes_here') #Get your key at https://platform.openai.com/api-keys response = client.chat.completions.create( model="gpt-4-0125-preview", # The Latest GPT-4 model. Trained with data till end of 2023 messages =[{'role':'user', 'content':"Write a short story about a robot saving earth from Aliens."}], max_tokens=250, # Response text length. temperature=0.6, # Ranges from 0 to 2, lower values ==> Determinism, Higher Values ==> Randomness top_p=1, # Ranges 0 to 1. Controls the pool of tokens. Lower ==> Narrower selection of words frequency_penalty=0, # used to discourage the model from repeating the same words or phrases too frequently within the generated text presence_penalty=0) # used to encourage the model to include a diverse range of tokens in the generated text. print(response.choices[0].message.content) In this example, we use the client.chat.completions.create function to generate text. The model parameter specifies which version of the model to use, with "gpt-4-0125-preview" representing the latest GPT-4 preview that is trained with the data available up to Dec 2023. The messages parameter feeds the initial text to the model, serving as the basis for the generated content. Other parameters like max_tokens, temperature, and top_p allow us to control the length and creativity of the output. Applications and Implications The applications of GPT-4 extend far beyond simple text generation. Industries ranging from entertainment to customer service find value in its ability to create compelling narratives, generate informative content, and even converse with users in a natural manner. However, as we integrate these models more deeply into our digital experiences, ethical considerations come to the forefront. Issues such as bias, misinformation, and the potential for misuse necessitate a thoughtful approach to deployment and regulation. Conclusion GPT-4's capabilities represent a significant leap forward in the field of artificial intelligence, offering tools that can understand and generate human-like text with remarkable accuracy. The Python example provided herein serves as a starting point for exploring the vast potential of text generation models. As we continue to push the boundaries of what AI can achieve, it remains crucial to navigate the ethical landscape with care, ensuring that these technologies augment human creativity and knowledge rather than detract from it. In summary, GPT-4 not only showcases the power of modern AI but also invites us to reimagine the future of human-computer interaction. With each advancement, we step closer to a world where machines understand not just the words we say but the meaning and emotion behind them, unlocking new possibilities for creativity, efficiency, and understanding.
In the Oxford Dictionary, the word agility is defined as "the ability to move quickly and easily." It is, therefore, understandable that many people relate agility to speed. Which I think is unfortunate. I much prefer the description from Sheppard and Young, two academics in the field of sports science, who proposed a new definition of agility within the sports science community as "a rapid whole-body movement with a change of velocity or direction in response to a stimulus" [1]. The term “agility” is often used to describe “a change of direction of speed.” However, there is a distinct difference between “agility” and “a change of direction of speed.” Agility involves the ability to react in unpredictable environments. Change of direction of speed, on the other hand, focuses purely on maintaining speed as the direction of travel is changed. Maintaining a speed while changing direction is usually only possible when each change of direction of travel is known in advance. Using a sports analogy as an example, in soccer, we can say that a defender’s reaction to an attacker’s sudden movement is an agility-based movement. The defender has to react based on what the attacker is doing. Compare this to an athlete running in a zig-zag through a course of pre-positioned cones, and the reactive component is missing. There are no impulsive or unpredictable events happening. The athlete is trying to maintain speed while changing direction. Often, when leaders of organizations want to adopt Agile, they do so for reasons such as “to deliver faster.” In this case, they are thinking of agile as a way to enable a change of direction of speed like the athlete, and not in the sense of agility needed by the soccer defender when faced with the attacker. This may explain why agile ways of working do not always live up to expectations, even though more and more companies are adopting it. Sticking with the sports analogy, the athlete running through the cones tries to reach each one as quickly as possible and then runs in the direction of the next until the end of the course. This works as a metaphor for defining the scope of a project and has teams work in short iterations in which they deliver each planned feature as quickly as possible and then move on to the next. This may be fine in a predictable environment, where the plan does not need to change, where requirements stay fixed, where the market stays the same, and where customer behaviors are well understood and set in stone. In many environments, however, change is a constant: customer expectations and behavior, market trends, the actions of competitors, and more. These are the VUCA environments (volatile, uncertain, complex, and ambiguous) where there is a need to react or, to put it another way, where agility is needed. Frameworks such as Scrum are meant to support agility. Sprints are short planning horizons, and the artifacts and events in Scrum are there to provide greater transparency and opportunities to inspect and adapt based on early feedback, changes in conditions, and new information. It gives an opportunity to pivot and react in a timely manner. However, Scrum is unfortunately often misunderstood as a mechanism to deliver with speed. Focusing only on speed and delivery and not investing in the practices that enable true agility is likely to actually slow things down in the long run. When the focus is only on speed, it becomes harder to maintain that speed, let alone to increase it, and any semblance of agility is a fantasy. Let me talk about a pattern that I see again and again as an example. Company A has a goal to build an e-commerce site through which they can sell their goods. Their first slice of functionality is delivered in a 1-month Sprint and consists of a static catalog page in which the company can upload its products with a short description. The first delivery is received by happy and excited stakeholders who are hungry for more. The team keeps on building, and the product keeps growing. Stakeholders make more requests, and the team works harder to keep up and deliver at the pace that stakeholders have come to expect from them. The team does not have time to invest in improving their practices. Manual regression testing becomes a bigger and bigger burden and is more challenging to complete within a Sprint. The codebase becomes more complex and brittle. The more bloated the product becomes, the more the team struggles to deliver at the same pace. To try to meet expectations, the team begins to cut corners. They end up carrying testing work over from one Sprint to the next. And there is no time for validating if what is being delivered actually produces value. This is just as well, as no analytics have been set up anyway. In the meantime, whilst the team is so busy trying to build new features and carrying out manual integration and regression testing, they do not have time either to look at improvements to their original build pipeline or to build in any automation. A release to production involves several hours of downtime, so this has to be done overnight and manually. To make matters worse, the market has been changing. The sales team has made deals with new suppliers, but this means further customizations to the site are needed for their products. Finally, the company has pushed for the platform to be available in different timezones, so the downtime for a release is a big problem and must be minimized, so they are only allowed to happen once every six months. Progress comes to a standstill. The product is riddled with technical debt. The team has lost the ability to get early feedback and the ability to react to what their attackers are doing, i.e., customers' changing needs, competitors’ actions, changing market conditions, etc. Just implementing the mechanics of a framework like Scrum does not ensure agility and does not automatically lead to a more beneficial way of working. The Agile Manifesto includes principles such as “continuous delivery of valuable software,” “continuous attention to technical excellence,” and “at regular intervals, the team reflects on how to become more effective.” Following these principles is a greater enabler of agility than following the Scrum framework alone. One effective way of enabling greater agility is to complement something like Scrum with agile engineering practices to get those expected benefits that organizations are looking for with agile. Over the years, I have encountered many Agile adoptions at companies where a lot of passion, energy, focus, and budget went into training and coaching people in implementing certain frameworks such as Scrum. However, what I do not encounter so much are companies spending the same amount of passion, energy, focus, and budget into implementing good Agile engineering practices such as Pair Programming, Test Driven Development, Continuous Integration, and Continuous Delivery. When challenged on this, responses are typically something like “We don’t have time for that now," or “Let’s just deliver this next release, and we’ll look at doing it at a later date when we have time." And, of course, that time usually never arrives. Now, of course, something like Scrum can be used without using any Agile engineering practices. After all, Scrum is just a framework. However, without good engineering practices and without striving for technical excellence, a Scrum team developing software will only get so far. Agile engineering practices are essential to achieve agility as they allow the shortening of validation cycles and get early feedback. For example, validation and feedback in real-time on quality when pairing, or that comes with proper Continuous Integration being in place. Many of the Agile engineering practices are viewed as daunting and expensive to implement, and doing so will get in the way of delivery. However, I would argue that investing in engineering practices that help to build quality or allow tasks to be automated, for example, enables sustainable development in the long run. While investing in Agile engineering practices may seem to slow things down in the short term, the aim is to be able to maintain or actually increase speed into the future while still retaining the ability to pivot. To me, it is an obvious choice to invest in implementing Agile engineering practices, but surprisingly, many companies do not. Instead, they choose to sacrifice long-term sustainability and quality for short-term speed. Creating a shared understanding of the challenges that teams face and the trade-offs for short-term speed without good engineering practices in place versus the problems that can arise without them can help to start a conversation. It is important everyone, including developers and a team’s stakeholders, understands the importance of investing in good Agile engineering practices and the impact on agility if you don’t do it. These investments can also be thought of as experiments — trying out or investigating a certain practice for a few iterations to see how it might help can be a way to get started and make it less daunting. Either way, it is questionable that an Agile process without underlying agile practices applied can be agile at all. For sustainability, robustness, quality, customer service, fitness for purpose, and true agility in software development teams, it is important for continuous investment in agile engineering practices. [1] J. M. Sheppard & W. B. Young (2006) Agility literature review: Classifications, training and testing, Journal of Sports Sciences, 24:9, 919-932, DOI: 10.1080/02640410500457109
MongoDB is one of the most reliable and robust document-oriented NoSQL databases. It allows developers to provide feature-rich applications and services with various modern built-in functionalities, like machine learning, streaming, full-text search, etc. While not a classical relational database, MongoDB is nevertheless used by a wide range of different business sectors and its use cases cover all kinds of architecture scenarios and data types. Document-oriented databases are inherently different from traditional relational ones where data are stored in tables and a single entity might be spread across several such tables. In contrast, document databases store data in separate unrelated collections, which eliminates the intrinsic heaviness of the relational model. However, given that the real world's domain models are never so simplistic to consist of unrelated separate entities, document databases (including MongoDB) provide several ways to define multi-collection connections similar to the classical databases relationships, but much lighter, more economical, and more efficient. Quarkus, the "supersonic and subatomic" Java stack, is the new kid on the block that the most trendy and influential developers are desperately grabbing and fighting over. Its modern cloud-native facilities, as well as its contrivance (compliant with the best-of-the-breed standard libraries), together with its ability to build native executables have seduced Java developers, architects, engineers, and software designers for a couple of years. We cannot go into further details here of either MongoDB or Quarkus: the reader interested in learning more is invited to check the documentation on the official MongoDB website or Quarkus website. What we are trying to achieve here is to implement a relatively complex use case consisting of CRUDing a customer-order-product domain model using Quarkus and its MongoDB extension. In an attempt to provide a real-world inspired solution, we're trying to avoid simplistic and caricatural examples based on a zero-connections single-entity model (there are dozens nowadays). So, here we go! The Domain Model The diagram below shows our customer-order-product domain model: As you can see, the central document of the model is Order, stored in a dedicated collection named Orders. An Order is an aggregate of OrderItem documents, each of which points to its associated Product. An Order document also references the Customer who placed it. In Java, this is implemented as follows: Java @MongoEntity(database = "mdb", collection="Customers") public class Customer { @BsonId private Long id; private String firstName, lastName; private InternetAddress email; private Set<Address> addresses; ... } The code above shows a fragment of the Customer class. This is a POJO (Plain Old Java Object) annotated with the @MongoEntity annotation, which parameters define the database name and the collection name. The @BsonId annotation is used in order to configure the document's unique identifier. While the most common use case is to implement the document's identifier as an instance of the ObjectID class, this would introduce a useless tidal coupling between the MongoDB-specific classes and our document. The other properties are the customer's first and last name, the email address, and a set of postal addresses. Let's look now at the Order document. Java @MongoEntity(database = "mdb", collection="Orders") public class Order { @BsonId private Long id; private DBRef customer; private Address shippingAddress; private Address billingAddress; private Set<DBRef> orderItemSet = new HashSet<>() ... } Here we need to create an association between an order and the customer who placed it. We could have embedded the associated Customer document in our Order document, but this would have been a poor design because it would have redundantly defined the same object twice. We need to use a reference to the associated Customer document, and we do this using the DBRef class. The same thing happens for the set of the associated order items where, instead of embedding the documents, we use a set of references. The rest of our domain model is quite similar and based on the same normalization ideas; for example, the OrderItem document: Java @MongoEntity(database = "mdb", collection="OrderItems") public class OrderItem { @BsonId private Long id; private DBRef product; private BigDecimal price; private int amount; ... } We need to associate the product which makes the object of the current order item. Last but not least, we have the Product document: Java @MongoEntity(database = "mdb", collection="Products") public class Product { @BsonId private Long id; private String name, description; private BigDecimal price; private Map<String, String> attributes = new HashMap<>(); ... } That's pretty much all as far as our domain model is concerned. There are, however, some additional packages that we need to look at: serializers and codecs. In order to be able to be exchanged on the wire, all our objects, be they business or purely technical ones, have to be serialized and deserialized. These operations are the responsibility of specially designated components called serializers/deserializers. As we have seen, we're using the DBRef type in order to define the association between different collections. Like any other object, a DBRef instance should be able to be serialized/deserialized. The MongoDB driver provides serializers/deserializers for the majority of the data types supposed to be used in the most common cases. However, for some reason, it doesn't provide serializers/deserializers for the DBRef type. Hence, we need to implement our own one, and this is what the serializers package does. Let's look at these classes: Java public class DBRefSerializer extends StdSerializer<DBRef> { public DBRefSerializer() { this(null); } protected DBRefSerializer(Class<DBRef> dbrefClass) { super(dbrefClass); } @Override public void serialize(DBRef dbRef, JsonGenerator jsonGenerator, SerializerProvider serializerProvider) throws IOException { if (dbRef != null) { jsonGenerator.writeStartObject(); jsonGenerator.writeStringField("id", (String)dbRef.getId()); jsonGenerator.writeStringField("collectionName", dbRef.getCollectionName()); jsonGenerator.writeStringField("databaseName", dbRef.getDatabaseName()); jsonGenerator.writeEndObject(); } } } This is our DBRef serializer and, as you can see, it's a Jackson serializer. This is because the quarkus-mongodb-panache extension that we're using here relies on Jackson. Perhaps, in a future release, JSON-B will be used but, for now, we're stuck with Jackson. It extends the StdSerializer class as usual and serializes its associated DBRef object by using the JSON generator, passed as an input argument, to write on the output stream the DBRef components; i.e., the object ID, the collection name, and the database name. For more information concerning the DBRef structure, please see the MongoDB documentation. The deserializer is performing the complementary operation, as shown below: Java public class DBRefDeserializer extends StdDeserializer<DBRef> { public DBRefDeserializer() { this(null); } public DBRefDeserializer(Class<DBRef> dbrefClass) { super(dbrefClass); } @Override public DBRef deserialize(JsonParser jsonParser, DeserializationContext deserializationContext) throws IOException, JacksonException { JsonNode node = jsonParser.getCodec().readTree(jsonParser); return new DBRef(node.findValue("databaseName").asText(), node.findValue("collectionName").asText(), node.findValue("id").asText()); } } This is pretty much all that may be said as far as the serializers/deserializers are concerned. Let's move further to see what the codecs package brings to us. Java objects are stored in a MongoDB database using the BSON (Binary JSON) format. In order to store information, the MongoDB driver needs the ability to map Java objects to their associated BSON representation. It does that on behalf of the Codec interface, which contains the required abstract methods for the mapping of the Java objects to BSON and the other way around. Implementing this interface, one can define the conversion logic between Java and BSON, and conversely. The MongoDB driver includes the required Codec implementation for the most common types but again, for some reason, when it comes to DBRef, this implementation is only a dummy one, which raises UnsupportedOperationException. Having contacted the MongoDB driver implementers, I didn't succeed in finding any other solution than implementing my own Codec mapper, as shown by the class DocstoreDBRefCodec. For brevity reasons, we won't reproduce this class' source code here. Once our dedicated Codec is implemented, we need to register it with the MongoDB driver, such that it uses it when it comes to mapping DBRef types to Java objects and conversely. In order to do that, we need to implement the interface CoderProvider which, as shown by the class DocstoreDBRefCodecProvider, returns via its abstract get() method, the concrete class responsible for performing the mapping; i.e., in our case, DocstoreDBRefCodec. And that's all we need to do here as Quarkus will automatically discover and use our CodecProvider customized implementation. Please have a look at these classes to see and understand how things are done. The Data Repositories Quarkus Panache greatly simplifies the data persistence process by supporting both the active record and the repository design patterns. Here, we'll be using the second one. As opposed to similar persistence stacks, Panache relies on the compile-time bytecode enhancements of the entities. It includes an annotation processor that automatically performs these enhancements. All that this annotation processor needs in order to perform its enhancements job is an interface like the one below: Java @ApplicationScoped public class CustomerRepository implements PanacheMongoRepositoryBase<Customer, Long>{} The code above is all that you need in order to define a complete service able to persist Customer document instances. Your interface needs to extend the PanacheMongoRepositoryBase one and parameter it with your object ID type, in our case a Long. The Panache annotation processor will generate all the required endpoints required to perform the most common CRUD operations, including but not limited to saving, updating, deleting, querying, paging, sorting, transaction handling, etc. All these details are fully explained here. Another possibility is to extend PanacheMongoRepository instead of PanacheMongoRepositoryBase and to use the provided ObjectID keys instead of customizing them as Long, as we did in our example. Whether you chose the 1st or the 2nd alternative, this is only a preference matter. The REST API In order for our Panache-generated persistence service to become effective, we need to expose it through a REST API. In the most common case, we have to manually craft this API, together with its implementation, consisting of the full set of the required REST endpoints. This fastidious and repetitive operation might be avoided by using the quarkus-mongodb-rest-data-panache extension, which annotation processor is able to automatically generate the required REST endpoints, out of interfaces having the following pattern: Java public interface CustomerResource extends PanacheMongoRepositoryResource<CustomerRepository, Customer, Long> {} Believe it if you want: this is all you need to generate a full REST API implementation with all the endpoints required to invoke the persistence service generated previously by the mongodb-panache extension annotation processor. Now we are ready to build our REST API as a Quarkus microservice. We chose to build this microservice as a Docker image, on behalf of the quarkus-container-image-jib extension. By simply including the following Maven dependency: XML <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-container-image-jib</artifactId> </dependency> The quarkus-maven-plugin will create a local Docker image to run our microservice. The parameters of this Docker image are defined by the application.properties file, as follows: Properties files quarkus.container-image.build=true quarkus.container-image.group=quarkus-nosql-tests quarkus.container-image.name=docstore-mongodb quarkus.mongodb.connection-string = mongodb://admin:admin@mongo:27017 quarkus.mongodb.database = mdb quarkus.swagger-ui.always-include=true quarkus.jib.jvm-entrypoint=/opt/jboss/container/java/run/run-java.sh Here we define the name of the newly created Docker image as being quarkus-nosql-tests/docstore-mongodb. This is the concatenation of the parameters quarkus.container-image.group and quarkus.container-image.name separated by a "/". The property quarkus.container-image.build having the value true instructs the Quarkus plugin to bind the build operation to the package phase of maven. This way, simply executing a mvn package command, we generate a Docker image able to run our microservice. This may be tested by running the docker images command. The property named quarkus.jib.jvm-entrypoint defines the command to be run by the newly generated Docker image. quarkus-run.jar is the Quarkus microservice standard startup file used when the base image is ubi8/openjdk-17-runtime, as in our case. Other properties are quarkus.mongodb.connection-string and quarkus.mongodb.database = mdb which define the MongoDB database connection string and the name of the database. Last but not least, the property quarkus.swagger-ui.always-include includes the Swagger UI interface in our microservice space such that it allows us to test it easily. Let's see now how to run and test the whole thing. Running and Testing Our Microservices Now that we looked at the details of our implementation, let's see how to run and test it. We chose to do it on behalf of the docker-compose utility. Here is the associated docker-compose.yml file: YAML version: "3.7" services: mongo: image: mongo environment: MONGO_INITDB_ROOT_USERNAME: admin MONGO_INITDB_ROOT_PASSWORD: admin MONGO_INITDB_DATABASE: mdb hostname: mongo container_name: mongo ports: - "27017:27017" volumes: - ./mongo-init/:/docker-entrypoint-initdb.d/:ro mongo-express: image: mongo-express depends_on: - mongo hostname: mongo-express container_name: mongo-express links: - mongo:mongo ports: - 8081:8081 environment: ME_CONFIG_MONGODB_ADMINUSERNAME: admin ME_CONFIG_MONGODB_ADMINPASSWORD: admin ME_CONFIG_MONGODB_URL: mongodb://admin:admin@mongo:27017/ docstore: image: quarkus-nosql-tests/docstore-mongodb:1.0-SNAPSHOT depends_on: - mongo - mongo-express hostname: docstore container_name: docstore links: - mongo:mongo - mongo-express:mongo-express ports: - "8080:8080" - "5005:5005" environment: JAVA_DEBUG: "true" JAVA_APP_DIR: /home/jboss JAVA_APP_JAR: quarkus-run.jar This file instructs the docker-compose utility to run three services: A service named mongo running the Mongo DB 7 database A service named mongo-express running the MongoDB administrative UI A service named docstore running our Quarkus microservice We should note that the mongo service uses an initialization script mounted on the docker-entrypoint-initdb.d directory of the container. This initialization script creates the MongoDB database named mdb such that it could be used by the microservices. JavaScript db = db.getSiblingDB(process.env.MONGO_INITDB_ROOT_USERNAME); db.auth( process.env.MONGO_INITDB_ROOT_USERNAME, process.env.MONGO_INITDB_ROOT_PASSWORD, ); db = db.getSiblingDB(process.env.MONGO_INITDB_DATABASE); db.createUser( { user: "nicolas", pwd: "password1", roles: [ { role: "dbOwner", db: "mdb" }] }); db.createCollection("Customers"); db.createCollection("Products"); db.createCollection("Orders"); db.createCollection("OrderItems"); This is an initialization JavaScript that creates a user named nicolas and a new database named mdb. The user has administrative privileges on the database. Four new collections, respectively named Customers, Products, Orders and OrderItems, are created as well. In order to test the microservices, proceed as follows: Clone the associated GitHub repository: $ git clone https://github.com/nicolasduminil/docstore.git Go to the project: $ cd docstore Build the project: $ mvn clean install Check that all the required Docker containers are running: $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 7882102d404d quarkus-nosql-tests/docstore-mongodb:1.0-SNAPSHOT "/opt/jboss/containe…" 8 seconds ago Up 6 seconds 0.0.0.0:5005->5005/tcp, :::5005->5005/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 8443/tcp docstore 786fa4fd39d6 mongo-express "/sbin/tini -- /dock…" 8 seconds ago Up 7 seconds 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp mongo-express 2e850e3233dd mongo "docker-entrypoint.s…" 9 seconds ago Up 7 seconds 0.0.0.0:27017->27017/tcp, :::27017->27017/tcp mongo Run the integration tests: $ mvn -DskipTests=false failsafe:integration-test This last command will run all the integration tests which should succeed. These integration tests are implemented using the RESTassured library. The listing below shows one of these integration tests located in the docstore-domain project: Java @QuarkusIntegrationTest @TestMethodOrder(MethodOrderer.OrderAnnotation.class) public class CustomerResourceIT { private static Customer customer; @BeforeAll public static void beforeAll() throws AddressException { customer = new Customer("John", "Doe", new InternetAddress("john.doe@gmail.com")); customer.addAddress(new Address("Gebhard-Gerber-Allee 8", "Kornwestheim", "Germany")); customer.setId(10L); } @Test @Order(10) public void testCreateCustomerShouldSucceed() { given() .header("Content-type", "application/json") .and().body(customer) .when().post("/customer") .then() .statusCode(HttpStatus.SC_CREATED); } @Test @Order(20) public void testGetCustomerShouldSucceed() { assertThat (given() .header("Content-type", "application/json") .when().get("/customer") .then() .statusCode(HttpStatus.SC_OK) .extract().body().jsonPath().getString("firstName[0]")).isEqualTo("John"); } @Test @Order(30) public void testUpdateCustomerShouldSucceed() { customer.setFirstName("Jane"); given() .header("Content-type", "application/json") .and().body(customer) .when().pathParam("id", customer.getId()).put("/customer/{id}") .then() .statusCode(HttpStatus.SC_NO_CONTENT); } @Test @Order(40) public void testGetSingleCustomerShouldSucceed() { assertThat (given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).get("/customer/{id}") .then() .statusCode(HttpStatus.SC_OK) .extract().body().jsonPath().getString("firstName")).isEqualTo("Jane"); } @Test @Order(50) public void testDeleteCustomerShouldSucceed() { given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).delete("/customer/{id}") .then() .statusCode(HttpStatus.SC_NO_CONTENT); } @Test @Order(60) public void testGetSingleCustomerShouldFail() { given() .header("Content-type", "application/json") .when().pathParam("id", customer.getId()).get("/customer/{id}") .then() .statusCode(HttpStatus.SC_NOT_FOUND); } } You can also use the Swagger UI interface for testing purposes by firing your preferred browser at http://localhost:8080/q:swagger-ui. Then, in order to test endpoints, you can use the payloads in the JSON files located in the src/resources/data directory of the docstore-api project. You also can use the MongoDB UI administrative interface by going to http://localhost:8081 and authenticating yourself with the default credentials (admin/pass). You can find the project source code in my GitHub repository. Enjoy!
Unit testing has become a standard part of development. Many tools can be utilized for it in many different ways. This article demonstrates a couple of hints or, let's say, best practices working well for me. In This Article, You Will Learn How to write clean and readable unit tests with JUnit and Assert frameworks How to avoid false positive tests in some cases What to avoid when writing unit tests Don't Overuse NPE Checks We all tend to avoid NullPointerException as much as possible in the main code because it can lead to ugly consequences. I believe our main concern is not to avoid NPE in tests. Our goal is to verify the behavior of a tested component in a clean, readable, and reliable way. Bad Practice Many times in the past, I've used isNotNull assertion even when it wasn't needed, like in the example below: Java @Test public void getMessage() { assertThat(service).isNotNull(); assertThat(service.getMessage()).isEqualTo("Hello world!"); } This test produces errors like this: Plain Text java.lang.AssertionError: Expecting actual not to be null at com.github.aha.poc.junit.spring.StandardSpringTest.test(StandardSpringTest.java:19) Good Practice Even though the additional isNotNull assertion is not really harmful, it should be avoided due to the following reasons: It doesn't add any additional value. It's just more code to read and maintain. The test fails anyway when service is null and we see the real root cause of the failure. The test still fulfills its purpose. The produced error message is even better with the AssertJ assertion. See the modified test assertion below. Java @Test public void getMessage() { assertThat(service.getMessage()).isEqualTo("Hello world!"); } The modified test produces an error like this: Java java.lang.NullPointerException: Cannot invoke "com.github.aha.poc.junit.spring.HelloService.getMessage()" because "this.service" is null at com.github.aha.poc.junit.spring.StandardSpringTest.test(StandardSpringTest.java:19) Note: The example can be found in SimpleSpringTest. Assert Values and Not the Result From time to time, we write a correct test, but in a "bad" way. It means the test works exactly as intended and verifies our component, but the failure isn't providing enough information. Therefore, our goal is to assert the value and not the comparison result. Bad Practice Let's see a couple of such bad tests: Java // #1 assertThat(argument.contains("o")).isTrue(); // #2 var result = "Welcome to JDK 10"; assertThat(result instanceof String).isTrue(); // #3 assertThat("".isBlank()).isTrue(); // #4 Optional<Method> testMethod = testInfo.getTestMethod(); assertThat(testMethod.isPresent()).isTrue(); Some errors from the tests above are shown below. Plain Text #1 Expecting value to be true but was false at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at com.github.aha.poc.junit5.params.SimpleParamTests.stringTest(SimpleParamTests.java:23) #3 Expecting value to be true but was false at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502) at com.github.aha.poc.junit5.ConditionalTests.checkJdk11Feature(ConditionalTests.java:50) Good Practice The solution is quite easy with AssertJ and its fluent API. All the cases mentioned above can be easily rewritten as: Java // #1 assertThat(argument).contains("o"); // #2 assertThat(result).isInstanceOf(String.class); // #3 assertThat("").isBlank(); // #4 assertThat(testMethod).isPresent(); The very same errors as mentioned before provide more value now. Plain Text #1 Expecting actual: "Hello" to contain: "f" at com.github.aha.poc.junit5.params.SimpleParamTests.stringTest(SimpleParamTests.java:23) #3 Expecting blank but was: "a" at com.github.aha.poc.junit5.ConditionalTests.checkJdk11Feature(ConditionalTests.java:50) Note: The example can be found in SimpleParamTests. Group-Related Assertions Together The assertion chaining and a related code indentation help a lot in the test clarity and readability. Bad Practice As we write a test, we can end up with the correct, but less readable test. Let's imagine a test where we want to find countries and do these checks: Count the found countries. Assert the first entry with several values. Such tests can look like this example: Java @Test void listCountries() { List<Country> result = ...; assertThat(result).hasSize(5); var country = result.get(0); assertThat(country.getName()).isEqualTo("Spain"); assertThat(country.getCities().stream().map(City::getName)).contains("Barcelona"); } Good Practice Even though the previous test is correct, we should improve the readability a lot by grouping the related assertions together (lines 9-11). The goal here is to assert result once and write many chained assertions as needed. See the modified version below. Java @Test void listCountries() { List<Country> result = ...; assertThat(result) .hasSize(5) .singleElement() .satisfies(c -> { assertThat(c.getName()).isEqualTo("Spain"); assertThat(c.getCities().stream().map(City::getName)).contains("Barcelona"); }); } Note: The example can be found in CountryRepositoryOtherTests. Prevent False Positive Successful Test When any assertion method with the ThrowingConsumer argument is used, then the argument has to contain assertThat in the consumer as well. Otherwise, the test would pass all the time - even when the comparison fails, which means the wrong test. The test fails only when an assertion throws a RuntimeException or AssertionError exception. I guess it's clear, but it's easy to forget about it and write the wrong test. It happens to me from time to time. Bad Practice Let's imagine we have a couple of country codes and we want to verify that every code satisfies some condition. In our dummy case, we want to assert that every country code contains "a" character. As you can see, it's nonsense: we have codes in uppercase, but we aren't applying case insensitivity in the assertion. Java @Test void assertValues() throws Exception { var countryCodes = List.of("CZ", "AT", "CA"); assertThat( countryCodes ) .hasSize(3) .allSatisfy(countryCode -> countryCode.contains("a")); } Surprisingly, our test passed successfully. Good Practice As mentioned at the beginning of this section, our test can be corrected easily with additional assertThat in the consumer (line 7). The correct test should be like this: Java @Test void assertValues() throws Exception { var countryCodes = List.of("CZ", "AT", "CA"); assertThat( countryCodes ) .hasSize(3) .allSatisfy(countryCode -> assertThat( countryCode ).containsIgnoringCase("a")); } Now the test fails as expected with the correct error message. Plain Text java.lang.AssertionError: Expecting all elements of: ["CZ", "AT", "CA"] to satisfy given requirements, but these elements did not: "CZ" error: Expecting actual: "CZ" to contain: "a" (ignoring case) at com.github.aha.sat.core.clr.AppleTest.assertValues(AppleTest.java:45) Chain Assertions The last hint is not really the practice, but rather the recommendation. The AssertJ fluent API should be utilized in order to create more readable tests. Non-Chaining Assertions Let's consider listLogs test, whose purpose is to test the logging of a component. The goal here is to check: Asserted number of collected logs Assert existence of DEBUG and INFO log message Java @Test void listLogs() throws Exception { ListAppender<ILoggingEvent> logAppender = ...; assertThat( logAppender.list ).hasSize(2); assertThat( logAppender.list ).anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(DEBUG); assertThat( logEntry.getFormattedMessage() ).startsWith("Initializing Apple"); }); assertThat( logAppender.list ).anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(INFO); assertThat( logEntry.getFormattedMessage() ).isEqualTo("Here's Apple runner" ); }); } Chaining Assertions With the mentioned fluent API and the chaining, we can change the test this way: Java @Test void listLogs() throws Exception { ListAppender<ILoggingEvent> logAppender = ...; assertThat( logAppender.list ) .hasSize(2) .anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(DEBUG); assertThat( logEntry.getFormattedMessage() ).startsWith("Initializing Apple"); }) .anySatisfy(logEntry -> { assertThat( logEntry.getLevel() ).isEqualTo(INFO); assertThat( logEntry.getFormattedMessage() ).isEqualTo("Here's Apple runner" ); }); } Note: the example can be found in AppleTest. Summary and Source Code The AssertJ framework provides a lot of help with their fluent API. In this article, several tips and hints were presented in order to produce clearer and more reliable tests. Please be aware that most of these recommendations are subjective. It depends on personal preferences and code style. The used source code can be found in my repositories: spring-advanced-training junit-poc
Think of data pipeline orchestration as the backstage crew of a theater, ensuring every scene flows seamlessly into the next. In the data world, tools like Apache Airflow and AWS Step Functions are the unsung heroes that keep the show running smoothly, especially when you're working with dbt (data build tool) to whip your data into shape and ensure that the right data is available at the right time. Both tools are often used alongside dbt (data build tool), which has emerged as a powerful tool for transforming data in a warehouse. In this article, we will introduce dbt, Apache Airflow, and AWS Step Functions and then delve into the pros and cons of using Apache Airflow and AWS Step Functions for data pipeline orchestration involving dbt. A note that dbt has a paid version of dbt cloud and a free open source version; we are focussing on dbt-core, the free version of dbt. dbt (Data Build Tool) dbt-core is an open-source command-line tool that enables data analysts and engineers to transform data in their warehouses more effectively. It allows users to write modular SQL queries, which it then runs on top of the data warehouse in the appropriate order with respect to their dependencies. Key Features Version control: It integrates with Git to help track changes, collaborate, and deploy code. Documentation: Autogenerated documentation and a searchable data catalog are created based on the dbt project. Modularity: Reusable SQL models can be referenced and combined to build complex transformations. Airflow vs. AWS Step Functions for dbt Orchestration Apache Airflow Apache Airflow is an open-source tool that helps to create, schedule, and monitor workflows. It is used by data engineers/ analysts to manage complex data pipelines. Key Features Extensibility: Custom operators, executors, and hooks can be written to extend Airflow’s functionality. Scalability: Offers dynamic pipeline generation and can scale to handle multiple data pipeline workflows. Example: DAG Shell from airflow import DAG from airflow.operators.bash_operator import BashOperator from datetime import datetime, timedelta default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime.now() - timedelta(days=1), 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5), } dag = DAG('dbt_daily_job', default_args=default_args, description='A simple DAG to run dbt jobs', schedule_interval=timedelta(days=1)) dbt_run = BashOperator( task_id='dbt_run', bash_command='dbt build --s sales.sql', dag=dag, ) slack_notify = SlackAPIPostOperator( task_id='slack_notify', dag=dag, # Replace with your actual Slack notification code ) dbt_run >> slack_notify Pros Flexibility: Apache Airflow offers unparalleled flexibility with the ability to define custom operators and is not limited to AWS resources. Community support: A vibrant open-source community actively contributes plugins and operators that provide extended functionalities. Complex workflows: Better suited to complex task dependencies and can manage task orchestration across various systems. Cons Operational overhead: Requires management of underlying infrastructure unless managed services like Astronomer or Google Cloud Composer are used. Learning curve: The rich feature set comes with a complexity that may present a steeper learning curve for some users. AWS Step Functions AWS Step Functions is a fully managed service provided by Amazon Web Services that makes it easier to orchestrate microservices, serverless applications, and complex workflows. It uses a state machine model to define and execute workflows, which can consist of various AWS services like Lambda, ECS, Sagemaker, and more. Key Features Serverless operation: No need to manage infrastructure as AWS provides a managed service. Integration with AWS Services: Seamless connection to AWS services is supported for complex orchestration. Example: State Machine Cloud Formation Template (Step Function) Shell AWSTemplateFormatVersion: '2010-09-09' Description: State Machine to run a dbt job Resources: DbtStateMachine: Type: 'AWS::StepFunctions::StateMachine' Properties: StateMachineName: DbtStateMachine RoleArn: !Sub 'arn:aws:iam::${AWS::AccountId}:role/service-role/StepFunctions-ECSTaskRole' DefinitionString: !Sub | Comment: "A Step Functions state machine that executes a dbt job using an ECS task." StartAt: RunDbtJob States: RunDbtJob: Type: Task Resource: "arn:aws:states:::ecs:runTask.sync" Parameters: Cluster: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:cluster/MyECSCluster" TaskDefinition: "arn:aws:ecs:${AWS::Region}:${AWS::AccountId}:task-definition/MyDbtTaskDefinition" LaunchType: FARGATE NetworkConfiguration: AwsvpcConfiguration: Subnets: - "subnet-0193156582abfef1" - "subnet-abcjkl0890456789" AssignPublicIp: "ENABLED" End: true Outputs: StateMachineArn: Description: The ARN of the dbt state machine Value: !Ref DbtStateMachine When using AWS ECS with AWS Fargate to run dbt workflows, while you can define the dbt command in DbtTaskdefinition, it's also common to create a Docker image that contains not only the dbt environment but also the specific dbt commands you wish to run. Pros Fully managed service: AWS manages the scaling and operation under the hood, leading to reduced operational burden. AWS integration: Natural fit for AWS-centric environments, allowing easy integration of various AWS services. Reliability: Step Functions provide a high level of reliability and support, backed by AWS SLA. Cons Cost: Pricing might be higher for high-volume workflows compared to running your self-hosted or cloud-provider-managed Airflow instance. Step functions incur costs based on the number of state transitions. Locked-in with AWS: Tightly coupled with AWS services, which can be a downside if you're aiming for a cloud-agnostic architecture. Complexity in handling large workflows: While capable, it can become difficult to manage larger, more complex workflows compared to using Airflow's DAGs. There are limitations on the number of parallel executions of a State Machine. Learning curve: The service also presents a learning curve with specific paradigms, such as the Amazon States Language. Scheduling: AWS Step functions need to rely on other AWS services like AWS Eventbridge for scheduling. Summary Choosing the right tool for orchestrating dbt workflows comes down to assessing specific features and how they align with a team's needs. The main attributes that inform this decision include customization, cloud alignment, infrastructure flexibility, managed services, and cost considerations. Customization and Extensibility Apache Airflow is highly customizable and extends well, allowing teams to create tailored operators and workflows for complex requirements. Integration With AWS AWS Step Functions is the clear winner for teams operating solely within AWS, offering deep integration with the broader AWS ecosystem. Infrastructure Flexibility Apache Airflow supports a wide array of environments, making it ideal for multi-cloud or on-premises deployments. Managed Services Here, it’s a tie. For managed services, teams can opt for Amazon Managed Workflows for Apache Airflow (MWAA) for an AWS-centric approach or a vendor like Astronomer for hosting Airflow in different environments. There are also platforms like Dagster that offer similar features to Airflow and can be managed as well. This category is highly competitive and will be based on the level of integration and vendor preference. Cost at Scale Apache Airflow may prove more cost-effective for scale, given its open-source nature and the potential for optimized cloud or on-premises deployment. AWS Step Functions may be more economical at smaller scales or for teams with existing AWS infrastructure. Conclusion The choice between Apache Airflow and AWS Step Functions for orchestrating dbt workflows is nuanced. For operations deeply rooted in AWS with a preference for serverless execution and minimal maintenance, AWS Step Functions is the recommended choice. For those requiring robust customizability, diverse infrastructure support, or cost-effective scalability, Apache Airflow—whether self-managed or via a platform like Astronomer or MWAA (AWS-managed)—emerges as the optimal solution.
The cost of services is on everybody’s mind right now, with interest rates rising, economic growth slowing, and organizational budgets increasingly feeling the pinch. But I hear a special edge in people’s voices when it comes to their observability bill, and I don’t think it’s just about the cost of goods sold. I think it’s because people are beginning to correctly intuit that the value they get out of their tooling has become radically decoupled from the price they are paying. In the happiest cases, the price you pay for your tools is “merely” rising at a rate several times faster than the value you get out of them. But that’s actually the best-case scenario. For an alarming number of people, the value they get actually decreases as their bill goes up. Observability 1.0 and the Cost Multiplier Effect Are you familiar with this chestnut? “Observability has three pillars: metrics, logs, and traces.” This isn’t exactly true, but it’s definitely true of a particular generation of tools—one might even say it's definitionally true of a particular generation of tools. Let’s call it “observability 1.0.” From an evolutionary perspective, you can see how we got here. Everybody has logs… so we spin up a service for log aggregation. But logs are expensive, and everybody wants dashboards… so we buy a metrics tool. Software engineers want to instrument their applications… so we buy an APM tool. We start unbundling the monolith into microservices, and pretty soon, we can’t understand anything without traces… so we buy a tracing tool. The front-end engineers point out that they need sessions and browser data… so we buy a RUM tool. On and on it goes. Logs, metrics, traces, APM, RUM. You’re now paying to store telemetry five different ways, in five different places, for every single request. And a 5x multiplier is on the modest side of the spectrum, given how many companies pay for multiple overlapping tools in the same category. You may also be collecting: Profiling data Product analytics Business intelligence data Database monitoring/query profiling tools Mobile app telemetry Behavioral analytics Crash reporting Language-specific profiling data Stack traces CloudWatch or hosting provider metrics …and so on. So, how many times are you paying to store data about your user requests? What’s your multiplier? (If you have one consolidated vendor bill, this may require looking at your itemized bill.) There are many types of tools, each gathering slightly different data for a slightly different use case, but underneath the hood, there are really only three basic data types: metric, unstructured logs, and structured logs. Each of these has its own distinctive trade-offs when it comes to how much they cost and how much value you can get out of them. Metrics Metrics are the great-granddaddy of telemetry formats: tiny, fast, and cheap. A “metric” consists of a single number, often with tags appended. All of the contexts of the request get discarded at write time; each individual metric is emitted separately. This means you can never correlate one metric with another from the same request, or select all the metrics for a given request ID, user, or app ID, or ask arbitrary new questions about your metrics data. Metrics-based tools include vendors like Datadog and open-source projects like Prometheus. RUM tools are built on top of metrics to understand browser user sessions; APM tools are built on top of metrics to understand application performance. When you set up a metrics tool, it generally comes prepopulated with a bunch of basic metrics, but the useful ones are typically the custom metrics you emit from your application. Your metrics bill is usually dominated by the cost of these custom metrics. At a minimum, your bill goes up linearly with the number of custom metrics you create. This is unfortunate because to restrain your bill from unbounded growth, you have to regularly audit your metrics, do your best to guess which ones are going to be useful in the future and prune any you think you can afford to go without. Even in the hands of experts, these tools require significant oversight. Linear cost growth is the goal, but it’s rarely achieved. The cost of each metric varies wildly depending on how you construct it, what the values are, how often it gets hit, etc. I’ve seen a single custom metric cost $30k per month. You probably have dozens of custom metrics per service, and it’s almost impossible to tell how much each of them costs you. Metrics bills tend to be incredibly opaque (possibly by design). Nobody can understand their software or their systems with a metrics tool alone because the metric is extremely limited in what it can do. No context, no cardinality, no strings… only basic static dashboards. For richer data, we must turn to logs. Unstructured Logs You can understand much more about your code with logs than you can with metrics. Logs are typically emitted multiple times throughout the execution of the request, with one or a small number of nouns per log line plus the request ID. Unstructured logs are still the default, although this is slowly changing. The cost of unstructured logs is driven by a few things: Write amplification: If you want to capture lots of rich context about the request, you need to emit a lot of log lines. If you are printing out just 10 log lines per request, per service, and you have half a dozen services, that’s 60 log events for every request. Noisiness: It’s extremely easy to accidentally blow up your log footprint yet add no value—e.g., by putting a print statement inside a loop instead of outside the loop. Here, the usefulness of the data goes down as the bill shoots up. Constraints on physical resources Due to the write amplification of log lines per request, it’s often physically impossible to log everything you want to log for all requests or all users—it would saturate your NIC or disk. Therefore, people tend to use blunt instruments like these to blindly slash the log volume: Log levels Consistent hashes Dumb sample rates When you emit multiple log lines per request, you end up duplicating a lot of raw data; sometimes, over half the bits are consumed by request ID, process ID, and timestamp. This can be quite meaningful from a cost perspective. All of these factors can be annoying. But the worst thing about unstructured logs is that the only thing you can do to query them is a full-text search. The more data you have, the slower it becomes to search that data, and there’s not much you can do about it. Searching your logs over any meaningful length of time can take minutes or even hours, which means experimenting and looking around for unknown unknowns is prohibitively time-consuming. You have to know what to look for in order to find it. Once again, as your logging bill goes up, the value goes down. Structured Logs Structured logs are gaining adoption across the industry, especially as OpenTelemetry picks up steam. The nice thing about structured logs is that you can actually do things with the data other than slow, dumb string searches. If you’ve structured your data properly, you can perform calculations! Compute percentiles! Generate heatmaps! Tools built on structured logs are so clearly the future. But just taking your existing logs and adding structure isn’t quite good enough. If all you do is stuff your existing log lines into key/value pairs, the problems of amplification, noisiness, and physical constraints remain unchanged—you can just search more efficiently and do some math with your data. There are a number of things you can and should do to your structured logs in order to use them more effectively and efficiently. In order of achievability: Instrument your code using the principles of canonical logs, which collect all the vital characteristics of a request into one wide, dense event. It is difficult to overstate the value of doing this for reasons of usefulness and usability as well as cost control. Add trace IDs and span IDs so you can trace your code using the same events instead of having to use an entirely separate tool. Feed your data into a columnar storage engine so you don’t have to predefine a schema or indexes to decide which dimensions of the future you can search or compute based on. Use a storage engine that supports high cardinality with an explorable interface. If you go far enough down this path of enriching your structured events, instrumenting your code with the right data, and displaying it in real-time, you will reach an entirely different set of capabilities, with a cost model so distinct it can only be described as “observability 2.0.” More on that in a second. Ballooning Costs Are Baked Into Observability 1.0 To recap, high costs are baked into the observability 1.0 model. Every pillar has a price. You have to collect and store your data—and pay to store it—again and again and again for every single use case. Depending on how many tools you use, your observability bill may be growing at a rate 3x faster than your traffic is growing, or 5x, or 10x, or even more. It gets worse. As your costs go up, the value you get out of your tools goes down. Your logs get slower and slower to search. You have to know what you’re searching for in order to find it. You have to use a blunt force sampling technique to keep the log volume from blowing up. Any time you want to be able to ask a new question, you first have to commit to a new code and deploy it. You have to guess which custom metrics you’ll need and which fields to index in advance. As the volume goes up, your ability to find a needle in the haystack—any unknown-unknowns—goes down commensurately. And nothing connects any of these tools. You cannot correlate a spike in your metrics dashboard with the same requests in your logs, nor can you trace one of the errors. It’s impossible. If your APM and metrics tools report different error rates, you have no way of resolving this confusion. The only thing connecting any of these tools is the intuition and straight-up guesses made by your most senior engineers. This means that the cognitive costs are immense, and your bus factor risks are very real. The most important connective data in your system—connecting metrics with logs and logs with traces—exists only in the heads of a few people. At the same time, the engineering overhead required to manage all these tools (and their bills) rises inexorably. With metrics, an engineer needs to spend time auditing your metrics, tracking people down to fix poorly constructed metrics, and reaping those that are too expensive or don’t get used. With logs, an engineer needs to spend time monitoring the log volume, watching for spammy or duplicate log lines, pruning or consolidating them, and choosing and maintaining indexes. But all this time spent wrangling observability 1.0 data types isn’t even the costliest part. The most expensive part is the unseen costs inflicted on your engineering organization as development slows down and tech debt piles up due to low visibility and, thus, low confidence. Is there an alternative? Yes. The Cost Model of Observability 2.0 Is Very Different Observability 2.0 has no three pillars; it has a single source of truth. Observability 2.0 tools are built on top of arbitrarily wide structured log events, also known as spans. From these wide, context-rich structured log events, you can derive the other data types (metrics, logs, or traces). Since there is only one data source, you can correlate and cross-correlate to your heart’s content. You can switch fluidly back and forth between slicing and dicing, breaking down or grouping by events, and viewing them as a trace waterfall. You don’t have to worry about cardinality or key space limitations. You also effectively get infinite custom metrics since you can append as many as you want to the same events. Not only does your cost not go up linearly as you add more custom metrics, but your telemetry just gets richer and more valuable the more key-value pairs you add! Nor are you limited to numbers; you can add any and all types of data, including valuable high-cardinality fields like “App Id” or “Full Name.” Observability 2.0 has its own amplification factor to consider. As you instrument your code with more spans per request, the number of events you have to send (and pay for) goes up. However, you have some very powerful tools for dealing with this: you can perform dynamic head-based sampling or even tail-based sampling, where you decide whether or not to keep the event after it’s finished, allowing you to capture 100% of slow requests and other outliers. Engineering Time Is Your Most Precious Resource But the biggest difference between observability 1.0 and 2.0 won’t show up on any invoice. The difference shows up in your engineering team’s ability to move quickly and with confidence. Modern software engineering is all about hooking up fast feedback loops. Observability 2.0 tooling is what unlocks the kind of fine-grained, exploratory experience you need in order to accelerate those feedback loops. Where observability 1.0 is about MTTR, MTTD, reliability, and operating software, observability 2.0 is what underpins the entire software development lifecycle, setting the bar for how swiftly you can build and ship software, find problems, and iterate on them. Observability 2.0 is about being in conversation with your code, understanding each user’s experience, and building the right things. Observability 2.0 isn’t exactly cheap either, although it is often less expensive. But the key difference between o11y 1.0 and o11y 2.0 has never been that either is cheap; it’s that with observability 2.0 when your bill goes up, the value you derive from your telemetry goes up too. You pay more money; you get more out of your tools than you should. Note: Earlier, I said, “Nothing connects any of these tools.” If you are using a single unified vendor for your metrics, logging, APM, RUM, and tracing tools, this is not strictly true. Vendors like New Relic or Datadog now let you define certain links between your traces and metrics, which allows you to correlate between data types in a few limited, predefined ways. This is better than nothing! But it’s very different from the kind of fluid, open-ended correlation capabilities that we describe with o11y 2.0. With o11y 2.0, you can slice and dice, break down, and group by your complex data sets, then grab a trace that matches any specific set of criteria at any level of granularity. With o11y 1.0, you can define a metric up front, then grab a random exemplar of that metric, and that’s it. All the limitations of metrics still apply; you can’t correlate any metric with any other metric from that request, app, user, etc, and you certainly can’t trace arbitrary criteria. But it’s not nothing.
In this example, we'll learn about the Strategy pattern in Spring. We'll cover different ways to inject strategies, starting from a simple list-based approach to a more efficient map-based method. To illustrate the concept, we'll use the three Unforgivable curses from the Harry Potter series — Avada Kedavra, Crucio, and Imperio. What Is the Strategy Pattern? The Strategy Pattern is a design principle that allows you to switch between different algorithms or behaviors at runtime. It helps make your code flexible and adaptable by allowing you to plug in different strategies without changing the core logic of your application. This approach is useful in scenarios where you have different implementations for a specific task of functionality and want to make your system more adaptable to changes. It promotes a more modular code structure by separating the algorithmic details from the main logic of your application. Step 1: Implementing Strategy Picture yourself as a dark wizard who strives to master the power of Unforgivable curses with Spring. Our mission is to implement all three curses — Avada Kedavra, Crucio and Imperio. After that, we will switch between curses (strategies) in runtime. Let's start with our strategy interface: Java public interface CurseStrategy { String useCurse(); String curseName(); } In the next step, we need to implement all Unforgivable curses: Java @Component public class CruciatusCurseStrategy implements CurseStrategy { @Override public String useCurse() { return "Attack with Crucio!"; } @Override public String curseName() { return "Crucio"; } } @Component public class ImperiusCurseStrategy implements CurseStrategy { @Override public String useCurse() { return "Attack with Imperio!"; } @Override public String curseName() { return "Imperio"; } } @Component public class KillingCurseStrategy implements CurseStrategy { @Override public String useCurse() { return "Attack with Avada Kedavra!"; } @Override public String curseName() { return "Avada Kedavra"; } } Step 2: Inject Curses as List Spring brings a touch of magic that allows us to inject multiple implementations of an interface as a List so we can use it to inject strategies and switch between them. But let's first create the foundation: Wizard interface. Java public interface Wizard { String castCurse(String name); } And we can inject our curses (strategies) into the Wizard and filter the desired one. Java @Service public class DarkArtsWizard implements Wizard { private final List<CurseStrategy> curses; public DarkArtsListWizard(List<CurseStrategy> curses) { this.curses = curses; } @Override public String castCurse(String name) { return curses.stream() .filter(s -> name.equals(s.curseName())) .findFirst() .orElseThrow(UnsupportedCurseException::new) .useCurse(); } } UnsupportedCurseException is also created if the requested curse does not exist. Java public class UnsupportedCurseException extends RuntimeException { } And we can verify that curse casting is working: Java @SpringBootTest class DarkArtsWizardTest { @Autowired private DarkArtsWizard wizard; @Test public void castCurseCrucio() { assertEquals("Attack with Crucio!", wizard.castCurse("Crucio")); } @Test public void castCurseImperio() { assertEquals("Attack with Imperio!", wizard.castCurse("Imperio")); } @Test public void castCurseAvadaKedavra() { assertEquals("Attack with Avada Kedavra!", wizard.castCurse("Avada Kedavra")); } @Test public void castCurseExpelliarmus() { assertThrows(UnsupportedCurseException.class, () -> wizard.castCurse("Abrakadabra")); } } Another popular approach is to define the canUse method instead of curseName. This will return boolean and allows us to use more complex filtering like: Java public interface CurseStrategy { String useCurse(); boolean canUse(String name, String wizardType); } @Component public class CruciatusCurseStrategy implements CurseStrategy { @Override public String useCurse() { return "Attack with Crucio!"; } @Override public boolean canUse(String name, String wizardType) { return "Crucio".equals(name) && "Dark".equals(wizardType); } } @Service public class DarkArtstWizard implements Wizard { private final List<CurseStrategy> curses; public DarkArtsListWizard(List<CurseStrategy> curses) { this.curses = curses; } @Override public String castCurse(String name) { return curses.stream() .filter(s -> s.canUse(name, "Dark"))) .findFirst() .orElseThrow(UnsupportedCurseException::new) .useCurse(); } } Pros: Easy to implement. Cons: Runs through a loop every time, which can lead to slower execution times and increased processing overhead. Step 3: Inject Strategies as Map We can easily address the cons from the previous section. Spring lets us inject a Map with bean names and instances. It simplifies the code and improves its efficiency. Java @Service public class DarkArtsWizard implements Wizard { private final Map<String, CurseStrategy> curses; public DarkArtsMapWizard(Map<String, CurseStrategy> curses) { this.curses = curses; } @Override public String castCurse(String name) { CurseStrategy curse = curses.get(name); if (curse == null) { throw new UnsupportedCurseException(); } return curse.useCurse(); } } This approach has a downside: Spring injects the bean name as the key for the Map, so strategy names are the same as the bean names like cruciatusCurseStrategy. This dependency on Spring's internal bean names might cause problems if Spring's code or our class names change without notice. Let's check that we're still capable of casting those curses: Java @SpringBootTest class DarkArtsWizardTest { @Autowired private DarkArtsWizard wizard; @Test public void castCurseCrucio() { assertEquals("Attack with Crucio!", wizard.castCurse("cruciatusCurseStrategy")); } @Test public void castCurseImperio() { assertEquals("Attack with Imperio!", wizard.castCurse("imperiusCurseStrategy")); } @Test public void castCurseAvadaKedavra() { assertEquals("Attack with Avada Kedavra!", wizard.castCurse("killingCurseStrategy")); } @Test public void castCurseExpelliarmus() { assertThrows(UnsupportedCurseException.class, () -> wizard.castCurse("Crucio")); } } Pros: No loops. Cons: Dependency on bean names, which makes the code less maintainable and more prone to errors if names are changed or refactored. Step 4: Inject List and Convert to Map Cons of Map injection can be easily eliminated if we inject List and convert it to Map: Java @Service public class DarkArtsWizard implements Wizard { private final Map<String, CurseStrategy> curses; public DarkArtsMapWizard(List<CurseStrategy> curses) { this.curses = curses.stream() .collect(Collectors.toMap(CurseStrategy::curseName, Function.identity())); } @Override public String castCurse(String name) { CurseStrategy curse = curses.get(name); if (curse == null) { throw new UnsupportedCurseException(); } return curse.useCurse(); } } With this approach, we can move back to use curseName instead of Spring's bean names for Map keys (strategy names). Step 5: @Autowire in Interface Spring supports autowiring into methods. The simple example of autowiring into methods is through setter injection. This feature allows us to use @Autowired in a default method of an interface so we can register each CurseStrategy in the Wizard interface without needing to implement a registration method in every strategy implementation. Let's update the Wizard interface by adding a registerCurse method: Java public interface Wizard { String castCurse(String name); void registerCurse(String curseName, CurseStrategy curse) } This is the Wizard implementation: Java @Service public class DarkArtsWizard implements Wizard { private final Map<String, CurseStrategy> curses = new HashMap<>(); @Override public String castCurse(String name) { CurseStrategy curse = curses.get(name); if (curse == null) { throw new UnsupportedCurseException(); } return curse.useCurse(); } @Override public void registerCurse(String curseName, CurseStrategy curse) { curses.put(curseName, curse); } } Now, let's update the CurseStrategy interface by adding a method with the @Autowired annotation: Java public interface CurseStrategy { String useCurse(); String curseName(); @Autowired default void registerMe(Wizard wizard) { wizard.registerCurse(curseName(), this); } } At the moment of injecting dependencies, we register our curse into the Wizard. Pros: No loops, and no reliance on inner Spring bean names. Cons: No cons, pure dark magic. Conclusion In this article, we explored the Strategy pattern in the context of Spring. We assessed different strategy injection approaches and demonstrated an optimized solution using Spring's capabilities. The full source code for this article can be found on GitHub.
I started research for an article on how to add a honeytrap to a GitHub repo. The idea behind a honeypot weakness is that a hacker will follow through on it and make his/her presence known in the process. My plan was to place a GitHub personal access token in an Ansible vault protected by a weak password. Should an attacker crack the password and use the token to clone the private repository, a webhook should have triggered and mailed a notification that the honeypot repo has been cloned and the password cracked. Unfortunately, GitHub seems not to allow webhooks to be triggered after cloning, as is the case for some of its higher-level actions. This set me thinking that platforms as standalone systems are not designed with Dev(Sec)Ops integration in mind. DevOps engineers have to bite the bullet and always find ways to secure pipelines end-to-end. I, therefore, instead decided to investigate how to prevent code theft using tokens or private keys gained by nefarious means. Prevention Is Better Than Detection It is not best practice to have secret material on hard drives thinking that root-only access is sufficient security. Any system administrator or hacker that is elevated to root can view the secret in the open. They should, rather, be kept inside Hardware Security Modules (HSMs) or a secret manager, at the very least. Furthermore, tokens and private keys should never be passed in as command line arguments since they might be written to a log file. A way to solve this problem is to make use of a super-secret master key to initiate proceedings and finalize using short-lived lesser keys. This is similar to the problem of sharing the first key in applied cryptography. Once the first key has been agreed upon, successive transactions can be secured using session keys. It goes beyond saying that the first key has to be stored in Hardware Security Modules, and all operations against it have to happen inside an HSM. I decided to try out something similar when Ansible clones private Git repositories. Although I will illustrate at the hand of GitHub, I am pretty sure something similar can be set up for other Git platforms as well. First Key GitHub personal access tokens can be used to perform a wide range of actions on your GitHub account and its repositories. It authenticates and authorizes from both the command line and the GitHub API. It clearly can serve as the first key. Personal access tokens are created by clicking your avatar in the top right and selecting Settings: A left nav panel should appear from where you select Developer settings: The menu for personal access tokens will display where you can create the token: I created a classic token and gave it the following scopes/permissions: repo, admin:public_key, user, and admin:gpg_key. Take care to store the token in a reputable secret manager from where it can be copied and pasted when the Ansible play asks for it when it starts. This secret manager should clear the copy buffer after a few seconds to prevent attacks utilizing attention diversion. vars_prompt: - name: github_token prompt: "Enter your github personal access token?" private: true Establishing the Session GitHub deployment keys give access to private repositories. They can be created by an API call or from the repo's top menu by clicking on Settings: With the personal access token as the first key, a deployment key can finish the operation as the session key. Specifically, Ansible authenticates itself using the token, creates the deployment key, authorizes the clone, and deletes it immediately afterward. The code from my previous post relied on adding Git URLs that contain the tokens to the Ansible vault. This has now been improved to use temporary keys as envisioned in this post. An Ansible role provided by Asif Mahmud has been amended for this as can be seen in the usual GitHub repo. The critical snippets are: - name: Add SSH public key to GitHub account ansible.builtin.uri: url: "https://api.{{ git_server_fqdn }/repos/{{ github_account_id }/{{ repo }/keys" validate_certs: yes method: POST force_basic_auth: true body: title: "{{ key_title }" key: "{{ key_content.stdout }" read_only: true body_format: json headers: Accept: application/vnd.github+json X-GitHub-Api-Version: 2022-11-28 Authorization: "Bearer {{ github_access_token }" status_code: - 201 - 422 register: create_result The GitHub API is used to add the deploy key to the private repository. Note the use of the access token typed in at the start of play to authenticate and authorize the request. - name: Clone the repository shell: | GIT_SSH_COMMAND="ssh -i {{ key_path } -v -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" {{ git_executable } clone git@{{ git_server_fqdn }:{{ github_account_id }/{{ repo }.git {{ clone_dest } - name: Switch branch shell: "{{ git_executable } checkout {{ branch }" args: chdir: "{{ clone_dest }" The repo is cloned, followed by a switch to the required branch. - name: Delete SSH public key ansible.builtin.uri: url: "https://api.{{ git_server_fqdn }/repos/{{ github_account_id }/{{ repo }/keys/{{ create_result.json.id }" validate_certs: yes method: DELETE force_basic_auth: true headers: Accept: application/vnd.github+json X-GitHub-Api-Version: 2022-11-28 Authorization: "Bearer {{ github_access_token }" status_code: - 204 Deletion of the deployment key happens directly after the clone and switch, again via the API. Conclusion The short life of the deployment key enhances the security of the DevOps pipeline tremendously. Only the token has to be kept secured at all times as is the case for any first key. Ideally, you should integrate Ansible with a compatible HSM platform. I thank Asif Mahmud for using their code to illustrate the concept of using temporary session keys when cloning private Git repositories.
Introduction to Datafaker Datafaker is a modern framework that enables JVM programmers to efficiently generate fake data for their projects using over 200 data providers allowing for quick setup and usage. Custom providers could be written when you need some domain-specific data. In addition to providers, the generated data can be exported to popular formats like CSV, JSON, SQL, XML, and YAML. For a good introduction to the basic features, see "Datafaker: An Alternative to Using Production Data." Datafaker offers many features, such as working with sequences and collections and generating custom objects based on schemas (see "Datafaker 2.0"). Bulk Data Generation In software development and testing, the need to frequently generate data for various purposes arises, whether it's to conduct non-functional tests or to simulate burst loads. Let's consider a straightforward scenario when we have the task of generating 10,000 messages in JSON format to be sent to RabbitMQ. From my perspective, these options are worth considering: Developing your own tool: One option is to write a custom application from scratch to generate these records(messages). If the generated data needs to be more realistic, it makes sense to use Datafaker or JavaFaker. Using specific tools: Alternatively, we could select specific tools designed for particular databases or message brokers. For example, tools like voluble for Kafka provide specialized functionalities for generating and publishing messages to Kafka topics; or a more modern tool like ShadowTraffic, which is currently under development and directed towards a container-based approach, which may not always be necessary. Datafaker Gen: Finally, we have the option to use Datafaker Gen, which I want to consider in the current article. Datafaker Gen Overview Datafaker Gen offers a command-line generator based on the Datafaker library which allows for the continuous generation of data in various formats and integration with different storage systems, message brokers, and backend services. Since this tool uses Datafaker, there may be a possibility that the data is realistic. Configuration of the scheme, format type, and sink can be done without rebuilding the project. Datafake Gen consists of the following main components that can be configured: 1. Schema Definition Users can define the schema for their records in the config.yaml file. The schema specifies the field definitions of the record based on the Datafaker provider. It also allows for the definition of embedded fields. YAML default_locale: en-EN fields: - name: lastname generators: [ Name#lastName ] - name: firstname generators: [ Name#firstName ] 2. Format Datafake Gen allows users to specify the format in which records will be generated. Currently, there are basic implementations for CSV, JSON, SQL, XML, and YAML formats. Additionally, formats can be extended with custom implementations. The configuration for formats is specified in the output.yaml file. YAML formats: csv: quote: "@" separator: $$$$$$$ json: formattedAs: "[]" yaml: xml: pretty: true 3. Sink The sink component determines where the generated data will be stored or published. The basic implementation includes command-line output and text file sinks. Additionally, sinks can be extended with custom implementations such as RabbitMQ, as demonstrated in the current article. The configuration for sinks is specified in the output.yaml file. YAML sinks: rabbitmq: batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents host: localhost port: 5672 username: guest password: guest exchange: test.direct.exchange routingkey: products.key Extensibility via Java SPI Datafake Gen uses the Java SPI (Service Provider Interface) to make it easy to add new formats or sinks. This extensibility allows for customization of Datafake Gen according to specific requirements. How To Add a New Sink in Datafake Gen Before adding a new sink, you may want to check if it already exists in the datafaker-gen-examples repository. If it does not exist, you can refer to examples on how to add a new sink. When it comes to extending Datafake Gen with new sink implementations, developers have two primary options to consider: By using this parent project, developers can implement sink interfaces for their sink extensions, similar to those available in the datafaker-gen-examples repository. Include dependencies from the Maven repository to access the required interfaces. For this approach, Datafake Gen should be built and exist in the local Maven repository. This approach provides flexibility in project structure and requirements. 1. Implementing RabbitMQ Sink To add a new RabbitMQ sink, one simply needs to implement the net.datafaker.datafaker_gen.sink.Sink interface. This interface contains two methods: getName - This method defines the sink name. run - This method triggers the generation of records and then sends or saves all the generated records to the specified destination. The method parameters include the configuration specific to this sink retrieved from the output.yaml file as well as the data generation function and the desired number of lines to be generated. Java import net.datafaker.datafaker_gen.sink.Sink; public class RabbitMqSink implements Sink { @Override public String getName() { return "rabbitmq"; } @Override public void run(Map<String, ?> config, Function<Integer, ?> function, int numberOfLines) { // Read output configuration ... int numberOfLinesToPrint = numberOfLines; String host = (String) config.get("host"); // Generate lines String lines = (String) function.apply(numberOfLinesToPrint); // Sending or saving results to the expected resource // In this case, this is connecting to RebbitMQ and sending messages. ConnectionFactory factory = getConnectionFactory(host, port, username, password); try (Connection connection = factory.newConnection()) { Channel channel = connection.createChannel(); JsonArray jsonArray = JsonParser.parseString(lines).getAsJsonArray(); jsonArray.forEach(jsonElement -> { try { channel.basicPublish(exchange, routingKey, null, jsonElement.toString().getBytes()); } catch (Exception e) { throw new RuntimeException(e); } }); } catch (Exception e) { throw new RuntimeException(e); } } } 2. Adding Configuration for the New RabbitMQ Sink As previously mentioned, the configuration for sinks or formats can be added to the output.yaml file. The specific fields may vary depending on your custom sink. Below is an example configuration for a RabbitMQ sink: YAML sinks: rabbitmq: batchsize: 1 # when 1 message contains 1 document, when >1 message contains a batch of documents host: localhost port: 5672 username: guest password: guest exchange: test.direct.exchange routingkey: products.key 3. Adding Custom Sink via SPI Adding a custom sink via SPI (Service Provider Interface) involves including the provider configuration in the ./resources/META-INF/services/net.datafaker.datafaker_gen.sink.Sink file. This file contains paths to the sink implementations: Properties files net.datafaker.datafaker_gen.sink.RabbitMqSink These are all 3 simple steps on how to expand Datafake Gen. In this example, we are not providing a complete implementation of the sink, as well as how to use additional libraries. To see the complete implementations, you can refer to the datafaker-gen-rabbitmq module in the example repository. How To Run Step 1 Build a JAR file based on the new implementation: Shell ./mvnw clean verify Step 2 Define the schema for records in the config.yaml file and place this file in the appropriate location where the generator should run. Additionally, define the sinks and formats in the output.yaml file, as demonstrated previously. Step 3 Datafake Gen can be executed through two options: Use bash script from the bin folder in the parent project: Shell # Format json, number of lines 10000 and new RabbitMq Sink bin/datafaker_gen -f json -n 10000 -sink rabbitmq 2. Execute the JAR directly, like this: Shell java -cp [path_to_jar] net.datafaker.datafaker_gen.DatafakerGen -f json -n 10000 -sink rabbitmq How Fast Is It? The test was done based on the scheme described above, which means that one document consists of two fields. Documents are recorded one by one in the RabbitMQ queue in JSON format. The table below shows the speed for 10,000, 100,000, and 1M records on my local machine: Records Time 10000 401 ms 100000 11613ms 1000000 121601ms Conclusion The Datafake Gen tool enables the creation of flexible and fast data generators for various types of destinations. Built on Datafaker, it facilitates realistic data generation. Developers can easily configure the content of records, formats, and sinks to suit their needs. As a simple Java application, it can be deployed anywhere you want, whether it's in Docker or on-premise machines. The full source code is available here. I would like to thank Sergey Nuyanzin for reviewing this article. Thank you for reading, and I am glad to be of help.
We don't usually think of Git as a debugging tool. Surprisingly, Git shines not just as a version control system, but also as a potent debugging ally when dealing with the tricky matter of regressions. The Essence of Debugging with Git Before we tap into the advanced aspects of git bisect, it's essential to understand its foundational premise. Git is known for tracking changes and managing code history, but the git bisect tool is a hidden gem for regression detection. Regressions are distinct from generic bugs. They signify a backward step in functionality—where something that once worked flawlessly now fails. Pinpointing the exact change causing a regression can be akin to finding a needle in a haystack, particularly in extensive codebases with long commit histories. Traditionally, developers would employ a manual, binary search strategy—checking out different versions, testing them, and narrowing down the search scope. This method, while effective, is painstakingly slow and error-prone. Git bisect automates this search, transforming what used to be a marathon into a swift sprint. Setting the Stage for Debugging Imagine you're working on a project, and recent reports indicate a newly introduced bug affecting the functionality of a feature that previously worked flawlessly. You suspect a regression but are unsure which commit introduced the issue among the hundreds made since the last stable version. Initiating Bisect Mode To start, you'll enter bisect mode in your terminal within the project's Git repository: git bisect start This command signals Git to prepare for the bisect process. Marking the Known Good Revision Next, you identify a commit where the feature functioned correctly, often a commit tagged with a release number or dated before the issue was reported. Mark this commit as "good": git bisect good a1b2c3d Here, a1b2c3d represents the hash of the known good commit. Marking the Known Bad Revision Similarly, you mark the current version or a specific commit where the bug is present as "bad": git bisect bad z9y8x7w z9y8x7w is the hash of the bad commit, typically the latest commit in the repository where the issue is observed. Bisecting To Find the Culprit Upon marking the good and bad commits, Git automatically jumps to a commit roughly in the middle of the two and waits for you to test this revision. After testing (manually or with a script), you inform Git of the result: If the issue is present: git bisect bad If the issue is not present: git bisect good Git then continues to narrow down the range, selecting a new commit to test based on your feedback. Expected Output After several iterations, Git will isolate the problematic commit, displaying a message similar to: Bisecting: 0 revisions left to test after this (roughly 3 steps) [abcdef1234567890] Commit message of the problematic commit Reset and Analysis Once the offending commit is identified, you conclude the bisect session to return your repository to its initial state: git bisect reset Notice that bisect isn't linear. Bisect doesn't scan through the revisions in a sequential manner. Based on the good and bad markers, Git automatically selects a commit approximately in the middle of the range for testing (e.g., commit #6 in the following diagram). This is where the non-linear, binary search pattern starts, as Git divides the search space in half instead of examining each commit sequentially. This means fewer revisions get scanned and the process is faster. Advanced Usage and Tips The magic of git bisect lies in its ability to automate the binary search algorithm within your repository, systematically halving the search space until the rogue commit is identified. Git bisect offers a powerful avenue for debugging, especially for identifying regressions in a complex codebase. To elevate your use of this tool, consider delving into more advanced techniques and strategies. These tips not only enhance your debugging efficiency but also provide practical solutions to common challenges encountered during the bisecting process. Script Automation for Precision and Efficiency Automating the bisect process with a script is a game-changer, significantly reducing manual effort and minimizing the risk of human error. This script should ideally perform a quick test that directly targets the regression, returning an exit code based on the test's outcome. Example Imagine you're debugging a regression where a web application's login feature breaks. You could write a script that attempts to log in using a test account and checks if the login succeeds. The script might look something like this in a simplified form: #!/bin/bash # Attempt to log in and check for success if curl -s http://yourapplication/login -d "username=test&password=test" | grep -q "Welcome"; then exit 0 # Login succeeded, mark this commit as good else exit 1 # Login failed, mark this commit as bad fi By passing this script to git bisect run, Git automatically executes it at each step of the bisect process, effectively automating the regression hunt. Handling Flaky Tests With Strategy Flaky tests, which sometimes pass and sometimes fail under the same conditions, can complicate the bisecting process. To mitigate this, your automation script can include logic to rerun tests a certain number of times or to apply more sophisticated checks to differentiate between a true regression and a flaky failure. Example Suppose you have a test that's known to be flaky. You could adjust your script to run the test multiple times, considering the commit "bad" only if the test fails consistently: #!/bin/bash # Run the flaky test three times success_count=0 for i in {1..3}; do if ./run_flaky_test.sh; then ((success_count++)) fi done # If the test succeeds twice or more, consider it a pass if [ "$success_count" -ge 2 ]; then exit 0 else exit 1 fi This approach reduces the chances that a flaky test will lead to incorrect bisect results. Skipping Commits With Care Sometimes, you'll encounter commits that cannot be tested due to reasons like broken builds or incomplete features. git bisect skip is invaluable here, allowing you to bypass these commits. However, use this command judiciously to ensure it doesn't obscure the true source of the regression. Example If you know that commits related to database migrations temporarily break the application, you can skip testing those commits. During the bisect session, when Git lands on a commit you wish to skip, you would manually issue: git bisect skip This tells Git to exclude the current commit from the search and adjust its calculations accordingly. It's essential to only skip commits when absolutely necessary, as skipping too many can interfere with the accuracy of the bisect process. These advanced strategies enhance the utility of git bisect in your debugging toolkit. By automating the regression testing process, handling flaky tests intelligently, and knowing when to skip untestable commits, you can make the most out of git bisect for efficient and accurate debugging. Remember, the goal is not just to find the commit where the regression was introduced but to do so in the most time-efficient manner possible. With these tips and examples, you're well-equipped to tackle even the most elusive regressions in your projects. Unraveling a Regression Mystery In the past, we got to use git bisect when working on a large-scale web application. After a routine update, users began reporting a critical feature failure: the application's payment gateway stopped processing transactions correctly, leading to a significant business impact. We knew the feature worked in the last release but had no idea which of the hundreds of recent commits introduced the bug. Manually testing each commit was out of the question due to time constraints and the complexity of the setup required for each test. Enter git bisect. The team started by identifying a "good" commit where the payment gateway functioned correctly and a "bad" commit where the issue was observed. We then crafted a simple test script that would simulate a transaction and check if it succeeded. By running git bisect start, followed by marking the known good and bad commits, and executing the script with git bisect run, we set off on an automated process that identified the faulty commit. Git efficiently navigated through the commits, automatically running the test script on each step. In a matter of minutes, git bisect pinpointed the culprit: a seemingly innocuous change to the transaction logging mechanism that inadvertently broke the payment processing logic. Armed with this knowledge, we reverted the problematic change, restoring the payment gateway's functionality and averting further business disruption. This experience not only resolved the immediate issue but also transformed our approach to debugging, making git bisect a go-to tool in our arsenal. Final Word The story of the payment gateway regression is just one example of how git bisect can be a lifesaver in the complex world of software development. By automating the tedious process of regression hunting, git bisect not only saves precious time but also brings a high degree of precision to the debugging process. As developers continue to navigate the challenges of maintaining and improving complex codebases, tools like git bisect underscore the importance of leveraging technology to work smarter, not harder. Whether you're dealing with a mysterious regression or simply want to refine your debugging strategies, git bisect offers a powerful, yet underappreciated, solution to swiftly and accurately identify the source of regressions. Remember, the next time you're faced with a regression, git bisect might just be the debugging partner you need to uncover the truth hidden within your commit history. Video
Here's Why Developers Quit Their Jobs
February 27, 2024 by
Architecture: Software Cost Estimation
February 27, 2024 by
Architectural Insights: Designing Efficient Multi-Layered Caching With Instagram Example
February 27, 2024 by CORE
Essential Relational Database Structures and SQL Tuning Techniques
February 27, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by
Architectural Insights: Designing Efficient Multi-Layered Caching With Instagram Example
February 27, 2024 by CORE
Essential Relational Database Structures and SQL Tuning Techniques
February 27, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Essential Relational Database Structures and SQL Tuning Techniques
February 27, 2024 by
Architectural Insights: Designing Efficient Multi-Layered Caching With Instagram Example
February 27, 2024 by CORE
An Approach To Synthetic Transactions With Spring Microservices: Validating Features and Upgrades
February 27, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Architecture: Software Cost Estimation
February 27, 2024 by
Enhancing DevOps With AI: A Strategy for Optimized Efficiency
February 27, 2024 by
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by