DZone Research Report: A look at our developer audience, their tech stacks, and topics and tools they're exploring.
Getting Started With Large Language Models: A guide for both novices and seasoned practitioners to unlock the power of language models.
Development and programming tools are used to build frameworks, and they can be used for creating, debugging, and maintaining programs — and much more. The resources in this Zone cover topics such as compilers, database management systems, code editors, and other software tools and can help ensure engineers are writing clean code.
IntelliJ and Java Spring Microservices: Productivity Tips With GitHub Copilot
Exploring the Horizon of Microservices With KubeMQ's New Control Center
I never moved away from Docker Desktop. For some time, after you use it to build an image, it prints a message: Plain Text What's Next? View a summary of image vulnerabilities and recommendations → docker scout quickview I decided to give it a try. I'll use the root commit of my OpenTelemetry tracing demo. Let's execute the proposed command: Shell docker scout quickview otel-catalog:1.0 Here's the result: Plain Text ✓ Image stored for indexing ✓ Indexed 272 packages Target │ otel-catalog:1.0 │ 0C 2H 15M 23L digest │ 7adfce68062e │ Base image │ eclipse-temurin:21-jre │ 0C 0H 15M 23L Refreshed base image │ eclipse-temurin:21-jre │ 0C 0H 15M 23L │ │ What's Next? View vulnerabilities → docker scout cves otel-catalog:1.0 View base image update recommendations → docker scout recommendations otel-catalog:1.0 Include policy results in your quickview by supplying an organization → docker scout quickview otel-catalog:1.0 --org <organization> Docker gives out exciting bits of information: The base image contains 15 middle-severity vulnerabilities and 23 low-severity ones The final image has an additional two high-level severity Ergo, our code introduced them! Following Scout's suggestion, we can drill down the CVEs: Shell docker scout cves otel-catalog:1.0 This is the result: Plain Text ✓ SBOM of image already cached, 272 packages indexed ✗ Detected 18 vulnerable packages with a total of 39 vulnerabilities ## Overview │ Analyzed Image ────────────────────┼────────────────────────────── Target │ otel-catalog:1.0 digest │ 7adfce68062e platform │ linux/arm64 vulnerabilities │ 0C 2H 15M 23L size │ 160 MB packages │ 272 ## Packages and Vulnerabilities 0C 1H 0M 0L org.yaml/snakeyaml 1.33 pkg:maven/org.yaml/snakeyaml@1.33 ✗ HIGH CVE-2022-1471 [Improper Input Validation] https://scout.docker.com/v/CVE-2022-1471 Affected range : <=1.33 Fixed version : 2.0 CVSS Score : 8.3 CVSS Vector : CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:L 0C 1H 0M 0L io.netty/netty-handler 4.1.100.Final pkg:maven/io.netty/netty-handler@4.1.100.Final ✗ HIGH CVE-2023-4586 [OWASP Top Ten 2017 Category A9 - Using Components with Known Vulnerabilities] https://scout.docker.com/v/CVE-2023-4586 Affected range : >=4.1.0 : <5.0.0 Fixed version : not fixed CVSS Score : 7.4 CVSS Vector : CVSS:3.1/AV:N/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:N The original output is much longer, but I stopped at the exciting bit: the two high-severity CVEs; first, we see the one coming from Netty still needs to be fixed — tough luck. However, Snake YAML fixed its CVE from version 2.0 onward. I'm not using Snake YAML directly; it's a Spring dependency brought by Spring. Because of this, no guarantee exists that a major version upgrade will be compatible. But we can surely try. Let's bump the dependency to the latest version: XML <dependency> <groupId>org.yaml</groupId> <artifactId>snakeyaml</artifactId> <version>2.2</version> </dependency> We can build the image again and check that it still works. Fortunately, it does. We can execute the process again: Shell docker scout quickview otel-catalog:1.0 Lo and behold, the high-severity CVE is no more! Plain Text ✓ Image stored for indexing ✓ Indexed 273 packages Target │ local://otel-catalog:1.0-1 │ 0C 1H 15M 23L digest │ 9ddc31cdd304 │ Base image │ eclipse-temurin:21-jre │ 0C 0H 15M 23L Conclusion In this short post, we tried Docker Scout, the Docker image vulnerability detection tool. Thanks to it, we removed one high-level CVE we introduced in the code. To Go Further Docker Scout 4 Free, Easy-To-Use Tools For Docker Vulnerability Scanning
In a previous blog, I demonstrated how to use Redis (Elasticache Serverless as an example) as a chat history backend for a Streamlit app using LangChain. It was deployed to EKS and also made use of EKS Pod Identity to manage the application Pod permissions for invoking Amazon Bedrock. This use-case here is a similar one: a chat application. I will switch back to implementing things in Go using langchaingo (I used Python for the previous one) and continue to use Amazon Bedrock. But there are a few unique things you can explore in this blog post: The chat application is deployed as an AWS Lambda function along with a Function URL. It uses DynamoDB as the chat history store (aka Memory) for each conversation - I extended langchaingo to include this feature. Thanks to the AWS Lambda Web Adapter, the application was built as a (good old) REST/HTTP API using a familiar library (in this case, Gin). And the other nice add-on was to be able to combine Lambda Web Adapter streaming response feature with Amazon Bedrock streaming inference API. Deploy Using SAM CLI (Serverless Application Model) Make sure you have Amazon Bedrock prerequisites taken care of and the SAM CLI installed. git clone https://github.com/abhirockzz/chatbot-bedrock-dynamodb-lambda-langchain cd chatbot-bedrock-dynamodb-lambda-langchain Run the following commands to build the function and deploy the entire app infrastructure (including the Lambda Function, DynamoDB, etc.) sam build sam deploy -g Once deployed, you should see the Lambda Function URL in your terminal. Open it in a web browser and start conversing with the chatbot! Inspect the DynamoDB table to verify that the conversations are being stored (each conversation will end up being a new item in the table with a unique chat_id): aws dynamodb scan --table-name langchain_chat_history Scan operation is used for demonstration purposes. Using Scan in production is not recommended. Quick Peek at the Good Stuff Using DynamoDB as the backend store history: Refer to the GitHub repository if you are interested in the implementation. To summarize, I implemented the required functions of the schema.ChatMessageHistory. Lambda Web Adapter Streaming response + LangChain Streaming: I used the chains.WithStreamingFunc option with the chains.Call call and then let Gin Stream do the heavy lifting of handling the streaming response. Here is a sneak peek of the implementation (refer to the complete code here): _, err = chains.Call(c.Request.Context(), chain, map[string]any{"human_input": message}, chains.WithMaxTokens(8191), chains.WithStreamingFunc(func(ctx context.Context, chunk []byte) error { c.Stream(func(w io.Writer) bool { fmt.Fprintf(w, (string(chunk))) return false }) return nil })) Closing Thoughts I really like the extensibility of LangChain. While I understand that langchaingo may not be as popular as the original Python version (I hope it will reach there in due time), it's nice to be able to use it as a foundation and build extensions as required. Previously, I had written about how to use the AWS Lambda Go Proxy API to run existing Go applications on AWS Lambda. The AWS Lambda Web Adapter offers similar functionality but it has lots of other benefits, including response streaming and the fact that it is language agnostic. Oh, and one more thing - I also tried a different approach to building this solution using the API Gateway WebSocket. Let me know if you're interested, and I would be happy to write it up! If you want to explore how to use Go for Generative AI solutions, you can read up on some of my earlier blogs: Building LangChain applications with Amazon Bedrock and Go - An introduction Serverless Image Generation Application Using Generative AI on AWS Generative AI Apps With Amazon Bedrock: Getting Started for Go Developers Use Amazon Bedrock and LangChain to build an application to chat with web pages Happy building!
As organizations increasingly migrate their applications to the cloud, efficient and scalable load balancing becomes pivotal for ensuring optimal performance and high availability. This article provides an overview of Azure's load balancing options, encompassing Azure Load Balancer, Azure Application Gateway, Azure Front Door Service, and Azure Traffic Manager. Each of these services addresses specific use cases, offering diverse functionalities to meet the demands of modern applications. Understanding the strengths and applications of these load-balancing services is crucial for architects and administrators seeking to design resilient and responsive solutions in the Azure cloud environment. What Is Load Balancing? Load balancing is a critical component in cloud architectures for various reasons. Firstly, it ensures optimized resource utilization by evenly distributing workloads across multiple servers or resources, preventing any single server from becoming a performance bottleneck. Secondly, load balancing facilitates scalability in cloud environments, allowing resources to be scaled based on demand by evenly distributing incoming traffic among available resources. Additionally, load balancers enhance high availability and reliability by redirecting traffic to healthy servers in the event of a server failure, minimizing downtime, and ensuring accessibility. From a security perspective, load balancers implement features like SSL termination, protecting backend servers from direct exposure to the internet, and aiding in mitigating DDoS attacks and threat detection/protection using Web Application Firewalls. Furthermore, efficient load balancing promotes cost efficiency by optimizing resource allocation, preventing the need for excessive server capacity during peak loads. Finally, dynamic traffic management across regions or geographic locations capabilities allows load balancers to adapt to changing traffic patterns, intelligently distributing traffic during high-demand periods and scaling down resources during low-demand periods, leading to overall cost savings. Overview of Azure’s Load Balancing Options Azure Load Balancer: Unleashing Layer 4 Power Azure Load Balancer is a Layer 4 (TCP, UDP) load balancer that distributes incoming network traffic across multiple Virtual Machines or Virtual Machine Scalesets to ensure no single server is overwhelmed with too much traffic. There are 2 options for the load balancer: a Public Load Balancer primarily used for internet traffic and also supports outbound connection, and a Private Load Balancer to load balance traffic with a virtual network. The load balancer uses a five-tuple (source IP, source port, destination IP, destination port, protocol). Features High availability and redundancy: Azure Load Balancer efficiently distributes incoming traffic across multiple virtual machines or instances in a web application deployment, ensuring high availability, redundancy, and even distribution, thereby preventing any single server from becoming a bottleneck. In the event of a server failure, the load balancer redirects traffic to healthy servers. Provide outbound connectivity: The frontend IPs of a public load balancer can be used to provide outbound connectivity to the internet for backend servers and VMs. This configuration uses source network address translation (SNAT) to translate the virtual machine's private IP into the load balancer's public IP address, thus preventing outside sources from having a direct address to the backend instances. Internal load balancing: Distribute traffic across internal servers within a Virtual Network (VNet); this ensures that services receive an optimal share of resources Cross-region load balancing: Azure Load Balancer facilitates the distribution of traffic among virtual machines deployed in different Azure regions, optimizing performance and ensuring low-latency access for users of global applications or services with a user base spanning multiple geographic regions. Health probing and failover: Azure Load Balancer monitors the health of backend instances continuously, automatically redirecting traffic away from unhealthy instances, such as those experiencing application errors or server failures, to ensure seamless failover. Port-level load balancing: For services running on different ports within the same server, Azure Load Balancer can distribute traffic based on the specified port numbers. This is useful for applications with multiple services running on the same set of servers. Multiple front ends: Azure Load Balancer allows you to load balance services on multiple ports, multiple IP addresses, or both. You can use a public or internal load balancer to load balance traffic across a set of services like virtual machine scale sets or virtual machines (VMs). High Availability (HA) ports in Azure Load Balancer play a crucial role in ensuring resilient and reliable network traffic management. These ports are designed to enhance the availability and redundancy of applications by providing failover capabilities and optimal performance. Azure Load Balancer achieves this by distributing incoming network traffic across multiple virtual machines to prevent a single point of failure. Configuration and Optimization Strategies Define a well-organized backend pool, incorporating healthy and properly configured virtual machines (VMs) or instances, and consider leveraging availability sets or availability zones to enhance fault tolerance and availability. Define load balancing rules to specify how incoming traffic should be distributed. Consider factors such as protocol, port, and backend pool association. Use session persistence settings when necessary to ensure that requests from the same client are directed to the same backend instance. Configure health probes to regularly check the status of backend instances. Adjust probe settings, such as probing intervals and thresholds, based on the application's characteristics. Choose between the Standard SKU and the Basic SKU based on the feature set required for your application. Implement frontend IP configurations to define how the load balancer should handle incoming network traffic. Implement Azure Monitor to collect and analyze telemetry data, set up alerts based on performance thresholds for proactive issue resolution, and enable diagnostics logging to capture detailed information about the load balancer's operations. Adjust the idle timeout settings to optimize the connection timeout for your application. This is especially important for applications with long-lived connections. Enable accelerated networking on virtual machines to take advantage of high-performance networking features, which can enhance the overall efficiency of the load-balanced application. Azure Application Gateway: Elevating To Layer 7 Azure Application Gateway is a Layer 7 load balancer that provides advanced traffic distribution and web application firewall (WAF) capabilities for web applications. Features Web application routing: Azure Application Gateway allows for the routing of requests to different backend pools based on specific URL paths or host headers. This is beneficial for hosting multiple applications on the same set of servers. SSL termination and offloading: Improve the performance of backend servers by transferring the resource-intensive task of SSL decryption to the Application Gateway and relieving backend servers of the decryption workload. Session affinity: For applications that rely on session state, Azure Application Gateway supports session affinity, ensuring that subsequent requests from a client are directed to the same backend server for a consistent user experience. Web Application Firewall (WAF): Implement a robust security layer by integrating the Azure Web Application Firewall with the Application Gateway. This helps safeguard applications from threats such as SQL injection, cross-site scripting (XSS), and other OWASP Top Ten vulnerabilities. You can define your own WAF custom firewall rules as well. Auto-scaling: Application Gateway can automatically scale the number of instances to handle increased traffic and scale down during periods of lower demand, optimizing resource utilization. Rewriting HTTP headers: Modify HTTP headers for requests and responses, as adjusting these headers is essential for reasons including adding security measures, altering caching behavior, or tailoring responses to meet client-specific requirements. Ingress Controller for AKS: The Application Gateway Ingress Controller (AGIC) enables the utilization of Application Gateway as the ingress for an Azure Kubernetes Service (AKS) cluster. WebSocket and HTTP/2 traffic: Application Gateway provides native support for the WebSocket and HTTP/2 protocols. Connection draining: This pivotal feature ensures the smooth and graceful removal of backend pool members during planned service updates or instances of backend health issues. This functionality promotes seamless operations and mitigates potential disruptions by allowing the system to handle ongoing connections gracefully, maintaining optimal performance and user experience during transitional periods Configuration and Optimization Strategies Deploy the instances in a zone-aware configuration, where available. Use Application Gateway with Web Application Firewall (WAF) within a virtual network to protect inbound HTTP/S traffic from the Internet. Review the impact of the interval and threshold settings on health probes. Setting a higher interval puts a higher load on your service. Each Application Gateway instance sends its own health probes, so 100 instances every 30 seconds means 100 requests per 30 seconds. Use App Gateway for TLS termination. This promotes the utilization of backend servers because they don't have to perform TLS processing and easier certificate management because the certificate only needs to be installed on Application Gateway. When WAF is enabled, every request gets buffered until it fully arrives, and then it gets validated against the ruleset. For large file uploads or large requests, this can result in significant latency. The recommendation is to enable WAF with proper testing and validation. Having appropriate DNS and certificate management for backend pools is crucial for improved performance. Application Gateway does not get billed in stopped state. Turn it off for the dev/test environments. Take advantage of features for autoscaling and performance benefits, and make sure to have scale-in and scale-out instances based on the workload to reduce the cost. Use Azure Monitor Network Insights to get a comprehensive view of health and metrics, crucial in troubleshooting issues. Azure Front Door Service: Global-Scale Entry Management Azure Front Door is a comprehensive content delivery network (CDN) and global application accelerator service that provides a range of use cases to enhance the performance, security, and availability of web applications. Azure Front Door supports four different traffic routing methods latency, priority, weighted, and session affinity to determine how your HTTP/HTTPS traffic is distributed between different origins. Features Global content delivery and acceleration: Azure Front Door leverages a global network of edge locations, employing caching mechanisms, compressing data, and utilizing smart routing algorithms to deliver content closer to end-users, thereby reducing latency and enhancing overall responsiveness for an improved user experience. Web Application Firewall (WAF): Azure Front Door integrates with Azure Web Application Firewall, providing a robust security layer to safeguard applications from common web vulnerabilities, such as SQL injection and cross-site scripting (XSS). Geo filtering: In Azure Front Door WAF you can define a policy by using custom access rules for a specific path on your endpoint to allow or block access from specified countries or regions. Caching: In Azure Front Door, caching plays a pivotal role in optimizing content delivery and enhancing overall performance. By strategically storing frequently requested content closer to the end-users at the edge locations, Azure Front Door reduces latency, accelerates the delivery of web applications, and prompts resource conservation across entire content delivery networks. Web application routing: Azure Front Door supports path-based routing, URL redirect/rewrite, and rule sets. These help to intelligently direct user requests to the most suitable backend based on various factors such as geographic location, health of backend servers, and application-defined routing rules. Custom domain and SSL support: Front Door supports custom domain configurations, allowing organizations to use their own domain names and SSL certificates for secure and branded application access. Configuration and Optimization Strategies Use WAF policies to provide global protection across Azure regions for inbound HTTP/S connections to a landing zone. Create a rule to block access to the health endpoint from the internet. Ensure that the connection to the back end is re-encrypted as Front Door does support SSL passthrough. Consider using geo-filtering in Azure Front Door. Avoid combining Traffic Manager and Front Door as they are used for different use cases. Configure logs and metrics in Azure Front Door and enable WAF logs for debugging issues. Leverage managed TLS certificates to streamline the costs and renewal process associated with certificates. Azure Front Door service issues and rotates these managed certificates, ensuring a seamless and automated approach to certificate management, thereby enhancing security while minimizing operational overhead. Use the same domain name on Front Door and your origin to avoid any issues related to request cookies or URL redirections. Disable health probes when there’s only one origin in an origin group. It's recommended to monitor a webpage or location that you specifically designed for health monitoring. Regularly monitor and adjust the instance count and scaling settings to align with actual demand, preventing overprovisioning and optimizing costs. Azure Traffic Manager: DNS-Based Traffic Distribution Azure Traffic Manager is a global DNS-based traffic load balancer that enhances the availability and performance of applications by directing user traffic to the most optimal endpoint. Features Global load balancing: Distribute user traffic across multiple global endpoints to enhance application responsiveness and fault tolerance. Fault tolerance and high availability: Ensure continuous availability of applications by automatically rerouting traffic to healthy endpoints in the event of failures. Routing: Support various routing globally. Performance-based routing optimizes application responsiveness by directing traffic to the endpoint with the lowest latency. Geographic traffic routing is based on the geographic location of end-users, priority-based, weighted, etc. Endpoint monitoring: Regularly check the health of endpoints using configurable health probes, ensuring traffic is directed only to operational and healthy endpoints. Service maintenance: You can have planned maintenance done on your applications without downtime. Traffic Manager can direct traffic to alternative endpoints while the maintenance is in progress. Subnet traffic routing: Define custom routing policies based on IP address ranges, providing flexibility in directing traffic according to specific network configurations. Configuration and Optimization Strategies Enable automatic failover to healthy endpoints in case of endpoint failures, ensuring continuous availability and minimizing disruptions. Utilize appropriate traffic routing methods, such as Priority, Weighted, Performance, Geographic, and Multi-value, to tailor traffic distribution based on specific application requirements. Implement a custom page to use as a health check for your Traffic Manager. If the Time to Live (TTL) interval of the DNS record is too long, consider adjusting the health probe timing or DNS record TTL. Consider nested Traffic Manager profiles. Nested profiles allow you to override the default Traffic Manager behavior to support larger, more complex application deployments. Integrate with Azure Monitor for real-time monitoring and logging, gaining insights into the performance and health of Traffic Manager and endpoints. How To Choose When selecting a load balancing option in Azure, it is crucial to first understand the specific requirements of your application, including whether it necessitates layer 4 or layer 7 load balancing, SSL termination, and web application firewall capabilities. For applications requiring global distribution, options like Azure Traffic Manager or Azure Front Door are worth considering to efficiently achieve global load balancing. Additionally, it's essential to evaluate the advanced features provided by each load balancing option, such as SSL termination, URL-based routing, and application acceleration. Scalability and performance considerations should also be taken into account, as different load balancing options may vary in terms of throughput, latency, and scaling capabilities. Cost is a key factor, and it's important to compare pricing models to align with budget constraints. Lastly, assess how well the chosen load balancing option integrates with other Azure services and tools within your overall application architecture. This comprehensive approach ensures that the selected load balancing solution aligns with the unique needs and constraints of your application. Service Global/Regional Recommended traffic Azure Front Door Global HTTP(S) Azure Traffic Manager Global Non-HTTP(S) and HTTPS Azure Application Gateway Regional HTTP(S) Azure Load Balancer Regional or Global Non-HTTP(S) and HTTPS Here is the decision tree for load balancing from Azure. Source: Azure
AzureSignTool is a code-signing utility that organizations use to secure their software. This signing tool is compatible with all major executable files and works impeccably with all OV and EV code signing certificates. But, it's mostly used with Azure DevOps due to the benefit of Azure Key Vault. And the same is depicted by this guide. Here, you will undergo the complete procedure to sign the executable using AzureSignTool in Azure DevOps. Prerequisites To Complete to Sign With Azure DevOps To use Azure SignTool to sign with Azure DevOps, you will need the following components and mechanisms to be configured: A Code Signing Certificate (You should prefer purchasing an EV Cloud Code Signing Certificate or Azure Key Vault Code Signing Certificate.) Azure Platform Subscription Azure Key Vault App registration on Azure Active Directory Azure DevOps Azure SignTool Once you fulfill all the requirements, you can move forward with the signing procedure. Complete the Process To Use Azure SignTool With Azure DevOps To ease the process, we have divided it into six parts, each with sub-steps for quick completion. So, let's start with the procedure. Part 1: Configuring the Azure Platform Step 1: Sign in to your Azure platform account and create a resource group to manage all associated resources better. Step 2: In your resource group, add the Azure Key Vault and write down its URL, which will be used later in this process. You need to click “+Add,” then search for “Key Vault” and click on “Create.” Step 3: Enter the Key Vault details and click on “Review + Create.” Step 4: Now, note the URL for further processing. Part 2: Importing the Certificate The code signing certificate must be available on your machine, as you'll import it to the Azure Key Vault. Step 1: Under the settings, choose “Certificates” à “+Generate/Import.” Step 2: Enter the details of your certificate. As we are importing it to the Azure Key Vault, the method of certificate creation should be “Import.” Step 3: After the import, your certificate details will look similar to the following snippet. Part 3: Application Principle Configuration The application principle configuration aims to establish a secure way of accessing the certificate. It will help us to eliminate the direct use of hard-coded credentials. Step 1: From your Azure portal, navigate to Azure Active Directory (AD). Step 2: Go to “App registration” à “+ New registration.” Step 3: Register the application by inputting a name and selecting an option from the supported account types section. For this tutorial, the "Account in this organizational directory only" option is selected. Step 4: Click on "Register," after its creation, note the application ID. This ID will be used as the “Client ID.” Part 4: Pairing Client ID With a Secret Step 1: Navigate to the app registration page and choose the "Certificates & Secrets" option in the left panel. Further, click on “+New client secret.” Step 2: Generate your secret and issue it a descriptive name. In addition, copy and note the secret. Part 5: Configuring Key Vault to Access the Principal Step 1: Go to the Key Vault settings à “Access Policies” à “Add Access Policy.” Step 2: Define a new access policy according to the registered application. To define access policy, implement the following permissions. Parameter Permission Key Verify, Sign, Get, List Secret Get, List Certificate Get, List Step 3: While configuring the policies, your interface will look similar to the following. Step 4: Save the access policy settings. Till now, you have provided the application principal (Client ID + Secret) access to Key Vault. Part 6: Configuring the Azure DevOps and Signing the Executable To start signing the executable with AzureSignTool with Azure DevOps, you should download the .NET Core global tool. The AzureSignTool is a part of this .NET Core. To install it, add the following command in your Azure DevOps build. Now, you'll need the following information to set up the signing process. Key Vault URL Application ID or the Client ID Secret associated with app registration. Name of the imported certificate or the certificate available in the Azure Key Vault. List of the executable files that you want to sign. Further, follow the below process for the signing process. Step 1: Open the Azure DevOps and access the pipeline. Step 2: Go to “Library Menu”. Step 3: Click “+ Variable group” and add the variable for client secret, code signing certificate name, client ID, and key vault URL. Step 4: While hovering over the variable name, you will see a lock icon. Click that lock icon to mark the variable as sensitive. Step 5: Save all the defined variables. Step 6: Use the following script, using the variable name instead of the original names of the certificate, client ID, secret, and other parameters. In addition, using variables will provide you with an added security advantage. The logs containing signing data will only disclose the variable names instead of the original client ID, secret, and cert name. Thus, integrity and confidentiality will be retained. As a result, whenever your build runs, it will run the script, access the certificate and key, and utilize the AzureSignTool with Azure DevOps to sign executables. Conclusion To sign the executable files with AzureSignTool while using Azure DevOps, you will need a code signing certificate that is compatible with the platform. Primarily, an EV code signing certificate is recommended. In addition, a Key Vault, platform subscription, and active directory configuration are also needed. Once you fulfill all the requirements, you can proceed with signing the script configuration. The process begins by setting up the Azure Key Vault and then importing the code signing certificate to it. Following it, an application is registered, and an associated secret is generated. Additionally, the application and key vault are securely connected. Lastly, the variables are defined for every component, and the script to sign the executables is added to the Azure DevOps pipeline.
As anyone who has hired new developers onto an existing software team can tell you, onboarding new developers is one of the most expensive things you can do. One of the most difficult things about onboarding junior developers is that it takes your senior developers away from their work. Even the best hires might get Imposter Syndrome since they feel like they need to know more than they do and need to depend on their peers. You might have the best documentation, but it can be difficult to figure out where to start with onboarding. Onboarding senior developers takes time and resources as well. With the rise of LLMs, it seems like putting one on your code, documentation, chats, and ticketing systems would make sense. The ability to converse with an LLM trained on the right dataset would be like adding a team member who can make sure no one gets bogged down with sharing something that’s already documented. I thought I’d check out a new service called Unblocked that does just this. In this article, we will take a spin through a code base I was completely unfamiliar with and see what it would be like to get going on a new team with this tool. Data Sources If you’ve been following conversations around LLM development, then you know that they are only as good as the data they have access to. Fortunately, Unblocked allows you to connect a bunch of data sources to train your LLM. Additionally, because this LLM will be working on your specific code base and documentation, it wouldn’t even be possible to train it on another organization’s data. Unblocked isn’t trying to build a generic code advice bot. It’s personalized to your environment, so you don’t need to worry about data leaking to someone else. Setting up is pretty straightforward, thanks to lots of integrations with developer tools. After signing up for an account, you’ll be prompted to connect to the sources Unblocked supports. You'll need to wait a few minutes or longer depending on the size of your team while Unblocked ingests your content and trains the model. Getting Started I tried exploring some of the features of Unblocked. While there’s a web dashboard that you’ll interact with most of the time, I recommend you install the Unblocked Mac app, also. The app will run in your menu bar and allow you to ask Unblocked a question from anywhere. There are a bunch of other features for teammates interacting with Unblocked. I may write about those later, but for now, I just like that it gives me a universal shortcut (Command+Shift+U) to access Unblocked at any time. Another feature of the macOS menu bar app is that it provides a quick way to install the IDE Plugins based on what I have installed on my machine. Of course, you don’t have to install them this way (Unblocked does this install for you), but it takes some of the thinking out of it. Asking Questions Since I am working on a codebase that is already in Unblocked, I don’t need to wait for anything after getting my account set up on the platform. If you set up your code and documentation, then you won’t need your new developers to wait either. Let’s take this for a spin and look at what questions a new developer might ask the bot. I started by asking a question about setting up the front end. This answer looks pretty good! It’s enough to get me going in a local environment without contacting anyone else on my team. Unblocked kept everyone else “unblocked” on their work and pointed me in the right direction all on its own. I decided to ask about how to get a development environment set up locally. Let’s see what Unblocked says if I ask about that. This answer isn’t what I was hoping for, but I can click on the README link and find that this is not really Unblocked’s fault. My team just hasn’t updated the README for the backend app and Unblocked found the incorrect boilerplate setup instructions. Now that I know where to go to get the code, I’ll just update it after I have finished setting up the backend on my own. In the meantime, though, I will let Unblocked know that it didn’t give me the answer I hoped for. Since it isn’t really the bot’s fault that it’s wrong, I made sure to explain that in my feedback. I had a good start, but I wanted some more answers to my architectural questions. Let’s try something a little more complicated than reading the setup instructions from a README. This is a pretty good high-level overview, especially considering that I didn’t have to do anything, other than type them in. Unblocked generated these answers with links to the relevant resources for me to investigate more as needed. Browse the Code I actually cloned the repos for the front end and back end of my app to my machine and opened them in VS Code. Let’s take a look at how Unblocked works with the repos there. As soon as I open the Unblocked plugin while viewing the backend repository, I’m presented with recommended insights asked by other members of my team. There are also some references to pull requests, Slack conversations, and Jira tasks that the bot thinks are relevant before I open a single file. This is useful. As I open various files, the suggestions change with the context, too. Browse Components The VS Code plugin also called out some topics that it discovered about the app I’m trying out. I clicked on the Backend topic, and it took me to the following page: All of this is automatically generated, as Unblocked determines the experts for each particular part of the codebase. However, experts can also update their expertise when they configure their profiles in our organization. Now, in addition to having many questions I can look at about the backend application, I also know which of my colleagues to go to for questions. If I go to the Components page on the Web Dashboard, I can see a list of everything Unblocked thinks is important about this app. It also gives me a quick view of who I can talk to about these topics. Clicking on any one of them provides me with a little overview, and the experts on the system can manage these as needed. Again, all of this was automatically generated. Conclusion This was a great start with Unblocked. I’m looking forward to next trying this out on some of the things that I’ve been actively working on. Since the platform is not going to be leaking any of my secrets to other teams, I’m not very concerned at all about putting it on even the most secret of my projects and expect to have more to say about other use cases later. Unblocked is in public beta and free and worth checking out!
In today's highly competitive landscape, businesses must be able to gather, process, and react to data in real-time in order to survive and thrive. Whether it's detecting fraud, personalizing user experiences, or monitoring systems, near-instant data is now a need, not a nice-to-have. However, building and running mission-critical, real-time data pipelines is challenging. The infrastructure must be fault-tolerant, infinitely scalable, and integrated with various data sources and applications. This is where leveraging Apache Kafka, Python, and cloud platforms comes in handy. In this comprehensive guide, we will cover: An overview of Apache Kafka architecture Running Kafka clusters on the cloud Building real-time data pipelines with Python Scaling processing using PySpark Real-world examples like user activity tracking, IoT data pipeline, and support chat analysis We will include plenty of code snippets, configuration examples, and links to documentation along the way for you to get hands-on experience with these incredibly useful technologies. Let's get started! Apache Kafka Architecture 101 Apache Kafka is a distributed, partitioned, replicated commit log for storing streams of data reliably and at scale. At its core, Kafka provides the following capabilities: Publish-subscribe messaging: Kafka lets you broadcast streams of data like page views, transactions, user events, etc., from producers and consume them in real-time using consumers. Message storage: Kafka durably persists messages on disk as they arrive and retains them for specified periods. Messages are stored and indexed by an offset indicating the position in the log. Fault tolerance: Data is replicated across configurable numbers of servers. If a server goes down, another can ensure continuous operations. Horizontal scalability: Kafka clusters can be elastically scaled by simply adding more servers. This allows for unlimited storage and processing capacity. Kafka architecture consists of the following main components: Topics Messages are published to categories called topics. Each topic acts as a feed or queue of messages. A common scenario is a topic per message type or data stream. Each message in a Kafka topic has a unique identifier called an offset, which represents its position in the topic. A topic can be divided into multiple partitions, which are segments of the topic that can be stored on different brokers. Partitioning allows Kafka to scale and parallelize the data processing by distributing the load among multiple consumers. Producers These are applications that publish messages to Kafka topics. They connect to the Kafka cluster, serialize data (say, to JSON or Avro), assign a key, and send it to the appropriate topic. For example, a web app can produce clickstream events, or a mobile app can produce usage stats. Consumers Consumers read messages from Kafka topics and process them. Processing may involve parsing data, validation, aggregation, filtering, storing to databases, etc. Consumers connect to the Kafka cluster and subscribe to one or more topics to get feeds of messages, which they then handle as per the use case requirements. Brokers This is the Kafka server that receives messages from producers, assigns offsets, commits messages to storage, and serves data to consumers. Kafka clusters consist of multiple brokers for scalability and fault tolerance. ZooKeeper ZooKeeper handles coordination and consensus between brokers like controller election and topic configuration. It maintains cluster state and configuration info required for Kafka operations. This covers Kafka basics. For an in-depth understanding, refer to the excellent Kafka documentation. Now, let's look at simplifying management by running Kafka in the cloud. Kafka in the Cloud While Kafka is highly scalable and reliable, operating it involves significant effort related to deployment, infrastructure management, monitoring, security, failure handling, upgrades, etc. Thankfully, Kafka is now available as a fully managed service from all major cloud providers: Service Description Pricing AWS MSK Fully managed, highly available Apache Kafka clusters on AWS. Handles infrastructure, scaling, security, failure handling etc. Based on number of brokers Google Cloud Pub/Sub Serverless, real-time messaging service based on Kafka. Auto-scaling, at least once delivery guarantees. Based on usage metrics Confluent Cloud Fully managed event streaming platform powered by Apache Kafka. Free tier available. Tiered pricing based on features Azure Event Hubs High throughput event ingestion service for Apache Kafka. Integrations with Azure data services. Based on throughput units The managed services abstract away the complexities of Kafka operations and let you focus on your data pipelines. Next, we will build a real-time pipeline with Python, Kafka, and the cloud. You can also refer to the following guide as another example. Building Real-Time Data Pipelines A basic real-time pipeline with Kafka has two main components: a producer that publishes messages to Kafka and a consumer that subscribes to topics and processes the messages. The architecture follows this flow: We will use the Confluent Kafka Python client library for simplicity. 1. Python Producer The producer application gathers data from sources and publishes it to Kafka topics. As an example, let's say we have a Python service collecting user clickstream events from a web application. In a web application, when a user acts like a page view or product rating, we can capture these events and send them to Kafka. We can abstract the implementation details of how the web app collects the data. Python from confluent_kafka import Producer import json # User event data event = { "timestamp": "2022-01-01T12:22:25", "userid": "user123", "page": "/product123", "action": "view" } # Convert to JSON event_json = json.dumps(event) # Kafka producer configuration conf = { 'bootstrap.servers': 'my_kafka_cluster-xyz.cloud.provider.com:9092', 'client.id': 'clickstream-producer' } # Create producer instance producer = Producer(conf) # Publish event producer.produce(topic='clickstream', value=event_json) # Flush and close producer producer.flush() producer.close() This publishes the event to the clickstream topic on our cloud-hosted Kafka cluster. The confluent_kafka Python client uses an internal buffer to batch messages before sending them to Kafka. This improves efficiency compared to sending each message individually. By default, messages are accumulated in the buffer until either: The buffer size limit is reached (default 32 MB). The flush() method is called. When flush() is called, any messages in the buffer are immediately sent to the Kafka broker. If we did not call flush(), and instead relied on the buffer size limit, there would be a risk of losing events in the event of a failure before the next auto-flush. Calling flush() gives us greater control to minimize potential message loss. However, calling flush() after every production introduces additional overhead. Finding the right buffering configuration depends on our specific reliability needs and throughput requirements. We can keep adding events as they occur to build a live stream. This gives downstream data consumers a continual feed of events. 2. Python Consumer Next, we have a consumer application to ingest events from Kafka and process them. For example, we may want to parse events, filter for a certain subtype, and validate schema. Python from confluent_kafka import Consumer import json # Kafka consumer configuration conf = {'bootstrap.servers': 'my_kafka_cluster-xyz.cloud.provider.com:9092', 'group.id': 'clickstream-processor', 'auto.offset.reset': 'earliest'} # Create consumer instance consumer = Consumer(conf) # Subscribe to 'clickstream' topic consumer.subscribe(['clickstream']) # Poll Kafka for messages infinitely while True: msg = consumer.poll(1.0) if msg is None: continue # Parse JSON from message value event = json.loads(msg.value()) # Process event based on business logic if event['action'] == 'view': print('User viewed product page') elif event['action'] == 'rating': # Validate rating, insert to DB etc pass print(event) # Print event # Close consumer consumer.close() This polls the clickstream topic for new messages, consumes them, and takes action based on the event type - prints, updates database, etc. For a simple pipeline, this works well. But what if we get 100x more events per second? The consumer will not be able to keep up. This is where a tool like PySpark helps scale out processing. 3. Scaling With PySpark PySpark provides a Python API for Apache Spark, a distributed computing framework optimized for large-scale data processing. With PySpark, we can leverage Spark's in-memory computing and parallel execution to consume Kafka streams faster. First, we load Kafka data into a DataFrame, which can be manipulated using Spark SQL or Python. Python from pyspark.sql import SparkSession # Initialize Spark session spark = SparkSession.builder \ .appName('clickstream-consumer') \ .getOrCreate() # Read stream from Kafka 'clickstream' df = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \ .option("subscribe", "clickstream") \ .load() # Parse JSON from value df = df.selectExpr("CAST(value AS STRING)") df = df.select(from_json(col("value"), schema).alias("data")) Next, we can express whatever processing logic we need using DataFrame transformations: from pyspark.sql.functions import * # Filter for 'page view' events views = df.filter(col("data.action") == "view") # Count views per page URL counts = views.groupBy(col("data.page")) .count() .orderBy("count") # Print the stream query = counts.writeStream \ .outputMode("complete") \ .format("console") \ .start() query.awaitTermination() This applies operations like filter, aggregate, and sort on the stream in real-time, leveraging Spark's distributed runtime. We can also parallelize consumption using multiple consumer groups and write the output sink to databases, cloud storage, etc. This allows us to build scalable stream processing on data from Kafka. Now that we've covered the end-to-end pipeline let's look at some real-world examples of applying it. Real-World Use Cases Let's explore some practical use cases where these technologies can help process huge amounts of real-time data at scale. User Activity Tracking Many modern web and mobile applications track user actions like page views, button clicks, transactions, etc., to gather usage analytics. Problem Data volumes can scale massively with millions of active users. Need insights in real-time to detect issues and personalize content Want to store aggregate data for historical reporting Solution Ingest clickstream events into Kafka topics using Python or any language. Process using PySpark for cleansing, aggregations, and analytics. Save output to databases like Cassandra for dashboards. Detect anomalies using Spark ML for real-time alerting. IoT Data Pipeline IoT sensors generate massive volumes of real-time telemetry like temperature, pressure, location, etc. Problem Millions of sensor events per second Requires cleaning, transforming, and enriching Need real-time monitoring and historical storage Solution Collect sensor data in Kafka topics using language SDKs. Use PySpark for data wrangling and joining external data. Feed stream into ML models for real-time predictions. Store aggregate data in a time series database for visualization. Customer Support Chat Analysis Chat platforms like Zendesk capture huge amounts of customer support conversations. Problem Millions of chat messages per month Need to understand customer pain points and agent performance Must detect negative sentiment and urgent issues Solution Ingest chat transcripts into Kafka topics using a connector Aggregate and process using PySpark SQL and DataFrames Feed data into NLP models to classify sentiment and intent Store insights into the database for historical reporting Present real-time dashboards for contact center ops This demonstrates applying the technologies to real business problems involving massive, fast-moving data. Learn More To summarize, we looked at how Python, Kafka, and the cloud provide a great combination for building robust, scalable real-time data pipelines.
In the rapidly changing world of technology, DevOps is the vehicle that propels software development forward, making it agile, cost-effective, fast, and productive. This article focuses on key DevOps tools and practices, delving into the transformative power of technologies such as Docker and Kubernetes. By investigating them, I hope to shed light on what it takes to streamline processes from conception to deployment and ensure high product quality in a competitive technological race. Understanding DevOps DevOps is a software development methodology that bridges the development (Dev) and operations (Ops) teams in order to increase productivity and shorten development cycles. It is founded on principles such as continuous integration, process automation, and improving team collaboration. Adopting DevOps breaks down silos and accelerates workflows, allowing for faster iterations and faster deployment of new features and fixes. This reduces time to market, increases efficiency in software development and deployment, and improves final product quality. The Role of Automation in DevOps In DevOps, automation is the foundation of software development and delivery process optimization. It involves using tools and technologies to automatically handle a wide range of routine tasks, such as code integration, testing, deployment, and infrastructure management. Through automation, development teams get the ability to reduce human error, standardize processes, enable faster feedback and correction, improve scalability and efficiency, and bolster testing and quality assurance, eventually enhancing consistency and reliability. Several companies have successfully leveraged automation: Walmart: The retail corporation has embraced automation in order to gain ground on its retail rival, Amazon. WalmartLabs, the company's innovation arm, has implemented OneOps cloud-based technology, which automates and accelerates application deployment. As a result, the company was able to quickly adapt to changing market demands and continuously optimize its operations and customer service. Etsy: The e-commerce platform fully automated its testing and deployment processes, resulting in fewer disruptions and an enhanced user experience. Its pipeline stipulates that Etsy developers first run 4,500 unit tests, spending less than a minute on it, before checking the code into run and 7,000 automated tests. The whole process takes no more than 11 minutes to complete. These cases demonstrate how automation in DevOps not only accelerates development but also ensures stable and efficient product delivery. Leveraging Docker for Containerization Containerization, or packing an application's code with all of the files and libraries needed to run quickly and easily on any infrastructure, is one of today's most important software development processes. The leading platform that offers a comprehensive set of tools and services for containerization is Docker. It has several advantages for containerization in the DevOps pipeline: Isolation: Docker containers encapsulate an application and its dependencies, ensuring consistent operation across different computing environments. Efficiency: Containers are lightweight, reducing overhead and improving resource utilization when compared to traditional virtual machines. Portability: Docker containers allow applications to be easily moved between systems and cloud environments. Many prominent corporations leverage Docker tools and services to optimize their development cycles. Here are some examples: PayPal: The renowned online payment system embraced Docker for app development, migrating 700+ applications to Docker Enterprise and running over 200,000 containers. As a result, the company's productivity in developing, testing, and deploying applications increased by 50%. Visa: The global digital payment technology company used Docker to accelerate application development and testing by standardizing environments and streamlining operations. The Docker-based platform assisted in the processing of 100,000 transactions per day across multiple global regions six months after its implementation. Orchestrating Containers With Kubernetes Managing complex containerized applications is a difficult task that necessitates the use of a specialized tool. Kubernetes (aka K8S), an open-source container orchestration system, is one of the most popular. It organises the containers that comprise an application into logical units to facilitate management and discovery. It then automates application container distribution and scheduling across a cluster of machines, ensuring resource efficiency and high availability. Kubernetes enables easy and dynamic adjustment of application workloads, accommodating changes in demand without requiring manual intervention. This orchestration system streamlines complex tasks, allowing for more consistent and manageable deployments while optimizing resource utilization. Setting up a Kubernetes cluster entails installing Kubernetes on a set of machines, configuring networking for pods (containers), and deploying applications using Kubernetes manifests or helm charts. This procedure creates a stable environment in which applications can be easily scaled, updated, and maintained. Automating Development Workflows Continuous Integration (CI) and Continuous Deployment (CD) are critical components of DevOps software development. CI is the practice of automating the integration of code changes from multiple contributors into a single software project. It is typically implemented in such a way that it triggers an automated build with testing, with the goals of quickly detecting and fixing bugs, improving software quality, and reducing release time. After the build stage, CD extends CI by automatically deploying all code changes to a testing and/or production environment. This means that, in addition to automated testing, the release process is also automated, allowing for a more efficient and streamlined path to delivering new features and updates to users. Docker and Kubernetes are frequently used to improve efficiency and consistency in CI/CD workflows. The code is first built into a Docker container, which is then pushed to a registry in the CI stage. During the CD stage, Kubernetes retrieves the Docker container from the registry and deploys it to the appropriate environment, whether testing, staging, or production. This procedure automates deployment and ensures that the application runs consistently across all environments. Many businesses use DevOps tools to automate development cycles. Among them are: Siemens: The German multinational technology conglomerate uses GitLab's integration with Kubernetes to set up new machines in minutes. This improves software development and deployment efficiency, resulting in faster time-to-market for their products and cost savings for the company. Shopify: The Canadian e-commerce giant chose Buildkite to power its continuous integration (CI) systems due to its flexibility and ability to be used in the company's own infrastructure. Buildkite allows lightweight Buildkite agents to run in a variety of environments and is compatible with all major operating systems. Ensuring Security in DevOps Automation Lack of security in DevOps can lead to serious consequences such as data breaches, where vulnerabilities in software expose sensitive information to hackers. This can not only result in operational disruptions like system outages significantly increasing post-deployment costs but also lead to legal repercussions linked to compliance violations. Integrating security measures into the development process is thus crucial to avoid these risks. The best practices for ensuring security involve: In the case of Docker containers, using official images, scanning for vulnerabilities, implementing least privilege principles, and regularly updating containers are crucial for enhancing security. For Kubernetes clusters, it is essential to configure role-based access controls, enable network policies, and use namespace strategies to isolate resources. Here are some examples of companies handling security issues: Capital One: The American bank holding company uses DevSecOps to automate security in its CI/CD pipelines, ensuring that security checks are integrated into every stage of software development and deployment. Adobe: The American multinational computer software company has integrated security into its DevOps culture. Adobe ensures that its software products meet stringent security standards by using automated tools for security testing and compliance monitoring. Overcoming Challenges and Pitfalls Implementing DevOps and automation frequently encounters common stumbling blocks, such as resistance to change, a lack of expertise, and integration issues with existing systems. To overcome these, clear communication, training, and demonstrating the value of DevOps to all stakeholders are required. Here are some examples of how businesses overcame obstacles on their way to implementing DevOps methodology: HP: As a large established corporation, HP encountered a number of challenges in transitioning to DevOps, including organizational resistance to new development culture and tools. It relied on a "trust-based culture and a strong set of tools and processes" while taking a gradual transition approach. It started with small projects and scaled up, eventually demonstrating success in overcoming skepticism. Target: While integrating new DevOps practices, the US's seventh-largest retailer had to deal with organizational silos and technology debt accumulated over 50 years in business. It introduced a set of integration APIs that broke down departmental silos while fostering a learning and experimentation culture. They gradually improved their processes over time, resulting in successful DevOps implementation. The Future of DevOps and Automation With AI and ML taking the world by storm, these new technologies are rapidly reshaping DevOps practices. In particular, they enable the adoption of more efficient decision-making and predictive analytics, significantly optimizing the development pipeline. They also automate tasks such as code reviews, testing, and anomaly detection, which increases the speed and reliability of continuous integration and deployment processes. To prepare for the next evolution in DevOps, it's crucial to embrace trending technologies such as AI and machine learning and integrate them into your processes for enhanced automation and efficiency. This involves investing in training and upskilling teams to adapt to these new tools and methodologies. Adopting flexible architectures like microservices and leveraging data analytics for predictive insights will be key. Conclusion In this article, we have delved into the evolution of the approaches toward software development, with the DevOps methodology taking center stage in this process. DevOps is created for streamlining and optimizing development cycles through automation, containerization, and orchestration. To reach its objectives, DevOps uses powerful technologies like Docker and Kubernetes, which not only reshape traditional workflows but also ensure enhanced security and compliance. As we look towards the future, the integration of AI and ML within this realm promises further advancements, ensuring that DevOps continues to evolve, adapting to the ever-changing landscape of software development and deployment. Additional Resources Read on to learn more about this topic: The official Docker documentation; The official Kubernetes documentation; "DevOps with Kubernetes"; "DevOps: Puppet, Docker, and Kubernetes"; "Introduction to DevOps with Kubernetes"; "Docker in Action".
In early 2021, I started to work on the Apache APISIX project. I have to admit that I had never heard about it before. In this post, I'd like to introduce some Apache projects that are less well-known than HTTPD or Kafka. Apache APISIX APISIX is an API Gateway. It builds upon OpenResty, a Lua layer built on top of the famous Nginx reverse proxy. APISIX adds abstractions to the mix, e.g., Route, Service, Upstream, and offers a plugin-based architecture. Lots of plugins are provided out of the box: Transformation: response-rewrite, proxy-rewrite, gRPC, body-transformer, etc. Authentication: JWT, OPA, Keycloak, OpenID Connect, etc. Observability: metrics, logging, and traces Traffic: rate limiting, request validation, canary release, etc. Serverless: Azure functions, AWS Lambdas, OpenWhisk, etc. Messaging: Kafka, Dubbo, and MQTT Pre- and post-processing If no plugin fits your requirements, writing your own is possible. You can leverage APISIX on Kubernetes as an Ingress Controller. APISIX provides a Helm Chart for this. Apache ShardingSphere ShardingSphere claims to offer an ecosystem able to transform any database into a distributed database system. It acts as a proxy between your code and your database(s). It comes in two flavors: ShardingSphere-JDBC: a JDBC driver that acts as a proxy to your database(s). It's only available for JVM-based applications. ShardingSphere-Proxy: a technology-independent deployable component. ShardingSphere offers several core features: Data Sharding is the core feature, as the project's name implies. Most use cases focus on scaling purposes, but there are others, e.g., data residency requirements. XA transactions for distributed transactions Read/write splitting Data encryption etc. Apache SeaTunnel Apache SeaTunnel is a data integration platform that offers the three pillars of data pipelines: sources, transforms, and sinks. It offers an abstract API over three possible engines: the Zeta engine from SeaTunnel or a wrapper around Apache Spark or Apache Flink. Be careful, as each engine comes with its own set of features. The power of SeaTunnel comes from its rich connector ecosystem. It does provide traditional SQL connectors, e.g., Oracle, PostgreSQL, and MySQL, and NoSQL ones, e.g., MongoDB, Cassandra, and Elasticsearch. However, it also comes bundled with some original ones, including Jira, Google Sheets, and Notion. I have a particular fondness for the CDC connector sources over MongoDB, MySQL, and Microsoft SQL Server. SeaTunnel comes with a web UI, which provides visual management of jobs, scheduling, running, and monitoring capabilities. Apache SkyWalking Apache SkyWalking is an APM tool, focusing on microservices, Cloud Native apps, and Kuernetes architectures. It builds its architecture on four kinds of components: Probes collect telemetry data (metrics, logs, traces, and events). They support multiple output formats, including OpenTelemetry. The platform aggregates and processes data The storage offers an interface over a supported backend. Supported backends include ElasticSearch, H2, MySQL, TiDB, and BanyanDB, a custom storage engine developed for SkyWalking Finally, a web UI allows visualization of SkyWalking's data Skywalking supports a couple of formats, including OpenTelemetry. Given the industry's current focus on OpenTelemetry, I recommend seriously considering this option. Apache Doris Apache Doris is a real-time data warehouse. Doris promotes four primary scenarios: Reporting analysis Ad-Hoc query Unified data warehouse construction Data lake query Doris is mostly MySQL compliant so you can use a regular MySQL client. Discussion The Apache Foundation hosts the projects above, but they have another thing in common: they were all incepted in China. Have a look at the Apache project list. You'll probably be amazed at the sheer number; it's close to 300! In recent years, the number of projects incepted at the Apache Foundation has increased drastically. Look again at the list; I'm sure you only know a few of them - lots come from China. The trend is only growing; it's a great move to integrate China with the OpenSource world! Just as I finish this post, my friend Stefano Fago has posted on another relevant project, Apache Paimon, a streaming data lake platform.
The most popular use case in current IT architecture is moving from Serverful to Serverless design. There are cases where we might need to design a service in a Serverful manner or move to Serverful as part of operational cost. In this article, we will be showing how to run Kumologica flow as a docker container. Usually, the applications built on Kumologica are focussed on serverless computing like AWS Lambda, Azure function, or Google function but here we will be building the service very similar to a NodeJS express app running inside a container. The Plan We will be building a simple hello world API service using a low code integration tooling and wrapping it as a docker image. We will then run the docker container using the image in our local machine. Then test the API using an external client. Prerequisites To start the development we need to have the following utilities and access ready. NodeJS installed Kumologica Designer Docker installed Implementation Building the Service First, let's start the development of Hello World service by opening the designer. To open the designer use the following command kl open. Once the designer is opened, Drag and drop an EvenListener node to the canvas. Click open the configuration and provide the below details. Plain Text Provider : NodeJS Verb : GET Path : /hello Display Name : [GET] /hello Now drag and drop a logger node from pallet to canvas and wire it after the EventListener node. Plain Text Display name : Log_Entry level : INFO Message : Inside the service Log Format : String Drag and drop the EventListenerEnd node to the canvas wire it to the Logger node and provide the following configuration. Plain Text Display Name : Success Payload : {"status" : "HelloWorld"} ContentType : application/json The flow is now completed. Let's dockerize it. Dockerizing the Flow To dockerize the flow open the project folder and place the following Docker file on the root project folder (same level as package.json). Plain Text FROM node:16-alpine WORKDIR /app COPY package*.json ./ RUN npm install ENV PATH /app/node_modules/.bin:$PATH COPY . . EXPOSE 1880 CMD ["node","index.js"] Note: The above Dockerfile is very basic and can be modified according to your needs. Now we need to add another file that treats Kumologica flow to run as an NodeJS express app. Create an index.js file with the following Javascript content. Replace the "your-flow.json" with the name of the flow.json in your project folder. JavaScript const { NodeJsFlowBuilder } = require('@kumologica/runtime'); new NodeJsFlowBuilder('your-flow.json').listen(); Now let's test the flow locally by invoking the endpoint from Postman or any REST client of your choice. curl http://localhost:1880/hello You will be getting the following response: JSON {"status" : "HelloWorld"} As we are done with our local testing, Now we will build an image based on our Docker file. To build the image, go to the root of the project folder and run the following command from a command line in Windows or a terminal in Mac. Plain Text docker build . -t hello-kl-docker-app Now the Docker image is built. Let's check the image locally by running the following command. Plain Text docker images Let's test the image running the image locally by executing the following command. Plain Text docker run -p 1880:1880 hello-kl-docker-app Check the container by running the following command: Plain Text docker ps -a You should now see the container name and ID listed. Now we are ready to push the image to any registry of your choice.
Here, I am going to show the power of SQL Loader + Unix Script utility, where multiple data files can be loaded by the SQL loader with automated shell scripts. This would be useful while dealing with large chunks of data and when data needs to be moved from one system to another system. It would be suitable for a migration project where large historical data is involved. Then, it is not possible to run the SQL loader for each file and wait till it's loaded. So the best option is to keep the Unix program containing the SQL loader command running all the time. Once any file is available in the folder location then it will pick up the files from that folder location and start processing immediately. The Set Up The sample program I have done in Macbook. Installation of Oracle differs from one from Windows machine. Please go through the video that contains the detailed steps of how to install Oracle on Mac book. Get the SQL developer with Java 8 compliance. Now let us demonstrate the example. Loading Multiple Data Files in Oracle DB Table Because it is a Macbook, I have to do all the stuff inside the Oracle Virtual Machine. Let's see the below diagram of how SQL Loader works. Use Case We need to load millions of students' information onto to Student Table using shell scripts + SQL Loader Automation. The script will run all the time in the Unix server and poll for the .Dat file, and once the DAT file is in place, it will process them. Also in case any bad data is there, you need to identify them separately. This type of example is useful in a migration project, where need to load millions of historical records. From the old system, a live Feed (DAT file ) will be generated periodically and sent to the new system server. In the new system, the server file is available and will be loaded into the database using the automation Unix script. Now let's run the script. The script can run all the time on a Unix server. To achieve this, the whole code is put into the block below: Plain Text while true [some logic] done The Process 1. I have copied all the files + folder structure in the folder below. /home/oracle/Desktop/example-SQLdr 2. Refer to the below file (ls -lrth): Shell rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 11:33 DISCARD -rw-------. 1 oracle oinstall 3.5K Dec 24 11:33 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 11:33 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 12:25 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 12:29 ARCHIVE 3. As shown below, there is no data in the student table. 4. Now run the script using the nohup.out ./TestSqlLoader.sh. By doing this it will run all the time in the Unix server. 5. Now the script will run, which will load the two .dat files through the SQL loader. 6. The table should be loaded with the content of two files. 7. Now I am again deleting the table data. Just to prove the script is running all the time in the server, I will just place two DAT files from ARCHIVE to the current Directory. 8. Again place the two data files in the current directory. Shell -rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 12:53 DISCARD -rw-------. 1 oracle oinstall 4.3K Dec 24 12:53 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 12:53 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 13:02 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 13:03 ARCHIVE 9. See the student table again has loaded with all the data. 10. The script is running all the time on the server: Shell [oracle@localhost example-sqldr]$ ps -ef|grep Test oracle 30203 1 0 12:53? 00:00:00 /bin/bash ./TestSqlLoader.sh oracle 31284 31227 0 13:06 pts/1 00:00:00 grep --color=auto Test Full Source Code for Reference Python #!/bin/bash bad_ext='.bad' dis_ext='.dis' data_ext='.dat' log_ext='.log' log_folder='TASKLOG' arch_loc="ARCHIVE" bad_loc="BAD" discard_loc="DISCARD" now=$(date +"%Y.%m.%d-%H.%M.%S") log_file_name="$log_folder/TestSQLLoader_$now$log_ext" while true; do ls -a *.dat 2>/dev/null > all_data_file_list.unx for i in `cat all_data_file_list.unx` do #echo "The data file name is :-- $i" data_file_name=`basename $i .dat` echo "Before executing the sql loader command ||Starting of the script" > $log_file_name sqlldr userid=hr/oracle@orcl control=student.ctl errors=15000 log=$i$log_ext bindsize=512000000 readsize=500000 DATA=$data_file_name$data_ext BAD=$data_file_name$bad_ext DISCARD=$data_file_name$dis_ext mv $data_file_name$data_ext $arch_loc 2>/dev/null mv $data_file_name$bad_ext $bad_loc 2>/dev/null mv $data_file_name$dis_ext $discard_loc 2>/dev/null mv $data_file_name$data_ext$log_ext $log_folder 2>/dev/null echo "After Executing the sql loader command||File moved successfully" >> $log_file_name done ## halt the procesing for 2 mins sleep 1m done The CTL file is below. SQL OPTIONS (SKIP=1) LOAD DATA APPEND INTO TABLE student FIELDS TERMINATED BY '|' OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS ( id, name, dept_id ) The SQL Loader Specification control - Name of the .ctl file errors=15000- Maximum number of errors SQL Loader can allow log=$i$log_ext- Name of the log file bindsize=512000000 - Max size of bind array readsize=500000- Max size of the read buffer DATA=$data_file_name$data_ext- Name and location of data file BAD=$data_file_name$bad_ext- Name and location of bad file DISCARD=$data_file_name$dis_ext- Name and location of discard file In this way stated above, millions of records can be loaded through SQL loader + Unix Script automated way, and the above parameter can be set according to the need. Please let me know if you like this article.
Bartłomiej Żyliński
Software Engineer,
SoftwareMill
Abhishek Gupta
Principal Developer Advocate,
AWS
Yitaek Hwang
Software Engineer,
NYDIG