Data Resources

DZone's Featured Data Resources

How To Base64 Encode or Decode Content Using APIs in Java

By Brian O'Neill

CORE

Base64 encoding was originally conceived more than 30 years ago (named in 1992). Back then, the Simple Mail Transfer Protocol (SMTP) forced developers to find a way to encode e-mail attachments in ASCII characters so SMTP servers wouldn't interfere with them. All these years later, Base64 encoding is still widely used for the same purpose: to replace binary data in systems where only ASCII characters are accepted. E-mail file attachments remain the most common example of where we use Base64 encoding, but it’s not the only use case. Whether we’re stashing images or other documents in HTML, CSS, or JavaScript, or including them in JSON objects (e.g., as a payload to certain API endpoints), Base64 simply offers a convenient, accessible solution when our recipient systems say “no” to binary. The Base64 encoding process starts with a binary encoded file — or encoding a file's content in binary if it isn't in binary already. The binary content is subsequently broken into groups of 6 bits, with each group represented by an ASCII character. There are precisely 64 ASCII characters available for this encoding process — hence the name Base64 — and those characters range from A-Z (capitalized), a-z (lower case), 0 – 9, +, and /. The result of this process is a string of characters; the phrase “hello world”, for example, ends up looking like “aGVsbG8gd29ybGQ=”. The “=” sign at the end is used as padding to ensure the length of the encoded data is a multiple of 4. The only significant challenge with Base64 encoding in today’s intensely content-saturated digital world is the toll it takes on file size. When we Base64 encode content, we end up with around 1 additional byte of information for every 3 bytes in our original content, increasing the original file size by about 33%. In context, that means an 800kb image file we’re encoding instantly jumps to over 1mb, eating up additional costly resources and creating an increasingly cumbersome situation when we share Base64 encoded content at scale. When our work necessitates a Base64 content conversion, we have a few options at our disposal. First and foremost, many modern programming languages now have built-in classes designed to handle Base64 encoding and decoding locally. Since Java 8 was initially released in 2014, for example, we’ve been able to use java.util.Base64 to handle conversions to and from Base64 with minimal hassle. Similar options exist in Python and C# languages, among others. Depending on the needs of our project, however, we might benefit from making our necessary conversions to and from Base64 encoding with a low-code API solution. This can help take some hands-on coding work off our own plate, offload some of the processing burden from our servers, and in some contexts, deliver more consistent results. In the remainder of this article, I’ll demonstrate a few free APIs we can leverage to streamline our workflows for 1) identifying when content is Base64 encoded and 2) encoding or decoding Base64 content. Demonstration Using the ready-to-run Java code examples provided further down the page, we can take advantage of three separate free-to-use APIs designed to help build out our Base64 detection and conversion workflow. These three APIs serve the following functions (respectively): Detect if a text string is Base64 encoded Base64 decode content (convert Base64 string to binary content) Base64 encode content (convert either binary or file data to a Base64 text string) A few different text encoding options exist out there, so we can use the first API as a consistent way of identifying (or validating) Base64 encoding when it comes our way. Once we’re sure that we’re dealing with Base64 content, we can use the second API to decode that content back to binary. When it comes time to package our own content for email attachments or relevant systems that require ASCII characters, we can use the third API to convert binary OR file data content directly to Base64. As a quick reminder, if we’re using a file data string, the API will first binary encode that content before Base64 encoding it, so we don’t have to worry about that part. To authorize our API calls, we’ll need a free-tier API key, which will allow us a limit of 800 API calls per month (with no additional commitments — our total will simply reset the following month if/when we reach it). Before we call the functions for any of the above APIs, our first step is to install the SDK. In our Maven POM file, let’s add a reference to the repository (Jitpack is used to dynamically compile the library): XML <repositories> <repository> <id>jitpack.io</id> <url>https://jitpack.io</url> </repository> </repositories> Next, let’s add a reference to the dependency: XML <dependencies> <dependency> <groupId>com.github.Cloudmersive</groupId> <artifactId>Cloudmersive.APIClient.Java</artifactId> <version>v4.25</version> </dependency> </dependencies> Now we can implement ready-to-run code to call each independent API. Let’s start with the base64 detection API. We can use the following code to structure our API call: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.EditTextApi; ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); EditTextApi apiInstance = new EditTextApi(); Base64DetectRequest request = new Base64DetectRequest(); // Base64DetectRequest | Input request try { Base64DetectResponse result = apiInstance.editTextBase64Detect(request); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling EditTextApi#editTextBase64Detect"); e.printStackTrace(); } Next, let’s move on to our base64 decoding API. We can use the following code to structure our API call: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.EditTextApi; ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); EditTextApi apiInstance = new EditTextApi(); Base64DecodeRequest request = new Base64DecodeRequest(); // Base64DecodeRequest | Input request try { Base64DecodeResponse result = apiInstance.editTextBase64Decode(request); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling EditTextApi#editTextBase64Decode"); e.printStackTrace(); } Finally, let’s implement our base64 encoding option (as a reminder, we can use binary OR file data content for this one). We can use the following code to structure our API call: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.EditTextApi; ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); EditTextApi apiInstance = new EditTextApi(); Base64EncodeRequest request = new Base64EncodeRequest(); // Base64EncodeRequest | Input request try { Base64EncodeResponse result = apiInstance.editTextBase64Encode(request); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling EditTextApi#editTextBase64Encode"); e.printStackTrace(); } Now we have a few additional options for identifying, decoding, and/or encoding base64 content in our Java applications. More

Non-Volatile Random Access Memory: Key Guidelines for Writing an Efficient NVRAM Algorithm

By Viktoriia Erokhina

In an era when data management is critical to business success, exponential data growth presents a number of challenges for technology departments, including DRAM density limitations and strict budget constraints. These issues are driving the adoption of memory tiering, a game-changing approach that alters how data is handled and stored. Non-Volatile Random Access Memory (NVRAM), which is becoming more affordable and popular, is one of the key technologies designed to work within a tiered memory architecture. This article will investigate the fundamentals of NVRAM, compare it to traditional solutions, and provide guidelines for writing efficient NVRAM algorithms. What Is NVRAM? Non-Volatile Random Access Memory, or NVRAM, is a type of memory that retains data even when the power is turned off. It combines the RAM (Random Access Memory) and ROM (Read Only Memory) properties, allowing data to be read and written quickly like RAM while also being retained when the power is turned off like ROM. It is available on Intel-based computers and employs 3D Xpoint, a revolutionary memory technology that strikes a balance between the high speed of RAM and the persistence of traditional storage, providing a new solution for high-speed, long-term data storage and processing. NVRAM is typically used for specific purposes such as system configuration data storage rather than as a general-purpose memory for running applications. NVRAM is used in a variety of memory modules, including: Dual In-line Memory Modules (DIMMs), which use NVRAM to store firmware data or provide persistent memory Solid State Drives (SSDs) that use NVRAM to store firmware, wear-leveling data, and sometimes to cache writes Motherboard Chipsets that use NVRAM to store BIOS or UEFI settings PCIe Cards that use NVRAM for high-speed data storage or caching Hybrid memory modules use NVRAM in addition to traditional RAM. What Are the Differences Between NVRAM and RAM? To gain a better understanding of the differences between NVRAM and RAM, it is necessary to review the concepts of both types of memory. RAM, or Random Access Memory, is a type of memory that can be read and written to in any order and is commonly used to store working data and machine code. RAM is "volatile" memory, which means it can store data until the computer is powered down. Unlike RAM, NVRAM can retain stored information, making it ideal for storing critical data that must persist across reboots. It may contain information such as system configurations, user settings, or application state. Apart from this critical distinction, these types of memory differ in other ways that define their advantages and disadvantages: Speed: While DRAM is fast, especially when it comes to accessing and writing data, its speed is typically lower than what NVRAM strives for. NVRAM, in addition to potentially competing with DRAM (Dynamic RAM) in terms of speed, offers the durability of traditional non-volatile memory. Energy consumption: NVRAM consumes less power than RAM/DRAM, owing to the fact that it does not require power to retain data, whereas the latter requires constant refreshing. Cost and availability: NVRAM may be more expensive and less widely available at first than established memory technologies such as RAM, which is widely available and available in a variety of price ranges. What Distinguishes NVRAM Algorithms? Because NVRAM allows for the direct storage of important bits of information (such as program settings) in memory, it becomes a game changer in the industry. The NVRAM algorithms are defined by several key characteristics: NVRAM provides new opportunities for developing recoverable algorithms, allowing for efficient recovery of a program's state following system or individual process failure. NVRAM frequently has faster read and write speeds than traditional magnetic disc drives or flash-based SSDs, making it suitable for high-performance computing and real-time tasks that require quick data access. Some types of NVRAM, such as flash memory, are prone to wear due to frequent rewrites. This necessitates the use of special wear-leveling algorithms, which distribute write operations evenly across the memory in order to extend its lifespan. Integrating NVRAM into systems necessitates taking into account its distinct characteristics, such as access speed and wear management. This could entail modifying existing algorithms and system architectures. How To Write NVRAM Algorithms Mutual Exclusion Algorithms Mutex (Mutual Exclusion) algorithms are designed to ensure that multiple processes can manage access to shared resources without conflict, even in the event of system crashes or power outages. The following are the key requirements for this type of algorithm: Mutual exclusion: It ensures that only one process or thread can access a critical section at the same time, preventing concurrent access to shared resources. Deadlock-free: This avoids situations in which processes are indefinitely waiting for each other to release resources, ensuring that programs run continuously. Starvation-free: This ensures that every process has access to the critical section, preventing indefinite delays for any process. Peterson's Algorithm for NVRAM Peterson's Algorithm is an example of an algorithm that can be adapted for NVRAM. In computer science, it is a concurrency control algorithm used to achieve mutual exclusion in multi-threading environments. It enables multiple processes to share a single-use resource without conflict while ensuring that only one process has access to the resource at any given time. In an NVRAM environment, Peterson's algorithm, which was originally designed for two processes, can be extended to support multiple processes (from 0 to n-1). Adapting Peterson's algorithm for NVRAM entails not only expanding it to support multiple processes but also incorporating mechanisms for post-failure recoverability. To adapt Peterson's algorithm for recoverability in NVRAM include specific recovery code that allows a process to re-enter the critical section after a crash. This might involve checking the state of shared variables or locks to determine the last known state before the crash. To write the algorithm, you must first complete the following steps: Initialization: In NVRAM, define shared variables (flag array, turn variable). Set these variables to their default values, indicating that no processes are currently in the critical section. Entry section: Each process that attempts to enter the critical section sets a flag in the NVRAM. After that, the process sets the turn variable to indicate its desire to enter the critical section. It examines the status of other processes' flags as well as the turn variable to see if it can enter the critical section. Critical section: Once inside, the process gets to work. NVRAM stores any state changes or operations that must be saved. Exit section: When the process completes its operations, it resets its flag in the NVRAM, indicating that it has exited the critical section. Recovery mechanism: Include code to handle crashes during entry or exit from the critical section. If a process fails in the entry section, it reads the state of competitors and determines whether to continue. If a process crashes in the exit section, it re-executes the entry section to ensure proper state updates. Handling process failures: Use logic to determine whether a failed process completed its operation in the critical section and take appropriate action. Tournament tree for process completion: Create a hierarchical tournament tree structure. Each process traverses this tree, running recovery and entry code at each level. If necessary, include an empty recovery code segment to indicate that the process is aware of its failure state. Nonblocking Algorithms Nonblocking algorithms are a type of concurrent programming algorithm that enables multiple threads to access and modify shared data without using locks or mutual exclusion mechanisms. These algorithms are intended to ensure that the failure or suspension of one thread does not prevent other threads from progressing. The following are the primary requirements of nonblocking algorithms: Nonblocking algorithms are frequently lock-free, which means that at least one thread makes progress in a finite number of steps even if other threads are delayed indefinitely. Wait-free: A more powerful type of nonblocking algorithm is wait-free, in which each thread is guaranteed to complete its operation in a finite number of steps, regardless of the activity of other threads. Obstruction-free: The most basic type of nonblocking algorithm, in which a thread can finish its operation in a finite number of steps if it eventually operates without interference from other threads. Linearizability is a key concept in concurrent programming that is associated with nonblocking algorithms. It ensures that all operations on shared resources (such as read, write, or update) appear to occur in a single, sequential order that corresponds to the actual order of operations in real time. Nonblocking Algorithm Example Let's take a look at the recoverable version of the CAS program, which is intended to make operations more resilient to failures. The use of a two-dimensional array is a key feature of this implementation. This array acts as a log or record, storing information about which process (or "who") wrote a value and when it happened. Such logging is essential in a recoverable system, particularly in NVRAM, where data persists despite system reboots or failures. The linearizability of operations, which ensures that operations appear to occur in a sequential order consistent with their actual execution, is a key feature of this algorithm. The CAS RECOVER function's evaluation order is critical for maintaining linearizability: If process p1 fails after a successful CAS operation and then recovers, evaluating the second part of the expression in CAS.RECOVER first can lead to non-linearizable execution. This is because another process, p2, could complete a CAS operation in the meantime, changing the state in a way that's not accounted for if p1 only checks the second part of the condition. Therefore, the first part of the condition (checking C=<p,new>) must be evaluated before the second part (checking if new is in R[p][1] to R[p][N]). Conclusion This article delves into the fundamental concepts of NVRAM, a new type of memory, compares it to RAM, presents key requirements for mutex and nonblocking algorithms, and offers guidelines for developing efficient NVRAM algorithms. More

Unlocking the Secrets of Data Privacy: Navigating the World of Data Anonymization: Part 2

By Mitesh Mangaonkar

Using Approximate Nearest Neighbor (ANN) Search With SingleStoreDB

By Akmal Chaudhri

CORE

Why You Might Need To Know Algorithms as a Mobile Developer: Emoji Example

By Aleksei Pichukov

Why Can’t I Find the Right Data?

The modern data stack has helped democratize the creation, processing, and analysis of data across organizations. However, it has also led to a new set of challenges thanks to the decentralization of the data stack. In this post, we’ll discuss one of the cornerstones of the modern data stack—data catalogs—and why they fall short of overcoming the fragmentation to deliver a fully self-served data discovery experience. If you are the leader of the data team at a company with 200+ employees, there is a high probability that you have. Started seeing data discovery issues at your company; Tried one of the commercial or open-source data catalogs or Cobbled together an in-house data catalog. If that’s the case, you’d definitely find this post highly relatable. Pain Points This post is based on our own experience of building DataHub at LinkedIn and the learnings from 100+ interviews with data leaders and practitioners at various companies. There may be many reasons why a company adopts a data catalog, but here are the pain points we often come across: Your data team is spending a lot of time answering questions about where to find data and what datasets to use. Your company is making bad decisions because data is inconsistent, poor in quality, delayed, or simply unavailable. Your data team can't confidently apply changes to, migrate, or deprecate data because there’s no visibility into how the data is being used. The bottom line is that you want to empower your stakeholders to self-serve the data and, more importantly, the right data. The data team doesn't want to be bogged down by support questions as much as data consumers don't want to depend on the data team to answer their questions. Both of them share a common goal—True Self-service Data Discovery™. First Reaction In our research, we saw striking similarities in companies attempting to solve this problem themselves. The story often goes like this: Create a database to store metadata. Collect important metadata, such as schemas, descriptions, owners, usage, and lineage, from key data systems. Make it searchable through a web app. Voila! You now have a full self-service solution and proudly declare victory over all data discovery problems. Initial Excitement Let’s walk through what typically happened after this shiny new data catalog was introduced. It looked great on first impression. A handful of power users were super excited about the catalog and its potential. They were thrilled about their newfound visibility into the whole data ecosystem and the endless opportunities to explore new data. They were optimistic that this was indeed The Solution they’d been looking for. Reality Sets In A few months after launching, you started noticing that the user engagement waned quickly. Customer’s questions in your data team’s Slack channel didn’t seem to go away either. If anything, they became even harder for the team to answer. So what happened? People searched “revenue,” hoping to find the official revenue dataset. Instead, they got 100s of similarly named results, such as “revenue”, “revenue_new”, ”revenue_latest”, “revenue_final”, “revenue_final_final”, and were at a complete loss. Even if the person knew the exact name of what they were looking for, the data catalog only provided technical information, e.g., SQL definition, column descriptions, linage, and data profile, without any explicit instructions on how to use it for a specific use case. Your data team has painstakingly tagged datasets as "core", "golden", "important", etc., but the customers didn't know what these tags mean or their importance. Worse yet, they started tagging things randomly and messed up the curation effort. Is it really that hard to find the right data, even with such advanced search capabilities and all the rich metadata? Yes! Because the answer to “what’s the right data” depends on who you are and what use cases you’re trying to solve. Most data catalogs only present the information from the producer’s point of view but fail to cater to the data consumers. The Missing Piece Providing the producer’s point of view through automation and integration of all the technical metadata is definitely a key part of the solution. However, the consumer’s point of view—trusted tables used by my organization, common usage patterns for various business scenarios, impact from upstream changes have on my analyses—is the missing piece that completes the data discovery and understandability puzzle. Most of the data catalogs don't help users find the data they need; they help users find someone to pester, which is often referred to as a “tap on the shoulder.” This is not true self-service. The Solution We believe that there are three types of information/metadata required to make data discovery truly self-serviceable: Technical Metadata This refers to all metadata originating from the data systems, including schemas, lineage, SQL/code, description, data profile, data quality, etc. Automation and integration would keep the information at the user’s fingertips. Challenges There is no standard for metadata across data platforms. Worst yet, many companies build their own custom systems that hold or produce key metadata. How to integrate these systems at scale to ingest metadata accurately, reliably, and timely is an engineering challenge. Business Metadata Each business function operates based on a set of common business definitions, often referred to as “business terms.” Examples include Active Customers, Revenue, Employees, Churn, etc. As a data-driven organization relies heavily on these definitions to make key business decisions, it is paramount for data practitioners to correctly translate between the physical data and business terms. Challenges Many companies lack the tools, processes, and disciplines to govern and communicate these business terms. As a result, when serving a business ask, data practitioners often struggle to find the right data for a particular business term or end up producing results that contradict each other. Behavioral Metadata Surfacing the association between people and data is critical to effective data discovery. Users often place their trust in data based on who created or used it. They also prefer to learn how to do their analyses from more experienced “power users.” To that end, we need to encourage the sharing of these data learnings/insights across the company. This would also improve your organization’s data literacy, provide a better understanding of the business, and reduce inconsistencies. Challenges People interact with the data in different ways. Some query using Snowflake console, Notebooks, R, and Presto, while others explore using BI tools, dashboards, or even spreadsheets. As a result, the learnings and insights often spread across multiple places and make it difficult to associate people with data. It should be fairly clear by now that discovering the right data and understanding what it means is not a mere technical problem. It requires bringing technical, business, and behavioral metadata together. Doing this without creating an onerous governance process will boost your organization’s data productivity significantly and bring a truly data-driven culture to your company.

By Pardhu Gunnam

Data Life With Algorithms

Data is the lifeblood of the digital age. Algorithms collect, store, process, and analyze it to create new insights and value. The data life cycle is the process by which data is created, used, and disposed of. It typically includes the following stages: Data collection: Data can be collected from a variety of sources, such as sensors, user input, and public records. Data preparation: Data is often cleaned and processed before it can be analyzed. This may involve removing errors, formatting data consistently, and converting data to a common format. Data analysis: Algorithms are used to analyze data and extract insights. This may involve identifying patterns, trends, and relationships in the data. Data visualization: Data visualization techniques are used to present the results of data analysis in a clear and concise way. Data storage: Data is often stored for future use. This may involve storing data in a database, filesystem, or cloud storage service. Algorithms are used at every stage of the data life cycle. For example, algorithms can be used to: Collect data: Algorithms can be used to filter and collect data from a stream of data, such as sensor data or social media data. Prepare data: Algorithms can be used to clean and process data, such as removing errors, formatting data consistently, and converting data to a common format. Analyze data: Algorithms can be used to analyze data and extract insights, such as identifying patterns, trends, and relationships in the data. Visualize data: Algorithms can be used to create data visualizations, such as charts, graphs, and maps. Store data: Algorithms can be used to compress and encrypt data before storing it. Algorithms play a vital role in the data life cycle. They enable us to collect, store, process, and analyze data efficiently and effectively. Here are some examples of how algorithms are used in the data life cycle: Search engines: Search engines use algorithms to index and rank websites so that users can find the information they are looking for quickly and easily. Social media: Social media platforms use algorithms to recommend content to users based on their interests and past behavior. E-commerce websites: E-commerce websites use algorithms to recommend products to users based on their browsing history and purchase history. Fraud detection: Financial institutions use algorithms to detect fraudulent transactions. Medical diagnosis: Medical professionals use algorithms to diagnose diseases and recommend treatments. Data Data is the lifeblood of the digital age because it powers the technologies and innovations that shape our world. From the social media platforms we use to stay connected to the streaming services we watch to the self-driving cars that are being developed, all of these technologies rely on data to function. Data is collected from various sources, including sensors, devices, and online transactions. Once collected, data is stored and processed using specialized hardware and software. This process involves cleaning, organizing, and transforming the data into a format that can be analyzed. Algorithms Algorithms are used to analyze data and extract insights. Algorithms are mathematical formulas that can be used to perform various tasks, such as identifying patterns, making predictions, and optimizing processes. The insights gained from data analysis can be used to create new products and services, improve existing ones, and make better decisions. For example, companies can use data to personalize their marketing campaigns, develop new products that meet customer needs, and improve their supply chains. Data Can Be Collected From a Variety of Sources Sensors: Sensors can be used to collect data about the physical environment, such as temperature, humidity, and movement. For example, smart thermostats use sensors to collect data about the temperature in a room and adjust the thermostat accordingly. User input: Data can also be collected from users, such as through surveys, polls, and website forms. For example, e-commerce websites collect data about customer purchases and preferences in order to improve their product recommendations and marketing campaigns. Public records: Public records, such as census data and government reports, can also be used to collect data. For example, businesses can use census data to identify target markets, and government reports to track industry trends. Here Are Some Additional Examples of Data Collection Sources Social media: Social media platforms collect data about users' activity, such as the posts they like, the people they follow, and the content they share. This data is used to target users with relevant ads and to personalize their user experience. IoT devices: The Internet of Things (IoT) refers to the network of physical objects that are connected to the internet and can collect and transmit data. IoT devices, such as smart home devices and wearables, can be used to collect data about people's daily lives. Business transactions: Businesses collect data about their customers and transactions, such as purchase history and contact information. This data is used to improve customer service, develop new products and services, and target marketing campaigns. Data Can Also Be Collected From a Variety of Different Types of Data Sources Structured data: Structured data is data that is organized in a predefined format, such as a database table. Structured data is easy to store, process, and analyze. Unstructured data: Unstructured data is data that does not have a predefined format, such as text, images, and videos. Unstructured data is more difficult to store, process, and analyze than structured data, but it can contain valuable insights. Data Preparation Data preparation is the process of cleaning and processing data so that it is ready for analysis. This is an important step in any data science project, as it can have a significant impact on the quality of the results. There are a number of different data preparation tasks that may be necessary, depending on the specific data set and the desired outcome. Some common tasks include: Removing errors: Data may contain errors due to human mistakes, technical glitches, or other factors. It is important to identify and remove these errors before proceeding with the analysis. Formatting data consistently: Data may be collected from a variety of sources, and each source may have its own unique format. It is important to format the data consistently so that it can be easily processed and analyzed. Converting data to a common format: Data may be collected in various formats, such as CSV, Excel, and JSON. It is often helpful to convert the data to a common format, such as CSV so that it can be easily processed and analyzed by different tools and software. Handling missing values: Missing values are a common problem in data sets. There are a number of different ways to handle missing values, such as removing the rows with missing values, replacing the missing values with a default value, or estimating the missing values using a statistical model. Feature engineering: Feature engineering is the process of creating new features from existing features. This can be done to improve machine learning algorithms' performance or make the data more informative for analysis. Data preparation can be a time-consuming and challenging task, but it is essential for producing high-quality results. By carefully preparing the data, data scientists can increase the accuracy and reliability of their analyses. Here are some additional tips for data preparation: Start by understanding the data: Before you start cleaning and processing the data, it is important to understand what the data represents and how it will be used. This will help you to identify the most important tasks and to make informed decisions about how to handle the data. Use appropriate tools and techniques: There are a number of different data preparation tools and techniques available. Choose the tools and techniques that are most appropriate for your data set and your desired outcome. Document your work: It is important to document your data preparation work so that you can reproduce the results and so that others can understand how the data was prepared. This is especially important if you are working on a team or if you are sharing your data with others. How Algorithm Works An algorithm is a set of instructions that can be used to solve a problem or achieve a goal. Algorithms are used in many different fields, including computer science, mathematics, and engineering. In the context of data, algorithms are used to process and analyze data in order to extract useful information. For example, an algorithm could be used to sort a list of numbers, find the average of a set of values, or identify patterns in a dataset. Algorithms work with data by performing a series of steps on the data. These steps can include arithmetic operations, logical comparisons, and decision-making. The output of an algorithm is typically a new piece of data, such as a sorted list of numbers, a calculated average, or a set of identified patterns. Here is a simple example of an algorithm for calculating the average of a set of numbers: Initialize a variable sum to 0. Iterate over the set of numbers, adding each number to the variable sum. Divide the variable sum by the number of numbers in the set. The result is the average of the set of numbers. This algorithm can be implemented in any programming language and can be used to calculate the average of any set of numbers, regardless of size. More complex algorithms can be used to perform more sophisticated tasks, such as machine learning and natural language processing. These algorithms typically require large datasets to train, and they can be used to make predictions or generate creative text formats. Here are some examples of how algorithms are used with data in the real world: Search engines: Algorithms are used to rank the results of a search query based on the relevance of the results to the query and other factors. Social media: Algorithms are used to filter the content that users see in their feeds based on their interests and past behavior. Recommendation systems: Algorithms are used to recommend products, movies, and other content to users based on their past preferences. Fraud detection: Algorithms are used to identify fraudulent transactions and other suspicious activities. Medical diagnosis: Algorithms are used to assist doctors in diagnosing diseases and recommending treatments. These are just a few examples of the many ways that algorithms are used with data in the real world. As the amount of data that we collect and store continues to grow, algorithms will play an increasingly important role in helping us to make sense of that data and to use it to solve problems.

By Prashanth Mally

Data Management in 2024

What data management in 2024 and beyond will look like hangs on one question. Can open data formats lead to a best-of-breed data management platform? It will take Interoperability across clouds and formats, as well as on the semantics and governance layer. Sixth Platform. Atlas. Debezium. DCAT. Egeria. Nessie. Mesh. Paimon. Transmogrification. This veritable word soup sounds like something that jumped out of a role-playing game. In reality, these are all terms related to data management. These terms may tell us something about the mindscape of people involved in data management tools and nomenclature, but that's a different story. Data management is a long-standing and ever-evolving practice. The days of "Big Data" have long faded, and data has now taken a back seat in terms of hype and attention. Generative AI is the new toy people are excited about, shining a light on AI and machine learning for the masses. Data management may seem dull in comparison. However, there is something people who have been into AI before it was cool understand well, and people who are new to this eventually realize, too: AI is only as good as the data it operates on. There is an evolutionary chain leading from data to analytics to AI. Some organizations were well aware of this a decade ago. Others will have to learn the hard way today. Exploring and understanding how, where, and when data is stored, managed, integrated, and used, as well as related aspects of data formats, interoperability, semantics, and governance, may be hard and unglamorous work. But it is what generates value for organizations, along with RPG-like word soups. Peter Corless and Alex Merced understand this. Their engagement with data goes way before their current roles as Director of Product Marketing at StarTree and Developer Advocate at Dremio, respectively. We caught up to talk about the state of data management in 2024 and where it may be headed. The Sixth Data Platform For many organizations today, data management comes down to handing over their data to one of the "Big 5" data vendors: Amazon, Microsoft Azure, and Google, plus Snowflake and Databricks. However, analysts David Vellante and George Gilbert believe that the need for modern data applications coupled with the evolution of open storage management may lead to the emergence of a "sixth data platform." The notion of the "sixth data platform" was the starting point for this conversation with Corless and Merced. The sixth data platform hypothesis is that open data formats may enable interoperability, leading the transition away from vertically integrated vendor-controlled platforms towards independent management of data storage and permissions. It's an interesting scenario and one that would benefit users by forcing vendors to compete for every workload based on the business value delivered, irrespective of lock-in. But how close are we to realizing this? In order to answer this question, we need to examine open data formats and their properties. In turn, in order to understand data formats, a brief historical overview is needed. We might call this table stakes, and it's not just a pun. Historically, organizations would resort to using data warehouses for their analytics needs. These were databases specifically designed for analytics, as opposed to transactional applications. Traditional data warehouses worked, but scaling them was an issue, as Merced pointed out. Scaling traditional data warehouses was expensive and cumbersome, because it meant buying new hardware with storage and compute bundled. This was the problem Hadoop was meant to solve when introduced in the early 2010s, by separating storage and compute. Hadoop made scaling easier, but it was cumbersome to use. This is why a SQL interface was eventually created for Hadoop via Apache Hive, making it more accessible. Hive used metadata and introduced a protocol for mapping files, which was what Hadoop operated on, to tables, which is what SQL operates on. That was the beginning of open data formats. Data Lakes, Data Lakehouses, and Open Data Formats Hadoop also signified the emergence of the data lake. Since storage was now cheap and easy to add, that opened up the possibility of storing all data in one big Hadoop store: the data lake. Eventually, however, Hadoop and on-premise compute and storage gave way to the cloud, which is the realm the "Big 5" operate on. "The idea of a data lakehouse was, hey, you know what? We love this decoupling of compute and storage. We love the cloud. But wouldn't it be nice if we didn't have to duplicate the data and we could just operate over the data store you already have, your data lake, and then just take all that data warehouse functionality and start trying to move it on there?", as Merced put it. However, as he added, that turned out to be a tall order because there were elements of traditional data management missing. Other cloud-based data management platforms introduced notions like automated sharding, partitioning, distribution, and replication. That came as part of the move away from data centers and file systems towards the cloud and APIs to access data. But for data lakehouses, these things are not a given. This is where open data formats such as Apache Iceberg, Apache Hudi, and Delta Lake come in. They all have some things in common, such as the use of metadata to abstract and optimize file operations and being agnostic as to storage. However, Hudi and Delta Lake both continued to build on what Hive started, while Iceberg departed from that. A detailed open data format comparison is a nuanced exercise, but perhaps the most important question is: Does interoperability exist between open data formats, and if yes, on what level? For the sixth data platform vision to become a reality, Interoperability should exist among clouds and formats, as well as on the semantics and governance level. Interoperability of the same format between different cloud storage vendors is possible but not easy. It can be done, but it's not an inherent feature of the data formats themselves. It's done on the application level, either via custom integration or by using something like Dremio. And there will be egress and network costs from transferring files outside the cloud where they reside. Interoperability of different formats is also possible but not perfect. It can be done by using a third-party application, but there are also a couple of other options, like Delta Uniform or OneTable. Neither works 100%, as Delta Uniform batch transactions and OneTable are for migration purposes. As Merced noted, creating a solution that works 100% would probably incur a lot of overhead and complexity as metadata would have to be synchronized across formats. Merced thinks these solutions are motivated at least partially by Iceberg's growth and the desire to ensure third-party tools can work with the Iceberg ecosystem. To Federate or Not to Federate? What's certain is that there's always gonna be sprinkles of data in other systems or in multiple clouds. That's why federation and virtualization that can work at scale is needed, as Corless noted. Schema management is complicated enough on one system, and trying to manage changes across three different systems would not make things easier. Whether data is left where it originally resides and used via federated queries or ingested in one storage location is a key architectural question. As such, there are tradeoffs involved that need to be understood. Even within the federation, there are different ways to go about it — Dremio, GraphQL, Pinot, Trino, and more. "How do data consumers want to size up these problems? How can we give them predictive ways to plan for these trade-offs? They could use federated queries, or a hybrid table of real time and batch data. We don't even have a grammar to describe those kinds of hybrid or complex data products these days", Corless said. Corless has a slightly different point of view, focusing on real-time data and processing. This is what StarTree does, as it builds on. Pinot is a real-time distributed OLAP datastore designed to answer OLAP queries with low latency. Corless also noted that the choice of data format depends on a number of parameters. Some formats are optimized for in-memory use, like Apache Arrow. Others, like Apache Parquet, are optimized for disk storage. There are attempts to have common representations of data both in memory and in storage, motivated by the need to leverage tiered storage. Tiered storage may utilize different media, ranging from in-memory to SSD to some sort of blob storage. "People want flexibility in where and how they store their data. If they have to do transmogrification in real time, that largely defeats the purpose of what they're trying to do with tiered storage. Is there something like the best, most universal format? Whenever you're optimizing for something, you're optimizing away from something else. But I'm very eager to see where that kind of universal representation of data is taking us right now.", Corless said. Semantics and Governance Regardless of where or how data is stored, however, true interoperability should also include semantics and governance. This is where the current state of affairs leaves a lot to be desired. As Merced shared, the way all table formats work is that the metadata has information about the table but not about the overall catalog of tables. That means they are not able to document the semantics of tables and how they relate to each other. A standard format for doing that doesn't exist yet. Merced also noted that there is something that addresses this gap, but it only works for Iceberg tables. Project Nessie is an open-source platform that creates a catalog, thus making tracking and versioning Iceberg tables and views possible. Nessie also incorporates elements of governance, and Merced noted that Nessie will eventually have Delta Lake support, too. Databricks, on its part, offers Unity Catalog, which works with Delta Lake, but it's a proprietary product. There is no lack of data catalog products in the market, but none of that can really be considered the solution to semantics and data interoperability. Corless, on his part, noted that there is a standard called DCAT. DCAT, which stands for Data Catalog Vocabulary, is an RDF vocabulary designed to facilitate interoperability between data catalogs published on the Web. DCAT was recently updated to v.3. It's been around for over a decade, and it's precisely aimed at interoperability. The fact that DCAT is not used widely has probably more to do with vendors reinventing the wheel and/or aiming for lock-in, as well as clients not being more proactive in requiring interoperability standards than with DCAT itself, as per Corless. Unfortunately, it seems like what data is to AI governance is to data for most organizations: a distant second at best. Forty-two percent of data and analytics leaders do not assess, measure, or monitor their data and analytics governance, according to a 2020 Gartner survey. Those who said they measured their governance activity mainly focused on achieving compliance-oriented goals. The onset of GDPR in 2018 marked the opportunity for governance to leverage metadata and semantics to rise up. Apache Atlas was an effort to standardize governance for data lakes leveraging DCAT and other metadata vocabularies. Today, however, both Data Lakes and Atlas seem to have fallen out of fashion. A new project called Egeria was spun out of Atlas, aiming to address more than data lakes. Furthermore, there is another open standard for metadata and data lineage called OpenLineage and a number of lineage platforms that support it, including Egeria. Conclusion So, what is the verdict? Is the sixth data platform possible for data management in 2024, or is it a pipedream? Merced thinks that we're already starting to see it. In his view, using open formats like Apache Iceberg, Apache Arrow, and Apache Parquet can create a greater level of interoperability. This interoperability is still imperfect, he noted, but it's a much better state of things than in the past. Dremio's thesis is to be an open data lakehouse, not just operating on the data lake, but across tools and data sources. Corless thinks that we're going to see a drive toward clusters of clusters or systems of systems. There's gonna be a reinforced drive towards the automation of integration, more like Lego bricks that snap together easily, rather than Ikea furniture that comes with a hex wrench. But for that to happen, he noted, we'll need a language and a grammar so that systems and people can both understand each other. The ingredients for a sixth data platform are either there or close enough. As complexity and fragmentation explode, a language and grammar for interoperability sounds like a good place to start. Admittedly, interoperability and semantics are hard. Even that, however, may be more of a people and market issue than a technical one. DCAT and OpenLineage are just some of the vocabularies out there. Even things as infamously hard to define as data mesh and data products have a vocabulary of their own — the Data Product Descriptor Specification. Perhaps, then, the right stance here would be cautious optimism.

By George Anadiotis

Consistent Change Data Capture Across Multiple Tables

Change data capture (CDC) is a widely adopted pattern to move data across systems. While the basic principle works well on small single-table use cases, things get complicated when we need to take into account consistency when information spans multiple tables. In cases like this, creating multiple 1-1 CDC flows is not enough to guarantee a consistent view of the data in the database because each table is tracked separately. Aligning data with transaction boundaries becomes a hard and error-prone problem to solve once the data leaves the database. This tutorial shows how to use PostgreSQL logical decoding, the outbox pattern, and Debezium to propagate a consistent view of a dataset spanning over multiple tables. Use Case: A PostgreSQL-Based Online Shop Relational databases are based on an entity-relationship model, where entities are stored in tables, with each table having a key for uniqueness. Relationships take the form of foreign keys that allow information from various tables to be joined. A practical example is the following with the three entities users, products, orders, and order lines and the relationships within them. In the above picture, the orders table contains a foreign key to users (the user making the order), and the order lines table contains the foreign keys to orders and products allowing us to understand to which order the line belongs and which products it includes. We can recreate the above situation by signing up for an Aiven account and accessing the console, then creating a new Aiven for the PostgreSQL database. When the service is up and running, we can retrieve the connection URI from the service console page's Overview tab. When you have the connection URI, connect with psql and run the following: SQL CREATE TABLE USERS (ID SERIAL PRIMARY KEY, USERNAME TEXT); INSERT INTO USERS (USERNAME) VALUES ('Franco'),('Giuseppina'),('Wiltord'); CREATE TABLE ORDERS ( ID SERIAL PRIMARY KEY, SHIPPING_ADDR TEXT, ORDER_DATE DATE, USER_ID INT, CONSTRAINT FK_USER FOREIGN KEY(USER_ID) REFERENCES USERS(ID) ); INSERT INTO ORDERS (SHIPPING_ADDR, ORDER_DATE, USER_ID) VALUES ('Via Ugo 1', '02/08/2023',3), ('Piazza Carlo 2', '03/08/2023',1), ('Lincoln Street', '03/08/2023',2); CREATE TABLE PRODUCTS ( ID SERIAL PRIMARY KEY, CATEGORY TEXT, NAME TEXT, PRICE INT ); INSERT INTO PRODUCTS (CATEGORY, NAME, PRICE) VALUES ('t-shirt', 'red t-shirt',5), ('shoes', 'Wow shoe',35), ('t-shirt', 'blue t-shirt',15), ('dress', 'white-golden dress',50); CREATE TABLE ORDER_LINES ( ID SERIAL PRIMARY KEY, ORDER_ID INT, PROD_ID INT, QTY INT, CONSTRAINT FK_ORDER FOREIGN KEY(ORDER_ID) REFERENCES ORDERS(ID), CONSTRAINT FK_PRODUCT FOREIGN KEY(PROD_ID) REFERENCES PRODUCTS(ID) ); INSERT INTO ORDER_LINES (ORDER_ID, PROD_ID, QTY) VALUES (1,1,5), (1,4,1), (2,2,7), (2,4,2), (2,3,7), (2,1,1), (3,2,2); Start the Change Data Capture Flow With the Debezium Connector Now, if we want to send an event to Apache Kafka® every time a new order happens we can define a Debezium CDC connector that includes all four tables defined above. To do this, navigate to the Aiven Console and create a new Aiven for Apache Kafka® service (we need at least a business plan for this example). Then, enable Kafka Connect from the service overview page. Navigate to the bottom of the same page; we can enable the kafka.auto_create_topics_enable configuration in the Advanced Parameter section for our test purposes. Finally, when the service is up and running, create a Debezium CDC connector with the following JSON definition: JSON { "name": "mysourcedebezium", "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.hostname": "<HOSTNAME>", "database.port": "<PORT>", "database.user": "avnadmin", "database.password": "<PASSWORD>", "database.dbname": "defaultdb", "database.server.name": "mydebprefix", "plugin.name": "pgoutput", "slot.name": "mydeb_slot", "publication.name": "mydeb_pub", "publication.autocreate.mode": "filtered", "table.include.list": "public.users,public.products,public.orders,public.order_lines" } Where: database.hostname, database.port, database.password are pointing to the Aiven for PostgreSQL connection parameters that can be found in the Aiven Console's service overview tab database.server.name is the prefix for the topic names in Aiven for Apache Kafka plugin.name is the PostgreSQL plugin name used pgoutput slot.name and publication.name are the name of the replication slot and publication in PostgreSQL "publication.autocreate.mode": "filtered" allows us to create a publication only for the tables in scope table.include.list lists the tables for which we want to enable the CDC The connector will create four topics (one per table) and tracks the changes separately for each table. In Aiven for Apache Kafka, we should see four different topics named <prefix>.<schema_name>.<table_name> where: <prefix> matches the database.server.name parameter (mydebprefix) <schema_name> matches the name of the schema (public in our scenario) <table_name> matches the name of the tables (users, products, orders, and order_lines) If we check with kcat, the mydebprefix.public.users log in Apache Kafka; we should see data similar to the below: JSON {"before":null,"after":{"id":1,"username":"Franco"},"source":{"version":"1.9.7.aiven","connector":"postgresql","name":"mydebprefix1","ts_ms":1690794104325,"snapshot":"true","db":"defaultdb","sequence":"[null,\"251723832\"]","schema":"public","table":"users","txId":2404,"lsn":251723832,"xmin":null},"op":"r","ts_ms":1690794104585,"transaction":null} {"before":null,"after":{"id":2,"username":"Giuseppina"},"source":{"version":"1.9.7.aiven","connector":"postgresql","name":"mydebprefix1","ts_ms":1690794104325,"snapshot":"true","db":"defaultdb","sequence":"[null,\"251723832\"]","schema":"public","table":"users","txId":2404,"lsn":251723832,"xmin":null},"op":"r","ts_ms":1690794104585,"transaction":null} {"before":null,"after":{"id":3,"username":"Wiltord"},"source":{"version":"1.9.7.aiven","connector":"postgresql","name":"mydebprefix1","ts_ms":1690794104325,"snapshot":"true","db":"defaultdb","sequence":"[null,\"251723832\"]","schema":"public","table":"users","txId":2404,"lsn":251723832,"xmin":null},"op":"r","ts_ms":1690794104585,"transaction":null} The above is the typical Debezium data representation with the before and after representations, as well as information about the transactions (ts_ms as example) and the data source (schema, table and others). This rich information will be ueful later. The Consistency Problem Now, let's say Franco, one of our users decides to issue a new order for the white-golden dress. Just a few seconds later, our company, due to an online debate decides that the white-golden dress is now called blue-black dress and wants to charge 65$$ instead of the50$$ original price. The above two actions can be represented by the following two transactions in PostgreSQL: SQL --- Franco purchasing the white-golden dress BEGIN; INSERT INTO ORDERS (SHIPPING_ADDR, ORDER_DATE, USER_ID) VALUES ('Piazza Carlo 2', '04/08/2023',1); INSERT INTO ORDER_LINES (ORDER_ID, PROD_ID, QTY) VALUES (4,4,1); END; --- Our company updating name and the price of the white-golden dress BEGIN; UPDATE PRODUCTS SET NAME = 'blue-black dress', PRICE = 65 WHERE ID = 4; END; At all points in time, we can get the order details with the following query: SQL SELECT USERNAME, ORDERS.ID ORDER_ID, PRODUCTS.NAME PRODUCT_NAME, PRODUCTS.PRICE PRODUCT_PRICE, ORDER_LINES.QTY QUANTITY FROM USERS JOIN ORDERS ON USERS.ID = ORDERS.USER_ID JOIN ORDER_LINES ON ORDERS.ID = ORDER_LINES.ORDER_ID JOIN PRODUCTS ON ORDER_LINES.PROD_ID = PRODUCTS.ID WHERE ORDERS.ID = 4; If we issue the query just after Franco's order was inserted, but before the product update, this results in the correct order details: Plain Text username | order_id | product_name | product_price | quantity ----------+----------+--------------------+---------------+---------- Franco | 4 | white-golden dress | 50 | 1 (1 row) If we issued the same query after the product update, this results in the blue-black dress being in the order and Franco being up charged by an extra $15. Plain Text username | order_id | product_name | product_price | quantity ----------+----------+------------------+---------------+---------- Franco | 4 | blue-black dress | 65 | 1 (1 row) Recreate Consistency in Apache Kafka When we look at the data in Apache Kafka, we can see all the changes in the topics. Browsing the mydebprefix.public.order_lines topic with kcat, we can check the new entry (the results in mydebprefix.public.orders would be similar): JSON {"before":null,"after":{"id":8,"order_id":4,"prod_id":4,"qty":1},"source":{"version":"1.9.7.aiven","connector":"postgresql","name":"mydebprefix1","ts_ms":1690794206740,"snapshot":"false","db":"defaultdb","sequence":"[null,\"251744424\"]","schema":"public","table":"order_lines","txId":2468,"lsn":251744424,"xmin":null},"op":"c","ts_ms":1690794207231,"transaction":null} And in mydebprefix.public.products, we can see an entry like the following, showcasing the update from white-golden dress to blue-black dress and related price change: JSON {"before":{"id":4,"category":"dress","name":"white-golden dress","price":50},"after":{"id":4,"category":"dress","name":"blue-black dress","price":65},"source":{"version":"1.9.7.aiven","connector":"postgresql","name":"mydebprefix1","ts_ms":1690794209729,"snapshot":"false","db":"defaultdb","sequence":"[\"251744720\",\"251744720\"]","schema":"public","table":"products","txId":2469,"lsn":251744720,"xmin":null},"op":"u","ts_ms":1690794210275,"transaction":null} The question now is: How can we keep the order consistent with reality, where Franco purchased the white-golden dress for 50$$? As mentioned before, the Debezium format stores lots of metadata in addition to the change data. We could make use of the transaction's metadata (txId, lsn and ts_ms for example) and additional tools like Aiven for Apache Flink® to recreate a consistent view of the transaction via stream processing. This solution requires additional tooling that might not be in scope for us, however. Use the Outbox Pattern in PostgreSQL An alternative solution that doesn't require additional tooling is to propagate a consistent view of the data using an outbox pattern built in PostgreSQL. With the outbox pattern, we store, alongside the original set of tables, an additional table that consolidates the information. With this pattern, we can update both the original table and the outbox one within a transaction. Add a New Outbox Table in PostgreSQL How do we implement the outbox pattern in PostgreSQL? The first option is to add a new dedicated table and update it within the same transaction, changing the ORDERS and ORDER_LINES tables. We can define the outbox table as follows: SQL CREATE TABLE ORDER_OUTBOX ( ORDER_LINE_ID INT, ORDER_ID INT, USERNAME TEXT, PRODUCT_NAME TEXT, PRODUCT_PRICE INT, QUANTITY INT ); We can then add the ORDER_OUTBOX table in the table.include.list parameter for the Debezium Connector to track its changes. The last part of the equation is to update the outbox table at every order: if Giuseppina wants 5 red t-shirts, the transaction will need to change the ORDERS, ORDER_LINES and ORDER_OUTBOX tables like the following: SQL BEGIN; INSERT INTO ORDERS (ID, SHIPPING_ADDR, ORDER_DATE, USER_ID) VALUES (5, 'Lincoln Street', '05/08/2023',2); INSERT INTO ORDER_LINES (ORDER_ID, PROD_ID, QTY) VALUES (5,1,5); INSERT INTO ORDER_OUTBOX SELECT ORDER_LINES.ID, ORDERS.ID, USERNAME, NAME PRODUCT_NAME, PRICE PRODUCT_PRICE, QTY QUANTITY FROM USERS JOIN ORDERS ON USERS.ID = ORDERS.USER_ID JOIN ORDER_LINES ON ORDERS.ID = ORDER_LINES.ORDER_ID JOIN PRODUCTS ON ORDER_LINES.PROD_ID = PRODUCTS.ID WHERE ORDERS.ID=5; END; With this transaction and the Debezium configuration change to include the public.order_outbox table in the CDC, we end up with a new topic called mydebprefix.public.order_outbox. It has the following data, which represents the consistent situation in PostgreSQL: JSON {"before":null,"after":{"order_line_id":12,"order_id":5,"username":"Giuseppina","product_name":"red t-shirt","product_price":5,"quantity":5},"source":{"version":"1.9.7.aiven","connector":"postgresql","name":"mydebprefix1","ts_ms":1690798353655,"snapshot":"false","db":"defaultdb","sequence":"[\"251744920\",\"486544200\"]","schema":"public","table":"order_outbox","txId":4974,"lsn":486544200,"xmin":null},"op":"c","ts_ms":1690798354274,"transaction":null} Avoid the Additional Table With PostgreSQL Logical Decoding The main problem with the outbox table approach is that we're storing the same information twice: once in the original tables and once in the outbox table. This doubles the storage needs, and the original applications that use the database generally do not access it, making this an inefficient approach. A better transactional approach is to use PostgreSQL logical decoding. Created originally for replication purposes, PostgreSQL logical decoding can also write custom information to the WAL log. Instead of restoring the result of the joined data in another PostgreSQL table, we can emit the result as an entry to the WAL log. By doing it within a transaction, we can benefit from the transaction isolation; therefore, the entry in the log is committed only if the whole transaction is. To use PostgreSQL logical decoding messages for our outbox pattern needs, we need to execute the following: SQL BEGIN; DO $$ DECLARE JSON_ORDER text; begin INSERT INTO ORDERS (ID, SHIPPING_ADDR, ORDER_DATE, USER_ID) VALUES (6, 'Via Ugo 1', '05/08/2023',3); INSERT INTO ORDER_LINES (ORDER_ID, PROD_ID, QTY) VALUES (6,4,2),(6,3,3); SELECT JSONB_BUILD_OBJECT( 'order_id', ORDERS.ID, 'order_lines', JSONB_AGG( JSONB_BUILD_OBJECT( 'order_line', ORDER_LINES.ID, 'username', USERNAME, 'product_name', NAME, 'product_price',PRICE, 'quantity', QTY))) INTO JSON_ORDER FROM USERS JOIN ORDERS ON USERS.ID = ORDERS.USER_ID JOIN ORDER_LINES ON ORDERS.ID = ORDER_LINES.ORDER_ID JOIN PRODUCTS ON ORDER_LINES.PROD_ID = PRODUCTS.ID WHERE ORDERS.ID=6 GROUP BY ORDERS.ID; SELECT * FROM pg_logical_emit_message(true,'myprefix',JSON_ORDER) into JSON_ORDER; END; $$; END; Where: The two lines below insert the new order into the original tables SQL INSERT INTO ORDERS (ID, SHIPPING_ADDR, ORDER_DATE, USER_ID) VALUES (6, 'Via Ugo 1', '05/08/2023',3); INSERT INTO ORDER_LINES (ORDER_ID, PROD_ID, QTY) VALUES (6,4,2),(6,3,3); Next, we need to construct a SELECT that: Gets the new order details from the source tables Creates a unique JSON document (stored in the JSON_ORDER variable) for the entire order and stores the results in an array for each line of the order Emits this as a logical message to the WAL file. The SELECT statement looks like the following: SQL SELECT * FROM pg_logical_emit_message(true,'outbox',JSON_ORDER) into JSON_ORDER; pg_logical_emit_message has three arguments. The first, true, defines this operation as a part of a transaction. myprefix defines the message prefix, and JSON_ORDER is the content of the message. The emitted JSON document should look similar to: JSON {"order_id": 6, "order_lines": [{"quantity": 2, "username": "Wiltord", "order_line": 19, "product_name": "blue-black dress", "product_price": 65}, {"quantity": 3, "username": "Wiltord", "order_line": 20, "product_name": "blue t-shirt", "product_price": 15}]} If the above transaction is successful, we should see a new topic named mydebprefix.message that contains the logical message that we just pushed, the form should be the following: JSON {"op":"m","ts_ms":1690804437953,"source":{"version":"1.9.7.aiven","connector":"postgresql","name":"mydebmsg","ts_ms":1690804437778,"snapshot":"false","db":"defaultdb","sequence":"[\"822085608\",\"822089728\"]","schema":"","table":"","txId":8651,"lsn":822089728,"xmin":null},"message":{"prefix":"myprefix","content":"eyJvcmRlcl9pZCI6IDYsICJvcmRlcl9saW5lcyI6IFt7InF1YW50aXR5IjogMiwgInVzZXJuYW1lIjogIldpbHRvcmQiLCAib3JkZXJfbGluZSI6IDI1LCAicHJvZHVjdF9uYW1lIjogImJsdWUtYmxhY2sgZHJlc3MiLCAicHJvZHVjdF9wcmljZSI6IDY1fSwgeyJxdWFudGl0eSI6IDMsICJ1c2VybmFtZSI6ICJXaWx0b3JkIiwgIm9yZGVyX2xpbmUiOiAyNiwgInByb2R1Y3RfbmFtZSI6ICJibHVlIHQtc2hpcnQiLCAicHJvZHVjdF9wcmljZSI6IDE1fV19"} Where: "op":"m" defines that the event is a logical decoding message "prefix":"myprefix" is the prefix we defined in the pg_logical_emit_message call content contains the JSON document with the order details encoded based on the binary.handling.mode defined in the connector definition. If we use a mix of kcat and jq to showcase the data included in the message.content part of the payload with: Shell kcat -b KAFKA_HOST:KAFKA_PORT \ -X security.protocol=SSL \ -X ssl.ca.location=ca.pem \ -X ssl.key.location=service.key \ -X ssl.certificate.location=service.crt \ -C -t mydebmsg.message -u | jq -r '.message.content | @base64d' We see the message in JSON format as: JSON {"order_id": 6, "order_lines": [{"quantity": 2, "username": "Wiltord", "order_line": 25, "product_name": "blue-black dress", "product_price": 65}, {"quantity": 3, "username": "Wiltord", "order_line": 26, "product_name": "blue t-shirt", "product_price": 15}]} Conclusion Defining a change data capture system allows downstream technologies to make use of the information assets, which is useful only if we can provide a consistent view on top of the data. The outbox pattern allows us to join data spanning different tables and provide a consistent, up-to-date view of complex queries. PostgreSQL's logical decoding enables us to push such a consistent view to Apache Kafka without having to write changes into an extra outbox table but rather by writing directly to the WAL log.

By Francesco Tisiot

How Artificial Intelligence and Data Management Interconnect

Artificial Intelligence (AI) and Data Management have a strong connection, with AI playing a significant role in enhancing and automating various data management tasks. Data Integration and Automated Processing AI algorithms can be used to automate data integration processes, where disparate data sources are combined and transformed into a unified format. AI can help in identifying patterns and relationships across different datasets, leading to more accurate and efficient data integration. Data Cleansing and Quality Assurance AI techniques such as machine learning can be employed to identify and rectify errors and inconsistencies in datasets, ensuring data quality. AI-powered algorithms can automatically flag and correct duplicate, missing, or outdated records, improving the overall reliability of data. Data Governance and Compliance AI can assist in ensuring compliance with data governance policies and regulations. By analyzing data usage patterns, AI can identify any potential compliance risks or data breaches, enabling proactive measures to be taken. AI algorithms can also automate the enforcement of data governance policies, thereby reducing human errors. Data Security and Privacy AI can be used to enhance data security and privacy measures. AI algorithms can detect and flag potential security threats by monitoring network activity and data access patterns. Additionally, AI-powered tools can anonymize sensitive data to protect individual privacy while still allowing meaningful analysis. Data Analytics and Insights AI techniques, particularly machine learning, and deep learning can extract valuable insights and patterns from massive datasets, facilitating data analysis and decision-making processes. AI models can autonomously identify trends, correlations, and anomalies in data, thereby enabling organizations to make data-driven decisions. Data Storage and Retrieval AI algorithms can optimize data storage and retrieval processes. AI can analyze historical data access patterns, predict future data requirements, and automatically alter data placement strategies to improve overall data access performance. Natural Language Processing (NLP) NLP, a subfield of AI, enables the understanding and processing of human language. It can be applied to data management tasks like data querying and data annotation. NLP-powered tools can interpret user queries in plain language and automatically generate corresponding SQL statements or execute database searches. Here's a simple example of Natural Language Processing using Python's NLTK library: Python import nltk from nltk.tokenize import word_tokenize, sent_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer # Example text text = "Natural Language Processing (NLP) is a subfield of Artificial Intelligence." # Tokenization: split the text into sentences and words sentences = sent_tokenize(text) words = word_tokenize(text) # Stopword removal: remove common words that don't carry much meaning stop_words = set(stopwords.words("english")) filtered_words = [word for word in words if word.casefold() not in stop_words] # Stemming: reduce words to their base or root form stemmer = PorterStemmer() stemmed_words = [stemmer.stem(word) for word in filtered_words] # Printing the results print("Original text: \n", text) print("\nSentences: \n", sentences) print("\nWords: \n", words) print("\nFiltered words: \n", filtered_words) print("\nStemmed words: \n", stemmed_words) Output: Plain Text Original text: Natural Language Processing (NLP) is a subfield of Artificial Intelligence. Sentences: ['Natural Language Processing (NLP) is a subfield of Artificial Intelligence.'] Words: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'Artificial', 'Intelligence', '.'] Filtered words: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'subfield', 'Artificial', 'Intelligence', '.'] Stemmed words: ['natur', 'languag', 'process', '(', 'nlp', ')', 'subfield', 'artifici', 'intellig', '.'] In this example, we tokenize the text into sentences and words using NLTK's `sent_tokenize` and `word_tokenize` functions. Then, we remove stop words like "is", "a", "of", etc., from the list of words using NLTK's `stopwords` corpus. Finally, we apply stemming to reduce words to their base or root form using the `PorterStemmer` algorithm from the `stem` module. Please note that this is just a basic example to demonstrate some of the common NLP techniques. NLP can include various other tasks, such as part-of-speech tagging, named entity recognition, sentiment analysis, and more. Conclusion AI and data management are interconnected, with AI technologies assisting in various aspects of data integration, cleansing, governance, security, analysis, and retrieval. The adoption of AI in data management can lead to improved data quality, governance, security, and decision-making capabilities, ultimately enhancing business performance.

By Amrish Solanki

Beyond Data Silos: Data Mesh, Game Changer for Enterprises

In today's data-driven world, organizations are grappling with managing the ever-increasing volume of data they collect. In order to derive insights and make informed decisions, businesses must break down data silos and create a unified view of their data. This is where Data Mesh comes in - a revolutionary new approach to data architecture that is changing the game for enterprises. What Is Data Mesh? Data Mesh is a decentralized approach to data architecture that promotes the autonomy and scalability of data domains within an organization. It is a response to the traditional centralized data architecture, which relies on a single data platform and a centralized team to manage all data-related activities. In a Data Mesh architecture, data is treated as a product and is owned by the domain that generates it. Each domain has its own team responsible for collecting, processing, and analyzing data. These domain teams operate autonomously and have the freedom to choose the tools and technologies that best fit their needs. The teams are also accountable for the quality and accuracy of their data. Why Data Mesh and Why Now? Traditional centralized data management approaches typically involve a single team or department responsible for managing all the data across an organization. This centralized team is often responsible for designing and managing a shared data architecture, defining data models, and ensuring data quality. However, this approach can result in siloed data because each team or department has its own unique data needs and requirements. As a result, teams may create their own data sets and systems that are not integrated with the centralized data architecture. This can lead to duplication of data, inconsistencies, and errors across different data sets. Additionally, traditional centralized approaches can lead to slow decision-making processes because data requests and analysis often need to be routed through the centralized team. This can create bottlenecks, delays, and frustration for teams who need to access data quickly to make informed decisions. One of the main benefits of Data Mesh is that it enables organizations to scale their data infrastructure in a more efficient and effective way. With traditional centralized data architecture, scaling can become a bottleneck due to the need for a centralized team to manage everything. This can slow down data processing and limit the organization's ability to extract insights in a timely manner. Data Mesh, on the other hand, allows for independent scaling of data domains, making it easier to scale data infrastructure as the organization grows. Data Mesh also promotes collaboration and cross-functional teams within an organization. By breaking down data silos, different teams can share data more easily and collaborate on data-related projects. This can lead to faster innovation and improved decision-making. What Changes With Data Mesh? Data mesh revolutionizes the landscape of analytical data management, introducing multidimensional transformations in both technical and organizational aspects. This paradigm calls for a fundamental reevaluation of assumptions, architecture, technical solutions, and social structures within organizations, reshaping how analytical data is managed, utilized, and owned. From an organizational perspective, data mesh drives a shift away from centralized data ownership, previously held by specialized professionals managing data platform technologies. Instead, it advocates for a decentralized data ownership model, empowering business domains to assume ownership and accountability for the data they generate or utilize. Architecturally, data mesh moves away from the conventional approach of collecting data in monolithic warehouses and lakes. Instead, it embraces a distributed mesh framework that interconnects data through standardized protocols, fostering a more agile and adaptable data ecosystem. Technologically, data mesh departs from treating data as a byproduct of pipeline code execution. Instead, it promotes solutions that treat data and the code responsible for its maintenance as a cohesive and dynamic unit, recognizing the intrinsic relationship between the two. Operationally, data governance undergoes a profound transformation. The traditional top-down, centralized operational model with manual interventions is replaced by a federated model, incorporating computational policies embedded in the nodes of the data mesh. This approach ensures more efficient and autonomous data governance. At its core, data mesh redefines the value system associated with data. Rather than viewing data as a mere asset to be collected, it embraces the perspective of data as a product designed to serve and delight data users, both within and outside the organization. Furthermore, data mesh extends its influence to the infrastructure level. It transcends the fragmented and point-to-point integration of infrastructure services, previously segregated into separate realms for data and analytics and applications and operational systems. Instead, data mesh advocates for a comprehensive and well-integrated infrastructure that caters to both operational and data systems promoting synergistic efficiency. Data Mesh: What it changes Data Mesh Building Blocks Domain-oriented data ownership: In Data Mesh, data ownership is decentralized and distributed across different domains or business units. Each domain has its own unique data needs and requirements and is responsible for managing its own data. This ensures that data is more aligned with business needs and can be managed more efficiently. Data products and services: Data Mesh treats data as a product that can be consumed by other teams and domains. Each domain is responsible for creating and managing its own data products and services, which can be consumed by other domains. This creates a more modular and scalable approach to data management. Federated data governance: Data Mesh decentralizes data governance, with each domain responsible for defining its own data policies and standards. However, a federated data governance model is still maintained, which ensures that data is managed consistently across domains and is compliant with relevant regulations and policies. Self-service data infrastructure: Data Mesh promotes a self-service approach to data infrastructure, with each domain responsible for managing its own data infrastructure needs. This includes selecting and managing relevant data storage, compute, and processing tools. This enables teams to select the tools that best fit their specific needs and requirements. Data Mesh: Building blocks Implementation of Data on AWS Identify data domains: The first step is to identify the different data domains within your organization. Each domain represents a distinct area of business focus with its own unique data needs and requirements. You can use tools such as AWS Glue to discover and catalog data assets across your organization. Decentralize data ownership: In Data Mesh, each domain is responsible for managing its own data. To implement this on AWS, you can use AWS Organizations to create separate accounts for each domain, with each account having its own unique set of permissions and access controls. Define data products and services: Each domain is responsible for creating and managing its own data products and services. To do this on AWS, you can use tools such as AWS Lambda and AWS API Gateway to create serverless APIs that expose data products and services to other domains. Federated data governance: Although data ownership and management are decentralized in Data Mesh, a federated data governance model is still maintained. To implement this on AWS, you can use AWS Lake Formation to define data policies and standards that are consistent across domains. Self-service data infrastructure: Each domain is responsible for managing its own data infrastructure needs. To implement this on AWS, you can use tools such as Amazon S3, Amazon Redshift, and Amazon EMR to create scalable and flexible data storage, compute, and processing infrastructure. Mesh architecture: In Data Mesh, data is managed through a mesh architecture that connects different domains and teams. To implement this on AWS, you can use AWS App Mesh to create a service mesh that connects different data products and services across domains. Challenges of Data Mesh While Data Mesh has many benefits, it is not without its challenges. One of the main challenges is that it requires a significant cultural shift within an organization. The domain teams must have a high level of autonomy and accountability, which may be difficult for some organizations to embrace. Another challenge is that Data Mesh can be more complex to set up and manage compared to traditional centralized data architecture. Each domain team may have different data requirements, tools, and technologies, which can create additional complexity. Conclusion In summary, Data Mesh is a new approach to data architecture that is changing the game for enterprises. By promoting autonomy and scalability of data domains, Data Mesh enables organizations to scale their data infrastructure more efficiently and effectively. However, implementing a Data Mesh architecture requires a significant cultural shift and can be more complex to manage compared to traditional centralized data architecture. Despite the challenges, the benefits of Data Mesh are compelling, and organizations that embrace it are likely to be more successful in the long run.

By Kshitiz Jain

SQL Loader + Unix Script: Loading Multiple Data Files in Oracle DB Table

Here, I am going to show the power of SQL Loader + Unix Script utility, where multiple data files can be loaded by the SQL loader with automated shell scripts. This would be useful while dealing with large chunks of data and when data needs to be moved from one system to another system. It would be suitable for a migration project where large historical data is involved. Then, it is not possible to run the SQL loader for each file and wait till it's loaded. So the best option is to keep the Unix program containing the SQL loader command running all the time. Once any file is available in the folder location then it will pick up the files from that folder location and start processing immediately. The Set Up The sample program I have done in Macbook. Installation of Oracle differs from one from Windows machine. Please go through the video that contains the detailed steps of how to install Oracle on Mac book. Get the SQL developer with Java 8 compliance. Now let us demonstrate the example. Loading Multiple Data Files in Oracle DB Table Because it is a Macbook, I have to do all the stuff inside the Oracle Virtual Machine. Let's see the below diagram of how SQL Loader works. Use Case We need to load millions of students' information onto to Student Table using shell scripts + SQL Loader Automation. The script will run all the time in the Unix server and poll for the .Dat file, and once the DAT file is in place, it will process them. Also in case any bad data is there, you need to identify them separately. This type of example is useful in a migration project, where need to load millions of historical records. From the old system, a live Feed (DAT file ) will be generated periodically and sent to the new system server. In the new system, the server file is available and will be loaded into the database using the automation Unix script. Now let's run the script. The script can run all the time on a Unix server. To achieve this, the whole code is put into the block below: Plain Text while true [some logic] done The Process 1. I have copied all the files + folder structure in the folder below. /home/oracle/Desktop/example-SQLdr 2. Refer to the below file (ls -lrth): Shell rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 11:33 DISCARD -rw-------. 1 oracle oinstall 3.5K Dec 24 11:33 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 11:33 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 12:25 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 12:29 ARCHIVE 3. As shown below, there is no data in the student table. 4. Now run the script using the nohup.out ./TestSqlLoader.sh. By doing this it will run all the time in the Unix server. 5. Now the script will run, which will load the two .dat files through the SQL loader. 6. The table should be loaded with the content of two files. 7. Now I am again deleting the table data. Just to prove the script is running all the time in the server, I will just place two DAT files from ARCHIVE to the current Directory. 8. Again place the two data files in the current directory. Shell -rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 12:53 DISCARD -rw-------. 1 oracle oinstall 4.3K Dec 24 12:53 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 12:53 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 13:02 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 13:03 ARCHIVE 9. See the student table again has loaded with all the data. 10. The script is running all the time on the server: Shell [oracle@localhost example-sqldr]$ ps -ef|grep Test oracle 30203 1 0 12:53? 00:00:00 /bin/bash ./TestSqlLoader.sh oracle 31284 31227 0 13:06 pts/1 00:00:00 grep --color=auto Test Full Source Code for Reference Python #!/bin/bash bad_ext='.bad' dis_ext='.dis' data_ext='.dat' log_ext='.log' log_folder='TASKLOG' arch_loc="ARCHIVE" bad_loc="BAD" discard_loc="DISCARD" now=$(date +"%Y.%m.%d-%H.%M.%S") log_file_name="$log_folder/TestSQLLoader_$now$log_ext" while true; do ls -a *.dat 2>/dev/null > all_data_file_list.unx for i in `cat all_data_file_list.unx` do #echo "The data file name is :-- $i" data_file_name=`basename $i .dat` echo "Before executing the sql loader command ||Starting of the script" > $log_file_name sqlldr userid=hr/oracle@orcl control=student.ctl errors=15000 log=$i$log_ext bindsize=512000000 readsize=500000 DATA=$data_file_name$data_ext BAD=$data_file_name$bad_ext DISCARD=$data_file_name$dis_ext mv $data_file_name$data_ext $arch_loc 2>/dev/null mv $data_file_name$bad_ext $bad_loc 2>/dev/null mv $data_file_name$dis_ext $discard_loc 2>/dev/null mv $data_file_name$data_ext$log_ext $log_folder 2>/dev/null echo "After Executing the sql loader command||File moved successfully" >> $log_file_name done ## halt the procesing for 2 mins sleep 1m done The CTL file is below. SQL OPTIONS (SKIP=1) LOAD DATA APPEND INTO TABLE student FIELDS TERMINATED BY '|' OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS ( id, name, dept_id ) The SQL Loader Specification control - Name of the .ctl file errors=15000- Maximum number of errors SQL Loader can allow log=$i$log_ext- Name of the log file bindsize=512000000 - Max size of bind array readsize=500000- Max size of the read buffer DATA=$data_file_name$data_ext- Name and location of data file BAD=$data_file_name$bad_ext- Name and location of bad file DISCARD=$data_file_name$dis_ext- Name and location of discard file In this way stated above, millions of records can be loaded through SQL loader + Unix Script automated way, and the above parameter can be set according to the need. Please let me know if you like this article.

By ARINDAM GOSWAMI

Unlocking the Secrets of Data Privacy: Navigating the World of Data Anonymization, Part 1

In today's data-driven world, ensuring individual data privacy has become critical as organizations rely on extensive data for decision-making, research, and customer engagement. Data anonymization is a technique that transforms personal data to safeguard personal information while maintaining its utility. This balance allows organizations to leverage data without compromising privacy. The rise of Big Data and Advanced Analytics has heightened the necessity for efficient anonymization methods. In our first series of articles about ensuring data privacy using data anonymization techniques, we will explore the importance of data anonymization, its ethical and legal implications, and its challenges. The following articles will review critical data anonymization techniques and their advantages and limitations. Importance of Privacy-Preserving Techniques The need for privacy-preserving techniques is present in various sectors. In healthcare, anonymized data is crucial for research and treatment development while protecting patient confidentiality. In finance, anonymization combats fraud while respecting customer privacy. An example is using anonymized mobile data during the COVID-19 pandemic, where governments tracked the virus's spread while ensuring user locations remained unidentifiable. Robust anonymization is necessary to prevent privacy breaches and serve the public good. As the world becomes more connected and data-centric, protecting personal information while utilizing data becomes complex. Data anonymization is a pathway to harness data ethically and with privacy in mind. Data privacy laws, technological advancements, and public awareness have added layers to the significance and application of anonymization techniques. This article aims to clarify these techniques, assess their effectiveness, and emphasize their critical role in the modern data ecosystem. Understanding Data Anonymization Data anonymization is the process that safeguards personal information, thus ensuring individuals cannot be identified. The technique aims to guarantee confidentiality while preserving the value of data for analysis and decision-making. Anonymization techniques like data masking, pseudonymization, aggregation, and data perturbation obscure identifying details. The ultimate goal is to create a version of the data where individual identities are secure, yet the data remains valuable for purposes like research, statistical analysis, and business planning. The Balance Between Data Utility and Privacy Balancing data utility and privacy is a nuanced and critical aspect of data anonymization. For example, a healthcare organization may anonymize patient records for research. While removing direct identifiers like names and social security numbers is essential, the data must retain enough detail (like age, gender, and medical history) to be useful for medical research. Over-anonymization can strip the data of usefulness, rendering it ineffective for the intended analysis. Conversely, insufficient anonymization risks exposing personal details, leading to privacy breaches. Hence, finding the right balance is critical to successfully applying data anonymization. Legal and Ethical Considerations in Data Anonymization Legal and ethical considerations are crucial in shaping data anonymization practices. GDPR in the EU and HIPAA in the US are regulatory frameworks that create the guidelines for managing personal data. HIPAA mandates rigorous data anonymization to protect patient privacy. These frameworks ensure organizations maintain high standards of privacy and ethical conduct. Conformance with these laws is a legal duty and an ethical responsibility to uphold individuals' trust in organizations when exchanging personal information. Challenges in Data Anonymization Technical Challenges in Implementation Implementing data anonymization techniques presents many technical challenges that demand meticulous deliberation and expertise. One paramount obstacle lies in the intricacies of determining the optimal level of anonymization. A profound comprehension of the data's structure and the potential for re-identification is imperative when employing techniques such as k-anonymity, l-diversity, or differential privacy. Furthermore, scalability poses another formidable hurdle. With the continuous growth of data volumes, effectively applying anonymization techniques without unduly compromising performance becomes increasingly more work. Numerous difficulties emerge in the execution procedure because of the differing nature of information types, from organized information in databases to unstructured information in reports and pictures. Additionally, the challenge of keeping pace with the ever-evolving data formats and sources necessitates constant updates and adaptations of anonymization strategies. Impact on Data Quality and Utility Data anonymization can significantly impact the quality and utility of the data. Over-anonymization can strip away too much information, rendering the data less useful for analysis or decision-making. For instance, in healthcare research, excessive anonymization might remove vital details crucial for epidemiological studies. Conversely, under-anonymization risks privacy breaches. Finding the right balance is critical but challenging. Anonymization also introduces biases in the data, as certain attributes may be disproportionately affected. This can lead to skewed results in data analysis, particularly in machine learning models where the quality and representativeness of data are paramount. Future-Oriented Challenges Challenges in data anonymization intersect with AI and Big Data, posing a significant challenge. AI algorithms can uncover patterns in data, compromising anonymization efforts. The vast amounts of data in the era of Big Data amplify the difficulty of anonymization. More sophisticated techniques are needed to withstand advanced AI algorithms. Anonymization practices must adapt to evolving technology and comply with emerging standards. Conclusion Data anonymization is crucial for data privacy, with both opportunities and challenges. Its role in protecting privacy and enabling data analysis cannot be overstated. Anonymizing data effectively is complex, requiring technical expertise and considering data utility and privacy. The field continuously evolves with AI and Big Data advancements and legal and ethical frameworks. Navigating these challenges demands expertise and awareness. Developing robust and ethical anonymization practices is essential for maximizing data potential and upholding privacy rights.

By Mitesh Mangaonkar

Algorithmic Alchemy: Exploiting Graph Theory in the Foreign Exchange

If you're familiar with the FinTech startup industry, you may have heard of Revolut, a well-known FinTech giant based in London, UK. Founded in 2015, Revolut has garnered substantial investments and become one of the fastest-growing startups in the UK, providing banking services to many European citizens. While banking operations are often shrouded in mystery when it comes to how they generate revenue, some key figures about Revolut for the years 2020 and 2021 have shed some light on their income sources: As illustrated, a significant portion of this neobank's revenue comes from Foreign Exchange (FX), wealth management (including cryptocurrencies), and card services. Notably, in 2021, FX became the most profitable sector. A friend of mine, who is also a software engineer, once shared an intriguing story about his technical interview at Revolut's Software Engineering department a few years back. He was tasked with developing an algorithm to identify the most profitable way to convert two currencies using one or multiple intermediate currencies. In other words, they were looking for a strategy for Currency Arbitrage. Currency Arbitrage is a trading strategy wherein a currency trader leverages different spreads offered by brokers for a particular currency pair through multiple trades. It was explicitly mentioned in the task that the algorithm's foundation must be rooted in graph theory. FX Basics FX, or Foreign Exchange, plays a pivotal role in global trade, underpinning the functioning of our interconnected world. It's evident that FX also plays a substantial role in making banks some of the wealthiest organizations. The profit generated from foreign exchange is primarily the difference or spread between the buying (BID) and selling (ASK) prices. While this difference might appear minuscule per transaction, it can accumulate into millions of dollars in profits, given the volume of daily operations. This allows some companies to thrive solely on these highly automated financial operations. In the realm of FX (Foreign Exchange), we always work with pairs of currencies, such as EUR/USD. In most cases, these exchanges are bidirectional (i.e., EUR/USD and USD/EUR), and the exchange rate value differs in each direction. An Arbitrage Pair represents a numerical ratio between the values of two currencies (EUR and US Dollar, for example), determining the exchange rate between them. Potentially, we can use multiple intermediate currencies for profitable trading, known as a sure bet. Arbitrage sure bet is a set of pairs to be used in a circular manner. Read more Many providers employ mathematical modeling and analysis to secure their own profits and prevent others from profiting from them. Hence, the term potentially is emphasized here. Sure bet length refers to the number of pairs that constitute a set of potential arbitrage opportunities. In the real world, exchange rates can vary among different banks or exchange platforms. It's not uncommon for tourists to traverse a city to find the best possible rate. With computer software, this process can be accomplished within milliseconds when you have access to a list of providers. In practical, profitable trades, multiple steps might involve conversions through various currencies across different exchange platforms. In other words, the Arbitrage Circle can be quite extensive. Arbitrage Circle entails acquiring a currency, transferring it to another platform, conducting an exchange for other currencies, and ultimately returning to the original currency. The exchange rate between two currencies via one or more intermediate currencies is calculated as the product of the exchange rates of these intermediate transactions. An Example For example, let's imagine we want to buy Swiss Franks for US Dollars, then exchange Franks for Japanese Yens, and then sell Yens for US Dollars again. In Autumn 2023, we have the following exchange rates: We can buy 0.91 CHF (Swiss Frank) for 1 USD. We can buy 163.16 Japanese Yens for 1 CHF. We can buy 0.0067 USD for 1 Japanese Yen. Let's present it with a table: Shell 1 USD | 1 CHF | 1 YEN 0.91 CHF | 163.16 YEN | 0.0067 USD ----------------|-------------------|-------------- 1.098901099 | 0.006128953 | 149.2537313 Now, we need to find a product of those values. A sequence of transactions becomes profitable when this product yields a value of less than one: 1.098901099 * 0.006128953 * 149.2537313 = 1.005240803 As we can see, the result is larger than one, so it looks like we lost 0.05% of our money. But how many exactly? We can sort it out like this: 0.91 CHF * 163.16 (YEN per 1 CHF) * 0.0067 (USD per 1 YEN) = 0.99478652 US Dollars So, after selling 1 US Dollar in the beginning, we have got 0.994 — less than 1 US Dollar in the end. In simpler terms, Arbitrage Cycle is profitable when one unit of currency can be obtained for less than one unit of the same currency. Let's imagine we have found an opportunity to take 0.92 CHF per 1 US Dollar in the initial transaction instead of 0.91 CHF: Shell 1 USD | 1 CHF | 1 YEN 0.92 CHF | 163.16 YEN | 0.0067 USD ----------------|-------------------|-------------- 1.086956522 | 0.006128953 | 149.2537313 A product will be less than 1: 1.086956522 * 0.006128953 * 149.2537313 = 0.994314272 This means, in real currencies, it will give us more than 1 US Dollar: 0.92 CHF * 163.16 (YEN per 1 CHF) * 0.0067 (USD per 1 YEN) = 1.00571824 US Dollars Wuolah, we got some PROFIT! Now, let's see how to automate this using graph analysis. So, the formula to check for profits or losses in an Arbitrage Circle of 3 Arbitrage Pairs would look like this: USD/CHF * CHF/YEN * YEN/USD < 1.0 Graph Representation To automate those processes, we can use graphs. The tables mentioned earlier can be naturally transformed into a matrix representation of a graph, where nodes represent currencies and edges represent bidirectional exchanges. Hence, it is straightforward to represent two pairs exchange in a matrix like this: Shell EUR USD 1 1 EUR 1 1 USD Depending on the number of pairs involved, our matrix can expand: Shell EUR USD YEN CHF 1 1 1 1 EUR 1 1 1 1 USD 1 1 1 1 YEN 1 1 1 1 CHF Consequently, our table can become considerably larger, even for just two currencies, if we take into account more exchange platforms and resources. To address real currency arbitrage problems, a complete graph that encompasses all relationships for currency quotes is often utilized. A three-currency exchange table might appear as follows: USD CHF YEN { 1.0, 1.10, 0.0067 } USD { 0.91, 1.0, 0.0061 } CHF { 148.84, 163.16, 1.0 } YEN We can employ a simple graph data structure to represent our currency pairs in memory: C++ class GraphNode { public: string Name; }; class Graph { public: vector<vector<double>> Matrix; vector<GraphNode> Nodes; }; Now, we only need to find out how to traverse this graph and find the most profitable circle. But there is still one problem... Math Saves Us Again Classical graph algorithms are not well-suited for working with the product of edge lengths because they are designed to find paths defined as the sum of these lengths (see implementations of any well-known classic path-finding algorithms BFS, DFS, Dijkstra, or even A-Star). However, to circumvent this limitation, there is a mathematical way to transition from a product to a sum: logarithms. If a product appears under a logarithm, it can be converted into a sum of logarithms. On the right side of this equation, the desired number is less than one, indicating that the logarithm of this number must be less than zero: LogE(USD/CHF) * LogE(CHF/YEN) * LogE(YEN/USD) < 0.0 This simple mathematical trick allows us to shift from searching for a cycle with an edge length product less than one to searching for a cycle where the sum of the edge lengths is less than zero. Our matrix values converted to a LogE(x) and rounded with two digits after the point, now look like this: USD CHF YEN { 0.0, 0.1, -5.01 } USD { -0.09, 0.0, -5.1 } CHF { 5.0, 5.09, 0.0 } YEN Now, this problem becomes more solvable using classical graph algorithms. What we need is to traverse the graph, looking for the most profitable path of exchange. Graph Algorithms Every algorithm has its limitations. I mentioned some of them in my previous article. We cannot apply classical BFS, DFS, or even Dijkstra here because our graph may contain negative weights, which may lead to Negative Cycles while it traverses the graph. Negative cycles pose a challenge to the algorithm since it continually finds better solutions on each iteration. To address this issue, the Bellman-Ford algorithm simply limits the number of iterations. It traverses each edge of the graph in a cycle and applies relaxation for all edges no more than V-1 times (where V is the number of nodes). As such, the Bellman-Ford algorithm lies at the heart of this Arbitrage system, as it enables the discovery of paths between two nodes in the graph that meet two essential criteria: they contain negative weights and are not part of negative cycles. While this algorithm is theoretically straightforward (and you can find billions of videos about it), practical implementation for our needs requires some effort. Let's dig into it. Bellman-Ford Algorithm Implementation As the aim of this article is computer science, I will use imaginary exchange rates that has nothing to do with the real ones. For a smoother introduction to the algorithm, let's use a graph that doesn't contain negative cycles at all: C++ graph.Nodes.push_back({ "USD" }); graph.Nodes.push_back({ "CHF" }); graph.Nodes.push_back({ "YEN" }); graph.Nodes.push_back({ "GBP" }); graph.Nodes.push_back({ "CNY" }); graph.Nodes.push_back({ "EUR" }); // Define exchange rates for pairs of currencies below // USD CHF YEN GBP CNY EUR graph.Matrix = { { 0.0, 0.41, INF, INF, INF, 0.29 }, // USD { INF, 0.0, 0.51, INF, 0.32, INF }, // CHF { INF, INF, 0.0, 0.50, INF, INF }, // YEN { 0.45, INF, INF, 0.0, INF, -0.38 }, // GBP { INF, INF, 0.32, 0.36, 0.0, INF }, // CNY { INF, -0.29, INF, INF, 0.21, 0.0 } }; // EUR The code example below finds a path between two nodes using the Bellman-Ford algorithm when the graph lacks negative cycles: C++ vector<double> _shortestPath; vector<int> _previousVertex; void FindPath(Graph& graph, int start) { int verticesNumber = graph.Nodes.size(); _shortestPath.resize(verticesNumber, INF); _previousVertex.resize(verticesNumber, -1); _shortestPath[start] = 0; // For each vertex, apply relaxation for all the edges V - 1 times. for (int k = 0; k < verticesNumber - 1; k++) for (int from = 0; from < verticesNumber; from++) for (int to = 0; to < verticesNumber; to++) if (_shortestPath[to] > _shortestPath[from] + graph.Matrix[from][to]) { _shortestPath[to] = _shortestPath[from] + graph.Matrix[from][to]; _previousVertex[to] = from; } } Running this code for the Chinese Yuan fills the _previousVertex array and yields results like this: Path from 4 to 0 is : 4(CNY) 3(GBP) 0(USD) Path from 4 to 1 is : 4(CNY) 3(GBP) 5(EUR) 1(CHF) Path from 4 to 2 is : 4(CNY) 3(GBP) 5(EUR) 1(CHF) 2(YEN) Path from 4 to 3 is : 4(CNY) 3(GBP) Path from 4 to 4 is : 4(CNY) Path from 4 to 5 is : 4(CNY) 3(GBP) 5(EUR) As you can observe, it identifies optimal paths between CNY and various other currencies. And again, I will not focus on finding only one best one, as it is relatively simple task and not the goal of the article. The above implementation works well in ideal cases but falls short when dealing with graphs containing negative cycles. Detecting Negative Cycles What we truly need is the ability to identify whether a graph contains negative cycles and, if so, pinpoint the problematic segments. This knowledge allows us to mitigate these issues and ultimately discover profitable chains. The number of iterations doesn't always have to reach precisely V - 1. A solution is deemed found if, on the (N+1)-th cycle, no better path than the one on the N-th cycle is discovered. Thus, there's room for slight optimization. The code mentioned earlier can be enhanced to not only find paths but also detect whether the graph contains negative cycles, including the optimization I mentioned: C++ vector<double> _shortestPath; vector<int> _previousVertex; bool ContainsNegativeCycles(Graph& graph, int start) { int verticesNumber = graph.Nodes.size(); _shortestPath.resize(verticesNumber, INF); _previousVertex.resize(verticesNumber, -1); _shortestPath[start] = 0; // For each vertex, apply relaxation for all the edges V - 1 times. for (int k = 0; k < verticesNumber - 1; k++) { updated = false; for (int from = 0; from < verticesNumber; from++) { for (int to = 0; to < verticesNumber; to++) { if (_shortestPath[to] > _shortestPath[from] + graph.Matrix[from][to]) { _shortestPath[to] = _shortestPath[from] + graph.Matrix[from][to]; _previousVertex[to] = from; updated = true; } } } if (!updated) // No changes in paths, means we can finish earlier. break; } // Run one more relaxation step to detect which nodes are part of a negative cycle. if (updated) for (int from = 0; from < verticesNumber; from++) for (int to = 0; to < verticesNumber; to++) if (_shortestPath[to] > _shortestPath[from] + graph.Matrix[from][to]) // A negative cycle has occurred if we can find a better path beyond the optimal solution. return true; return false; } Now we play with a more intricate graph that includes negative cycles: C++ graph.Nodes.push_back({ "USD" }); // 1 (Index = 0) graph.Nodes.push_back({ "CHF" }); graph.Nodes.push_back({ "YEN" }); graph.Nodes.push_back({ "GBP" }); graph.Nodes.push_back({ "CNY" }); graph.Nodes.push_back({ "EUR" }); graph.Nodes.push_back({ "XXX" }); graph.Nodes.push_back({ "YYY" }); // 8 (Index = 7) // USD CHF YEN GBP CNY EUR XXX YYY graph.Matrix = { { 0.0, 1.0, INF, INF, INF, INF, INF, INF }, // USD { INF, 0.0, 1.0, INF, INF, 4.0, 4.0, INF }, // CHF { INF, INF, 0.0, INF, 1.0, INF, INF, INF }, // YEN { INF, INF, 1.0, 0.0, INF, INF, INF, INF }, // GBP { INF, INF, INF, -3.0, 0.0, INF, INF, INF }, // CNY { INF, INF, INF, INF, INF, 0.0, 5.0, 3.0 }, // EUR { INF, INF, INF, INF, INF, INF, 0.0, 4.0 }, // XXX { INF, INF, INF, INF, INF, INF, INF, 0.0 } }; // YYY Our program simply halts and displays a message: Graph contains negative cycle. We were able to indicate the problem; however, we needed to navigate around problematic segments of the graph. Avoiding Negative Cycles To accomplish this, we'll mark vertices that are part of negative cycles with a constant value, NEG_INF: C++ bool FindPathsAndNegativeCycles(Graph& graph, int start) { int verticesNumber = graph.Nodes.size(); _shortestPath.resize(verticesNumber, INF); _previousVertex.resize(verticesNumber, -1); _shortestPath[start] = 0; for (int k = 0; k < verticesNumber - 1; k++) for (int from = 0; from < verticesNumber; from++) for (int to = 0; to < verticesNumber; to++) { if (graph.Matrix[from][to] == INF) // Edge not exists { continue; } if (_shortestPath[to] > _shortestPath[from] + graph.Matrix[from][to]) { _shortestPath[to] = _shortestPath[from] + graph.Matrix[from][to]; _previousVertex[to] = from; } } bool negativeCycles = false; for (int k = 0; k < verticesNumber - 1; k++) for (int from = 0; from < verticesNumber; from++) for (int to = 0; to < verticesNumber; to++) { if (graph.Matrix[from][to] == INF) // Edge not exists { continue; } if (_shortestPath[to] > _shortestPath[from] + graph.Matrix[from][to]) { _shortestPath[to] = NEG_INF; _previousVertex[to] = -2; negativeCycles = true; } } return negativeCycles; } Now, if we encounter NEG_INF in the _shortestPath array, we can display a message and skip that segment while still identifying optimal solutions for other currencies. For example, with Node 0 (representing USD): Graph contains negative cycle. Path from 0 to 0 is : 0(USD) Path from 0 to 1 is : 0(USD) 1(CHF) Path from 0 to 2 is : Infinite number of shortest paths (negative cycle). Path from 0 to 3 is : Infinite number of shortest paths (negative cycle). Path from 0 to 4 is : Infinite number of shortest paths (negative cycle). Path from 0 to 5 is : 0(USD) 1(CHF) 5(EUR) Path from 0 to 6 is : 0(USD) 1(CHF) 6(XXX) Path from 0 to 7 is : 0(USD) 1(CHF) 5(EUR) 7(YYY) Whoala! Our code was able to identify a number of profitable chains despite the fact that our data was "a bit dirty." All the code examples mentioned above, including test data, are shared with you on my GitHub. Even Little Fluctuations Matter Let's now consolidate what we've learned. Given a list of exchange rates for three currencies, we can easily detect negative cycles: C++ graph.Nodes.push_back({ "USD" }); // 1 (Index = 0) graph.Nodes.push_back({ "CHF" }); graph.Nodes.push_back({ "YEN" }); // 3 (Index = 2) // LogE(x) table: USD CHF YEN graph.Matrix = { { 0.0, 0.489, -0.402 }, // USD { -0.489, 0.0, -0.891 }, // CHF { 0.402, 0.89, 0.0 } }; // YEN from = 0; FindPathsAndNegativeCycles(graph, from); Result: Graph contains negative cycle. Path from 0 to 0 is: Infinite number of shortest paths (negative cycle). Path from 0 to 1 is: Infinite number of shortest paths (negative cycle). Path from 0 to 2 is: Infinite number of shortest paths (negative cycle). However, even slight changes in the exchange rates (i.e., adjustments to the matrix) can lead to significant differences: C++ // LogE(x) table: USD CHF YEN graph.Matrix = { { 0.0, 0.490, -0.402 }, // USD { -0.489, 0.0, -0.891 }, // CHF { 0.403, 0.891, 0.0 } }; // YEN from = 0; FindPathsAndNegativeCycles(graph, from); Look, we have found one profitable chain: Path from 0 to 0 is : 0(USD) Path from 0 to 1 is : 0(USD) 2(YEN) 1(CHF) Path from 0 to 2 is : 0(USD) 2(YEN) We can apply these concepts to much larger graphs involving multiple currencies: C++ graph.Nodes.push_back({ "USD" }); // 1 (Index = 0) graph.Nodes.push_back({ "CHF" }); graph.Nodes.push_back({ "YEN" }); graph.Nodes.push_back({ "GBP" }); graph.Nodes.push_back({ "CNY" }); // 5 (Index = 4) // LogE(x) table: USD CHF YEN GBP CNY graph.Matrix = { { 0.0, 0.490, -0.402, 0.7, 0.413 }, // USD { -0.489, 0.0, -0.891, 0.89, 0.360 }, // CHF { 0.403, 0.891, 0.0, 0.91, 0.581 }, // YEN { 0.340, 0.405, 0.607, 0.0, 0.72 }, // GBP { 0.403, 0.350, 0.571, 0.71, 0.0 } }; // CNY from = 0; runDetectNegativeCycles(graph, from); As a result, we might find multiple candidates to get profit: Path from 0 to 0 is : 0(USD) Path from 0 to 1 is : 0(USD) 2(YEN) 1(CHF) Path from 0 to 2 is : 0(USD) 2(YEN) Path from 0 to 3 is : 0(USD) 2(YEN) 3(GBP) Path from 0 to 4 is : 0(USD) 2(YEN) 4(CNY) There are two important factors, though: Time is a critical factor in implementing arbitrage processes, primarily due to the rapid fluctuations in currency prices. As a result, the lifespan of a sure bet is exceedingly brief. Platforms levy commissions for each transaction. Therefore, minimizing time costs and reducing commissions are paramount, achieved by limiting the length of the sure bet. Empirical experience suggests that an acceptable sure bet length typically ranges from 2 to 3 pairs. Beyond this, the computational requirements escalate, and trading platforms impose larger commissions. Thus, to make an income is not enough to have such technologies; you also need access to low-level commissions. Usually, only large financial institutions have such a resource in their hands. Automation Using Smart Contracts I've delved into the logic of FX operations and how to derive profits from them, but I haven't touched upon the technologies used to execute these operations. While this topic slightly veers off-course, I couldn't omit to mention smart contracts. Using smart contracts is one of the most innovative ways to conduct FX operations today. Smart contracts enable real-time FX operations without delays or human intervention (except for the creation of the smart contract). Solidity is the specialized programming language for creating smart contracts that automate financial operations involving cryptocurrencies. The world of smart contracts is dynamic and subject to rapid technological changes and evolving regulations. It's an area with considerable hype and significant risks related to wallets and legal compliance. While there are undoubtedly talented individuals and teams profiting from this field, there are also regulatory bodies striving to ensure market rules are upheld. Why Are We Looking Into This? Despite the complexity, obscurity, and unpredictability of global economics, Foreign Exchange remains a hidden driving force in the financial world. It's a crucial element that enables thousands of companies and millions of individuals worldwide to collaborate, provide services, and mutually benefit one another in a peaceful manner, transcending borders. Of course, various factors, such as politics, regulations, and central banks, influence exchange rates and FX efficiency. These complexities make the financial landscape intricate. Yet, it's essential to believe that these complexities serve a greater purpose for the common good. Numerous scientific papers delve into the existence and determination of exchange rates in the global economy, to mention a few: Importers, Exporters, and Exchange Rate Disconnect Currency choice and exchange rate pass-through Exchange rate puzzles and policies These papers shed light on some fundamental mechanisms of Foreign Exchanges, which are still hard to understand and fit into one model. However, playing with code and trying to find a solution for a practical problem helped me to get a little more clue on it. I hope you enjoyed this little exploration trip as much as I am. Stay tuned! Links Sedgewick R. — Algorithms in C, Part 5: Graph Algorithms Bellman Ford Algorithm Code Implementation William Fiset's GitHub examples — Bellman Ford On Adjacency Matrix William Fiset's GitHub examples — Bellman Ford On Edge List

By Anton Yarkov

CORE

Unlocking Data Dynamics: Understanding the Power of Pivoting and Un-Pivoting

In the realm of data manipulation and analysis, the terms "pivoting" and "un-pivoting" play crucial roles in transforming raw data into meaningful insights. These operations are fundamental to reshaping datasets for better visualization, analysis, and interpretation. In this blog post, let's kick back, sip some coffee, and demystify these concepts that make your data dance to the right tunes. Pivoting Data and Why Does It Matter? Pivoting a column means aggregating identical values in that column, resulting in a new table orientation. The initial step involves sorting the table in ascending order based on the values present in the first column. This transformative process revolves around the art of rotating or transposing data, seamlessly converting rows into columns and vice versa. The beauty of pivoting lies in its ability to bring order to the chaos of scattered information. Picture a scenario where dates are laid out in rows, and sales figures are scattered across columns in a dataset. Pivoting enables you to elegantly rearrange the data landscape. With a simple pivot, dates effortlessly transform into columns, aligning perfectly with their corresponding sales figures. Within the domain of data analysis, leveraging pivoted data empowers you to effortlessly employ aggregation and filtering techniques. Embracing the prowess of Pivot not only optimizes your data for analysis but also clears the path for a more streamlined and user-friendly data entry process. Key Characteristics of Pivoting Aggregate Values: Pivoting is often used to aggregate or summarize data. By grouping and restructuring information, it becomes simpler to perform calculations or generate summary statistics. Improved Readability: Pivoting can enhance the readability of data, especially when dealing with time series or categorical information. This makes it easier for analysts and decision-makers to identify trends or patterns. Charting and Visualization: Many data visualization tools prefer data in a pivoted format. Pivoted data is often more compatible with charting libraries and makes it straightforward to create visual representations of trends and comparisons. Un-Pivoting Data and What Does It Even Mean? You’ve heard of pivoting data, but have you ever heard of un-pivoting before? In data manipulation, pivoting restructures data to enhance analysis, but its counterpart, un-pivoting, performs the inverse function. Un-pivoting (sometimes called flattening the data) takes data from a summarized or aggregated state back to its original, more detailed form. This reversal is particularly valuable when dealing with pre-aggregated data designed for reporting purposes. The significance of un-pivoting becomes apparent when considering the diverse angles from which data analysis can be approached. While pivoted data provides a convenient view for specific analyses, the un-pivoted form is essential for exploring data from various perspectives. Despite its importance, working with un-pivoted data can be a messy and intricate process. To illustrate, creating charts or reports often necessitates the use of un-pivoted data. This step is crucial for delving into the nuances of information, revealing patterns, and facilitating a comprehensive understanding of the dataset. In essence, while pivoting optimizes data for specific analyses, un-pivoting is the key to unlocking the full analytical potential, offering a more granular and nuanced view of the underlying information. Key Characteristics of Un-Pivoting Detail Retrieval: Un-pivoting allows users to retrieve detailed information from aggregated datasets. For instance, if sales data is pivoted by product categories, un-pivoting can bring back the individual transactions associated with each category. Database Normalization: Un-pivoting is an essential step in the normalization of databases. By breaking down aggregated data into its atomic components, databases can be structured more efficiently. Flexible Analysis: Unpivoted data provides a level of granularity that is crucial for certain types of analysis. Users can explore individual data points and relationships that might be lost in a pivoted summary. Why Ottava Chose To Work With Pivoted Data By this time, we hope you already spent some time working with Ottava and realized what we’ve done with the data entry method; if you are new to Ottava, you can find out more about this no-code SaaS data analysis and visualization platform at ottava.io. Ottava's unique feature allows users to input raw data directly in a pivoted format (multi-dimensional summarized tables) rather than the conventional tabular data. This choice is driven by our understanding of the benefits and efficiency that such an approach brings to data management. The key reason Ottava enables users to work with pivoted data directly is to mitigate data entry errors that can arise from manual input. Moreover, Ottava seamlessly executes the un-pivoting command in the background, a crucial step often required before diving into data analysis. By automating this process, Ottava spares users the complexities of un-pivoting data, allowing them to focus on working directly with the pivoted data and the charts we've meticulously crafted. The already aggregated data in Ottava opens up a world of possibilities, presenting users with a curated collection of ready-to-use charts and visualizations tailored to this format. This not only simplifies the creation of dashboards and reports but also ensures that users can effortlessly transform their data into impactful visual representations. This aligns perfectly with Ottava's commitment to delivering a user-friendly experience, reducing human errors, and unlocking the full potential of data by transforming raw information into actionable insights.

By Vicky Pham

CORE

Data

DZone's Featured Data Resources

Top Data Experts

The Latest Data Topics