Data Engineering Resources

DZone's Featured Data Engineering Resources

Enterprise AI

In recent years, artificial intelligence has become less of a buzzword and more of an adopted process across the enterprise. With that, there is a growing need to increase operational efficiency as customer demands arise. AI platforms have become increasingly more sophisticated, and there has become the need to establish guidelines and ownership.In DZone's 2022 Enterprise AI Trend Report, we explore MLOps, explainability, and how to select the best AI platform for your business. We also share a tutorial on how to create a machine learning service using Spring Boot, and how to deploy AI with an event-driven platform. The goal of this Trend Report is to better inform the developer audience on practical tools and design paradigms, new technologies, and the overall operational impact of AI within the business.This is a technology space that's constantly shifting and evolving. As part of our December 2022 re-launch, we've added new articles pertaining to knowledge graphs, a solutions directory for popular AI tools, and more.

Neural Network Representations

By Boluwatife Ben-Adeola

Trained neural networks arrive at solutions that achieve superhuman performance on an increasing number of tasks. It would be at least interesting and probably important to understand these solutions. Interesting, in the spirit of curiosity and getting answers to questions like, “Are there human-understandable algorithms that capture how object-detection nets work?”[a] This would add a new modality of use to our relationship with neural nets from just querying for answers (Oracle Models) or sending on tasks (Agent Models) to acquiring an enriched understanding of our world by studying the interpretable internals of these networks’ solutions (Microscope Models). [1] And important in its use in the pursuit of the kinds of standards that we (should?) demand of increasingly powerful systems, such as operational transparency, and guarantees on behavioral bounds. A common example of an idealized capability we could hope for is “lie detection” by monitoring the model’s internal state. [2] Mechanistic interpretability (mech interp) is a subfield of interpretability research that seeks a granular understanding of these networks. One could describe two categories of mech interp inquiry: Representation interpretability: Understanding what a model sees and how it does; i.e., what information have models found important to look for in their inputs and how is this information represented internally? Algorithmic interpretability: Understanding how this information is used for computation across the model to result in some observed outcome Figure 1: “A Conscious Blackbox," the cover graphic for James C. Scott’s Seeing Like a State (1998) This post is concerned with representation interpretability. Structured as an exposition of neural network representation research [b], it discusses various qualities of model representations which range in epistemic confidence from the obvious to the speculative and the merely desired. Notes: I’ll use "Models/Neural Nets" and "Model Components" interchangeably. A model component can be thought of as a layer or some other conceptually meaningful ensemble of layers in a network. Until properly introduced with a technical definition, I use expressions like “input-properties” and “input-qualities” in place of the more colloquially used “feature.” Now, to some foundational hypotheses about neural network representations. Decomposability The representations of inputs to a model are a composition of encodings of discrete information. That is, when a model looks for different qualities in an input, the representation of the input in some component of the model can be described as a combination of its representations of these qualities. This makes (de)composability a corollary of “encoding discrete information”- the model’s ability to represent a fixed set of different qualities as seen in its inputs. Figure 2: A model layer trained on a task that needs it to care about background colors (trained on only blue and red) and center shapes (only circles and triangles) The component has dedicated a different neuron to the input qualities: "background color is composed of red," "background color is composed of blue," "center object is a circle,” and “center object is a triangle.” Consider the alternative: if a model didn't identify any predictive discrete qualities of inputs in the course of training. To do well on a task, the network would have to work like a lookup table with its keys as the bare input pixels (since it can’t glean any discrete properties more interesting than “the ordered set of input pixels”) pointing to unique identifiers. We have a name for this in practice: memorizing. Therefore, saying, “Model components learn to identify useful discrete qualities of inputs and compose them to get internal representations used for downstream computation,” is not far off from saying “Sometimes, neural nets don’t completely memorize.” Figure 3: An example of how learning discrete input qualities affords generalization or robustness This example test input, not seen in training, has a representation expressed in the learned qualities. While the model might not fully appreciate what “purple” is, it’ll be better off than if it was just trying to do a table lookup for input pixels. Revisiting the hypothesis: "The representations of inputs to a model are a composition of encodings of discrete information." While, as we’ve seen, this verges on the obvious; it provides a template for introducing stricter specifications deserving of study. The first of these specification revisits looks at “…are a composition of encodings…” What is observed, speculated, and hoped for about the nature of these compositions of the encodings? Linearity To recap decomposition, we expect (non-memorizing) neural networks to identify and encode varied information from input qualities/properties. This implies that any activation state is a composition of these encodings. Figure 4: What the decomposability hypothesis suggests What is the nature of this composition? In this context, saying a representation is linear suggests the information of discrete input qualities are encoded as directions in activation space and they are composed into a representation by a vector sum: We’ll investigate both claims. Claim #1: Encoded Qualities Are Directions in Activation Space Composability already suggests that the representation of input in some model components (a vector in activation space) is composed of discrete encodings of input qualities (other vectors in activation space). The additional thing said here is that in a given input-quality encoding, we can think of there being some core essence of the quality which is the vector’s direction. This makes any particular encoding vector just a scaled version of this direction (unit vector.) Figure 5: Various encoding vectors for the red-ness quality in the input They are all just scaled representations of some fundamental red-ness unit vector, which specifies direction. This is simply a generalization of the composability argument that says neural networks can learn to make their encodings of input qualities "intensity"-sensitive by scaling some characteristic unit vector. Alternative Impractical Encoding Regimes Figure 6a An alternative encoding scheme could be that all we can get from models are binary encodings of properties; e.g., “The Red values in this RGB input are Non-zero.” This is clearly not very robust. Figure 6b Another is that we have multiple unique directions for qualities that could be described by mere differences in scale of some more fundamental quality: “One Neuron for "kind-of-red" for 0-127 in the RGB input, another for "really-red" for 128-255 in the RGB input.” We’d run out of directions fairly quickly. Claim #2: These Encodings Are Composed as a Vector Sum Now, this is the stronger of the two claims as it is not necessarily a consequence of anything introduced thus far. Figure 7: An example of 2-property representation Note: We assume independence between properties, ignoring the degenerate case where a size of zero implies the color is not red (nothing). A vector sum might seem like the natural (if not only) thing a network could do to combine these encoding vectors. To appreciate why this claim is worth verifying, it’ll be worth investigating if alternative non-linear functions could also get the job done. Recall that the thing we want is a function that combines these encodings at some component in the model in a way that preserves information for downstream computation. So this is effectively an information compression problem. As discussed in Elhage et al [3a], the following non-linear compression scheme could get the job done: Where we seek to compress values x and y into t. The value of Z is chosen according to the required floating-point precision needed for compressions. Python # A Python Implementation from math import floor def compress_values(x1, x2, precision=1): z = 10 ** precision compressed_val = (floor(z * x1) + x2) / z return round(compressed_val, precision * 2) def recover_values(compressed_val, precision=1): z = 10 ** precision x2_recovered = (compressed_val * z) - floor(compressed_val * z) x1_recovered = compressed_val - (x2_recovered / z) return round(x1_recovered, precision), round(x2_recovered, precision) # Now to compress vectors a and b a = [0.3, 0.6] b = [0.2, 0.8] compressed_a_b = [compress_values(a[0], b[0]), compress_values(a[1], b[1])] # Returned [0.32, 0.68] recovered_a, recovered_b = ( [x, y] for x, y in zip( recover_values(compressed_a_b[0]), recover_values(compressed_a_b[1]) ) ) # Returned ([0.3, 0.6], [0.2, 0.8]) assert all([recovered_a == a, recovered_b == b]) As demonstrated, we’re able to compress and recover vectors a and b just fine, so this is also a viable way of compressing information for later computation using non-linearities like the floor() function that neural networks can approximate. While this seems a little more tedious than just adding vectors, it shows the network does have options. This calls for some evidence and further arguments in support of linearity. Evidence of Linearity The often-cited example of a model component exhibiting strong linearity is the embedding layer in language models [4], where relationships like the following exist between representations of words: This example would hint at the following relationship between the quality of $plurality$ in the input words and the rest of their representation: Okay, so that’s some evidence for one component in a type of neural network having linear representations. The broad outline of arguments for this being prevalent across networks is that linear representations are both the more natural and performant [3b][3a] option for neural networks to settle on. How Important for Interpretability Is It That This Is True? If non-linear compression is prevalent across networks, there are two alternative regimes in which networks could operate: Computation is still mostly done on linear variables: In this regime, while the information is encoded and moved between components non-linearly, the model components would still decompress the representations to run linear computations. From an interpretability standpoint, while this needs some additional work to reverse engineer the decompression operation, this wouldn't pose too high a barrier.Figure 8:Non-linear compression and propagation intervened by linear computation Computation is done in a non-linear state: The model figures out a way to do computations directly on the non-linear representation. This would pose a challenge needing new interpretability methods. However, based on arguments discussed earlier about model architecture affordances this is expected to be unlikely. Figure 9: Direct non-linear computation Features As promised in the introduction, after avoiding the word “feature” this far into the post, we’ll introduce it properly. As a quick aside, I think the engagement of the research community on the topic of defining what we mean when we use the word “feature” is one of the things that makes mech interp, as a pre-paradigmatic science, exciting. While different definitions have been proposed [3c] and the final verdict is by no means out, in this post and others to come on mech interp, I’ll be using the following: "The features of a given neural network constitute a set of all the input qualities the network would dedicate a neuron to if it could." We’ve already discussed the idea of networks necessarily encoding discrete qualities of inputs, so the most interesting part of the definition is, “...would dedicate a neuron to if it could.” What Is Meant by “...Dedicate a Neuron To...”? In a case where all quality-encoding directions are unique one-hot vectors in activation space ([0, 1] and [1, 0], for example) the neurons are said to be basis-aligned; i.e., one neuron’s activation in the network independently represents the intensity of one input quality. Figure 10: Example of a representation with basis-aligned neurons Note that while sufficient, this property is not necessary for lossless compression of encodings with vector addition. The core requirement is that these feature directions be orthogonal. The reason for this is the same as when we explored the non-linear compression method: we want to completely recover each encoded feature downstream. Basis Vectors Following the Linearity hypothesis, we expect the activation vector to be a sum of all the scaled feature directions: Given an activation vector (which is what we can directly observe when our network fires), if we want to know the activation intensity of some feature in the input, all we need is the feature’s unit vector, feature^j_d: (where the character “.” in the following expression is the vector dot product.) If all the feature unit vectors of that network component (making up the set, Features_d) are orthogonal to each other: And, for any vector: These simplify our equation to give an expression for our feature intensity feature^j_i: Allowing us to fully recover our compressed feature: All that was to establish the ideal property of orthogonality between feature directions. This means even though the idea of “one neuron firing by x-much == one feature is present by x-much” is pretty convenient to think about, there are other equally performant feature directions that don’t have their neuron-firing patterns aligning this cleanly with feature patterns. (As an aside, it turns out basis-aligned neurons don’t happen that often. [3d]) Fig 11: Orthogonal feature directions from non-basis-aligned neurons With this context, the request: ”dedicate a neuron to…” might seem arbitrarily specific. Perhaps “dedicate an extra orthogonal direction vector” would be sufficient to accommodate an additional quality. But as you probably already know, orthogonal vectors in a space don’t grow on trees. A 2-dimensional space can only have 2 orthogonal vectors at a time, for example. So to make more room, we might need an extra dimension, i.e [X X] -> [X X X] which is tantamount to having an extra neuron dedicated to this feature. How Are These Features Stored in Neural Networks? To touch grass quickly, what does it mean when a model component has learned 3 orthogonal feature directions {[1 0 0], [0 1 0], [0 0 1]} for compressing an input vector [a b c]? To get the compressed activation vector, we expect a series of dot products with each feature direction to get our feature scale. Now we just have to sum up our scaled-up feature directions to get our “compressed” activation state. In this toy example, the features are just the vector values so lossless decompressing gets us what we started with. The question is: what does this look like in a model? The above sequence of transformations of dot products followed by a sum is equivalent to the operations of the deep learning workhorse: matrix multiplication. The earlier sentence, “…a model component has learned 3 orthogonal feature directions,” should have been a giveaway. Models store their learnings in weights, and so our feature vectors are just the rows of this layer’s learned weight matrix, W. Why didn’t I just say the whole time, “Matrix multiplication. End of section.” Because we don’t always have toy problems in the real world. The learned features aren’t always stored in just one set of weights. It could (and usually does) involve an arbitrarily long sequence of linear and non-linear compositions to arrive at some feature direction (but the key insight of decompositional linearity is that this computation can be summarised by a direction used to compose some activation). The promise of linearity we discussed only has to do with how feature representations are composed. For example, some arbitrary vector is more likely to not be hanging around for discovery by just reading one row of a layer’s weight matrix, but the computation to encode that feature is spread across several weights and model components. So we had to address features as arbitrary strange directions in activation space because they often are. This point brings the proposed dichotomy between representation and algorithmic interpretability into question. Back to our working definition of features: "The features of a given neural network constitute a set of all the input qualities the network would dedicate a neuron to if it could." On the Conditional Clause: “…Would Dedicate a Neuron to if It Could...” You can think of this definition of a feature as a bit of a set-up for an introduction to a hypothesis that addresses its counterfactual: What happens when a neural network cannot provide all its input qualities with dedicated neurons? Superposition Thus far, our model has done fine on the task that required it to compress and propagate 2 learned features — “size” and “red-ness” — through a 2-dimensional layer. What happens when a new task requires the compression and propagation of an additional feature like the x-displacement of the center of the square? Figure 12 This shows our network with a new task, requiring it to propagate one more learned property of the input: center x-displacement. We’ve returned to using neuron-aligned bases for convenience. Before we go further with this toy model, it would be worth thinking through if there are analogs of this in large real-world models. Let’s take the large language model GPT2 small [5]. Do you think, if you had all week, you could think of up to 769 useful features of an arbitrary 700-word query that would help predict the next token (e.g., “is a formal letter," “contains how many verbs," “is about about ‘Chinua Achebe,’” etc.)? Even if we ignored the fact that feature discovery was one of the known superpowers of neural networks [c] and assumed GPT2-small would also end up with only 769 useful input features to encode, we’d have a situation much like our toy problem above. This is because GPT2 has —at the narrowest point in its architecture— only 768 neurons to work with, just like our toy problem has 2 neurons but needs to encode information about 3 features. [d] So this whole “model component encodes more features it has neurons” business should be worth looking into. It probably also needs a shorter name. That name is the Superposition hypothesis. Considering the above thought experiment with GPT2 Small, it would seem this hypothesis is just stating the obvious- that models are somehow able to represent more input qualities (features) than they have dimensions for. What Exactly Is Hypothetical About Superposition? There’s a reason I introduced it this late in the post: it depends on other abstractions that aren't necessarily self-evident. The most important is the prior formulation of features. It assumes linear decomposition- the expression of neural net representations as sums of scaled directions representing discrete qualities of their inputs. These definitions might seem circular, but they’re not if defined sequentially: If you conceive of neural networks as encoding discrete information of inputs called Features as directions in activation space, then when we suspect the model has more of these features than it has neurons, we call this Superposition. A Way Forward As we’ve seen, it would be convenient if the features of a model were aligned with neurons and necessary for them to be orthogonal vectors to allow lossless recovery from compressed representations. So to suggest this isn't happening poses difficulties to interpretation and raises questions on how networks can pull this off anyway. Further development of the hypothesis provides a model for thinking about why and how superposition happens, clearly exposes the phenomenon in toy problems, and develops promising methods for working around barriers to interpretability [6]. More on this in a future post. Footnotes [a] That is, algorithms more descriptive than “Take this Neural Net architecture and fill in its weights with these values, then do a forward pass.” [b] Primarily from ideas introduced in Toy Models of Superposition [c] This refers specifically to the codification of features as their superpower. Humans are pretty good at predicting the next token in human text; we’re just not good at writing programs for extracting and representing this information vector space. All of that is hidden away in the mechanics of our cognition. [d] Technically, the number to compare the 768-dimension residual stream width to is the maximum number of features we think *any* single layer would have to deal with at a time. If we assume equal computational workload between layers and assume each batch of features was built based on computations on the previous, for the 12-layer GPT2 model, this would be 12 * 768 = 9,216 features you’d need to think up. References [1] Chris Olah on Mech Interp - 80000 Hours [2] Interpretability Dreams [3] Toy Models of Superposition [3a] Nonlinear Compression [3b] Features as Directions [3c] What are Features? [3d] Definitions and Motivation [4] Linguistic regularities in continuous space word representations: Mikolov, T., Yih, W. and Zweig, G., 2013. Proceedings of the 2013 conference of the North American chapter of the Association for Computational Linguistics: Human language technologies, pp. 746--751. [5] Language Models are Unsupervised Multitask Learners: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever [6] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning More

Refcard #335

Distributed SQL Essentials

By ANDREW OLIVER

Evolution of Privacy-Preserving AI: From Protocols to Practical Implementations

By Petr Emelianov

NIST AI Risk Management Framework: Developer’s Handbook

By Josephine Eskaline Joyce

Advanced Brain-Computer Interfaces With Java

In the first part of this series, we introduced the basics of brain-computer interfaces (BCIs) and how Java can be employed in developing BCI applications. In this second part, let's delve deeper into advanced concepts and explore a real-world example of a BCI application using NeuroSky's MindWave Mobile headset and their Java SDK. Advanced Concepts in BCI Development Motor Imagery Classification: This involves the mental rehearsal of physical actions without actual execution. Advanced machine learning algorithms like deep learning models can significantly improve classification accuracy. Event-Related Potentials (ERPs): ERPs are specific patterns in brain signals that occur in response to particular events or stimuli. Developing BCI applications that exploit ERPs requires sophisticated signal processing techniques and accurate event detection algorithms. Hybrid BCI Systems: Hybrid BCI systems combine multiple signal acquisition methods or integrate BCIs with other physiological signals (like eye tracking or electromyography). Developing such systems requires expertise in multiple signal acquisition and processing techniques, as well as efficient integration of different modalities. Real-World BCI Example Developing a Java Application With NeuroSky's MindWave Mobile NeuroSky's MindWave Mobile is an EEG headset that measures brainwave signals and provides raw EEG data. The company provides a Java-based SDK called ThinkGear Connector (TGC), enabling developers to create custom applications that can receive and process the brainwave data. Step-by-Step Guide to Developing a Basic BCI Application Using the MindWave Mobile and TGC Establish Connection: Use the TGC's API to connect your Java application with the MindWave Mobile device over Bluetooth. The TGC provides straightforward methods for establishing and managing this connection. Java ThinkGearSocket neuroSocket = new ThinkGearSocket(this); neuroSocket.start(); Acquire Data: Once connected, your application will start receiving raw EEG data from the device. This data includes information about different types of brainwaves (e.g., alpha, beta, gamma), as well as attention and meditation levels. Java public void onRawDataReceived(int rawData) { // Process raw data } Process Data: Use signal processing techniques to filter out noise and extract useful features from the raw data. The TGC provides built-in methods for some basic processing tasks, but you may need to implement additional processing depending on your application's needs. Java public void onEEGPowerReceived(EEGPower eegPower) { // Process EEG power data } Interpret Data: Determine the user's mental state or intent based on the processed data. This could involve setting threshold levels for certain values or using machine learning algorithms to classify the data. For example, a high attention level might be interpreted as the user wanting to move a cursor on the screen. Java public void onAttentionReceived(int attention) { // Interpret attention data } Perform Action: Based on the interpretation of the data, have your application perform a specific action. This could be anything from moving a cursor, controlling a game character, or adjusting the difficulty level of a task. Java if (attention > ATTENTION_THRESHOLD) { // Perform action } Improving BCI Performance With Java Optimize Signal Processing: Enhance the quality of acquired brain signals by implementing advanced signal processing techniques, such as adaptive filtering or blind source separation. Employ Advanced Machine Learning Algorithms: Utilize state-of-the-art machine learning models, such as deep neural networks or ensemble methods, to improve classification accuracy and reduce user training time. Libraries like DeepLearning4j or TensorFlow Java can be employed for this purpose. Personalize BCI Models: Customize BCI models for individual users by incorporating user-specific features or adapting the model parameters during operation. This can be achieved using techniques like transfer learning or online learning. Implement Efficient Real-Time Processing: Ensure that your BCI application can process brain signals and generate output commands in real time. Optimize your code, use parallel processing techniques, and leverage Java's concurrency features to achieve low-latency performance. Evaluate and Validate Your BCI Application: Thoroughly test your BCI application on a diverse group of users and under various conditions to ensure its reliability and usability. Employ standard evaluation metrics and follow best practices for BCI validation. Conclusion Advanced BCI applications require a deep understanding of brain signal acquisition, processing, and classification techniques. Java, with its extensive libraries and robust performance, is an excellent choice for implementing such applications. By exploring advanced concepts, developing real-world examples, and continuously improving BCI performance, developers can contribute significantly to this revolutionary field.

By Arun Pandey

CORE

BigQuery DataFrames in Python

Google BigQuery is a powerful cloud-based data warehousing solution that enables users to analyze massive datasets quickly and efficiently. In Python, BigQuery DataFrames provide a Pythonic interface for interacting with BigQuery, allowing developers to leverage familiar tools and syntax for data querying and manipulation. In this comprehensive developer guide, we'll explore the usage of BigQuery DataFrames, their advantages, disadvantages, and potential performance issues. Introduction To BigQuery DataFrames BigQuery DataFrames serve as a bridge between Google BigQuery and Python, allowing seamless integration of BigQuery datasets into Python workflows. With BigQuery DataFrames, developers can use familiar libraries like Pandas to query, analyze, and manipulate BigQuery data. This Pythonic approach simplifies the development process and enhances productivity for data-driven applications. Advantages of BigQuery DataFrames Pythonic Interface: BigQuery DataFrames provide a Pythonic interface for interacting with BigQuery, enabling developers to use familiar Python syntax and libraries. Integration With Pandas: Being compatible with Pandas, BigQuery DataFrames allow developers to leverage the rich functionality of Pandas for data manipulation. Seamless Query Execution: BigQuery DataFrames handle the execution of SQL queries behind the scenes, abstracting away the complexities of query execution. Scalability: Leveraging the power of Google Cloud Platform, BigQuery DataFrames offer scalability to handle large datasets efficiently. Disadvantages of BigQuery DataFrames Limited Functionality: BigQuery DataFrames may lack certain advanced features and functionalities available in native BigQuery SQL. Data Transfer Costs: Transferring data between BigQuery and Python environments may incur data transfer costs, especially for large datasets. API Limitations: While BigQuery DataFrames provide a convenient interface, they may have limitations compared to directly using the BigQuery API for complex operations. Prerequisites Google Cloud Platform (GCP) Account: Ensure an active GCP account with BigQuery access. Python Environment: Set up a Python environment with the required libraries (pandas, pandas_gbq, and google-cloud-bigquery). Project Configuration: Configure your GCP project and authenticate your Python environment with the necessary credentials. Using BigQuery DataFrames Install Required Libraries Install the necessary libraries using pip: Python pip install pandas pandas-gbq google-cloud-bigquery Authenticate GCP Credentials Authenticate your GCP credentials to enable interaction with BigQuery: Python from google.auth import load_credentials # Load GCP credentials credentials, _ = load_credentials() Querying BigQuery DataFrames Use pandas_gbq to execute SQL queries and retrieve results as a DataFrame: Python import pandas_gbq # SQL Query query = "SELECT * FROM `your_project_id.your_dataset_id.your_table_id`" # Execute Query and Retrieve DataFrame df = pandas_gbq.read_gbq(query, project_id="your_project_id", credentials=credentials) Writing to BigQuery Write a DataFrame to a BigQuery table using pandas_gbq: Python # Write DataFrame to BigQuery pandas_gbq.to_gbq(df, destination_table="your_project_id.your_dataset_id.your_new_table", project_id="your_project_id", if_exists="replace", credentials=credentials) Advanced Features SQL Parameters Pass parameters to your SQL queries dynamically: Python params = {"param_name": "param_value"} query = "SELECT * FROM `your_project_id.your_dataset_id.your_table_id` WHERE column_name = @param_name" df = pandas_gbq.read_gbq(query, project_id="your_project_id", credentials=credentials, dialect="standard", parameters=params) Schema Customization Customize the DataFrame schema during the write operation: Python schema = [{"name": "column_name", "type": "INTEGER"}, {"name": "another_column", "type": "STRING"}] pandas_gbq.to_gbq(df, destination_table="your_project_id.your_dataset_id.your_custom_table", project_id="your_project_id", if_exists="replace", credentials=credentials, table_schema=schema) Performance Considerations Data Volume: Performance may degrade with large datasets, especially when processing and transferring data between BigQuery and Python environments. Query Complexity: Complex SQL queries may lead to longer execution times, impacting overall performance. Network Latency: Network latency between the Python environment and BigQuery servers can affect query execution time, especially for remote connections. Best Practices for Performance Optimization Use Query Filters: Apply filters to SQL queries to reduce the amount of data transferred between BigQuery and Python. Optimize SQL Queries: Write efficient SQL queries to minimize query execution time and reduce resource consumption. Cache Query Results: Cache query results in BigQuery to avoid re-executing queries for repeated requests. Conclusion BigQuery DataFrames offer a convenient and Pythonic way to interact with Google BigQuery, providing developers with flexibility and ease of use. While they offer several advantages, developers should be aware of potential limitations and performance considerations. By following best practices and optimizing query execution, developers can harness the full potential of BigQuery DataFrames for data analysis and manipulation in Python.

By Sreenath Devineni

Python Function Pipelines: Streamlining Data Processing

Function pipelines allow seamless execution of multiple functions in a sequential manner, where the output of one function serves as the input to the next. This approach helps in breaking down complex tasks into smaller, more manageable steps, making code more modular, readable, and maintainable. Function pipelines are commonly used in functional programming paradigms to transform data through a series of operations. They promote a clean and functional style of coding, emphasizing the composition of functions to achieve desired outcomes. In this article, we will explore the fundamentals of function pipelines in Python, including how to create and use them effectively. We'll discuss techniques for defining pipelines, composing functions, and applying pipelines to real-world scenarios. Creating Function Pipelines in Python In this segment, we'll explore two instances of function pipelines. In the initial example, we'll define three functions—'add', 'multiply', and 'subtract'—each designed to execute a fundamental arithmetic operation as implied by its name. Python def add(x, y): return x + y def multiply(x, y): return x * y def subtract(x, y): return x - y Next, create a pipeline function that takes any number of functions as arguments and returns a new function. This new function applies each function in the pipeline to the input data sequentially. Python # Pipeline takes multiple functions as argument and returns an inner function def pipeline(*funcs): def inner(data): result = data # Iterate thru every function for func in funcs: result = func(result) return result return inner Let’s understand the pipeline function. The pipeline function takes any number of functions (*funcs) as arguments and returns a new function (inner). The inner function accepts a single argument (data) representing the input data to be processed by the function pipeline. Inside the inner function, a loop iterates over each function in the funcs list. For each function func in the funcs list, the inner function applies func to the result variable, which initially holds the input data. The result of each function call becomes the new value of result. After all functions in the pipeline have been applied to the input data, the inner function returns the final result. Next, we create a function called ‘calculation_pipeline’ that passes the ‘add’, ‘multiply’ and ‘substract’ to the pipeline function. Python # Create function pipeline calculation_pipeline = pipeline( lambda x: add(x, 5), lambda x: multiply(x, 2), lambda x: subtract(x, 10) ) Then we can test the function pipeline by passing an input value through the pipeline. Python result = calculation_pipeline(10) print(result) # Output: 20 We can visualize the concept of a function pipeline through a simple diagram. Another example: Python def validate(text): if text is None or not text.strip(): print("String is null or empty") else: return text def remove_special_chars(text): for char in "!@#$%^&*()_+{}[]|\":;'<>?,./": text = text.replace(char, "") return text def capitalize_string(text): return text.upper() # Pipeline takes multiple functions as argument and returns an inner function def pipeline(*funcs): def inner(data): result = data # Iterate thru every function for func in funcs: result = func(result) return result return inner # Create function pipeline str_pipeline = pipeline( lambda x : validate(x), lambda x: remove_special_chars(x), lambda x: capitalize_string(x) ) Testing the pipeline by passing the correct input: Python # Test the function pipeline result = str_pipeline("Test@!!!%#Abcd") print(result) # TESTABCD In case of an empty or null string: Python result = str_pipeline("") print(result) # Error In the example, we've established a pipeline that begins by validating the input to ensure it's not empty. If the input passes this validation, it proceeds to the 'remove_special_chars' function, followed by the 'Capitalize' function. Benefits of Creating Function Pipelines Function pipelines encourage modular code design by breaking down complex tasks into smaller, composable functions. Each function in the pipeline focuses on a specific operation, making it easier to understand and modify the code. By chaining together functions in a sequential manner, function pipelines promote clean and readable code, making it easier for other developers to understand the logic and intent behind the data processing workflow. Function pipelines are flexible and adaptable, allowing developers to easily modify or extend existing pipelines to accommodate changing requirements.

By Sameer Shukla

CORE

20 Days of DynamoDB

For the next 20 days (don’t ask me why I chose that number), I will be publishing a DynamoDB quick tip per day with code snippets. The examples use the DynamoDB packages from AWS SDK for Go V2 but should be applicable to other languages as well. Day 20: Converting Between Go and DynamoDB Types Posted: 13/Feb/2024 The DynamoDB attributevalue in the AWS SDK for Go package can save you a lot of time, thanks to the Marshal and Unmarshal family of utility functions that can be used to convert between Go types (including structs) and AttributeValues. Here is an example using a Go struct: MarshalMap converts Customer struct into a map[string]types.AttributeValue that's required by PutItem UnmarshalMap converts the map[string]types.AttributeValue returned by GetItem into a Customer struct Go type Customer struct { Email string `dynamodbav:"email"` Age int `dynamodbav:"age,omitempty"` City string `dynamodbav:"city"` } customer := Customer{Email: "abhirockzz@gmail.com", City: "New Delhi"} item, _ := attributevalue.MarshalMap(customer) client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(tableName), Item: item, }) resp, _ := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{"email": &types.AttributeValueMemberS{Value: "abhirockzz@gmail.com"}, }) var cust Customer attributevalue.UnmarshalMap(resp.Item, &cust) log.Println("item info:", cust.Email, cust.City) Recommended reading: MarshalMap API doc UnmarshalMap API doc AttributeValue API doc Day 19: PartiQL Batch Operations Posted: 12/Feb/2024 You can use batched operations with PartiQL as well, thanks to BatchExecuteStatement. It allows you to batch reads as well as write requests. Here is an example (note that you cannot mix both reads and writes in a single batch): Go //read statements client.BatchExecuteStatement(context.Background(), &dynamodb.BatchExecuteStatementInput{ Statements: []types.BatchStatementRequest{ { Statement: aws.String("SELECT * FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }, { Statement: aws.String("SELECT * FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "qwer4321"}, }, }, }, }) //separate batch for write statements client.BatchExecuteStatement(context.Background(), &dynamodb.BatchExecuteStatementInput{ Statements: []types.BatchStatementRequest{ { Statement: aws.String("INSERT INTO url_metadata value {'longurl':?,'shortcode':?, 'active': true}"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "https://github.com/abhirockzz"}, &types.AttributeValueMemberS{Value: uuid.New().String()[:8]}, }, }, { Statement: aws.String("UPDATE url_metadata SET active=? where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberBOOL{Value: false}, &types.AttributeValueMemberS{Value: "abcd1234"}, }, }, { Statement: aws.String("DELETE FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "qwer4321"}, }, }, }, }) Just like BatchWriteItem, BatchExecuteStatement is limited to 25 statements (operations) per batch. Recommended reading: BatchExecuteStatementAPI docs Build faster with Amazon DynamoDB and PartiQL: SQL-compatible operations (thanks, Pete Naylor !) Day 18: Using a SQL-Compatible Query Language Posted: 6/Feb/2024 DynamoDB supports PartiQL to execute SQL-like select, insert, update, and delete operations. Here is an example of how you would use PartiQL-based queries for a simple URL shortener application. Notice how it uses a (generic) ExecuteStatement API to execute INSERT, SELECT, UPDATE and DELETE: Go _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("INSERT INTO url_metadata value {'longurl':?,'shortcode':?, 'active': true}"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "https://github.com/abhirockzz"}, &types.AttributeValueMemberS{Value: uuid.New().String()[:8]}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("SELECT * FROM url_metadata where shortcode=? AND active=true"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("UPDATE url_metadata SET active=? where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberBOOL{Value: false}, &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) _, err := client.ExecuteStatement(context.Background(), &dynamodb.ExecuteStatementInput{ Statement: aws.String("DELETE FROM url_metadata where shortcode=?"), Parameters: []types.AttributeValue{ &types.AttributeValueMemberS{Value: "abcd1234"}, }, }) Recommended reading: Amazon DynamoDB documentation on PartiQL support ExecuteStatement API docs Day 17: BatchGetItem Operation Posted: 5/Feb/2024 You can club multiple (up to 100) GetItem requests in a single BatchGetItem operation - this can be done across multiple tables. Here is an example that fetches includes four GetItem calls across two different tables: Go resp, err := client.BatchGetItem(context.Background(), &dynamodb.BatchGetItemInput{ RequestItems: map[string]types.KeysAndAttributes{ "customer": types.KeysAndAttributes{ Keys: []map[string]types.AttributeValue{ { "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, { "email": &types.AttributeValueMemberS{Value: "c2@foo.com"}, }, }, }, "Thread": types.KeysAndAttributes{ Keys: []map[string]types.AttributeValue{ { "ForumName": &types.AttributeValueMemberS{Value: "Amazon DynamoDB"}, "Subject": &types.AttributeValueMemberS{Value: "DynamoDB Thread 1"}, }, { "ForumName": &types.AttributeValueMemberS{Value: "Amazon S3"}, "Subject": &types.AttributeValueMemberS{Value: "S3 Thread 1"}, }, }, ProjectionExpression: aws.String("Message"), }, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Just like an individual GetItem call, you can include Projection Expressions and return RCUs. Note that BatchGetItem can only retrieve up to 16 MB of data. Recommended reading: BatchGetItem API doc Day 16: Enhancing Write Performance With Batching Posted: 2/Feb/2024 The DynamoDB BatchWriteItem operation can provide a performance boost by allowing you to squeeze in 25 individual PutItem and DeleteItem requests in a single API call — this can be done across multiple tables. Here is an example that combines PutItem and DeleteItem operations for two different tables (customer, orders): Go _, err := client.BatchWriteItem(context.Background(), &dynamodb.BatchWriteItemInput{ RequestItems: map[string][]types.WriteRequest{ "customer": []types.WriteRequest{ { PutRequest: &types.PutRequest{ Item: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c3@foo.com"}, }, }, }, { DeleteRequest: &types.DeleteRequest{ Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, }, }, }, "orders": []types.WriteRequest{ { PutRequest: &types.PutRequest{ Item: map[string]types.AttributeValue{ "order_id": &types.AttributeValueMemberS{Value: "oid_1234"}, }, }, }, { DeleteRequest: &types.DeleteRequest{ Key: map[string]types.AttributeValue{ "order_id": &types.AttributeValueMemberS{Value: "oid_4321"}, }, }, }, }, }, }) Be aware of the following constraints: The total request size cannot exceed 16 MB BatchWriteItem cannot update items Recommended reading: BatchWriteItem API doc Day 15: Using the DynamoDB Expression Package To Build Update Expressions Posted: 31/Jan/2024 The DynamoDB Go, SDK expression package, supports the programmatic creation of Update expressions. Here is an example of how you can build an expression to include execute a SET operation of the UpdateItem API and combine it with a Condition expression (update criteria): Go updateExpressionBuilder := expression.Set(expression.Name("category"), expression.Value("standard")) conditionExpressionBuilder := expression.AttributeNotExists(expression.Name("account_locked")) expr, _ := expression.NewBuilder(). WithUpdate(updateExpressionBuilder). WithCondition(conditionExpressionBuilder). Build() resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, UpdateExpression: expr.Update(), ConditionExpression: expr.Condition(), ExpressionAttributeNames: expr.Names(), ExpressionAttributeValues: expr.Values(), ReturnValues: types.ReturnValueAllOld, }) Recommended reading: WithUpdate method in the package API docs. Day 14: Using the DynamoDB Expression Package To Build Key Condition and Filter Expressions Posted: 30/Jan/2024 You can use expression package in the AWS Go SDK for DynamoDB to programmatically build key condition and filter expressions and use them with Query API. Here is an example: Go keyConditionBuilder := expression.Key("ForumName").Equal(expression.Value("Amazon DynamoDB")) filterExpressionBuilder := expression.Name("Views").GreaterThanEqual(expression.Value(3)) expr, _ := expression.NewBuilder(). WithKeyCondition(keyConditionBuilder). WithFilter(filterExpressionBuilder). Build() _, err := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String("Thread"), KeyConditionExpression: expr.KeyCondition(), FilterExpression: expr.Filter(), ExpressionAttributeNames: expr.Names(), ExpressionAttributeValues: expr.Values(), }) Recommended reading: Key and NameBuilder in the package API docs Day 13: Using the DynamoDB Expression Package To Build Condition Expressions Posted: 25/Jan/2024 Thanks to the expression package in the AWS Go SDK for DynamoDB, you can programmatically build Condition expressions and use them with write operations. Here is an example of the DeleteItem API: Go conditionExpressionBuilder := expression.Name("inactive_days").GreaterThanEqual(expression.Value(20)) conditionExpression, _ := expression.NewBuilder().WithCondition(conditionExpressionBuilder).Build() _, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: conditionExpression.Condition(), ExpressionAttributeNames: conditionExpression.Names(), ExpressionAttributeValues: conditionExpression.Values(), }) Recommended reading: WithCondition method in the package API docs Day 12: Using the DynamoDB Expression Package To Build Projection Expressions Posted: 24/Jan/2024 The expression package in the AWS Go SDK for DynamoDB provides a fluent builder API with types and functions to create expression strings programmatically along with corresponding expression attribute names and values. Here is an example of how you would build a Projection Expression and use it with the GetItem API: Go projectionBuilder := expression.NamesList(expression.Name("first_name"), expression.Name("last_name")) projectionExpression, _ := expression.NewBuilder().WithProjection(projectionBuilder).Build() _, err := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String("customer"), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: "c1@foo.com"}, }, ProjectionExpression: projectionExpression.Projection(), ExpressionAttributeNames: projectionExpression.Names(), }) Recommended reading: expression package API docs. Day 11: Using Pagination With Query API Posted: 22/Jan/2024 The Query API returns the result set size to 1 MB. Use ExclusiveStartKey and LastEvaluatedKey elements to paginate over large result sets. You can also reduce page size by limiting the number of items in the result set with the Limit parameter of the Query operation. Go func paginatedQuery(searchCriteria string, pageSize int32) { currPage := 1 var exclusiveStartKey map[string]types.AttributeValue for { resp, _ := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: searchCriteria}, }, Limit: aws.Int32(pageSize), ExclusiveStartKey: exclusiveStartKey, }) if resp.LastEvaluatedKey == nil { return } currPage++ exclusiveStartKey = resp.LastEvaluatedKey } } Recommended reading: Query Pagination Day 10: Query API With Filter Expression Posted: 19/Jan/2024 With the DynamoDB Query API, you can use Filter Expressions to discard specific query results based on criteria. Note that the filter expression is applied after a Query finishes but before the results are returned. Thus, it has no impact on the RCUs (read capacity units) consumed by the query. Here is an example that filters out forum discussion threads that have less than a specific number of views: Go resp, err := client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name"), FilterExpression: aws.String("#v >= :num"), ExpressionAttributeNames: map[string]string{ "#v": "Views", }, ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: forumName}, ":num": &types.AttributeValueMemberN{Value: numViews}, }, }) Recommended reading: Filter Expressions Day 9: Query API Posted: 18/Jan/2024 The Query API is used to model one-to-many relationships in DynamoDB. You can search for items based on (composite) primary key values using Key Condition Expressions. The value for the partition key attribute is mandatory - the query returns all items with that partition key value. Additionally, you can also provide a sort key attribute and use a comparison operator to refine the search results. With the Query API, you can also: Switch to strongly consistent read (eventual consistent being the default) Use a projection expression to return only some attributes Return the consumed Read Capacity Units (RCU) Here is an example that queries for a specific thread based on the forum name (partition key) and subject (sort key). It only returns the Message attribute: Go resp, err = client.Query(context.Background(), &dynamodb.QueryInput{ TableName: aws.String(tableName), KeyConditionExpression: aws.String("ForumName = :name and Subject = :sub"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":name": &types.AttributeValueMemberS{Value: forumName}, ":sub": &types.AttributeValueMemberS{Value: subject}, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ConsistentRead: aws.Bool(true), ProjectionExpression: aws.String("Message"), }) Recommended reading: API Documentation Item Collections Key Condition Expressions Composite primary key Day 8: Conditional Delete Operation Posted: 17/Jan/2024 All the DynamoDB write APIs, including DeleteItem support criteria-based (conditional) execution. You can use DeleteItem operation with a condition expression — it must be evaluated to true in order for the operation to succeed. Here is an example that verifies the value of inactive_days attribute: Go resp, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: aws.String("inactive_days >= :val"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":val": &types.AttributeValueMemberN{Value: "20"}, }, }) if err != nil { if strings.Contains(err.Error(), "ConditionalCheckFailedException") { return } else { log.Fatal(err) } } Recommended reading: Conditional deletes documentation Day 7: DeleteItem API Posted: 16/Jan/2024 The DynamoDB DeleteItem API does what it says - delete an item. But it can also: Return the content of the old item (at no additional cost) Return the consumed Write Capacity Units (WCU) Return the item attributes for an operation that failed a condition check (again, no additional cost) Retrieve statistics about item collections, if any, that were affected during the operation Here is an example: Go resp, err := client.DeleteItem(context.Background(), &dynamodb.DeleteItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ReturnValues: types.ReturnValueAllOld, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ReturnValuesOnConditionCheckFailure: types.ReturnValuesOnConditionCheckFailureAllOld, ReturnItemCollectionMetrics: types.ReturnItemCollectionMetricsSize, }) Recommended reading: DeleteItem API doc Day 6: Atomic Counters With UpdateItem Posted: 15/Jan/2024 Need to implement an atomic counter using DynamoDB? If you have a use case that can tolerate over-counting or under-counting (for example, visitor count), use the UpdateItem API. Here is an example that uses the SET operator in an update expression to increment num_logins attribute: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET num_logins = num_logins + :num"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":num": &types.AttributeValueMemberN{ Value: num, }, }, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Note that every invocation of UpdateItem will increment (or decrement) — hence, it is not idempotent. Recommended reading: Atomic Counters Day 5: Avoid Overwrites When Using DynamoDB UpdateItem API Posted: 12/Jan/2024 The UpdateItem API creates a new item or modifies an existing item's attributes. If you want to avoid overwriting an existing attribute, make sure to use the SET operation with if_not_exists function. Here is an example that sets the category of an item only if the item does not already have a category attribute: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET category = if_not_exists(category, :category)"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":category": &types.AttributeValueMemberS{ Value: category, }, }, }) Note that if_not_exists function can only be used in the SET action of an update expression. Recommended reading: DynamoDB documentation Day 4: Conditional UpdateItem Posted: 11/Jan/2024 Conditional operations are helpful in cases when you want a DynamoDB write operation (PutItem, UpdateItem or DeleteItem) to be executed based on certain criteria. To do so, use a condition expression - it must evaluate to true in order for the operation to succeed. Here is an example that demonstrates a conditional UpdateItem operation. It uses the attribute_not_exists function: Go resp, err := client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET first_name = :fn"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":fn": &types.AttributeValueMemberS{ Value: firstName, }, }, ConditionExpression: aws.String("attribute_not_exists(account_locked)"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Recommended reading: ConditionExpressions Day 3: UpdateItem Add-On Benefits Posted: 10/Jan/2024 The DynamoDB UpdateItem operation is quite flexible. In addition to using many types of operations, you can: Use multiple update expressions in a single statement Get the item attributes as they appear before or after they are successfully updated Understand which item attributes failed the condition check (no additional cost) Retrieve the consumed Write Capacity Units (WCU) Here is an example (using AWS Go SDK v2): Go resp, err = client.UpdateItem(context.Background(), &dynamodb.UpdateItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, UpdateExpression: aws.String("SET last_name = :ln REMOVE category"), ExpressionAttributeValues: map[string]types.AttributeValue{ ":ln": &types.AttributeValueMemberS{ Value: lastName, }, }, ReturnValues: types.ReturnValueAllOld, ReturnValuesOnConditionCheckFailure: types.ReturnValuesOnConditionCheckFailureAllOld, ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, } Recommended reading: UpdateItem API Update Expressions Day 2: GetItem Add-On Benefits Posted: 9/Jan/2024 Did you know that the DynamoDB GetItem operation also gives you the ability to: Switch to strongly consistent read (eventually consistent being the default) Use a projection expression to return only some of the attributes Return the consumed Read Capacity Units (RCU) Here is an example (DynamoDB Go SDK): Go resp, err := client.GetItem(context.Background(), &dynamodb.GetItemInput{ TableName: aws.String(tableName), Key: map[string]types.AttributeValue{ //email - partition key "email": &types.AttributeValueMemberS{Value: email}, }, ConsistentRead: aws.Bool(true), ProjectionExpression: aws.String("first_name, last_name"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, }) Recommended reading: GetItem API doc link Projection expressions Day 1: Conditional PutItem Posted: 8/Jan/2024 The DynamoDB PutItem API overwrites the item in case an item with the same primary key already exists. To avoid (or work around) this behavior, use PutItem with an additional condition. Here is an example that uses the attribute_not_exists function: Go _, err := client.PutItem(context.Background(), &dynamodb.PutItemInput{ TableName: aws.String(tableName), Item: map[string]types.AttributeValue{ "email": &types.AttributeValueMemberS{Value: email}, }, ConditionExpression: aws.String("attribute_not_exists(email)"), ReturnConsumedCapacity: types.ReturnConsumedCapacityTotal, ReturnValues: types.ReturnValueAllOld, ReturnItemCollectionMetrics: types.ReturnItemCollectionMetricsSize, }) if err != nil { if strings.Contains(err.Error(), "ConditionalCheckFailedException") { log.Println("failed pre-condition check") return } else { log.Fatal(err) } } With the PutItem operation, you can also: Return the consumed Write Capacity Units (WCU) Get the item attributes as they appeared before (in case they were updated during the operation) Retrieve statistics about item collections, if any, that were modified during the operation Recommended reading: API Documentation Condition Expressions Comparison Functions

By Abhishek Gupta

CORE

Code Search Using Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is becoming a popular paradigm for bridging the knowledge gap between pre-trained Large Language models and other data sources. For developer productivity, several code copilots help with code completion. Code Search is an age-old problem that can be rethought in the age of RAG. Imagine you are trying to contribute to a new code base (a GitHub repository) for a beginner task. Knowing which file to change and where to make the change can be time-consuming. We've all been there. You're enthusiastic about contributing to a new GitHub repository but overwhelmed. Which file do you modify? Where do you start? For newcomers, the maze of a new codebase can be truly daunting. Retrieval Augmented Generation for Code Search The technical solution consists of 2 parts. 1. Build a vector index generating embedding for every file (eg. .py .java.) 2. Query the vector index and leverage the code interpreter to provide instructions by calling GPT-x. Building the Vector Index Once you have a local copy of the GitHub repo, akin to a crawler of web search index, Traverse every file matching a regex (*.py, *.sh, *.java) Read the content and generate an embedding. Using OpenAI’s Ada embedding or Sentence BERT embedding (or both.) Build a vector store using annoy. Instead of choosing a single embedding, if we build multiple vector stores based on different embeddings, it improves the quality of retrieval. (anecdotally) However, there is a cost of maintaining multiple indices. 1. Prepare Your Requirements.txt To Install Necessary Python Packages pip install -r requirements.txt Python annoy==1.17.3 langchain==0.0.279 sentence-transformers==2.2.2 openai==0.28.0 open-interpreter==0.1.6 2. Walk Through Every File Python ### Traverse through every file in the directory def get_files(path): files = [] for r, d, f in os.walk(path): for file in f: if ".py" in file or ".sh" in file or ".java" in file: files.append(os.path.join(r, file)) return files 3. Get OpenAI Ada Embeddings Python embeddings = OpenAIEmbeddings(openai_api_key=" <Insert your key>") # we are getting embeddings for the contents of the file def get_file_embeddings(path): try: text = get_file_contents(path) ret = embeddings.embed_query(text) return ret except: return None def get_file_contents(path): with open(path, 'r') as f: return f.read() files = get_files(LOCAL_REPO_GITHUB_PATH) embeddings_dict = {} s = set() for file in files: e = get_file_embeddings(file) if (e is None): print ("Error in generating an embedding for the contents of file: ") print (file) s.add(file) else: embeddings_dict[file] = e 4. Generate the Annoy Index In Annoy, the metric can be "angular," "euclidean," "manhattan," "hamming," or "dot." Python annoy_index_t = AnnoyIndex(1536, 'angular') index_map = {} i = 0 for file in embeddings_dict: annoy_index_t.add_item(i, embeddings_dict[file]) index_map[i] = file i+=1 annoy_index_t.build(len(files)) name = "CodeBase" + "_ada.ann" annoy_index_t.save(name) ### Maintains a forward map of id -> file name with open('index_map' + "CodeBase" + '.txt', 'w') as f: for idx, path in index_map.items(): f.write(f'{idx}\t{path}\n') We can see the size of indices is proportional to the number of files in the local repository. Size of annoy index generated for popular GitHub repositories. Repository File Count (approx as its growing) Size Langchain 1983+ 60 MB Llama Index 779 14 MB Apache Solr 5000+ 328 MB Local GPT 8 165 KB Generate Response With Open Interpreter (Calls GPT-4) Once the index is built, a simple command line python script can be implemented to ask questions right from the terminal about your codebase. We can leverage Open Interpreter. One of the reasons to use Open Interpreter instead of us calling GPT-4 or other LLMs directly is because Open-Interpreter allows us to make changes to your file and run commands. It handles interaction with GPT-4. Python embeddings = OpenAIEmbeddings(openai_api_key="Your OPEN AI KEY") query = sys.argv[1] ### Your question depth = int(sys.argv[2]) ## Number of documents to retrieve from Vector SEarch name = sys.argv[3] ## Name of your index ### Get Top K files based on nearest neighbor search def query_top_files(query, top_n=4): # Load annoy index and index map t = AnnoyIndex(EMBEDDING_DIM, 'angular') t.load(name+'_ada.ann') index_map = load_index_map() # Get embeddings for the query query_embedding = get_embeddings_for_text(query) # Search in the Annoy index indices, distances = t.get_nns_by_vector(query_embedding, top_n, include_distances=True) # Fetch file paths for these indices (forward index helps) files = [(index_map[idx], dist) for idx, dist in zip(indices, distances)] return files ### Use Open Interpreter to make the call to GPT-4 import interpreter results = query_top_files(query, depth) file_content = "" s = set() print ("Files you might want to read:") for path, dist in results: content = get_file_contents(path) file_content += "Path : " file_content += path if (path not in s): print (path) s.add(path) file_content += "\n" file_content += content print( "open interpreter's recommendation") message = "Take a deep breath. I have a task to complete. Please help with the task below and answer my question. Task : READ THE FILE content below and their paths and answer " + query + "\n" + file_content interpreter.chat(message) print ("interpreter's recommendation done. (Risk: LLMs are known to hallucinate)") Anecdotal Results Langchain Question: Where should I make changes to add a new summarization prompt? The recommended files to change are; refine_prompts.py stuff_prompt.py map_reduce_prompt.py entity_summarization.py All of these files are indeed related to the summarization prompt in langchain. Local GPT Question: Which files should I change, and how do I add support to the new model Falcon 80B? Open interpreter identifies the files to be changed and gives specific step-by-step instruction for adding Falcon 80 b model to the list of models in constants.py and adding support in the user interface of localGPT_UI.py. For specific prompt templates, it recommends to modify the method get_prompt_template in prompt_template_utils.py. The complete code can be found here. Conclusion The advantages of a simple RAG solution like this will help with: Accelerated Onboarding: New contributors can quickly get up to speed with the codebase, reducing the onboarding time. Reduced Errors: With specific guidance, newcomers are less likely to make mistakes or introduce bugs. Increased Engagement: A supportive tool can encourage more contributions from the community, especially those hesitant due to unfamiliarity with the codebase. Continuous Learning: Even for experienced developers, the tool can be a means to discover and learn about lesser-known parts of the codebase.

By Raghavan Muthuregunathan

Empowering ADHD Research With Generative AI: A Developer's Guide to Synthetic Data Generation

Attention Deficit Hyperactivity Disorder (ADHD) presents a complex challenge in the field of neurodevelopmental disorders, characterized by a wide range of symptoms such as inattention, hyperactivity, and impulsivity that significantly affect individuals' daily lives. In the era of digital healthcare transformation, the role of artificial intelligence (AI), and more specifically Generative AI, has become increasingly pivotal. For developers and researchers in the tech and healthcare sectors, this presents a unique opportunity to leverage the power of AI to foster advancements in understanding, diagnosing, and treating ADHD. From a developer's standpoint, the integration of Generative AI into ADHD research is not just about the end goal of improving patient outcomes but also about navigating the intricate process of designing, training, and implementing AI models that can accurately generate synthetic patient data. This data holds the key to unlocking new insights into ADHD without the ethical and privacy concerns associated with using real patient data. The challenge lies in how to effectively capture the complex, multidimensional nature of ADHD symptoms and treatment responses within these models, ensuring they can serve as a reliable foundation for further research and development. Methodology Generative AI refers to a subset of AI algorithms capable of generating new data instances similar but not identical to the training data. This article proposes utilizing Generative Adversarial Networks (GANs) to generate synthetic patient data, aiding in the research and understanding of ADHD without compromising patient privacy. Data Collection and Preprocessing Data will be synthetically generated to resemble real patient data, including symptoms, genetic information, and response to treatment. Preprocessing steps involve normalizing the data and ensuring it is suitable for training the GAN model. Application and Code Sample Model Training The GAN consists of two main components: the Generator, which generates new data instances, and the Discriminator, which evaluates them against real data. The training process involves teaching the Generator to produce increasingly accurate representations of ADHD patient data. Data Generation/Analysis Generated data can be used to identify patterns in ADHD symptoms and responses to treatment, contributing to more personalized and effective treatment strategies. Python from keras.models import Sequential from keras.layers import Dense import numpy as np # Define the generator def create_generator(): model = Sequential() model.add(Dense(units=100, input_dim=100)) model.add(Dense(units=100, activation='relu')) model.add(Dense(units=50, activation='relu')) model.add(Dense(units=5, activation='tanh')) return model # Example synthetic data generation (simplified) generator = create_generator() noise = np.random.normal(0, 1, [100, 100]) synthetic_data = generator.predict(noise) print("Generated Synthetic Data Shape:", synthetic_data.shape) Results The application of Generative AI in ADHD research could lead to significant advancements in personalized medicine, early diagnosis, and the development of new treatment modalities. However, the accuracy of the generated data and the ethical implications of synthetic data use are important considerations. Discussion This exploration opens up possibilities for using Generative AI to understand complex disorders like ADHD more deeply. Future research could focus on refining the models for greater accuracy and exploring other forms of AI to support healthcare professionals in diagnosis and treatment. Conclusion Generative AI has the potential to revolutionize the approach to ADHD by generating new insights and aiding in the development of more effective treatments. While there are challenges to overcome, the benefits to patient care and research could be substantial.

By venkataramaiah gude

Exploring the New Eclipse JNoSQL Version 1.1.0: A Dive Into Oracle NoSQL

In the ever-evolving world of software development, staying up-to-date with the latest tools and frameworks is crucial. One such framework that has been making waves in NoSQL databases is Eclipse JNoSQL. This article will deeply dive into the latest release, version 1.1.0, and explore its compatibility with Oracle NoSQL. Understanding Eclipse JNoSQL Eclipse JNoSQL is a Java-based framework that facilitates seamless integration between Java applications and NoSQL databases. It leverages Java enterprise standards, specifically Jakarta NoSQL and Jakarta Data, to simplify working with NoSQL databases. The primary objective of this framework is to reduce the cognitive load associated with using NoSQL databases while harnessing the full power of Jakarta EE and Eclipse MicroProfile. With Eclipse JNoSQL, developers can easily integrate NoSQL databases into their projects using Widfly, Payara, Quarkus, or other Java platforms. This framework bridges the Java application layer and various NoSQL databases, making it easier to work with these databases without diving deep into their intricacies. What’s New in Eclipse JNoSQL Version 1.1.0 The latest version of Eclipse JNoSQL, 1.1.0, comes with several enhancements and upgrades to make working with NoSQL databases smoother. Let’s explore some of the notable changes: Jakarta Data Version Upgrade In this release, one of the significant updates in Eclipse JNoSQL version 1.1.0 is upgrading the Jakarta Data version to M2. To better understand the importance of this upgrade, let’s delve into what Jakarta Data is and how it plays a crucial role in simplifying data access across various database types. Jakarta Data Jakarta Data is a specification that provides a unified API for simplified data access across different types of databases, including both relational and NoSQL databases. This specification is part of the broader Jakarta EE ecosystem, which aims to offer a standardized and consistent programming model for enterprise Java applications. Jakarta Data empowers Java developers to access diverse data repositories straightforwardly and consistently. It achieves this by introducing concepts like Repositories and custom query methods, making data retrieval and manipulation more intuitive and developer-friendly. One of the key features of Jakarta Data is its flexibility in allowing developers to compose custom query methods on Repository interfaces. This flexibility means that developers can craft specific queries tailored to their application’s needs without manually writing complex SQL or NoSQL queries. This abstraction simplifies the interaction with databases, reducing the development effort required to access and manipulate data. The Goal of Jakarta Data The primary goal of Jakarta Data is to provide a familiar and consistent programming model for data access while preserving the unique characteristics and strengths of the underlying data stores. In other words, Jakarta Data aims to abstract away the intricacies of interacting with different types of databases, allowing developers to focus on their application logic rather than the specifics of each database system. By upgrading Eclipse JNoSQL to use Jakarta Data version M2, the framework aligns itself with the latest Jakarta EE standards and leverages the newest features and improvements introduced by Jakarta Data. It ensures that developers using Eclipse JNoSQL can benefit from the enhanced capabilities and ease of data access that Jakarta Data brings. Enhanced Support for Inheritance One of the standout features of Eclipse JNoSQL is its support for Object-Document Mapping (ODM). It allows developers to work with NoSQL databases in an object-oriented manner, similar to how they work with traditional relational databases. In version 1.1.0, the framework has enhanced its support for inheritance, making it even more versatile when dealing with complex data models. Oracle NoSQL Database: A Brief Overview Before we conclude, let’s take a moment to understand the database we’ll work with – Oracle NoSQL Database. Oracle NoSQL Database is a distributed key-value and document database developed by Oracle Corporation. It offers robust transactional capabilities for data manipulation, horizontal scalability to handle large workloads, and simplified administration and monitoring. It is particularly well-suited for applications that require low-latency access to data, flexible data models, and elastic scaling to accommodate dynamic workloads. The Oracle NoSQL Database Cloud Service provides a managed cloud platform for deploying applications that require the capabilities of Oracle NoSQL Database, making it even more accessible and convenient for developers. Show Me the Code We’ll create a simple demo application to showcase the new features of Eclipse JNoSQL 1.1.0 and its compatibility with Oracle NoSQL. This demo will help you understand how to set up the environment, configure dependencies, and interact with Oracle NoSQL using Eclipse JNoSQL. Prerequisites Before we begin, ensure you have an Oracle NoSQL instance running. You can use either the “primes” or “cloud” flavor. For local development, you can run Oracle NoSQL in a Docker container with the following command: Shell docker run -d --name oracle-instance -p 8080:8080 ghcr.io/oracle/nosql:latest-ce Setting up the Project We’ll create a Java SE project using the Maven Quickstart Archetype to keep things simple. It will give us a basic project structure to work with. Project Dependencies For Eclipse JNoSQL to work with Oracle NoSQL, we must include specific dependencies. Additionally, we’ll need Jakarta CDI, Eclipse MicroProfile, Jakarta JSONP, and the Eclipse JNoSQL driver for Oracle NoSQL. We’ll also include “datafaker” for generating sample data. Here are the project dependencies: XML <dependencies> <dependency> <groupId>org.eclipse.jnosql.databases</groupId> <artifactId>jnosql-oracle-nosql</artifactId> <version>1.1.0</version> </dependency> <dependency> <groupId>net.datafaker</groupId> <artifactId>datafaker</artifactId> <version>2.0.2</version> </dependency> </dependencies> Configuration Eclipse JNoSQL relies on configuration properties to establish a connection to the database. This is where the flexibility of Eclipse MicroProfile Config shines. You can conveniently define these properties in your application.properties or application.yml file, allowing for easy customization of your database settings. Remarkably, Eclipse JNoSQL caters to key-value and document databases, a testament to its adaptability. Despite Oracle NoSQL's support for both data models, the seamless integration and configuration options provided by Eclipse JNoSQL ensure a smooth experience, empowering developers to effortlessly switch between these database paradigms to meet their specific application needs. Properties files # Oracle NoSQL Configuration jnosql.keyvalue.database=cars jnosql.document.database=cars jnosql.oracle.nosql.host=http://localhost:8080 Creating a Model for Car Data in Eclipse JNoSQL After setting up the database configuration, the next step is to define the data model to be stored in it. The process of defining a model is consistent across all databases in Eclipse JNoSQL. For instance, in this example, we will form a data model for cars using a basic Car class. Java @Entity public class Car { @Id private String id; @Column private String vin; @Column private String model; @Column private String make; @Column private String transmission; // Constructors, getters, and setters @Override public boolean equals(Object o) { if (this == o) { return true; } if (o == null || getClass() != o.getClass()) { return false; } Car car = (Car) o; return Objects.equals(id, car.id); } @Override public int hashCode() { return Objects.hashCode(id); } @Override public String toString() { return "Car{" + "id='" + id + '\'' + ", vin='" + vin + '\'' + ", model='" + model + '\'' + ", make='" + make + '\'' + ", transmission='" + transmission + '\'' + '}'; } // Factory method to create a Car instance public static Car of(Faker faker) { Vehicle vehicle = faker.vehicle(); Car car = new Car(); car.id = UUID.randomUUID().toString(); car.vin = vehicle.vin(); car.model = vehicle.model(); car.make = vehicle.make(); car.transmission = vehicle.transmission(); return car; } } In the Car class, we use annotations to define how the class and its fields should be persisted in the database: @Entity: This annotation marks the class as an entity to be stored in the database. @Id: Indicates that the id field will serve as the unique identifier for each Car entity. @Column: Annotations like @Column specify that a field should be persisted as a column in the database. In this case, we annotate each field we want to store in the database. Additionally, we provide methods for getters, setters, equals, hashCode, and toString for better encapsulation and compatibility with database operations. We also include a factory method Car.of(Faker faker), to generate random car data using the “datafaker” library. This data model encapsulates the structure of a car entity, making it easy to persist and retrieve car-related information in your Oracle NoSQL database using Eclipse JNoSQL. Simplifying Database Operations With Jakarta Data Annotations In Eclipse JNoSQL’s latest version, 1.1.0, developers can harness the power of Jakarta Data annotations to streamline and clarify database operations. These annotations allow you to express your intentions in a more business-centric way, making your code more expressive and closely aligned with the actions you want to perform on the database. Here are some of the Jakarta Data annotations introduced in this version: @Insert: Effortless Data Insertion The @Insert annotation signifies the intent to perform an insertion operation in the database. When applied to a method, it indicates that it aims to insert data into the database. This annotation provides clarity and conciseness, making it evident that the method is responsible for adding new records. @Update: Seamless Data Update The @Update annotation is used to signify an update operation. It is beneficial when you want to modify existing records in the database. Eclipse JNoSQL will check if the information to be updated is already present and proceed accordingly. This annotation simplifies the code by explicitly stating its purpose. @Delete: Hassle-Free Data Removal When you want to delete data from the database, the @Delete annotation comes into play. It communicates that the method’s primary function is to remove information. Like the other annotations, it enhances code readability by conveying the intended action. @Save: Combining Insert and Update The @Save annotation serves a dual purpose. It behaves like the save method in BasicRepository but with added intelligence. It checks if the information is already in the database, and if so, it updates it; otherwise, it inserts new data. This annotation provides a convenient way to handle insertion and updating without needing separate methods. With these Jakarta Data annotations, you can express database operations in a more intuitive and business-centric language. In the context of a car-centric application, such as managing a garage or car collection, you can utilize these annotations to define operations like parking a car and unparking it: Java @Repository public interface Garage extends DataRepository<Car, String> { @Save Car parking(Car car); @Delete void unpark(Car car); Page<Car> findByTransmission(String transmission, Pageable page); } In this example, the @Save annotation is used for parking a car, indicating that this method handles both inserting new cars into the “garage” and updating existing ones. The @Delete annotation is employed for unparking, making it clear that this method is responsible for removing cars from the “garage.” These annotations simplify database operations and enhance code clarity and maintainability by aligning your code with your business terminology and intentions. Executing Database Operations With Oracle NoSQL Now that our entity and repository are set up let’s create classes to execute the application. These classes will initialize a CDI container to inject the necessary template classes for interacting with the Oracle NoSQL database. Interacting With Document Database As a first step, we’ll interact with the document database. We’ll inject the DocumentTemplate interface to perform various operations: Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { DocumentTemplate template = container.select(DocumentTemplate.class).get(); // Insert 10 random cars into the database for (int index = 0; index < 10; index++) { Car car = Car.of(faker); template.insert(car); } // Retrieve and print all cars template.select(Car.class).stream().toList().forEach(System.out::println); // Retrieve and print cars with Automatic transmission, ordered by model (descending) template.select(Car.class).where("transmission").eq("Automatic").orderBy("model").desc() .stream().forEach(System.out::println); // Retrieve and print cars with CVT transmission, ordered by make (descending) template.select(Car.class).where("transmission").eq("CVT").orderBy("make").desc() .stream().forEach(System.out::println); } System.exit(0); } In this code, we use the DocumentTemplate to insert random cars into the document database, retrieve and print all cars, and execute specific queries based on transmission type and ordering. Interacting With Key-Value Database Oracle NoSQL also supports a key-value database, and we can interact with it as follows: Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { KeyValueTemplate template = container.select(KeyValueTemplate.class).get(); // Create a random car and put it in the key-value database Car car = Car.of(faker); template.put(car); // Retrieve and print the car based on its unique ID System.out.println("The query result: " + template.get(car.id(), Car.class)); // Delete the car from the key-value database template.delete(car.id()); // Attempt to retrieve the deleted car (will return null) System.out.println("The query result: " + template.get(car.id(), Car.class)); } System.exit(0); } In this code, we utilize the KeyValueTemplate to put a randomly generated car into the key-value database, retrieve it by its unique ID, delete it, and attempt to retrieve it again (resulting in null since it’s been deleted). These examples demonstrate how to execute database operations seamlessly with Oracle NoSQL, whether you’re working with a document-oriented or key-value database model, using Eclipse JNoSQL’s template classes. In this final sampling execution, we will demonstrate how to interact with the repository using our custom repository interface. This approach simplifies database operations and makes them more intuitive, allowing you to work with your custom-defined terminology and actions. Java public static void main(String[] args) { Faker faker = new Faker(); try (SeContainer container = SeContainerInitializer.newInstance().initialize()) { Garage repository = container.select(Garage.class, DatabaseQualifier.ofDocument()).get(); // Parking 10 random cars in the repository for (int index = 0; index < 10; index++) { Car car = Car.of(faker); repository.parking(car); } // Park a car and then unpark it Car car = Car.of(faker); repository.parking(car); repository.unpark(car); // Retrieve the first page of cars with CVT transmission, ordered by model (descending) Pageable page = Pageable.ofPage(1).size(3).sortBy(Sort.desc("model")); Page<Car> page1 = repository.findByTransmission("CVT", page); System.out.println("The first page"); page1.forEach(System.out::println); // Retrieve the second page of cars with CVT transmission System.out.println("The second page"); Pageable secondPage = page.next(); Page<Car> page2 = repository.findByTransmission("CVT", secondPage); page2.forEach(System.out::println); System.out.println("The query result: "); } System.exit(0); } We create a Garage instance through the custom repository interface in this code. We then demonstrate various operations such as parking and unparking cars and querying for cars with specific transmission types, sorted by model and paginated. You can express database operations in a business-centric language by utilizing the repository interface with custom annotations like @Save and @Delete. This approach enhances code clarity and aligns with your domain-specific terminology, providing a more intuitive and developer-friendly way to interact with the database. Conclusion Eclipse JNoSQL 1.1.0, with its support for Oracle NoSQL databases, simplifies and streamlines the interaction between Java applications and NoSQL data stores. With the introduction of Jakarta Data annotations and custom repositories, developers can express database operations in a more business-centric language, making code more intuitive and easier to maintain. This article has covered the critical aspects of Eclipse JNoSQL’s interaction with Oracle NoSQL, including setting up configurations, creating data models, and executing various database operations. Whether you are working with document-oriented or key-value databases, Eclipse JNoSQL provides the necessary tools and abstractions to make NoSQL data access a breeze. To dive deeper into the capabilities of Eclipse JNoSQL and explore more code samples, check out the official repository. There, you will find a wealth of information, examples, and resources to help you leverage the power of Eclipse JNoSQL in your Java applications. Eclipse JNoSQL empowers developers to harness the flexibility and scalability of NoSQL databases while adhering to Java enterprise standards, making it a valuable tool for modern application development.

By Otavio Santana

CORE

Norm of a One-Dimensional Tensor in Python Libraries

The calculation of the norm of vectors is essential in both artificial intelligence and quantum computing for tasks such as feature scaling, regularization, distance metrics, convergence criteria, representing quantum states, ensuring unitarity of operations, error correction, and designing quantum algorithms and circuits. You will learn how to calculate the Euclidean (norm/distance), also known as the L2 norm, of a single-dimensional (1D) tensor in Python libraries like NumPy, SciPy, Scikit-Learn, TensorFlow, and PyTorch. Understand Norm vs Distance Before we begin, let's understand the difference between Euclidean norm vs Euclidean distance. Norm is the distance/length/size of the vector from the origin (0,0). Distance is the distance/length/size between two vectors. Prerequisites Install Jupyter. Run the code below in a Jupyter Notebook to install the prerequisites. Python # Install the prerequisites for you to run the notebook !pip install numpy !pip install scipy %pip install torch !pip install tensorflow You will use Jupyter Notebook to run the Python code cells to calculate the L2 norm in different Python libraries. Let's Get Started Now that you have Jupyter set up on your machine and installed the required Python libraries, let's get started by defining a 1D tensor using NumPy. NumPy NumPy is a Python library used for scientific computing. NumPy provides a multidimensional array and other derived objects. Tensor ranks Python # Define a single dimensional (1D) tensor import numpy as np vector1 = np.array([3,7]) #np.random.randint(1,5,2) vector2 = np.array([5,2]) #np.random.randint(1,5,2) print("Vector 1:",vector1) print("Vector 2:",vector2) print(f"shape & size of Vector1 & Vector2:", vector1.shape, vector1.size) Print the vectors Plain Text Vector 1: [3 7] Vector 2: [5 2] shape & size of Vector1 & Vector2: (2,) 2 Matplotlib Matplotlib is a Python visualization library for creating static, animated, and interactive visualizations. You will use Matplotlib's quiver to plot the vectors. Python # Draw the vectors using MatplotLib import matplotlib.pyplot as plt %matplotlib inline origin = np.array([0,0]) plt.quiver(*origin, vector1[0],vector1[1], angles='xy', color='r', scale_units='xy', scale=1) plt.quiver(*origin, vector2[0],vector2[1], angles='xy', color='b', scale_units='xy', scale=1) plt.plot([vector1[0],vector2[0]], [vector1[1],vector2[1]], 'go', linestyle="--") plt.title('Vector Representation') plt.xlim([0,10]) plt.ylim([0,10]) plt.grid() plt.show() Vector representation using Matplolib Python # L2 (Euclidean) norm of a vector # NumPy norm1 = np.linalg.norm(vector1, ord=2) print("The magnitude / distance from the origin",norm1) norm2 = np.linalg.norm(vector2, ord=2) print("The magnitude / distance from the origin",norm2) The output once you run this in the Jupyter Notebook: Plain Text The magnitude / distance from the origin 7.615773105863909 The magnitude / distance from the origin 5.385164807134504 SciPy SciPy is built on NumPy and is used for mathematical computations. If you observe, SciPy uses the same linalg functions as NumPy. Python # SciPy import scipy norm_vector1 = scipy.linalg.norm(vector1, ord=2) print("L2 norm in scipy for vector1:", norm_vector1) norm_vector2 = scipy.linalg.norm(vector2, ord=2) print("L2 norm in scipy for vector2:", norm_vector2) Output: Plain Text L2 norm in scipy for vector1: 7.615773105863909 L2 norm in scipy for vector2: 5.385164807134504 Scikit-Learn As the Scikit-learn documentation says: Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities. We reshape the vector as Scikit-learn expects the vector to be 2-dimensional. Python # Sklearn from sklearn.metrics.pairwise import euclidean_distances vector1_reshape = vector1.reshape(1,-1) ## Scikit-learn expects the vector to be 2-Dimensional euclidean_distances(vector1_reshape, [[0, 0]])[0,0] Output Plain Text 7.615773105863909 TensorFlow TensorFlow is an end-to-end machine learning platform. Python # TensorFlow import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' import tensorflow as tf print("TensorFlow version:", tf.__version__) ## Tensorflow expects Tensor of types float32, float64, complex64, complex128 vector1_tf = vector1.astype(np.float64) tf_norm = tf.norm(vector1_tf, ord=2) print("Euclidean(l2) norm in TensorFlow:",tf_norm.numpy()) Output The output prints the version of TensorFlow and the L2 norm: Plain Text TensorFlow version: 2.15.0 Euclidean(l2) norm in TensorFlow: 7.615773105863909 PyTorch PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. Python # PyTorch import torch print("PyTorch version:", torch.__version__) norm_torch = torch.linalg.norm(torch.from_numpy(vector1_tf), ord=2) norm_torch.item() The output prints the PyTorch version and the norm: Plain Text PyTorch version: 2.1.2 7.615773105863909 Euclidean Distance Euclidean distance is calculated in the same way as a norm, except that you calculate the difference between the vectors before passing the difference - vector_diff, in this case, to the respective libraries. Python # Euclidean distance between the vectors import math vector_diff = vector1 - vector2 # Using norm euclidean_distance = np.linalg.norm(vector_diff, ord=2) print(euclidean_distance) # Using dot product norm_dot = math.sqrt(np.dot(vector_diff.T,vector_diff)) print(norm_dot) Output Output using the norm and dot functions of NumPy libraries: Plain Text 5.385164807134504 5.385164807134504 Python # SciPy from scipy.spatial import distance distance.euclidean(vector1,vector2) Output Using SciPy 5.385164807134504 The Jupyter Notebook with the outputs is available on the GitHub repository. You can run the Jupyter Notebook on Colab following the instructions on the GitHub repository.

By Vidyasagar (Sarath Chandra) Machupalli

CORE

The Curse of Simplicity: The Simplest Doesn’t Mean the Least Sophisticated

It is often said that software developers should create simple solutions to the problems that they are presented with. However, coming up with a simple solution is not always easy, as it requires time, experience, and a good approach. And to make matters worse, a simple solution in many ways will not impress your co-workers or give your resume a boost. Ironically, the quest for simplicity in software development is often a complex journey. A developer must navigate through a labyrinth of technical constraints, user requirements, and evolving technological landscapes. The catch-22 is palpable: while a simple solution is desirable, it is not easily attained nor universally appreciated. In the competitiveness of software development, where complexity often disguises itself as sophistication, simple solutions may not always resonate with the awe and admiration they deserve. They may go unnoticed in a culture that frequently equates complexity with competence. Furthermore, the pursuit of simplicity can sometimes be a thankless endeavor. In an environment where complex designs and elaborate architectures are often celebrated, a minimalist approach might not captivate colleagues or stand out in a portfolio. This dichotomy presents a unique challenge for software developers who want to balance the art of simplicity with the practicalities of career advancement and peer recognition. As we get closer to the point of this discussion, I will share my personal experiences in grappling with the "curse of simplicity." These experiences shed light on the nuanced realities of being a software developer committed to simplicity in a world that often rewards complexity. The Story Several years ago, I was part of a Brazilian startup confronted with a daunting issue. The accounting report crucial for tax payment to São Paulo's city administration had been rendered dysfunctional due to numerous changes in its data sources. These modifications stemmed from shifts in payment structures with the company's partners. The situation escalated when the sole analyst responsible for manually generating the report went on vacation, leaving the organization vulnerable to substantial fines from the city hall. To solve the problem, the company’s CFO called a small committee to forward a solution. In advocating for a resolution, I argued against revisiting the complex, defunct legacy solution and proposed a simpler approach. I was convinced that we needed "one big table" with all the columns necessary for the report and that each row should have the granularity of a transaction. This way, the report could be generated by simply flattening the data in a simple query. Loading the data into this table should be done by a simple, secure, and replicable process. My team concurred with my initial proposal and embarked on its implementation, following two fundamental principles: The solution had to be altruistic and crafted for others to utilize and maintain. It had to be code-centric, with automated deployment and code reviews through Pull Requests (PR). We selected Python as our programming language due to its familiarity with the data analysis team and its reputation for being easy to master. In our tool exploration, we came across Airflow, which had been gaining popularity even before its version 1.0 release. Airflow employs DAGs (Direct Acyclic Graphs) to construct workflows, where each step is executed via what is termed "operators." Our team developed two straightforward operators: one for transferring data between tables in different databases, and another for schema migration. This approach allowed for local testing of DAG changes, with the deployment process encompassing Pull Requests followed by a CI/CD pipeline that deployed changes to production. The schema migration operator bore a close resemblance to the implementation in Ruby on Rails migration. We hosted Airflow on AWS Elastic Beanstalk, and Jenkins was employed for the deployment pipeline. During this period, Metabase was already operational for querying databases. Within a span of two to three weeks, our solution was up and running. The so-called "one big table," effectively provided the accounting report. It was user-friendly and, most crucially, comprehensible to everyone involved. The data analysis team, thrilled by the architecture, began adopting this infrastructure for all their reporting needs. A year down the line, the landscape had transformed significantly, with dozens of DAGs in place, hundreds of reporting tables created, and thousands of schema migration files in existence. Synopsis of the Solution In essence, our simple solution might not have seemed fancy, but it was super effective. It allowed the data analysis team to generate reports more quickly and easily, and it saved the company money on fines. The concept of the "curse of simplicity" in software development is a paradoxical phenomenon. It suggests that solutions that appear simple on the surface are often undervalued, especially when compared to their more complex counterparts, which I like to refer to as "complex megazords." This journey of developing a straightforward yet effective solution was an eye-opener for me, and it altered my perspective on the nature of simplicity in problem-solving. There's a common misconception that simple equates to easy. However, the reality is quite the contrary. In reality, as demonstrated by the example I have provided, crafting a solution that is both simple and effective requires a deep understanding of the problem, a sophisticated level of knowledge, and a wealth of experience. It's about distilling complex ideas and processes into their most essential form without losing their effectiveness. What I've come to realize is that simple solutions, though they may seem less impressive at first glance, are often superior. Their simplicity makes them more accessible and easier to understand, maintain, and use. This accessibility is crucial in a world where technology is rapidly evolving and there is a need for user-friendly, maintainable solutions.

By Gustavo Ribeiro Amigo

The Noticeable Shift in SIEM Data Sources

SIEM solutions didn't work perfectly well when they were first introduced in the early 2000s, partly because of their architecture and functionality at the time but also due to the faults in the data and data sources that were fed into them. During this period, data inputs were often rudimentary, lacked scalability, and necessitated extensive manual intervention across operational phases. Three of those data sources stood out. 1. Hand-Coded Application Layer Security Coincidentally, application layer security became a thing when SIEM solutions were first introduced. Around that time, it became obvious that defending the perimeter, hosts, and endpoints was not sufficient security for applications. Some developers experimented with manually coding application security layers to bolster protection against functionality-specific attacks. While this approach provided an additional security layer, it failed to provide SIEM solutions with accurate data due to developers' focus on handling use cases rather than abuse cases. This was because the developers were accustomed to writing code to handle use cases, not abuse cases. So, they weren’t experienced and didn’t have the experience or knowledge to anticipate all likely attacks and write complex codes to collect or authorize access to data related to those attacks. Moreover, many sophisticated attacks necessitated correlating events across multiple applications and data sources, which was beyond the monitoring of individual applications and their coding capabilities. 2. SPAN and TAP Ports SPAN ports, also known as mirror ports or monitor ports, were configured on network switches or routers to copy and forward traffic from one or more source ports to a designated monitoring port. They operated within the network infrastructure and allowed admins to monitor network traffic without disrupting the flow of data to the intended destination. On the other hand, TAP ports were hardware devices that passively captured and transmitted network traffic from one network segment to another. TAP operated independently of network switches and routers but still provided complete visibility into network traffic regardless of network topology or configuration. Despite offering complete visibility into network traffic, these ports fell out of favor in SIEM integration due to their deficiency in contextual information. The raw packet data that SPAN and TAP ports collected lacked the necessary context for effective threat detection and analysis, alongside challenges such as limited network visibility, complex configuration, and inadequate capture of encrypted traffic. 3. The 2000s REST API As a successor to SOAP API, REST API revolutionized data exchange with its simplicity, speed, efficiency, and statelessness. Aligned with the rise of cloud solutions, REST API served as an ideal conduit between SIEM and cloud environments, offering standardized access to diverse data sources. However, it had downsides: one of which was its network efficiency issues. REST APIs sometimes over-fetched or under-fetched data, which resulted in inefficient data transfer between the API and the SIEM solution. There were also the issues of evolving schemas in REST APIs. Without a strongly typed schema, SIEM solutions found it difficult to accurately map incoming data fields to the predefined schema, leading to parsing errors or data mismatches. Then there was the issue of its complexity and learning curve. REST API implementation is known to be complex, especially in managing authentication, pagination, rate limiting, and error handling. Because of this complexity, security analysts and admins responsible for configuring SIEM data sources found it difficult or even required additional training to handle its integrations effectively. This also led to configuration errors, which then affected data collection and analysis. While some of the above data sources have not been completely scrapped out of use, their technologies have been greatly improved, and they now have seamless integrations. Most Recently Used SIEM Data Sources 1. Cloud Logs The cloud was introduced in 2006 when Amazon launched AWS EC2, followed shortly by Salesforce's service cloud solution in 2009. It offers unparalleled scalability, empowering organizations to manage vast volumes of log data effortlessly. Additionally, it provides centralized logging and monitoring capabilities, streamlining data collection and analysis for SIEM solutions. With built-in security features and compliance controls, cloud logs enable SIEM solutions to swiftly detect and respond to security threats. However, challenges accompany these advantages. According to Adam Praksch, a SIEM administrator at IBM, SIEM solutions often struggle to keep pace with the rapid evolution of cloud solutions, resulting in the accumulation of irrelevant events or inaccurate data. Furthermore, integrating SIEM solutions with both on-premises and cloud-based systems increases complexity and cost, as noted by Mohamed El Bagory, a SIEM Technical Instructor at LogRhythm. Notwithstanding, El Bagory acknowledged the vast potential of cloud data for SIEM solutions, emphasizing the need to explore beyond basic information from SSH logins and Chrome tabs to include data from command lines and process statistics. 2. IoT Device Logs As Praksch rightly said, any IT or OT technology that creates logs or reports about its operation is already used for security purposes. This is because IoT devices are known to generate a wealth of rich data about their operations, interactions, and environments. IoT devices, renowned for producing diverse data types such as logs, telemetry, and alerts, are considered a SIEM solutions’s favorite data source. This data diversity allows SIEM solutions to analyze different aspects of the network and identify anomalies or suspicious behavior. Conclusion In conclusion, as Praksch rightly said, "The more data a SIEM solution can work with, the higher its chances of successfully monitoring an organization's environment against cyber threats." So, while most SIEM data sources date back to the inception of the technology, they have gone through several evolution stages to make sure they are extracting accurate and meaningful data for threat detection.

By Diamaka Aniagolu

Data Engineering

Functions of Data Engineering

AI/ML

Big Data

Data

Databases

IoT

DZone's Featured Data Engineering Resources

The Latest Data Engineering Topics