DZone Research Report: A look at our developer audience, their tech stacks, and topics and tools they're exploring.
Getting Started With Large Language Models: A guide for both novices and seasoned practitioners to unlock the power of language models.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Database Security: Best Practices and What You Need to Know
Fast Analytics With MariaDB ColumnStore
Microsoft's Azure provides many services via a single cloud, which lets them offer one solution for multiple corporate infrastructures. Development teams often use Azure because they value the opportunity to run SQL databases in the cloud and complete simple operations via the Azure portal. But you'll need to have a way to back up your data, as it's crucial to ensuring the functionality of the production site and the stability of everyday workflows. So, creating Azure SQL backups can help you and your team avoid data loss emergencies and have the shortest possible downtime while maintaining control over the infrastructure. Another reason to have a current Azure database backup is Microsoft’s policy. Microsoft uses the shared responsibility model, which makes the user responsible for data integrity and recovery, while Microsoft only ensures the availability of its services. Microsoft directly recommends using third-party solutions to create database backups. In case you run a local SQL Server, you'll need to prepare for the possibility of hardware failures that may result in data loss and downtime. An SQL database on Azure helps mitigate that risk, although it's still prone to human errors or cloud-specific threats like malware. These and other threats make enabling Azure SQL database backups necessary for any organization using Microsoft’s service to manage and process data. In this tutorial, you'll learn about backing up Azure databases and restoring your data on demand with native instruments provided by Microsoft, including methods like: Built-in Azure database backup functionality Cloud archiving Secondary database and table management Linked server Stretch Database Why Backup Your SQL Azure Database? Although I covered this briefly in the intro, there are many reasons to back up your SQL Azure database data. Disaster Recovery Data centers can be damaged or destroyed by planned cyberattacks, random malware infiltration (check out this article to discover more on ransomware protection), and natural disasters like floods or hurricanes, among others. Backups can be used to recover data and restore operations after various disaster cases swiftly. Data Loss Prevention Data corruption, hardware failure, and accidental or malicious deletion lead to data loss and can threaten an organization. Backup workflows set up to run regularly mean you can quickly recover the data that was lost or corrupted. Compliance and Regulations Compliance requirements and legislative regulations can be severe regardless of your organization’s industry. Mostly, laws require you to keep up with security and perform regular backups for compliance. Testing and Development You can use backups to create Azure database copies for development, troubleshooting, or testing. Thus, you can fix, develop, or improve your organization’s workflows without involving the production environment. How To Back Up Your Azure SQL Database Backing up your Azure SQL database can be challenging if you go through the process without preparation. So that's why I wrote this guide – to help you be prepared. Here's what we'll cover in the following sections: Requirements for SQL Azure database backup. How to configure database backups in Azure with native tools. Cloud archiving Backup verification and data restoration SQL Azure Database Backup Requirements Before backing up your SQL Azure databases, you need to create and configure Azure storage. Before you do that, you'll need to go through the following steps: First, open the Azure management portal and find Create a Resource. Then, go to Storage > Storage account. Provide the information, including the location and names of a storage account and resource group, according to your preferences. After you enter the information, hit Next. Then, go to the advanced section for additional settings. The optimal choice is to set "Secure transfer required" as Enabled and "Allow access" from All networks. For more resilience in case of human error, you can set "Blob soft delete" as Enabled. With that setting, you can quickly correct accidental deletions in the storage account. After that, specify the tags you need to simplify navigating through your infrastructure. Check the settings once more. If everything is configured correctly, hit Create. Your new storage account is now created. Once the storage volume is created, it's time to configure a backup data storage container. Go to the storage account, find Containers, then hit the + Container tab there. After that, specify a name for the new container and switch the Public access level to Private (no anonymous access).You can then use the container as a backup storage (.bak files will be stored there in that case). Azure Database Backup Configuration Now, everything is set up for you to back up your SQL Azure database. Do the following to create a database backup: First, go to SQL Management Studio and establish a connection with the SQL server. After that, right-click the database that should be backed up. The context menu appears, so go to Tasks there. Then hit Back Up…. SQL server tasks backup Then find the Destination tab, and set Back up to the line to the URL there. After that, hit New container. Next, sign in to Azure. Pick the container you created before. Provide your credentials, then hit OK. You’ll see a message asking you to sign in to an Azure subscription. Then, choose the container and hit OK. Now, you'll see the configured backup destination URL listed. To start the workflow to back up your Azure data, hit OK once again. When your SQL Azure database backup is completed, the message shows up: "The backup of database ‘your database name’ completed successfully." The backup file in the target container should now be visible from the Azure portal. Keep in mind that when uploading backups to any cloud storage, you may face issues if your network connection is not fast enough. In case that’s true for you, you can reorganize your backup workflows: send backup data to a physical storage drive first and then send another copy to the cloud. Thus, you can prevent operational challenges that might appear due to network bandwidth deficiency. Cloud Archiving for Azure Database Backups Databases tend to grow in volume as the organization grows. This means that the storage space required to fit the data and that the data's backup increases significantly. Also, the original data volume prolongs the duration of full backup workflows, posing another challenge. Of course, the first way to get more storage space is to revise your data regularly and erase records that are irrelevant, outdated, or unnecessary otherwise. Still, it's sometimes difficult to determine if data will be or become unnecessary or irrelevant, especially when dealing with compliance issues. To keep your organization compliant in any case, data archiving can help you solve two problems at once: you can ensure data accessibility on one hand and save storage space on the other hand. To archive your SQL database in the cloud, you should first save that database copy to an Azure blob container. Then, to move a newly created blob to the archive tier in the Azure portal, do the following: Go to the required container where the SQL database is stored. Choose the blob that you need to move. Hit Change tier. In the Access tier dropdown menu, choose Archive. Hit Save. Additionally, the Archive storage tier is the most affordable one in Azure, meaning that you can reduce your database data TCO with it. Secondary Database and Table Management There exist several workflows that can help you set up Azure database backup archiving for your organization. When you need the data to stay in the initial database, for instance, creating a separate table and moving that data, there can be your choice. However, the filegroup of that table should stay apart from the main database and be moved to a separate disk whenever possible. Most probably, you’ll want to let users access the data you send to a separate table. To make that happen, you can create a view merging the relevant tables and redirect the requests to that view, not to the original table. Doing things that way, you can keep the data accessible while dealing with maintenance faster. SQL Server Linking If you can’t move the data to another database for internal reasons, such as special Azure backup policies, you can consider maintaining your primary database accordingly. Here, the outcome is likely to be that of the previous case, but you need to link the SQL servers or configure apps so they can send direct requests to your second server. The downside here is that your SQL database, which was supposed to be a backup one, becomes a production database and gains appropriate importance for an organization. There are two ways to create linked servers via SQL Server Management Studio (SSMS): sp_addlinkedserver (Transact-SQL) system stored procedure that creates a linked server SSMS GUI After you've ensured that you have appropriate access rights on both server instances you need to link, the network is configured appropriately to access them. SSMS is installed, you'll need to go through the following steps: First, open SSMS. Microsoft SSMS Connect to the instance where you need to establish a linked server. Then, find Object Explorer > Server Objects, then right-click Linked Servers. Pick New Linked Server from the dropdown: New linked server SSMS Then configure the server properties, including name, server type, provider, and product name: Linked server configuration SSMS Then, you'll just need to complete the security configuration, set up the server options, and complete connection testing. Original Data Deletion When you don’t need 24/7 data availability but need the data stored due to internal policies or compliance requirements, you can choose what's probably the simplest solution to increase storage space efficiency. Just back up the data that can stay unavailable and then delete the originals from the main database. Accessing any records you may need will still be possible via the backup. Stretch Database Aiming to make data management of organizations’ databases simpler, Microsoft implemented a Stretch Database feature in SQL Server 2016. With this feature, you can get an SQL backup to Azure after you send the data from the hosted database to an Azure SQL database. The method enables you to increase overall infrastructure cost-efficiency by simplifying backup workflows. To enable this workflow in your environment, develop the policy specifying the data on a hosted server to send to Azure. You don’t need to introduce any changes in applications that use the production database: SQL Server can independently get the records from the Azure SQL Database. Azure Database Backup Verification and Restoration During an SQL Azure database backup, you can choose to create such backups WITH CHECKSUMS or without them. When the workflow is complete, I recommend you use the following command: RESTORE VERIFYONLY. This command enables you to check the recoverability of backup files. To access the data, you can restore records from a backup to a different database. With Azure Automation scripts on backups, you can accelerate the restoration process, thus minimizing downtime and increasing the overall resilience of your Azure infrastructure. You need to follow only a few steps to restore an Azure SQL database to a required recovery point from a backup. Still, keep in mind that your subscription can define the available retention period, which can vary from 7 to 35 days. A native tool for backup restoration to SQL servers is Server Management Studio. To Conclude The critical nature of Azure SQL database data makes Azure SQL backups obligatory for any organization that uses this Microsoft solution. In this guide, we reviewed the process of creating SQL Azure database backup using native Microsoft tools. These tools provide data backup, backup verification, and recovery functionality, along with some automation.
Here, I am going to show the power of SQL Loader + Unix Script utility, where multiple data files can be loaded by the SQL loader with automated shell scripts. This would be useful while dealing with large chunks of data and when data needs to be moved from one system to another system. It would be suitable for a migration project where large historical data is involved. Then, it is not possible to run the SQL loader for each file and wait till it's loaded. So the best option is to keep the Unix program containing the SQL loader command running all the time. Once any file is available in the folder location then it will pick up the files from that folder location and start processing immediately. The Set Up The sample program I have done in Macbook. Installation of Oracle differs from one from Windows machine. Please go through the video that contains the detailed steps of how to install Oracle on Mac book. Get the SQL developer with Java 8 compliance. Now let us demonstrate the example. Loading Multiple Data Files in Oracle DB Table Because it is a Macbook, I have to do all the stuff inside the Oracle Virtual Machine. Let's see the below diagram of how SQL Loader works. Use Case We need to load millions of students' information onto to Student Table using shell scripts + SQL Loader Automation. The script will run all the time in the Unix server and poll for the .Dat file, and once the DAT file is in place, it will process them. Also in case any bad data is there, you need to identify them separately. This type of example is useful in a migration project, where need to load millions of historical records. From the old system, a live Feed (DAT file ) will be generated periodically and sent to the new system server. In the new system, the server file is available and will be loaded into the database using the automation Unix script. Now let's run the script. The script can run all the time on a Unix server. To achieve this, the whole code is put into the block below: Plain Text while true [some logic] done The Process 1. I have copied all the files + folder structure in the folder below. /home/oracle/Desktop/example-SQLdr 2. Refer to the below file (ls -lrth): Shell rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 11:33 DISCARD -rw-------. 1 oracle oinstall 3.5K Dec 24 11:33 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 11:33 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 12:25 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 12:29 ARCHIVE 3. As shown below, there is no data in the student table. 4. Now run the script using the nohup.out ./TestSqlLoader.sh. By doing this it will run all the time in the Unix server. 5. Now the script will run, which will load the two .dat files through the SQL loader. 6. The table should be loaded with the content of two files. 7. Now I am again deleting the table data. Just to prove the script is running all the time in the server, I will just place two DAT files from ARCHIVE to the current Directory. 8. Again place the two data files in the current directory. Shell -rwxr-xr-x. 1 oracle oinstall 147 Jul 23 2022 student.ctl -rwxr-xr-x. 1 oracle oinstall 53 Jul 23 2022 student_2.dat -rwxr-xr-x. 1 oracle oinstall 278 Dec 9 12:42 student_1.dat drwxr-xr-x. 2 oracle oinstall 48 Dec 24 09:46 BAD -rwxr-xr-x. 1 oracle oinstall 1.1K Dec 24 10:10 TestSqlLoader.sh drwxr-xr-x. 2 oracle oinstall 27 Dec 24 12:53 DISCARD -rw-------. 1 oracle oinstall 4.3K Dec 24 12:53 nohup.out drwxr-xr-x. 2 oracle oinstall 4.0K Dec 24 12:53 TASKLOG -rwxr-xr-x. 1 oracle oinstall 0 Dec 24 13:02 all_data_file_list.unx drwxr-xr-x. 2 oracle oinstall 6 Dec 24 13:03 ARCHIVE 9. See the student table again has loaded with all the data. 10. The script is running all the time on the server: Shell [oracle@localhost example-sqldr]$ ps -ef|grep Test oracle 30203 1 0 12:53? 00:00:00 /bin/bash ./TestSqlLoader.sh oracle 31284 31227 0 13:06 pts/1 00:00:00 grep --color=auto Test Full Source Code for Reference Python #!/bin/bash bad_ext='.bad' dis_ext='.dis' data_ext='.dat' log_ext='.log' log_folder='TASKLOG' arch_loc="ARCHIVE" bad_loc="BAD" discard_loc="DISCARD" now=$(date +"%Y.%m.%d-%H.%M.%S") log_file_name="$log_folder/TestSQLLoader_$now$log_ext" while true; do ls -a *.dat 2>/dev/null > all_data_file_list.unx for i in `cat all_data_file_list.unx` do #echo "The data file name is :-- $i" data_file_name=`basename $i .dat` echo "Before executing the sql loader command ||Starting of the script" > $log_file_name sqlldr userid=hr/oracle@orcl control=student.ctl errors=15000 log=$i$log_ext bindsize=512000000 readsize=500000 DATA=$data_file_name$data_ext BAD=$data_file_name$bad_ext DISCARD=$data_file_name$dis_ext mv $data_file_name$data_ext $arch_loc 2>/dev/null mv $data_file_name$bad_ext $bad_loc 2>/dev/null mv $data_file_name$dis_ext $discard_loc 2>/dev/null mv $data_file_name$data_ext$log_ext $log_folder 2>/dev/null echo "After Executing the sql loader command||File moved successfully" >> $log_file_name done ## halt the procesing for 2 mins sleep 1m done The CTL file is below. SQL OPTIONS (SKIP=1) LOAD DATA APPEND INTO TABLE student FIELDS TERMINATED BY '|' OPTIONALLY ENCLOSED BY '"' TRAILING NULLCOLS ( id, name, dept_id ) The SQL Loader Specification control - Name of the .ctl file errors=15000- Maximum number of errors SQL Loader can allow log=$i$log_ext- Name of the log file bindsize=512000000 - Max size of bind array readsize=500000- Max size of the read buffer DATA=$data_file_name$data_ext- Name and location of data file BAD=$data_file_name$bad_ext- Name and location of bad file DISCARD=$data_file_name$dis_ext- Name and location of discard file In this way stated above, millions of records can be loaded through SQL loader + Unix Script automated way, and the above parameter can be set according to the need. Please let me know if you like this article.
Due to the surge in large language model adoption in the Enterprises, GENERATIVE AI has opened a new pathway to unlock the various business potentials and use cases. One of the main architectural building block for GENERATIVE AI is the semantic search powered by the Vector database. Semantic search, as the name suggests is, essentially involves a "nearest neighbor" (A-NN or k-NN) search within a collection of vectors accompanied by metadata. This system having an index to find the nearest vector data in the vector storage is called Vector Database where query results are based on the relevancy of the query and not the exact match. This technique is widely used in the popular RAG (Retrieval Augmented Generation) pattern where the a similarity search is performed in the Vector database based on the user's input query, and the similar results or relevant information is augmented to the input of an Large Language Model so that the LLM doesn't hallucinate for a query outside of its knowledge boundary to generate an unique response for the user. This popular GENERATIVE AI based pattern called, RAG can't be implemented without the support of Vector database as one of the main core component in the architecture. Because of more and more increase in GENERATIVE-AI use cases, as an engineer working on transitioning an LLM-based prototype to production, it is extremely crucial to identify the right vector database during early stage of development. During the proof-of-concept phase, the choice of database may not be a critical concern for the engineering team. However, the entire perspective changes a lot as the team progresses toward the production phases. The volume of Embedding/vectors data can expand significantly as well as the requirement to integrate the security and compliance within the app. This requires a thoughtful considerations such as access control and data preservation in case of server failures. In this article we will explain a framework and evaluation parameters which should be considered while making the right selection of the Enterprise grade Vector database for the GENERATIVE-AI based use case considering both the Developer Experience as well as the technological experience combining into the Enterprise experience. We also need to keep in mind that numerous vector db products are available in the markets with closed or open source offering and, each catering to a specific use case, and no single solution fits all use cases. Therefore, it's essential to focus on the key aspects when deciding the most suitable option for your GENERATIVE AI based application. What Is Vector Database and Why It Is Needed? First we need to understand what is Vector or Embedding. Vector or Embedding represents the mathematical representation of the digital data such as texts, images, or multimedia content like audio or video having numerous dimensions. These embedding relates to the semantic meaning and proximity relationships among the represented data. The more dimensions are there the more accurately the data would convey the relatedness or similarity. When a search is done in the vector space it converts the input query in natural language into Vector Embedding and then it is used to identify the related vectors using the similarity algorithms that link vectors with the shortest distance between them. This feature in the database is really useful while building AI applications or designing recommendation engines or classifier application. The similarity search using Vector database is very useful for storing vectorized private domain data for an organization which can be augmented while building the GENERATIVE-AI products or adding GEN-AI feature in the existing application such as building content summarization, recommended content generation, document question & answering system, data classification engine. As we know that, the traditional RDBMS or NoSQL databases struggle with storing vector data that has a significant number of dimensions along with other data types, and they face scalability challenges. Also, these databases only return results when the input query precisely matches with the stored data, whereas the purpose of Vector database is to provide results based on similarity or relevancy and not just exact match. Vector database uses various algorithms like N-th Nearest Neighbor (N-NN) or Approximate Nearest Neighbor (ANN) to index and retrieve related data with the shortest distance among them. So we can see that we still need the traditional database for building a secure ACID compliant transactional system. Still, the Vector database solves the puzzle when we look for some niche feature while building the GENERATIVE AI-based applications focusing on augmenting private business data or, classification of similar data from the cluster of nonrelated data or creating a user-friendly recommendation engine. Framework for Choosing the Right Vector Database for Enterprise AI Application Every project will have different requirements, and it is difficult to have only one solution that can fit all the requirements. As it’s impossible to list them all, we’ll use a framework to look at a few areas important not only for prototype building in the development phase but for the production of AI applications in general. Our framework, as depicted in the image below, consists of two categories: Technology Experience that focuses on Enterprise based technical aspects of the database that can support the entire Enterprise AI apps in the longer period of time with strong security in place. Development Experience focuses on the seamless development experience for the Enterprise engineering team to rapidly build AI prototypes without managing infrastructure challenges. What Are the Building Blocks of the Technology Experience? Technology experience majorly focuses on the operation and reliability side of the database involving the core infrastructure. We will discuss the major building blocks below. Performance Performance is one of the core aspect of the vector database. To implement the RAG pattern, an AI app needs to convert the company's own private data or knowledge into vector or embedding with the help of a Large Language Model (e.g., OpenAI's text-embedding-ada-002 model or Amazon's amazon.titan-embed-image-v1 foundation model) and then need to store that generated embedding into the vector database. As we know, in the Vector database, an index is required for efficient search in a large amount of records. An index is a data structure that speeds up retrieval operations in a database - the principle is the same as with an index in a book. The performance of the database heavily rely on how quickly and efficiently it can build the index so that the database search would yield a constant retrieval time no matter how much amount of data is stored in the system with varying dimensions. If you expect a huge number of knowledge or data that need to be stored in the vector database or if you expect them to grow exponentially down the line, then it is the best idea to choose a vector database that can support constant retrieval time from the storage, no matter how much data is stored. Most commonly used vector indexing algorithms are IVF (Inverted File Index) and Hierarchical Navigable Small Worlds (HNSW). HNSW provides better quality search results with more speed as it takes more time to build or re-build the indexes, whereas IVF takes less memory. So in situation where we have memory constraint and no require to re-build the indexes quite often then we need to look for vector database using IVF algorithm. Otherwise, a vector database using the HNSW indexing algorithm is the default choice for better quality results and speed. Scalable Are you anticipating the rapid growth of your database? If so, it's desirable to plan ahead and consider the scalability of the candidate database. Scalability refers to a system's capacity to expand. Determine if there's a limit to the amount of vector embeddings the database provider can accommodate and explore scaling options. One common approach for scaling out is "sharding" technique where the data is distributed across multiple databases copies, alleviating the load and enhancing performance by integrating the multiple read replicas of the same data. While this may seem a straightforward technique, practical implementation can be more complex. Adding additional shards to an existing deployment is not always straightforward and often involves some form of downtime. When the AI application has rapid growth then going with a managed service would be better choice than managing your own infrastructure, There are many SAAS (Software As A Service) based solution available which can scale immediately. Cost-Effective Cost is also another factor which carefully needs to be considered while building the GENERATIVE-AI based solution. When you are dealing with an outside vector database solution, you should consider an estimated number of embeddings you are expecting to scale and accordingly work with the vendor on the pricing based on quality, speed, and total number of embeddings. Highly Available Ensuring the continuous operation of your applications, particularly those facing customers, is crucial. We need to verify whether the provider can provide a service-level agreement (SLA) which is a formal commitment between the service provider and the client specifying the guaranteed uptime. In certain scenarios, it might be necessary to design a distributed system with a type of non-volatile storage to enhance the availability of the existing database. Enterprise Support When opting for a SAAS-based vector database solution, it's essential to establish a comprehensive SLA to ensure robust enterprise-grade support from the vector database vendor. This SLA should include pre-defined and mutually agreed upon RPO (Recovery Point Objective) and RTO (Recovery Time Objective). Additionally, whether it's a vendor-managed database solution or a self-managed open-source database solution, we should also consider having proper real-time monitoring in place. A robust monitoring system is vital for efficient vector database management, enabling the tracking of performance, health, and overall status through the real-time dashboard. Monitoring plays a critical role in detecting issues, optimizing performance, and ensuring seamless operations. Security In highly regulated sectors, there are instances where enterprise data must reside within the company's infrastructure. In such scenarios, a self-managed database solution could be considered an optimal choice. In the case of a vendor-managed solution, it's crucial to evaluate the product, ensuring that encryption in transit and encryption at rest is appropriately implemented to maintain the safety and security of the data. Additionally, it's important to confirm that the provider holds certifications such as SOC certification and is compliant with GDPR, as well as HIPAA-compliant for healthcare domains. What Are the Building Blocks of the Developer Experience? In this framework for choosing the right vector database for the Enterprise GENERATIVE AI solution, Developer experience holds equal significance alongside operational excellence. The goal should be to opt for a vector database that offers a user-friendly experience, allowing the engineering team to initiate experimentation with prototypes seamlessly, without the need for managing complexities. The various building blocks are: Ease of Use When implementing business logic into the GENERATIVE AI-based application, the engineering team greatly appreciates a database component that is user-friendly and easily portable to run in the development environment. It is truly a boon for an engineer to see if the vector database is available in docker-compose to run locally. Documentation The vector database solution needs to provide well-structured and straightforward documentation, accompanied by practical code examples in widely used programming languages like NodeJs, Java, etc. This ensures a smooth onboarding process for the engineering team. Tools and SDKs For the engineering team, it is vital that the database includes a library in their preferred programming language or a plugin compatible with their chosen framework. While Python packages are widespread in AI-based application development, for projects using other programming languages, it is advisable to explore options with existing support rather than investing valuable time in developing a custom client library. Integration Options In evaluating the developer experience, it is important to verify whether the vector database is accessible on the preferred cloud platform. The provision of multi-cloud support enhances the experience during prototype development and selecting the right deployment option. Community Support The vector database should be having a wide developer community to support ongoing features development and support. If this is a managed vendor product then ensure that the product should have a free tier offering or comprehensive free solution, enabling developers to run, view, modify, and experiment with working code across a majority of generic AI use cases. Conclusion In conclusion, I hope the high-level evaluation list presented here serves as a valuable guide, empowering you to make informed choices aligned with your unique organizational needs. In the next article, we will evaluate some of the popular vector database products to see how they fit within this framework depicted above.
PostgresML is an extension of the PostgreSQL ecosystem that allows the training, fine-tuning, and use of various machine learning and large language models within the database. This extension turns PostgreSQL into a complete MLOps platform, supporting various natural language processing tasks and expanding Postgres's capabilities as a vector database. The extension complements pgvector, another foundational extension for apps wishing to use Postgres as a vector database for AI use cases. With pgvector, applications can easily store and work with embeddings generated by large language models (LLMs). PostgresML takes it further by enabling the training and execution of models within the database. Let's look at the PostgresML extension in action by using PostgreSQL for language translation tasks and user sentiment analysis. Enable PostgresML The easiest way to start with PostgresML is by deploying a database instance with the pre-installed extension in Docker. Use the following command to launch PostgreSQL with PostgresML in a container and open a database session with the psql tool: Shell docker run \ -it \ -v postgresml_data:/var/lib/postgresql \ -p 5432:5432 \ -p 8000:8000 \ ghcr.io/postgresml/postgresml:2.7.12 \ sudo -u postgresml psql -d postgresml Once the container has started and the psql session is open, check that the pgml extension (short for PostgresML) is on the extensions list: SQL select * from pg_extension; oid | extname | extowner | extnamespace | extrelocatable | extversion | extconfig | extcondition -------+---------+----------+--------------+----------------+------------+-----------+-------------- 13540 | plpgsql | 10 | 11 | f | 1.0 | | 16388 | pgml | 16385 | 16387 | f | 2.7.12 | | (2 rows) Finally, if you run the \d command, you'll see a list of database objects used internally by PostgresML. SQL \d List of relations Schema | Name | Type | Owner --------+-----------------------+----------+------------ pgml | deployed_models | view | postgresml pgml | deployments | table | postgresml pgml | deployments_id_seq | sequence | postgresml pgml | files | table | postgresml pgml | files_id_seq | sequence | postgresml pgml | models | table | postgresml pgml | models_id_seq | sequence | postgresml ...truncated Text Translation With PostgresML PostgresML integrates with Hugging Face Transformers to enable the latest natural language processing (NLP) models in PostgreSQL. Hugging Face features thousands of pre-trained models that can be used for tasks like sentiment analysis, text classification, summarization, translation, question answering, and more. For instance, suppose you store a product catalog in PostgreSQL, with all the product descriptions in English. Now, you need to display these descriptions in French for customers visiting your e-commerce website from France. What if someone gets interested in Apple's AirTag? PostgresML can facilitate the translation from English to French using one of the translation transformers: SQL SELECT pgml.transform( 'translation_en_to_fr', inputs => ARRAY[ 'AirTag is a supereasy way to keep track of your stuff. Attach one to your keys, slip another in your backpack. And just like that, they’re on your radar in the Find My app, where you can also track down your Apple devices and keep up with friends and family.' ] ) AS french; -[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- french | [{"translation_text": "AirTag est un moyen super facile de suivre vos objets. Attachez-leur à vos clés, glissez-leur dans votre sac à dos. Et comme ça, ils sont sur votre radar dans l’app Find My, où vous pouvez aussi retrouver vos appareils Apple et suivre vos amis et votre famille."}] translation_en_to_fr - the name of a pre-configured transformer utilizing one of the models from Hugging Face. inputs - an array of text that needs translation. If the e-commerce website also caters to Spanish-speaking countries, then product descriptions can be translated into Spanish using a different model: SQL select pgml.transform( task => '{"task": "translation", "model": "Helsinki-NLP/opus-mt-en-es" }'::JSONB, inputs => ARRAY[ 'AirTag is a supereasy way to keep track of your stuff. Attach one to your keys, slip another in your backpack. And just like that, they’re on your radar in the Find My app, where you can also track down your Apple devices and keep up with friends and family.' ] ) as spanish; -[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- spanish | [{"translation_text": "AirTag es una manera superfácil de hacer un seguimiento de tus cosas. Conecta una a tus llaves, desliza otra en tu mochila. Y así mismo, están en tu radar en la aplicación Find My, donde también puedes rastrear tus dispositivos Apple y mantenerte al día con tus amigos y familiares."}] task - a custom task for translation using one of Helsinki-NLP's models. You can choose from thousands of models available on the Hugging Face hub. Overall, PostgresML can improve user experience by returning text that has already been translated back to the application layer. Sentiment Analysis With PostgresML What about engaging in more sophisticated ML and AI-related tasks with PostgresML? One such task is the sentiment analysis of data being inserted or stored in the database. Imagine that customers of the e-commerce website can share their feedback on the products. PostgresML can assist in monitoring customer sentiment about specific products and proactively responding to various concerns and complaints. For example, a customer purchased a headset and shared feedback that PostgresML classified as negative: SQL SELECT pgml.transform( task => 'text-classification', inputs => ARRAY[ 'I regret buying this headset. It does not connect to my laptop over Bluetooth.' ] ) AS positivity; -[ RECORD 1 ]---------------------------------------------------- positivity | [{"label": "NEGATIVE", "score": 0.9996261596679688}] task - a pre-configured transformation for text classification tasks. inputs - the text for sentiment analysis. A company representative reached out to the customer promptly and helped to solve the problem. As a result, the customer shared follow-up feedback that was classified as positive. SQL SELECT pgml.transform( task => 'text-classification', inputs => ARRAY[ 'I regret buying this headset. It does not connect to my laptop over Bluetooth.', 'The problem is solved. Jane reached out to me and helped with the setup. Love the product!' ] ) AS positivity; -[ RECORD 1 ]------------------------------------------------------------------------------------------------------- positivity | [{"label": "NEGATIVE", "score": 0.9996261596679688}, {"label": "POSITIVE", "score": 0.999795138835907}] Just like with the translation tasks, you can utilize thousands of other models from Hugging Face for sentiment analysis and other text classification tasks. For instance, here's how you can switch to the RoBERTa model, which was trained on approximately 40,000 English posts on X (Twitter): SQL SELECT pgml.transform( task => '{"task": "text-classification", "model": "finiteautomata/bertweet-base-sentiment-analysis" }'::jsonb, inputs => ARRAY[ 'I regret buying this headset. It does not connect to my laptop over Bluetooth.', 'The problem is solved. Jane reached out to me and helped with the setup. Love the product!' ] ) AS positivity; -[ RECORD 1 ]---------------------------------------------------------------------------------------------- positivity | [{"label": "NEG", "score": 0.9769334197044371}, {"label": "POS", "score": 0.9884902238845824}] The RoBERTa model has also accurately classified the sentiment of the comments, allowing the e-commerce company to address user concerns and complaints promptly as soon as negative feedback gets into PostgreSQL. Summary As a vector database, Postgres isn't limited to storing and querying embeddings. With the PostgresML extension, Postgres can be transformed into a computational platform for various AI and ML tasks. Discover more about PostgresML and PostgreSQL as a vector database in the following hands-on practical guides:
Hibernate Hibernate by itself does not have full-text search support. It has to rely on database engine support or third-party solutions. An extension called Hibernate Search integrates with Apache Lucene or Elasticsearch (there is also integration with OpenSearch). Postgres Postgres has had full-text search functionality since version 7.3. Although it can not compete with search engines like Elasticsearch or Lucene, it still provides a flexible and robust solution that might be enough to meet application users' expectations—features like stemming, ranking, and indexing. We will briefly explain how we can do a full-text search in Postgres. For more, please visit Postgres documentation. As for essential text matching, the most crucial part is the math operator @@. It returns true if the document (object of type tsvector) matches the query (object of type tsquery). The order is not crucial for the operator. So, it does not matter if we put the document on the left side of the operator and the query on the right side or in a different order. For better demonstration, we use a database table called the tweet. SQL create table tweet ( id bigint not null, short_content varchar(255), title varchar(255), primary key (id) ) With such data: SQL INSERT INTO tweet (id, title, short_content) VALUES (1, 'Cats', 'Cats rules the world'); INSERT INTO tweet (id, title, short_content) VALUES (2, 'Rats', 'Rats rules in the sewers'); INSERT INTO tweet (id, title, short_content) VALUES (3, 'Rats vs Cats', 'Rats and Cats hates each other'); INSERT INTO tweet (id, title, short_content) VALUES (4, 'Feature', 'This project is design to wrap already existed functions of Postgres'); INSERT INTO tweet (id, title, short_content) VALUES (5, 'Postgres database', 'Postgres is one of the widly used database on the market'); INSERT INTO tweet (id, title, short_content) VALUES (6, 'Database', 'On the market there is a lot of database that have similar features like Oracle'); Now let's see what the tsvector object looks like for the short_content column for each of the records. SQL SELECT id, to_tsvector('english', short_content) FROM tweet; Output: The output shows how to_tsvcector converts the text column to a tsvector object for the 'english' text search configuration. Text Search Configuration The first parameter for the to_tsvector function passed in the above example was the name of the text search configuration. In that case, it was the "english". According to Postgres documentation, the text search configuration is as follows: ... full text search functionality includes the ability to do many more things: skip indexing certain words (stop words), process synonyms, and use sophisticated parsing, e.g., parse based on more than just white space. This functionality is controlled by text search configurations. So, configuration is a crucial part of the process and vital to our full-text search results. For different configurations, the Postgres engine can return different results. This does not have to be the case among dictionaries for different languages. For example, you can have two configurations for the same language, but one ignores names containing digits (for example, some serial numbers). If we pass in our query the specific serial number we are looking for, which is mandatory, we won't find any record for configuration that ignores words with numbers. Even if we have such records in the database, please check the configuration documentation for more information. Text Query Text query supports such operators as & (AND), | (OR), ! (NOT), and <-> (FOLLOWED BY). The first three operators do not require a deeper explanation. The <-> operator checks if words exist and if they are placed in a specific order. So, for example, for the query "rat <-> cat", we expect that the "cat" word is going to exist, followed by the "rat." Examples Content that contains the rat and cat: SQL SELECT t.id, t.short_content FROM tweet t WHERE to_tsvector('english', t.short_content) @@ to_tsquery('english', 'Rat & cat'); Content that contains database and market, and the market is the third word after database: SQL SELECT t.id, t.short_content FROM tweet t WHERE to_tsvector('english', t.short_content) @@ to_tsquery('english', 'database <3> market'); Content that contains database but not Postgres: SQL SELECT t.id, t.short_content FROM tweet t WHERE to_tsvector('english', t.short_content) @@ to_tsquery('english', 'database & !Postgres'); Content that contains Postgres or Oracle: SQL SELECT t.id, t.short_content FROM tweet t WHERE to_tsvector('english', t.short_content) @@ to_tsquery('english', 'Postgres | Oracle'); Wrapper Functions One of the wrapper functions that creates text queries was already mentioned in this article, which is the to_tsquery. There are more such functions like: plainto_tsquery phraseto_tsquery websearch_to_tsquery plainto_tsquery The plainto_tsquery converts all passed words to query where all words are combined with the & (AND) operator. For example, the equivalent of the plainto_tsquery('english', 'Rat cat') is to_tsquery('english', 'Rat & cat'). For the following usage: SQL SELECT t.id, t.short_content FROM tweet t WHERE to_tsvector('english', t.short_content) @@ plainto_tsquery('english', 'Rat cat'); We get the result below: phraseto_tsquery The phraseto_tsquery converts all passed words to query where all words are combined with <-> (FOLLOW BY) operator. For example, the equivalent of the phraseto_tsquery('english', 'cat rule') is to_tsquery('english', 'cat <-> rule'). For the following usage: SQL SELECT t.id, t.short_content FROM tweet t WHERE to_tsvector('english', t.short_content) @@ phraseto_tsquery('english', 'cat rule'); We get the result below: websearch_to_tsquery The websearch_to_tsquery uses alternative syntax to create a valid text query. Unquoted text: Converts part of syntax in the same way as plainto_tsquery Quoted text: Converts part of syntax in the same way as phraseto_tsquery OR: Converts to "|" (OR) operator "-": Same as "!" (NOT) operator For example, the equivalent of the websearch_to_tsquery('english', '"cat rule" or database -Postgres') is to_tsquery('english', 'cat <-> rule | database & !Postgres'). For the following usage: SQL SELECT t.id, t.short_content FROM tweet t WHERE to_tsvector('english', t.short_content) @@ websearch_to_tsquery('english', '"cat rule" or database -Postgres'); We get the result below: Postgres and Hibernate Native Support As mentioned in the article, Hibernate alone does not have full-text search support. It has to rely on database engine support. This means that we are allowed to execute native SQL queries as shown in the examples below: plainto_tsquery Java public List<Tweet> findBySinglePlainQueryInDescriptionForConfigurationWithNativeSQL(String textQuery, String configuration) { return entityManager.createNativeQuery(String.format("select * from tweet t1_0 where to_tsvector('%1$s', t1_0.short_content) @@ plainto_tsquery('%1$s', :textQuery)", configuration), Tweet.class).setParameter("textQuery", textQuery).getResultList(); } websearch_to_tsquery Java public List<Tweet> findCorrectTweetsByWebSearchToTSQueryInDescriptionWithNativeSQL(String textQuery, String configuration) { return entityManager.createNativeQuery(String.format("select * from tweet t1_0 where to_tsvector('%1$s', t1_0.short_content) @@ websearch_to_tsquery('%1$s', :textQuery)", configuration), Tweet.class).setParameter("textQuery", textQuery).getResultList(); } Hibernate With posjsonhelper Library The posjsonhelper library is an open-source project that adds support for Hibernate queries for PostgreSQL JSON functions and full-text search. For the Maven project, we need to add the dependencies below: XML <dependency> <groupId>com.github.starnowski.posjsonhelper.text</groupId> <artifactId>hibernate6-text</artifactId> <version>0.3.0</version> </dependency> <dependency> <groupId>org.hibernate</groupId> <artifactId>hibernate-core</artifactId> <version>6.4.0.Final</version> </dependency> To use components that exist in the posjsonhelper library, we need to register them in the Hibernate context. This means that there must be a specified org.hibernate.boot.model.FunctionContributor implementation. The library has an implementation of this interface, that is com.github.starnowski.posjsonhelper.hibernate6.PosjsonhelperFunctionContributor. A file with the name "org.hibernate.boot.model.FunctionContributor" under the "resources/META-INF/services" directory is required to use this implementation. There is another way to register posjsonhelper's component, which can be done through programmability. To see how to do that, check this link. Now, we can use full-text search operators in Hibernate queries. PlainToTSQueryFunction This is a component that wraps the plainto_tsquery function. Java public List<Tweet> findBySinglePlainQueryInDescriptionForConfiguration(String textQuery, String configuration) { CriteriaBuilder cb = entityManager.getCriteriaBuilder(); CriteriaQuery<Tweet> query = cb.createQuery(Tweet.class); Root<Tweet> root = query.from(Tweet.class); query.select(root); query.where(new TextOperatorFunction((NodeBuilder) cb, new TSVectorFunction(root.get("shortContent"), configuration, (NodeBuilder) cb), new PlainToTSQueryFunction((NodeBuilder) cb, configuration, textQuery), hibernateContext)); return entityManager.createQuery(query).getResultList(); } For a configuration with the value 'english', the code is going to generate the statement below: Java select t1_0.id, t1_0.short_content, t1_0.title from tweet t1_0 where to_tsvector('english', t1_0.short_content) @@ plainto_tsquery('english', ?); PhraseToTSQueryFunction This component wraps the phraseto_tsquery function. Java public List<Tweet> findBySinglePhraseInDescriptionForConfiguration(String textQuery, String configuration) { CriteriaBuilder cb = entityManager.getCriteriaBuilder(); CriteriaQuery<Tweet> query = cb.createQuery(Tweet.class); Root<Tweet> root = query.from(Tweet.class); query.select(root); query.where(new TextOperatorFunction((NodeBuilder) cb, new TSVectorFunction(root.get("shortContent"), configuration, (NodeBuilder) cb), new PhraseToTSQueryFunction((NodeBuilder) cb, configuration, textQuery), hibernateContext)); return entityManager.createQuery(query).getResultList(); } For configuration with the value 'english', the code is going to generate the statement below: SQL select t1_0.id, t1_0.short_content, t1_0.title from tweet t1_0 where to_tsvector('english', t1_0.short_content) @@ phraseto_tsquery('english', ?) WebsearchToTSQueryFunction This component wraps the websearch_to_tsquery function. Java public List<Tweet> findCorrectTweetsByWebSearchToTSQueryInDescription(String phrase, String configuration) { CriteriaBuilder cb = entityManager.getCriteriaBuilder(); CriteriaQuery<Tweet> query = cb.createQuery(Tweet.class); Root<Tweet> root = query.from(Tweet.class); query.select(root); query.where(new TextOperatorFunction((NodeBuilder) cb, new TSVectorFunction(root.get("shortContent"), configuration, (NodeBuilder) cb), new WebsearchToTSQueryFunction((NodeBuilder) cb, configuration, phrase), hibernateContext)); return entityManager.createQuery(query).getResultList(); } For configuration with the value 'english', the code is going to generate the statement below: SQL select t1_0.id, t1_0.short_content, t1_0.title from tweet t1_0 where to_tsvector('english', t1_0.short_content) @@ websearch_to_tsquery('english', ?) HQL Queries All mentioned components can be used in HQL queries. To check how it can be done, please click this link. Why Use the posjsonhelper Library When We Can Use the Native Approach With Hibernate? Although dynamically concatenating a string that is supposed to be an HQL or SQL query might be easy, implementing predicates would be better practice, especially when you have to handle search criteria based on dynamic attributes from your API. Conclusion As mentioned in the previous article, Postgres full-text search support can be a good alternative for substantial search engines like Elasticsearch or Lucene, in some cases. This could save us from the decision to add third-party solutions to our technology stack, which could also add more complexity and additional costs.
The CAP Theorem is a foundational concept in the world of distributed computing, particularly for databases. Developed by Eric Brewer, it challenges us to understand that in any network of databases, we can’t have the perfect combination of three key features at the same time: Consistency (all nodes show the same data), Availability (every request receives a response), and Partition Tolerance (the system continues to operate despite network failures). Understanding Each Aspect of CAP 1. Consistency This is about having the same data across all system parts at any given time. Strong Consistency: The system ensures every change is made across all nodes before moving forward. Eventual Consistency: This is less rigid. Although all nodes will eventually show the same data, they don’t do it simultaneously. Different consistency models exist, ranging from very strict ones where data is always the same everywhere to more relaxed models where data can differ temporarily. 2. Availability Availability means ensuring that every user request gets some kind of response, no matter what. Systems focusing on high availability aim to be always operational, even if parts of the network fail. The downside here is that to stay always on; these systems might give out old or not completely updated data during network problems. 3. Partition Tolerance This refers to the system’s ability to keep running even when network issues cut off communication between different system parts. Since network issues are a reality in distributed systems, being partition-tolerant is considered non-negotiable for distributed databases. Database Types and the CAP Theorem Introduction The CAP Theorem is a fundamental principle in the field of distributed databases. It posits that a database system can only excel in two out of three characteristics: Consistency ‘C,’ Availability ‘A,’ and Partition Tolerance ‘P.’ This limitation arises from the inherent constraints of networked systems and data management, where achieving perfect synchronization across all nodes (for consistency) might conflict with keeping the system always available, particularly during network failures. If you are interested in a proof of that theorem, you can dive in here. CA (Consistency and Availability) Databases Nature: Predominantly relational databases. Example: PostgreSQL CA databases aim to provide both data consistency and availability under normal operating conditions. Every node in the system is both consistent with the others and available to handle requests. However, this balance is disrupted if a partition occurs. In such a scenario, a CA database cannot maintain its partition tolerance, leading to potential system failures or inaccessibility of certain nodes. Essentially, in the presence of a network partition, a CA database can no longer guarantee both consistency and availability across all nodes. CP (Consistency and Partition Tolerance) Databases Nature: Often non-relational or NoSQL databases configured for strong consistency. Example: MongoDB is configured for strong consistency. CP databases prioritize maintaining consistency and handling network partitions effectively. However, they need to improve on availability. In these systems, when a partition occurs between nodes, the inconsistent node (the one with possibly outdated or incorrect data) is made unavailable. This is done to maintain overall data integrity until the partition issue is resolved. Thus, CP databases ensure that the data remains accurate and synchronized across the network, but at the cost of potentially rendering parts of the system inaccessible during network disruptions. AP (Availability and Partition Tolerance) Databases Nature: Typically non-relational or NoSQL databases. Example: Cassandra AP databases focus on ensuring availability and handling network partitions. In the event of a partition, all nodes remain operational, but there’s a catch. Nodes on the ‘wrong’ side of the partition might serve older data. This means users can still access the database during a network partition but might not get the most recent data. Once the partition is resolved, AP databases initiate a synchronization process to rectify any inconsistencies across the system. These databases are designed to keep the system functional and accessible, even if it means temporarily compromising on having the latest data. Conclusion Understanding these nuances of CAP database types is crucial for architects and developers. It aids in selecting the right database type based on the specific operational needs and priorities of a system. While CP databases are ideal for scenarios where data accuracy is paramount, AP databases excel in environments where continuous operation is crucial. CA databases, meanwhile, offer a balance of consistency and availability but falter in the face of network partitions. CAP Theorem representation with database types and their examples. CAP Database Types: Aligning With Architecture and Requirements Understanding the Relationship Choosing a database type in line with the CAP Theorem is a decision deeply rooted in the system’s architecture and its specific requirements. The architecture dictates the operational environment of the database, while the application requirements determine the necessary features and capabilities. Architecture Considerations The system's architecture, such as a monolithic, microservices, or hybrid structure, has a significant impact on the database choice. In a distributed setup like microservices, Partition Tolerance becomes a crucial aspect due to the likelihood of network partitions. On the other hand, in more centralized or monolithic structures, the emphasis may shift towards Consistency and Availability. System Requirements Different applications have varying data handling needs. Applications involving critical transactional data, like financial or inventory systems, often need databases prioritizing Consistency and Availability. Conversely, user-centric applications like social media or e-commerce platforms, where user experience depends on uninterrupted service, might find Availability and Partition Tolerance more essential. CA Databases in Context These databases are typically ideal for environments where data integrity is critical and network partitions are a manageable risk. They are well-suited for backend systems where scheduled downtime is acceptable and does not severely impact operations. CP Databases in Context CP databases shine in scenarios where maintaining data accuracy is paramount, even at the cost of temporary unavailability. They are often the choice for systems that require synchronized data across services, such as in complex transaction processing. AP Databases in Context AP databases are the go-to for large-scale applications where service continuity is key. They are particularly beneficial in distributed networks where the reliability of network connections might be inconsistent, ensuring the application remains operational. Conclusion In the world of distributed databases, the CAP Theorem teaches us an important lesson: there’s no perfect database that has it all. This means that any claim of a universal, one-size-fits-all database solution should be met with a healthy dose of skepticism. Any claim of a universal, one-size-fits-all database solution should be met with a healthy dose of skepticism. Every database, whether it’s CA, CP, or AP, comes with its own set of strengths and weaknesses. The CAP Theorem shows us that we can’t have total consistency, availability, and partition tolerance all at once. So, each type of database sacrifices one aspect to excel in the other two. The job of architects and developers then becomes crucial and challenging. They need to understand the technical aspects of different databases and, more importantly, know what the application really needs. Does it need to be always on (Availability)? Does it need to have the most up-to-date information all the time (Consistency)? Or does it need to keep running no matter what happens in the network (Partition Tolerance)? For example, a CA database might be great for systems where accurate data is crucial, and network issues are rare. On the other hand, an AP database would be better for user-facing applications where it’s more important to keep the service always running. CP databases fit well where up-to-date data is a must, even if it means some downtime. In summary, choosing the right database is about finding the best fit for your specific situation and considering what you can or cannot compromise on. It’s about matching the database’s features with your application’s unique needs and challenges. Remember, in the world of databases guided by the CAP Theorem, being well-informed and realistic in your choices is key.
As AI services and the data they consume and create become more important and prevalent in various applications and processes, so do the platforms and architectures they are built upon. As usual, there is no “one size fits all.” However, what is briefly presented here is an optimized approach to such data-driven AI application architectures. All of the source code mentioned and more can be found here and a free “Develop with Oracle AI and Database Services: Gen, Vision, Speech, Language, and OML” workshop (where all of the use cases are based on the U.N.’s 17 Sustainable Development Goals) giving many more examples can be found here. Often multiple network calls must be made in a given AI app, entailing calls to the AI services as well as calls to retrieve and persist the content (whether it be text, audio, images, video, etc.) that is the input or output. The persistent information is then often processed and analyzed further, and additional calls, AI or otherwise, are made in reaction. The Oracle Database provides the ability to make calls out to other services, again AI and otherwise, whether they be within the Oracle Cloud or external. When the calls are instead made from the database itself, it provides an optimized architecture with various benefits, including: Reduced network calls, thus reducing latency. Reduced network calls, thus increasing reliability. Transactional (ACID) operations on AI and other data (and even messaging when using TxEventQ) which avoid the need for idempotent/duplicate processing logic, etc., and the related wasted resources there. Processing optimization is due to the locality of data whether that data is stored directly in the database or an object store or other source. This is because the Oracle Database provides a robust functional frontend to otherwise dumb object storage buckets and the database provides many options to either sync or optimally operate on data in place in the object store and other data sources. Enhanced security due to a common authentication mechanism and reuse of a widely acknowledged robust database and cloud security infrastructure. Reduced overall configuration as calls are made from a central location. The entry point to the database itself can be exposed as a REST endpoint (using ORDS) with a single click, and of course, drivers in various languages can be used to access the database as well. Vector database advantages. This topic is an article unto itself, and one I’ll release as a follow-on, especially as Oracle has and is adding several features in this area. Oracle Database Machine Learning. In addition to various AI services, the Oracle Database itself has had a machine learning engine for many years. Oracle Autonomous Database Select AI, which enables querying data using natural language and generating SQL that is specific to your database. Oracle Autonomous Database AI Vector Search, which includes a new vector data type, vector indexes, and vector search SQL operators that enable the Oracle Database to "store the semantic content of documents, images, and other unstructured data as vectors, and use these to run fast similarity queries." These new capabilities also support RAG (Retrieval Augmented Generation), which provides higher accuracy and avoids having to expose private data by including it in the LLM training data. Again, there are many different AI application flows and requirements, but a basic comparison of the two approaches can be visualized in the following way: The Code It is possible to run several different languages in the database, making it possible to conduct various application logic there. These include Java, JavaScript, and PL/SQL. PL/SQL examples are given here, and they can be executed from the Database Actions -> SQL page in the OCI console, from the SQLcl command line tool (which is pre-installed in the OCI Cloud Shell or can be downloaded), from SQLDeveloper, VS Code (where Oracle has a plugin), etc. There are also a few ways to go about making the calls out to AI and other services. Standard REST calls using the database’s UTL_HTTP package or fetch from JavaScript, etc. is one approach. If the AI services run within OCI (the Oracle Cloud) then OCI SDKs, which are written for all major languages, can also be used. I find the use of the DBMS_CLOUD.send_request package for all OCI services calls (rather than, for example, more specific OCI SDK calls such as DBMS_CLOUD_OCI_AIV_AI_SERVICE_VISION) to be the simplest and most dynamic approach. We start by creating a credential that can be referenced and reused for all of our cloud services calls and simply includes the information from your OCI account/config. PLSQL BEGIN dbms_cloud.create_credential ( credential_name => 'OCI_KEY_CRED', user_ocid => 'ocid1.user.oc1..[youruserocid]', tenancy_ocid => 'ocid1.tenancy.oc1..[yourtenancyocid]', private_key => '[yourprivatekey - you can read this from file or put the contents of your pem without header, footer, and line wraps]' fingerprint => '[7f:yourfingerprint]' ); END; Next, before we look at the main program/function, let’s just quickly just look at the table we’ll save the AI results in. Notice, in this case, the table has columns for both the JSON from an AI call return and a text field that is created from key fields in the JSON for quick reference, searches, etc. Again, the table structures, use of SQL/relational versus JSON, etc. may all vary, and again, this is a great example of the Oracle multi-purpose database where you can use various data models and data types. For example, the JSON Duality feature in the Oracle Database may be worth checking out, as it allows the same data to be accessed using SQL/relational as well as JSON and even MongoDB APIs. PLSQL CREATE TABLE aivision_results (id RAW (16) NOT NULL, date_loaded TIMESTAMP WITH TIME ZONE, label varchar2(20), textfromai varchar2(32767), jsondata CLOB CONSTRAINT ensure_aivision_results_json CHECK (jsondata IS JSON)); / And now, the simple function that typifies the heart of the architecture… Here we see a call to DBMS_CLOUD.send_request with the credential we created and the URL of the (AI) service operation endpoint (the analyzeImage operation of the Oracle Vision AI service, in this case). The JSON payload of the body consists of the feature(s) of the service that we would like to use and any other config as well as the arguments to the operation which in this case include the object storage location of an image (another option would be to provide the image byte array directly/inlined as part of the payload). The JSON result is then retrieved from the response, certain elements of it are parsed out into a text field for convenience, and the JSON, text, etc. are persisted, as mentioned earlier. PLSQL CREATE OR REPLACE FUNCTION VISIONAI_TEXTDETECTION ( p_endpoint VARCHAR2, p_compartment_ocid VARCHAR2, p_namespaceName VARCHAR2, p_bucketName VARCHAR2, p_objectName VARCHAR2, p_featureType VARCHAR2, p_label VARCHAR2 ) RETURN VARCHAR2 IS resp DBMS_CLOUD_TYPES.resp; json_response CLOB; v_textfromai VARCHAR2(32767); BEGIN resp := DBMS_CLOUD.send_request( credential_name => 'OCI_KEY_CRED', uri => p_endpoint || '/20220125/actions/analyzeImage', method => 'POST', body => UTL_RAW.cast_to_raw( JSON_OBJECT( 'features' VALUE JSON_ARRAY( JSON_OBJECT('featureType' VALUE p_featureType) ), 'image' VALUE JSON_OBJECT( 'source' VALUE 'OBJECT_STORAGE', 'namespaceName' VALUE p_namespaceName, 'bucketName' VALUE p_bucketName, 'objectName' VALUE p_objectName ), 'compartmentId' VALUE p_compartment_ocid ) ) ); json_response := DBMS_CLOUD.get_response_text(resp); SELECT LISTAGG(text, ', ') WITHIN GROUP (ORDER BY ROWNUM) INTO v_textfromai FROM JSON_TABLE(json_response, '$.imageText.words[*]' COLUMNS ( text VARCHAR2(100) PATH '$.text' ) ); INSERT INTO aivision_results (id, date_loaded, label, textfromai, jsondata) VALUES (SYS_GUID(), SYSTIMESTAMP, p_label, v_textfromai, json_response); RETURN v_textfromai; EXCEPTION WHEN OTHERS THEN RAISE; END VISIONAI_TEXTDETECTION; / We can also expose the function as a REST endpoint programmatically using the following: PLSQL BEGIN ORDS.ENABLE_OBJECT( P_ENABLED => TRUE, P_SCHEMA => 'AIUSER', P_OBJECT => 'VISIONAI_OBJECTDETECTION', P_OBJECT_TYPE => 'FUNCTION', P_OBJECT_ALIAS => 'VISIONAI_OBJECTDETECTION', P_AUTO_REST_AUTH => FALSE ); COMMIT; END; / Analysis and Text Searching of AI Results This architecture also makes analysis and text searching of all AI results more efficient. From here, more processing and analytics can take place. Let's take a look at three statements that will provide us with an easy-to-use text search of our AI results. First, we create an index for text searches on our aivision_results table. Then we create a function that searches for a given string using the powerful contains functionality, or we could additionally/optionally use the DBMS_SEARCH package to search multiple tables, and return the refcursor of results. Finally, we expose the function as a REST endpoint. It's that simple. PLSQL create index aivisionresultsindex on aivision_results(textfromai) indextype is ctxsys.context; / CREATE OR REPLACE FUNCTION VISIONAI_RESULTS_TEXT_SEARCH(p_sql IN VARCHAR2) RETURN SYS_REFCURSOR AS refcursor SYS_REFCURSOR; BEGIN OPEN refcursor FOR select textfromai from AIVISION_RESULTS where contains ( textfromai, p_sql ) > 0; RETURN refcursor; END VISIONAI_RESULTS_TEXT_SEARCH; / BEGIN ORDS.ENABLE_OBJECT( P_ENABLED => TRUE, P_SCHEMA => 'AIUSER', P_OBJECT => 'VISIONAI_RESULTS_TEXT_SEARCH', P_OBJECT_TYPE => 'FUNCTION', P_OBJECT_ALIAS => 'VISIONAI_RESULTS_TEXT_SEARCH', P_AUTO_REST_AUTH => FALSE ); COMMIT; END; / In Conclusion This was a quick article showing an architectural pattern for developing data-driven AI apps by making calls to AI services directly from the database. Thanks so much for reading, and please let me know of any questions or feedback you may have.
Have you ever wondered how data warehouses are different from Databases? And what are Data Lakes and Data Lake Houses? Let’s understand these with a hypothetical example. Bookster.biz is the new sensation in selling books worldwide. The business is flourishing, and they need to keep track of a lot of data: a large catalog of millions of books, millions of customers worldwide placing billions of orders to buy books. How do they keep track of all this data? How do they ensure their website and apps don’t grind to a halt because of all this load? Databases to the Rescue Databases are the workhorses of websites and mobile apps, handling all the data and millions of transactions. These databases come in many flavors (we will cover all different types of databases in a separate post). Still, the most popular ones are called Relational Databases (aka RDBMS), like MySQL, Postgres, Oracle, etc. Bookster would possibly have the following tables and schema (not exhaustive for brevity): BookCatalog: book ID, ISBN, title, authors, description, publisher, … BookInventory: book ID, number of books available for sale, ... Users: user ID, user name, email, … Orders: Order ID, book ID, user ID, payment information, order status, … When a user orders a book, Bookster will update two records simultaneously: reducing book inventory and inserting a new order entry in the Orders table. RDBMSs support transactions that enable such atomic operations where either all such operations succeed or all fail. Imagine if two or more users could order the last copy of a popular book. Without transaction support, all customers will place orders, and Bookster will have many pissed-off customers except one. Similarly, if the Database host crashes during the processing, the data may be inconsistent without transactions. This database interaction type is called Online Transaction Processing (aka OLTP), where the read and write operations happen very fast on a small amount of data, i.e., precisely two rows in the previous example. This is great. The customers are now happy, and they can order books fast. But the management wants to know what’s going on with the business. Which books are the best-sellers in different categories? Which authors are trending, and which are not selling much? How many orders are coming from which geographies or demographics? These kinds of answers are not accessible with just the databases. Data Warehouses Shine for Analytical Queries Data Warehouses (DWs) can handle large amounts of data, e.g., billions of orders, millions of book entries, etc. Bookster can load the data from the Database to the DW to answer the management questions. The analytical queries read a lot of data and summarise it in some form, like listing the total number of orders for a particular book broken down by geography and demographics. Examples of popular DWs are AWS Redshift, GCP BigQuery, etc. This database interaction type is called Online Analytical Processing (aka OLAP), where most reads happen on a large amount of data. The data is uploaded to the DWs in batches or can be streamed. The loading process is also known as ETL (Extract, Transform, and Load), which is done regularly to keep the DW in sync with the Database updates. DWs typically don't allow updating data but only add a newer version. Like RDBMS, DWs also have a notion of schema where tables and schema are well defined, and the ETL process converts the data into appropriate schema for loading. Some data doesn’t fit the schema easily but can be used by Machine Learning (ML) processes. For example, customers review different books as a text or a video review, and some rockstar ML engineers want to generate popular books by training an LLM on all books. So, the data can’t be structured as a strict schema anymore. Data Lakes help here by storing even more significant amounts of data with different formats and allowing efficient processing. Data Lakes and Data Lake Houses Are the Relatively New Kids on the Block Data Lakes (DLs) overcome the friction of converting the data into a specific format irrespective of if and when it will be used. Vast amounts of data in different native formats like JSON, text, binary, images, videos, etc., can be stored in a DL and converted to a specific schema at read time only when there is a need to process the data. The processing is flexible and scalable as DLs can support big data processing frameworks like Apache Spark. On the flip side, such flexibility could become a drawback if most of the data ingested is low quality due to the lack of data quality check or governance, making DL a ‘Data Swamp’ instead. That’s where the clever people of Databricks combined the goodness of DWs with DLs to create Data Lake Houses (DLHs). DLHs are more flexible than DWs, allowing schema both at the time of writing or reading, as needed, but with stricter mechanisms for data quality checks and metadata management, aka Data Governance. Also, DLHs allow flexibility in big data processing like DLs. The following table summarises the differences between these technologies: Key Characteristics Suitable for Drawbacks Examples Database Fast, small queries, transaction support Online use cases (OLTP) Not ideal for large analytical queries RDBMS: MySQL Data Warehouse Slow, large queries, no updates after write Analytics (OLAP) Less flexible as strict schema and lack of support for big data processing frameworks AWS Redshift, Google BigQuery, *Snowflake Data Lake Unstructured data, schema on read, flexible and big data processing Analytics (OLAP) Data quality issues due to lack of Data Governance *Snowflake, **AWS Lake Formation, **Databricks Delta Lake Data Lake House Structured or unstructured data, flexible with better Data Governance and supports big data processing Analytics (OLAP) More complex, less performance, and more expensive compared to DW *Snowflake, **AWS Lake Formation, **Databricks Delta Lake *Snowflake can be configured as a Data Warehouse, Data Lake, or Data Lake House. **AWS Lake Formation and Databricks Delta Lake can be configured as either Data Lake or Data Lake House.
In the dynamic landscape of modern application development, efficient and seamless interaction with databases is paramount. HarperDB, with its NoSQL capabilities, provides a robust solution for developers. To streamline this interaction, the HarperDB SDK for Java offers a convenient interface for integrating Java applications with HarperDB. This article is a comprehensive guide to getting started with the HarperDB SDK for Java. Whether you're a seasoned developer or just diving into the world of databases, this SDK aims to simplify the complexities of database management, allowing you to focus on HarperDB's NoSQL features. Motivation for Using HarperDB SDK Before delving into the intricacies of the SDK, let's explore the motivations behind its usage. The SDK is designed to provide a straightforward pathway for Java applications to communicate with HarperDB via HTTP requests. By abstracting away the complexities of raw HTTP interactions, developers can concentrate on leveraging the NoSQL capabilities of HarperDB without dealing with the intricacies of manual HTTP requests. In the fast-paced realm of software development, time is a precious resource. The HarperDB SDK for Java is a time-saving solution designed to accelerate the integration of Java applications with HarperDB. Rather than reinventing the wheel by manually crafting HTTP requests and managing the intricacies of communication with HarperDB, the SDK provides a high-level interface that streamlines these operations. By abstracting away the complexities of low-level HTTP interactions, developers can focus their efforts on building robust applications and leveraging the powerful NoSQL capabilities of HarperDB. It expedites the development process and enhances code maintainability, allowing developers to allocate more time to core business logic and innovation. The motivation for utilizing HTTP as the communication protocol between Java applications and HarperDB is rooted in efficiency, security, and performance considerations. While SQL is a widely adopted language for querying and managing relational databases, the RESTful HTTP interface provided by HarperDB offers distinct advantages. The purpose of this guide is to shed light on the functionality of HarperDB in the context of supported SQL operations. It's essential to note that the SQL parser within HarperDB is an evolving feature, and not all SQL functionalities may be fully optimized or utilize indexes. As a result, the REST interface emerges as a more stable, secure, and performant option for interacting with data. The RESTful nature of HTTP communication aligns with modern development practices, providing a scalable and straightforward approach to data interaction. The stability and security inherent in the RESTful architecture make it an attractive choice for integrating Java applications with HarperDB. While the SQL functionality in HarperDB can benefit administrative ad-hoc querying and leveraging existing SQL statements, the guide emphasizes the advantages of the RESTful HTTP interface for day-to-day data operations. As features and functionality evolve, the guide will be updated to reflect the latest capabilities of HarperDB. The motivation for using the HarperDB SDK and opting for HTTP communication lies in the quest for efficiency, security, and a more streamlined development experience. This guide aims to empower developers to make informed choices and harness the full potential of HarperDB's NoSQL capabilities while navigating the evolving landscape of SQL functionality. We understand the motivation behind employing the HarperDB SDK for Java and choosing HTTP as the communication protocol, which lays a solid foundation for an efficient and streamlined development process. The SDK is a valuable tool to save time and simplify complex interactions with HarperDB, allowing developers to focus on innovation rather than the intricacies of low-level communication. As we embark on the hands-on session on the following topic, we will delve into practical examples and guide you through integrating the SDK into your Java project. Let's dive into the hands-on session to bring theory into practice and unlock the full potential of HarperDB for your Java applications. Hands-On Session: Building a Simple Java SE Application with HarperDB In this hands-on session, we'll guide you through creating a simple Java SE application that performs CRUD operations using the HarperDB SDK. Before we begin, ensure you have a running instance of HarperDB. For simplicity, we'll use a Docker instance with the following command: Shell docker run -d -e HDB_ADMIN_USERNAME=root -e HDB_ADMIN_PASSWORD=password -e HTTP_THREADS=4 -p 9925:9925 -p 9926:9926 harperdb/harperdb This command sets up a HarperDB instance with a root username and password for administration. The instance will be accessible on ports 9925 and 9926. Now, let's proceed with building our Java application. We'll focus on CRUD operations for a straightforward entity—Beer. Throughout this session, we'll demonstrate the seamless integration of the HarperDB SDK into a Java project. To kickstart our project, we’ll create a Maven project and include the necessary dependencies—HarperDB SDK for Java and DataFaker for generating beer data. Create a Maven Project Open your preferred IDE or use the command line to create a new Maven project. If you’re using an IDE, there is typically an option to create a new Maven project. If you’re using the command line, you can use the following command: Shell mvn archetype:generate -DgroupId=com.example -DartifactId=harperdb-demo -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false Replace com.example with your desired package name and harperdb-demo with the name of your project. Include dependencies in pom.xml: Open the pom.xml file in your project and include the following dependencies: XML <dependencies> <dependency> <groupId>expert.os.harpderdb</groupId> <artifactId>harpderdb-core</artifactId> <version>0.0.1</version> </dependency> <dependency> <groupId>net.datafaker</groupId> <artifactId>datafaker</artifactId> <version>2.0.2</version> </dependency> </dependencies> Create the Beer Entity In your src/main/java/com/example directory, create a new Java file named Beer.java. Define the Beer entity as a record, taking advantage of the immutability provided by records. Additionally, include a static factory method to create a Beer instance using DataFaker: Java package com.example; import net.datafaker.Faker; public record Beer(String id, String name, String style, String brand) { static Beer of(Faker faker) { String id = faker.idNumber().valid(); String name = faker.beer().name(); String style = faker.beer().style(); String brand = faker.beer().brand(); return new Beer(id, name, style, brand); } } With these initial steps, you’ve set up a Maven project, included the required dependencies, and defined a simple immutable Beer entity using a record. The next phase involves leveraging the HarperDB SDK to perform CRUD operations with this entity, showcasing the seamless integration between Java and HarperDB. Let’s proceed to implement the interaction with HarperDB in the subsequent steps of our hands-on session. The Server and Template classes are fundamental components of the HarperDB SDK for Java, providing a seamless interface for integrating Java applications with HarperDB’s NoSQL database capabilities. Let’s delve into the purpose and functionality of each class. Server Class The Server class is the entry point for connecting with a HarperDB instance. It encapsulates operations related to server configuration, database creation, schema definition, table creation, and more. Using the ServerBuilder, users can easily set up the connection details, including the host URL and authentication credentials. Key features of the Server class: Database management: Create, delete, and manage databases. Schema definition: Define schemas within databases. Table operations: Create tables with specified attributes. Credential configuration: Set up authentication credentials for secure access. Template Class The Template class is a high-level abstraction for performing CRUD (Create, Read, Update, Delete) operations on Java entities within HarperDB. It leverages Jackson’s JSON serialization to convert Java objects to JSON, facilitating seamless communication with HarperDB via HTTP requests. Key features of the Template class: Entity operations: Perform CRUD operations on Java entities. ID-based retrieval: Retrieve entities by their unique identifiers. Integration with Server: Utilize a configured Server instance for database interaction. Type-Safe operations: Benefit from type safety when working with Java entities. Together, the Server and Template classes provide a robust foundation for developers to integrate their Java applications with HarperDB effortlessly. In the subsequent sections, we’ll explore practical code examples to illustrate the usage of these classes in real-world scenarios, showcasing the simplicity and power of the HarperDB SDK for Java. Let’s delve into the code and discover the capabilities these classes bring to your Java projects. In this session, we’ll execute a comprehensive code example to demonstrate the functionality of the HarperDB SDK for Java. The code below showcases a practical scenario where we create a database, define a table, insert a beer entity, retrieve it by ID, delete it, and then confirm its absence. Java public static void main(String[] args) { // Create a Faker instance for generating test data Faker faker = new Faker(); // Configure HarperDB server with credentials Server server = ServerBuilder.of("http://localhost:9925") .withCredentials("root", "password"); // Create a database and table server.createDatabase("beers"); server.createTable("beer").id("id").database("beers"); // Obtain a Template instance for the "beers" database Template template = server.template("beers"); // Generate a random beer entity Beer beer = Beer.of(faker); // Insert the beer entity into the "beer" table template.insert(beer); // Retrieve the beer by its ID and print it template.findById(Beer.class, beer.id()).ifPresent(System.out::println); // Delete the beer entity by its ID template.delete(Beer.class, beer.id()); // Attempt to retrieve the deleted beer and print a message template.findById(Beer.class, beer.id()) .ifPresentOrElse( System.out::println, () -> System.out.println("Beer not found after deletion") ); } Explanation of the code: Faker instance: We use the Faker library to generate random test data, including the details of a beer entity. Server configuration: The Server instance is configured with the HarperDB server’s URL and authentication credentials (username: root, password: password). Database and table creation: We create a database named “beers” and define a table within it named “beer” with an “id” attribute. Template instance: The Template instance is obtained from the configured server, specifically for the “beers” database. Beer entity operations: Insertion: A randomly generated beer entity is inserted into the “beer” table. Retrieval: The inserted beer is retrieved by its ID and printed. Deletion: The beer entity is deleted by its ID. Confirmation of deletion: We attempt to retrieve the deleted beer entity and print a message confirming its absence. This code provides a hands-on exploration of the core CRUD operations supported by the HarperDB SDK for Java. By running this code, you’ll witness the seamless integration of Java applications with HarperDB, making database interactions straightforward and efficient. Let’s execute and observe the SDK in action! In this hands-on session, we executed a concise yet comprehensive code example that showcased the power and simplicity of the HarperDB SDK for Java. By creating a database, defining a table, and manipulating beer entities, we explored the SDK's capability to integrate Java applications with HarperDB's NoSQL features seamlessly. The demonstrated operations, including insertion, retrieval, and deletion, underscored the SDK's user-friendly approach to handling CRUD functionalities. This session offered a practical glimpse into the ease of use and effectiveness of the HarperDB SDK to Java developers, making database interactions a seamless part of application development. As we proceed, we'll delve deeper into more advanced features and scenarios, building on this foundation to empower developers in leveraging HarperDB's capabilities within their Java projects. Conclusion In conclusion, this article has thoroughly explored the HarperDB SDK for Java, showcasing its capabilities in simplifying the integration of Java applications with HarperDB’s NoSQL database. From understanding the core classes like Server and Template to executing practical CRUD operations with a sample beer entity, we’ve witnessed the user-friendly nature of the SDK. By choosing the HarperDB SDK, developers can streamline database interactions, focusing more on application logic and less on intricate database configurations. For those eager to dive deeper, the accompanying GitHub repository contains the complete source code used in the hands-on session. Explore, experiment, and adapt the code to your specific use cases. Additionally, the official HarperDB Documentation serves as an invaluable resource, offering in-depth insights into the NoSQL operations API, making it an excellent reference for further exploration. As you embark on your journey with HarperDB and Java, remember that this SDK empowers developers, providing a robust and efficient bridge between Java applications and HarperDB’s NoSQL capabilities. Whether you’re building a small-scale project or a large-scale enterprise application, the HarperDB SDK for Java stands ready to enhance your development experience.
Database migration is the process of transferring data from one database system to another, which can involve changing the database schema, the database management system, or both. It is done to improve performance, scalability, and reliability or to take advantage of new features and capabilities in the target database system. Database migration comes in two forms when it comes to the availability of the system — downtime and live. Downtime The system is brought down for maintenance, and a splash page saying "Under maintenance" is put up by the developers. Behind the scenes, the data is copied over to the new database, and the connection strings are updated from the old to the new database. Once everything is validated to be working, the system is made live again. There are certain advantages to this method in that it is not necessary to care too much about system availability and/or data loss during the migration. The disadvantage, of course, is that the system SLA goes down, which may not be an option for all kinds of applications. Live In this method, the system is never brought down. Instead, the migration design is made such that migration can be done without any downtime, i.e., the system stays live throughout the migration. Although this requires a more complex design, this approach guarantees no/minimal impact on service availability SLAs and might be the more attractive option for many scenarios. This article will take a look at a case study of a Live database migration design. Phases The Live database migration design is broken into a few phases: Phase 1: Data Hiding This step is a code-heavy step in which the service is written in a form such that the service layers sitting above the data layer are agnostic to the database being used by the service. This step includes: Using database implementation-agnostic objects (and definitely not using model objects!) outside of the database layer. Using transformation functions to convert database model objects to and from the database implementation-agnostic objects By the completion of Phase 1, we can safely start coding the migration modules without impacting current functionality. Phase 2: New Data Layer This step introduces the new data layer alongside the existing data layer. Most of the time, this will be a new repository class, with each table/collection having a new copy of the CRUD functions 1:1 mapped from the existing data layer. This step also introduces the MigrationFlag (True/False) and ensures the new data layer code is only behind the True flag. This way, once the code is deployed, the config is deployed as False initially, and the code only writes to the old data layer. We shall see what happens when the configuration is turned to True in a later phase. Phase 3: Data Sync This step involves some data engineering and creating data pipelines that will copy the data from the old database to the new database. This includes: Having a datetime marker column saved somewhere, which tracks recent changes to any record, and helps track the latest version of records that need to be synced to the new database. Also, sometimes, the data is transformed from the old database to the new database in case there are any schema/model changes that are part of the migration. It is important to note that by running this data sync regularly, a live copy of the old database is being built, which would ideally be able to take traffic when switched to it. This brings us to the next Phase. Phase 4: Config Switch The migration flag that was developed in Step 2 is now used here to switch from False to True. The code then starts to dual-write to the old and new database and read from the new database. This helps in 2 ways: If the data sync developed in Step 3 was correct, then dual-written data to the old database will get synced to the new database, but the data already exists in the new database with the configuration changed to True. Therefore, the operation won't make any changes to functionality. However... If something were to break, the dual-write system ensures that the MigrationFlag config flag can be set to False at any time to stop the dual-write and the read from the new database and effectively return to only reading and writing from the old database as before. If there are no issues, then we move to the last steps Phase 5: Cut Off Old Database The code in this step is updated to stop the dual writing into the old database and only read and write from the new database. The data sync can also be safely stopped after the dual write is stopped and it is validated that there are no new data being synced via the pipeline. At the end of Phase 5, there can be an optional cleanup phase to delete all the unused code and data sync sources. Conclusion In conclusion, live database migration is a complex but useful process that ensures the seamless transition of data from an old database to a new one without any service downtime. The phased approach outlined ensures that the process is safe, reversible, and foolproof against any issues that might arise during testing. By the end of the migration, the system is fully live on the new database, with all unused code and data sync sources cleaned up.
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Principal Sales Engineer,
Snowflake