Azure Data Lake and Azure NoSQL

Neng Ilha
2022 November 14T12:37
Azure Big Data

Describe Azure Data Lake.

The Microsoft Azure ecosystem's many cloud services provide the foundation of the big data solution known as Azure Data Lake. It enables storage, processing, and analytics by allowing enterprises to ingest various data sources, including structured, unstructured, and semi-structured data.

You may process, query, and analyze data using Spark, MapReduce, SQL querying, NoSQL data models, and other analytics capabilities that Azure offers. We'll concentrate on four essential elements: HDInsights, Azure Data Lake Storage (ADLS), Azure Data Lake Analytics (ADLA), and core data lake architecture.

4 Data Lake Building Blocks on Azure

In Azure, a data lake solution generally consists of four components. Azure's basic infrastructure, which includes blob storage, Azure Data Factory, and Hadoop YARN, serves as the foundation for all data lakes.

In addition, businesses have the option to employ Azure Data Lake Analytics, a computer service that analyses massive data sets using T-SQL, and Azure Data Lake Storage, a dedicated storage solution for large-scale datasets. Azure HDInsight, another optional component, enables you to conduct distributed big data workloads using programs like Hadoop and Spark.

Core Infrastructure

Azure Blob Storage, an elastic object storage solution with low-cost tiered storage, high availability, and powerful disaster recovery features, is the foundation of Azure Data Lake.

Blob Storage and Azure Data Factory, a tool for developing and performing extract, transform, load (ETL), and extract, load and transform (ELT) operations, are integrated as part of the solution. In order to control the scalability of SQL Server instances, Azure SQL Database instances, and Azure SQL Data Warehouse servers, it also employs the Apache Hadoop YARN cluster management framework.

Azure Data Lake Storage (ADLS)

Massive datasets may be kept in Azure Data Lake Storage, a repository. It offers two options for data storage:

WebHDFS storage—ccompatible with the Hadoop File System (HDFS), a highly secure hierarchical data storage.
Data lake blob storage—With the full capabilities of Azure Blob Storage, including encryption, data tiering, connectivity with Azure AD, and data lifecycle management, you can store data as blobs.

Azure Data Lake Analytics (ADLA)

You may connect to and process data from ADLS using the computing service known as Azure Data Lake Analytics. It gives.NET developers a platform to efficiently process up to petabytes of data. Users may conduct analytics jobs of any size with Azure Data Lake Analytics by utilizing U-SQL to carry out C# and SQL-based analytics operations.

Azure HDInsight

Running distributed large data workloads on Azure infrastructure is made possible by Azure HDInsight, a managed service. Popular open source frameworks like Apache Hadoop, Spark, and Kafka may all be used by users. Without the requirement for installation or customization, it enables you to take advantage of these open source projects with fully managed infrastructure and cluster administration.

Creating Your Azure Data Lake: Associated Services

The primary Azure services you may utilize to create your data lake architecture are listed in the following table.

Service	Description	How a Data Lake works
Azure Blob Storage	Object management storage	Keeping unstructured information
Azure Databricks	Azure Spark-based serverless analytics	processing of huge files in batches
Cosmos DB	manageable NoSQL data storage that supports Cassandra and MongoDB	Key-value pairs are stored without a set structure.
Azure SQL Database	SQL Server controlled via the cloud	Using SQL queries to store relational datasets
Azure SQL Datawarehouse	corporate data warehouse on the cloud (EDW)	preserving vast amounts of structured data and allowing highly parallel processing (MPP)
Azure Analysis Service	SQL Server Analysis Server-based analytics engine	Ad-hoc semantic model construction for tabular data
Azure Data Factory	ETL service on the cloud	Including the data lake with more than 50 databases and storage systems, and data transformation

Top Techniques for Azure Data Lake

Here are a few best practices to follow so you can get the most out of your Azure data lake implementation.

Security

Access control through the Portable Operating System Interface (POSIX) is offered by Azure Data Lake Storage Gen2 for users, groups, and service principals established in Azure Active Directory (Azure AD). On already-existing files and folders, these access constraints can be configured. Create default permissions using access control so they may be automatically applied to new files or folders.

Resiliency

You must think about availability needs and potential service failures when developing a system that uses cloud or Data Lake Storage. Planning is necessary for outages that might effect a single computer instance, a zone, or an entire area.

Think about the goal recovery time objective (RTO) and recovery point objective (RPO) for the workload (RPO). Benefit from the variety of storage redundancy choices offered by Azure, including Local Redundant Storage (LRS) and Read-Access Geo-Redundant Storage (RA-GRS).

Directory Layout

You should design your data structure to provide security, effective processing, and partitioning when ingesting data into a data lake. Consider factors like organizational unit, data source, timescale, and processing needs while planning the directory structure.

In most circumstances, you should start your directory structure with the region and conclude it with the date. This enables you to restrict access to particular users or areas of data using POSIX permissions. By placing the date at the end, you may restrict particular date ranges without needing to process several subdirectories.

Cloud Volumes ONTAP from NetApp and Azure Data Lake

The top enterprise-grade storage management solution, NetApp Cloud Volumes ONTAP, offers safe, tried-and-true storage management services on AWS, Azure, and Google Cloud. With a powerful set of capabilities like high availability, data security, storage efficiency, Kubernetes integration, and more, Cloud Volumes ONTAP supports up to a capacity of 368TB and diverse use cases including file services, databases, DevOps, or any other corporate application.

Advanced functionality for managing SAN storage in the cloud, accommodating NoSQL database systems, as well as NFS shares that can be directly accessible from cloud big data analytics clusters are supported by Cloud Volumes ONTAP.

Additional storage efficiency capabilities offered by Cloud Volumes ONTAP include thin provisioning, data compression, and deduplication, which may save prices and footprint of storage by up to 70%.

What is Azure NoSQL?

NoSQL databases are those that use data models other than relational tables as their foundation. NoSQL databases can be of the key-value, document, graph, or wide-column kind. These databases are getting more and more common as businesses produce bigger amounts and more diverse unstructured data.

There are several NoSQL database solutions and numerous hosting or deployment options in Microsoft Azure. MongoDB, Gremlin, and Cassandra are some of the NoSQL big data options provided by Azure.

Different Azure NoSQL Database Types

Key-value, document, columnar, and graph NoSQL databases are the four categories of NoSQL databases that Azure offers choices for. The differences between these databases and the services that each kind is provided by Azure are explained below.

Key-Value

Hash tables are used by key-value databases to hold paired keys and values. With the help of these tools, you may provide certain keys data values and afterwards get the data using the key. You can store as many values as you like in a key-value database. Applications linked to the database deliver and understand data schemas.

Key-value databases can be used for applications that do straightforward lookups. However, keep in mind that a key-value database is less suited to data searching and is not intended for value queries. Key-value databases' scalability, which is often the consequence of simple data distribution over several nodes on different computers, is its major benefit.

Key-value databases have the following major use cases:

Session management
Data caching
Product recommendation
User preferences
Serving ads
Profile management

Azure's selection of key-value databases: Cosmos DB controlled service using Cassandra

Document

Similar to key-value databases, document databases hold whole documents organised in groups or collections as opposed to individual values. You can query these documents' key-value pairs to get information. JSON, YAML, and XML are just a few of the formats that may be used to store documents. Each document in these databases may have a unique structure.

Documents can include a variety of data kinds, and they can be organized using a variety of different formats. However, you must make sure that each document only contains information about a single entity, such as a client or an order. It is feasible to spread data from a single document over several relational tables in your RDBMS.

Key instances of document databases' use are as follows:

Content management
Product catalog
Inventory management

Azure offers MongoDB through the Cosmos DB managed service as a document database.

Columnar

You may store data in table format using a columnar database, sometimes referred to as a column-family or wide-column database. These databases, as opposed to relational databases, use a denormalized approach to sparse data. Due to the fact that not all tables or rows must contain the same data, you can store numerous tables within a database.

The majority of column-family databases store data in key order, in contrast to key-value databases and document databases, which store data using a computational hash. Typically, you may define indexes for particular columns within a column-family using columnar implementations. Then, instead of utilizing row keys to access data, indexes may be used to access data.

Read and write operations for a row in columnar databases are frequently atomic with a single column-family. Although this is not always the case, certain databases may offer atomicity for several column families over the whole row.

The following are important use cases for columnar databases:

Telemetry
Messaging
Recommendations
Activity monitoring
Personalization
Social media analytics
Sensor data
Weather and other time-series data
Web analytics

Columnar database offered on Azure: Azure Table Storage

Graph

Graph databases use nodes and edges to map the connections between the data. Data values are nodes, whereas interactions between these nodes and values—which may also be directional—are edges. These databases are made to display hierarchical or intricately connected data structures.

Graph databases can run queries on a network of nodes and edges and examine the connections between the nodes and edges. Large networks with plenty of items and connections may carry out extremely sophisticated analysis fast. Furthermore, graph databases frequently include a query language that makes it simple to switch between relationship networks.

The following are important use cases for graph databases:

Fraud detection
Social graphs
Recommendation engines
Organization charts

Graph database offered on Azure: Gremlin (via Cosmos managed service)

A Cosmos DB-powered Azure NoSQL Managed Database

Cosmos DB is the NoSQL service Microsoft Azure provides that is the most capable. This managed service offers APIs and engines for several database types. Transparent multi-master replication, turnkey worldwide distribution, intelligent scalability, and a variety of consistency choices are among the characteristics offered by Cosmos DB. The main option is Cosmos DB, although there are variants available through API, including choices for

Gremlin API in Azure Cosmos DB

A graph computing framework called the Gremlin API is built on Apache TinkerPop. It has automated graph partitioning, automatic indexing, and allows you to elastically scale your database. You can run real-time queries using this API without defining views, secondary indexes, or schema hints because it employs the Gremlin syntax.

MongoDB API in Azure Cosmos DB

The wire protocol of the database is applied to Cosmos DB through the MongoDB API. You can use native MongoDB tools, drivers, and client SDKs because of this. With the help of this API, you may migrate current apps with few changes while ensuring their vendor independence.

Table API in Azure Cosmos DB

With the help of the Table API interface, you can add Cosmos DB functionality to applications created for Azure Table Storage. For instance, dedicated global throughput and millisecond latencies in the single digits. You may move apps using this API without changing the code. This API comes with client SDKs for Java,.NET, Python, Node.js, and Python.

Azure Cosmos DB Cassandra API

You may connect to already-running Cassandra apps and get data using the Cassandra Query Language (CQL) thanks to the Cassandra API. Additionally, this API offers functionality for numerous consistency levels, compliance certifications, and permanent change tracking.