Azure HDInsight Best Practices


Azure HDInsight Metastore Best Practices

Since it acts as a central schema repository for other large data access resources like Apache Spark, Interactive Query (LLAP), Presto, and Apache Pig, the Apache Hive Metastore is a crucial component of the Apache Hadoop architecture. It's important to note that the Hive metastore used by HDInsight is an Azure SQL Database.

The HDInsight metastore offers two choices: default meta stores and custom meta stores.

  • Default meta stores are produced without charge for each type of cluster, however a default metastore cannot be shared by several clusters.
  • Custom meta stores are advised for use in production clusters since they permit creation and deletion without losing metadata. To separate computation and metadata, use a bespoke metastore, and make regular backups of it.

A default Hive metastore is made accessible when HDInsight is deployed, but it is temporary and will be erased once the cluster is removed. You may store the Hive metastore in Azure DB to prevent it from being removed along with the cluster.

Utilizing monitoring tools like Azure Portal and Azure Log Analysis, you can keep an eye on the performance of your metadata storage. Ensure that HDInsight and your metastore are situated in the same area.

Azure HDInsight Scaling Best Practices

When scaling down a cluster, some services need to be manually initiated and terminated. This is due to the possibility of task failures if the cluster is scaled while conducting jobs.

Elasticity, a feature that lets you adjust the number of worker nodes in your clusters up and down, is supported by HDInsight. For instance, you may reduce the size of a cluster on weekends and increase it during periods of heavy demand. In the same way, it makes sense to scale the cluster up before doing periodic batch processing to make sure it has adequate resources, and to scale it down to fewer worker nodes when processing is finished and utilization has dropped.

Azure HDInsight Architecture Best Practices

The recommended practices for Azure HDInsight Architecture are listed below:

  • It is advised that you employ numerous workload-clusters rather than a single cluster when transferring an on-premises Hadoop cluster to Azure HDInsight. However, employing several clusters in the long run would unnecessarily increase your prices.
  • Delete the clusters as soon as the task is finished by using transient on-demand clusters. Due to the potential infrequent utilization of HDInsight clusters, this aids in resource cost savings. The related meta-stores and storage accounts will not be erased when you delete a cluster, so you may use them later to reconstitute the cluster if necessary.
  • It is important to keep data storage and data processing separate because they are not co-located on HDInsight clusters and may be housed in Azure Storage, Azure Data Lake Storage, or both. By separating data storage from a computer, you may grow storage and computation independently, employ temporary clusters, and save storage costs.

Azure HDInsight Infrastructure Best Practices

the following are recommended practices for Azure HDInsight Infrastructure:

  • Capacity planning:The most important decisions you can make to optimize your deployment for capacity planning of your HDInsight cluster are selecting the optimum region, storage location, VM size, VM type, and number of nodes.
  • Script actions: To personalize your HDInsight clusters, use script actions. Make that the script actions are available from the HDInsight cluster and saved on a URI. Verify the supported HDInsight versions and the Hadoop components that are available in HDInsight.
  • Use Bootstrap: Another smart move is to alter HDInsight's configurations using Bootstrap. This enables you to alter configuration files such as core-site.xml, hive-site.xml and oozie-env.xml.
  •  Edge notes: Client applications may be tested and hosted on edge nodes, which can also be used to access the cluster. You have the flexibility to scale clusters up or down as necessary using Azure HDInsight. You can reduce read and write latency by situating the cluster close to where the data is located.

Azure HDInsight Migration Best Practices

The following is a list of ideal procedures for Azure HDInsight Migration:

  • Migration Using Scripts: DB replication or script migration for the Hive metastore are also options. Create Hive DDLs from the on-premises Hive metastore if you're migrating Hive metastore using scripts, change the created DDL to replace the HDFS URL with the WASB/ADLS/ABFS URLs, and then execute the amended DDL on the metastore. Both on-premises and cloud environments must be supported by the metastore version.
  • Migration Using DB Replication: You can use the Hive MetaTool to swap out HDFS URLs for WASB/ADLS/ABFS URLs if you're migrating your Hive metastore utilizing DB replication. Here is a sample of code:

./hive --service metatool -updateLocation

hdfs://nn1:8020/

wasb://<container_name>@<storage_account_name>.blob.core.windows.net/

You have two choices when transferring on-premises data to Azure: offline transfer or transfer via TLS. The quantity of data you must move will probably determine which option is ideal for you.

  • Migrating over TLS: You may use Azure Storage Explorer, AzCopy, Azure Powershell, or Azure CLI to move data to Azure storage over the network and migrate via TLS.
  • Migrating offline: You may also use DataBox, DataBox Disk, and Data Box Heavy devices to transport substantial volumes of data to Azure while sending it offline. Alternately, utilize native tools like Azure Data Factory, AzureCp, or Apache Hadoop DistCp to transport data across the network.

Azure HDInsight Performance Best Practices

The following performance factors for Azure HDInsight:

  • Increase parallelism: Using extra mappers for DistCp, you may enhance parallelism during the data transfer procedures to cut down on the length of the data transfer and boost performance. Use numerous DistCp tasks to reduce the effect of failures; if one fails, you may restart that single job rather than all the others. Consider dividing your huge files into 256-MB halves if you have a few of them. The number of threads that are active at any given moment can also be increased.
  • Monitor Performance: You will get useful information about the performance of your cluster using Azure HDInsight. Use it to get utilization statistics for the CPU, memory, and network. When a metric's value or the outcomes of a query meet a predetermined criteria, you may set up Azure Monitor notifications to be sent. Email, SMS, push, voice, an Azure Feature, a Webhook, or an ITSM are all possible triggering methods.

Azure HDInsight Storage Best Practices

Because each workload has different business needs, choosing the appropriate storage solution for your HDInsight clusters is crucial. These requirements must be mirrored at the storage layer. There are several different Azure storage solutions, including Azure Storage, Azure Blob Storage, and Azure Data Lake Store (ADLS).

The recommended practices for Azure HDInsight Storage are as follows:

  • Storage Throttling: When running jobs attempt to execute more input/output (I/O) activities than the storage can support, a cluster may frequently experience performance bottlenecks as a result of blocking I/O operations. When the present I/Os are finished, this blocking creates a list of requests for I/O that will be handled. Due to capacity throttling, a cap imposed by the storage service in accordance with the service level agreement, this has happened (SLA). Reducing the size of your cluster, changing the self-throttling settings, or raising the bandwidth allotted to your storage account are ways to prevent throttling.
  • Decoupled compute and storage: Storage is separated from computational resources in HDInsight. This implies that the data in the cluster would still be intact and accessible even if you turned off the computing component of the cluster.
  • Choosing the right storage type: Azure Storage is what HDInsight utilizes by default. You can pick one or more Azure Blob Storage accounts to store data in, but keep in mind that Standard LRS is the only supported kind; Premium LRS is not.
  • Azure Data Lake Store: An additional option for data storage is ADLS. It is a distributed file system that is designed to support concurrent processing tasks. No storage size restrictions apply to either file sizes or account storage.
  • Use multiple accounts: Avoid limiting the number of storage accounts your HDInsight cluster may utilize; it's advised to have several storage accounts and one container per storage account. This is due to the increased networking bandwidth provided by each storage account, which enables compute nodes to do their tasks as rapidly as feasible.
  •  
  • A 48-node cluster should have 4–8 storage accounts, according to recommendations.

Azure HDInsight Security and DevOps Best Practices

Use Enterprise Security Package (ESP), which offers Directory-based authentication, multi-user support, and role-based access management, to safeguard and maintain the cluster. The fact that ESP is compatible with several cluster types, including Apache Hadoop, Apache Spark, Apache Hbase, Apache Kafka, and Interactive Query, should be noted (Hive LLAP).

It's crucial to take the following actions to protect your HDInsight deployment:

  • Azure Monitor: Utilize Azure Monitor's monitoring and alerting capabilities.
  • Stay on top of upgrades: Ensure that you update to the most recent HDInsight version, the most recent OS fixes, and reboot any nodes that need it.
  • Enforce end-to-end enterprise security, Auditing, encryption, authentication, and authorisation are all required, as well as a private and secure data pipeline.
  • Azure Storage Keys should be protected utilizing encryption methods as well. One effective method for limiting access to your Azure storage resources is Shared Access Signature (SAS). Storage Service Encryption (SSE) and replication are used to automatically encrypt the data that is written to Azure Storage.

At regular intervals, update HDInsight to the most recent version. You can accomplish this by taking the actions listed below:

  1. Using the most recent HDInsight update, create a fresh test HDInsight cluster.
  2. Check the present cluster to make sure the workloads and workers are operating as intended.
  3. Change tasks, programs, or responsibilities as necessary.
  4. The cluster nodes' local temporary data should all be backed up.
  5. Get rid of the present cluster.
  6. For the most recent HDInsight version, create a new cluster using the same default data and metastore in the same virtual network subnet as the last one.
  7. Any temporary file backups should be imported.
  8. Finish processing or starting new jobs using the new cluster.

Read more: