Microsoft Azure Data Bricks - Train and Prepare data in Cloud

Azure Databricks is a workspace where we can build all the development regarding Data Science for data analysis,machine learning services, Python, R scripts.

 

 

Databricks Environment
Create Azure Databricks in Azure environment.Log into one of your accounts in Azure environment, create an Azure Databricks module.

Databricks Environment

Create Azure Databricks in Azure environment.Log into one of your accounts in Azure environment, create an Azure Databricks module.

 

 

To access to the Azure Databricks click on the Launch Workspace

 

As you can see in the below picture, the Azure Databricks environment has different components. The main components are Workspace and Cluster. The first step is to create a Cluster. Clusters in Databricks provide a unified platform for ETL (Extract, transform, and load), stream analytics, and machine learning. The cluster has two types: Interactive and Job. Notebook clusters are used to analyze data collaboratively. However, Job clusters are used to run fast and robustly automated workload using API.

 

Cluster page may contain both cluster types. Each cluster able to have different nodes. To start you need to create a cluster. Click on the Create Cluster option. In the create cluster page, enter the information such as Cluster Name, Version (Default), Python Version, min and maximum workers and so forth.   

 

To use the Cluster, you should wait till the Status change to Running (Below picture). By creating an interactive cluster, we able to create a notebook to write the codes there and get the result quickly.

 

 

To create a notebook, click on the workspace option and create a new notebook.

 

By creating a new notebook, you able to specify the notebook belong to which cluster, and what is the main language for the Notebook (Python, Scala, R, and SQL). In this example, R language has been selected as default.  However, you able still to write other languages in the Notebook by writing: %scala. %python, %sql or %r before the scripts.

 

In the notebook by default, there is a place to write the codes. As you can see in below figure, there an editor with name cmd1 as a node to write the codes and run them all. In this example, there is only one node there and the primary language to write the code is R. In this example, we use the existing dataset in the gpplot2 package named mpg by writing below codes.

 

library(ggplot2)

 

display(mpg)

 

 Display command, show the dataset in Databricks. To run the code, click on the arrow in the right side of the node and choose the Run Cell. After running the code, the result will appear at the end of the cell with table style.

 

To show the chart, you need to click on the chart icon at the bottom of the cell.