Azure Data Factory is Azure’s cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. You can also lift and shift existing SSIS packages to Azure and run them with full compatibility in ADF.
In this tutorial, we’ll create our very first ADF pipeline that simply copies data from a google sheet Azure Function & Google Drive API, stores the results in Azure SQL Server, and visualize it on Power BI.
Select SQL databases.
Click on Create SQL Database.
Select Resource group and create a new server. Once done click on OK. Then click on Configure database.
Select Basic and click on Apply.
Enter the Database name, select Locally-redundant backup storage and click on Review + create.
Click on Create.
Click on Go to resource.
Below are the overall features and details of our SQL Azure database created. Click on Set server Firewall.
Select Yes under Allow Azure services and resources to access this server and click on Save.
Download SQL Server Management Studio (SSMS) here, https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver15. Once installed, open SSMS and connect to the server.
Sign-in into Microsoft Azure. Once done click on OK.
Click on New Query and use this SQL query to create a new table. Once done click on Execute.
CREATE TABLE Users ( Id int NOT NULL PRIMARY KEY IDENTITY(1,1), Name varchar(255), Phone varchar(255), Email varchar(255), LastContact varchar(255), Status varchar(255) );
Open Google Sheet and create a simple datasheet like below.
Click on Share. Once you clicked, you will get the page like this so click on the copy link.
You will get a link in this format. Numeric text is your google sheet ID. (if you are reading data from someone else sheet then that sheet should be shared with your Google id).
https://docs.google.com/spreadsheets/d/1qkls44BLAmuCvFOO0P9IhYIXvjo3BGlsF4fq-g30f48/edit?usp=sharing
Enable the Google sheet API. Register first at https://cloud.google.com/. Once done go to console and make sure create a new project under resource manager before proceeding. Then go to https://console.cloud.google.com/apis/library/sheets.googleapis.com and click ENABLE.
Click on Credentials.
Click on Manage service accounts. Then click on CREATE SERVICE ACCOUNT. Input Service account ID and click on CREATE.
Grant this service account access to Owner and click on Continue.
Click on Done.
Create key for azure-data-factory@deepagi-e2f97.iam.gserviceaccount.com select JSON format, download the JSON file and store it in the same folder where our code will be saved. Copy service account email and paste it in Google Sheet.
Download and install Python 3.9.1 https://www.python.org/downloads. Once completed, install Jupiter notebook via command line.
pip3 install jupyterlab
Import the following packages using Pip
pip install google_spreadsheet pip install google-auth-oauthlib pip install gspread pip install pandas
Create a new workspace, change directory into new workspace created and run this command to launch Jupiter notebook.
cd new_workpspace; jupyter notebook
Click on New and select Python 3. Import library.
import gspread import pandas as pd from oauth2client.service_account import ServiceAccountCredentials
Define the scope of the application.
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive'] creds = ServiceAccountCredentials.from_json_keyfile_name('deepagi-e2f97-3de355cac46e.json', scope) client = gspread.authorize(creds)
Create the Sheet Instance
sheet = client.open('Azure Data Factory Tutorial') sheet_instance = sheet.get_worksheet(0)
Get all records
records_data = sheet_instance.get_all_records() print(records_data)
Convert the Dictionary to the Dataframe
records_df = pd.DataFrame.from_dict(records_data) records_df.head(10)
Connect to SQL using Python and this code to insert the data frame into the SQL database.
import pyodbc df = records_df server = 'deepagi.database.windows.net' database = 'deepagi' username = 'username' password = 'password' cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password) cursor = cnxn.cursor() # Insert Dataframe into SQL Server: for index, row in df.iterrows(): cursor.execute("INSERT INTO Users (Name, Phone, Email, LastContact, Status) values(?,?,?,?,?)", row.Name, row.Phone, row.Email, row.LastContact, row.Status) cnxn.commit() cursor.close()
Run the Jupyter again and check the result.
Next, we will create an Azure Function. Make sure you upgrade your subscription account before using the Azure Function. For this service, we need to create a new resource group. Select on Resource groups.
Click on Add.
Enter deepagi-linux in Resource group and select (Asia) Southeast Asia.
Click on Create.
Select Function App from the Azure portal.
Click on Create Function App.
On the Basics page, use the function app settings as specified in the following details.
Subscription: <your_subscription_account>
Resource Group: <your_resource_group>
Function App Name: <your_function_app_name>
Runtime stack: Python
Version: 3.8
Region: Southeast Asia
Once done click on Review + create.
Click on Create.
Once done, click on Go to resource.
Click on App keys.
Copy _master key for later usage.
Download and install Visual Studio Code. https://code.visualstudio.com. Install the Azure Functions extension.
After installation, select the Azure icon on the Activity bar. You should see an Azure Functions area in the Side Bar. From Azure: Functions, select the Create Function icon.
Click on Create new project.
Create a new folder or existing folder for your new project and click on Select.
Select Python.
Select python3.8.
Select HTTP trigger.
Enter extract.
Select Function.
Select Open in current window.
Copy credential JSON file downloaded from Google Cloud Platform into the new workspace.
Update __init__.py with this code.
import logging import azure.functions as func import gspread import pandas as pd from oauth2client.service_account import ServiceAccountCredentials import pyodbc import json def main(req: func.HttpRequest) -> func.HttpResponse: logging.info('Python HTTP trigger function processed a request.') scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive'] creds = ServiceAccountCredentials.from_json_keyfile_name('deepagi-e2f97-3de355cac46e.json', scope) client = gspread.authorize(creds) sheet = client.open('Azure Data Factory Tutorial') sheet_instance = sheet.get_worksheet(0) records_data = sheet_instance.get_all_records() records_df = pd.DataFrame.from_dict(records_data) records_df.rename(columns = {'Date of last contact':'LastContact'}, inplace = True) df = records_df server = 'deepagi.database.windows.net' database = 'deepagi' username = 'deepagi' password = 'your_password' cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password) cursor = cnxn.cursor() cursor.execute("TRUNCATE TABLE Users") for index, row in df.iterrows(): cursor.execute("INSERT INTO Users (Name, Phone, Email, LastContact, Status) values(?,?,?,?,?)", row.Name, row.Phone, row.Email, row.LastContact, row.Status) cnxn.commit() cursor.close() return func.HttpResponse("This HTTP triggered function executed successfully.")
Update requirements.txt file with these packages.
azure-functions gspread pandas pyodbc oauth2client
Click on Attach to Python.
Download and install Postman. https://www.postman.com/. Once debugging mode ready, you may copy the link inside the terminal.
http://localhost:7071/api/extract
Open postman and run that link.
Once successful, sign in to Azure.
Select your function app in Azure. Mine is Deepagi.
Click on Deploy.
Click on Functions > Functions.
You can see a new function appeared after deployment. Click on extract function.
Click on Code + Test.
Click on Test/Run and Run to test the API.
Make sure we truncate the table Users and run one more time the extract function.
TRUNCATE TABLE dbo.Users
Click on create a resource and search for Data Factory then click on create.
Provide a name for your data factory, select the resource group, and git configuration where you want to deploy your data factory and the version.
Click on Review + create. Once validation passed, click on Create.
Click on Go to resource.
Click on Author & Monitor.
Click on Log into Github.
Click on Authorize AzureDataFactory.
The Azure Data Factory completely had a different portal, as shown in the following figure. Click on Pencil symbol.
Select + > Pipeline.
Drag and drop Azure Function. Name it as extract.
Click on Settings then click on New.
Make sure you fill up all the details and click on Create once done filled up.
Enter extract in Function name and GET under Method. Once done, click on Debug for testing.
Once successful, click on Add trigger > New/Edit.
Select New.
Enter all the details and click on OK.
Click on Save. Then click on Publish.
Click on Ok.
Download and install Power BI. https://powerbi.microsoft.com/en-us/downloads/. Select Get data > More.
Select Azure > Azure SQL Database.
Enter these both details for Server and Database.
Select Database. Enter your Username and Password.
Select table Users and click on Load.
Once you add a data source, it is presented on the right sidebar. Since we can’t publish power bi using personal email, we skip this part.
I think that’s all for the basic azure data factory. Let me know if you have any feedback about this tutorial.
Nice bro