DEEPAGI

Build a Linkedin Data Scraper

Mohammad Nurdin

Scraping of LinkedIn profiles is a very useful activity especially to achieve public relations/marketing tasks. Using Python you can make this process smoother, using your time to focus on those profiles that have critical peculiarities.

Before you start, feel free to learn basic web scraping using beautifulsoup and selenium.

Let’s get started

Download and install Python 3.8.7 here https://www.python.org/downloads/release/python-387/.

After that, download chrome driver here https://chromedriver.chromium.org/downloads.

Create a new folder for your new workspace. Place chrome driver into that folder.

Install all the packages via pip.

pip install selenium
pip install bs4
pip install requests
pip install pandas

Change into the workspace and run this command to open Jupiter notebook app.

cd <your_new_workspace>
jupyter notebook

Import all the packages

import time
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

import bs4
from bs4 import BeautifulSoup
import requests
import pandas
from pandas import DataFrame
import csv

Now we target to scrap user profile link, name, job title, location, email & phone number. For email and phone number depends on the user profiler either private or public. If private we set try & catch.

contact_link = []
contact_name = []
contact_job_title = []
contact_location = []
contact_email = []
contact_phone_number = []

Now try to open the Linkedin web page.

url = "https://www.linkedin.com/mynetwork/invite-connect/connections"
driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)

linkedin

If you notice we need to bypass. Inspect this webpage before we code.

linkedin

Try code below

time.sleep(1)
usr = ""
pwd = ""

time.sleep(1)
if driver.find_element_by_xpath('//*[@class="main__sign-in-container"]/a'):
    driver.find_element_by_xpath('//*[@class="main__sign-in-container"]/a').click()

elem = driver.find_element_by_id("username")
elem.send_keys(usr)

elem = driver.find_element_by_id("password")
elem.send_keys(pwd)

elem.send_keys(Keys.RETURN)

Once successful, we try to scroll down the webpage until reach the bottom.

asLoadMore = True
count = 0
while hasLoadMore :
    time.sleep(1)
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        if count > 1:
            hasLoadMore = False
        else:
            count = count + 1
    except:
        hasLoadMore = False

Inspect the webpage again to target all the attributes such as link user profile, name, and job title.

linkedin

soup = BeautifulSoup(driver.page_source,"html.parser")
soup_link = soup.findAll("div", attrs={"class":'mn-connection-card__details'})

for x in range(len(soup_link)):
    ct_link = soup_link[x].find('a', { "class": "mn-connection-card__link ember-view" })
    contact_link.append(ct_link['href'])
    
    soup_name = soup_link[x].find('span', { "class": "mn-connection-card__name t-16 t-black t-bold" })
    contact_name.append(soup_name.text.strip())
    print(soup_name.text.strip())
    
    soup_job_title = soup_link[x].find('span', { "class": "mn-connection-card__occupation t-14 t-black--light t-normal" })
    contact_job_title.append(soup_job_title.text.strip())
    print(soup_job_title.text.strip())

Next, we do targeting for each link user profile.

linkedin

detail_page = "https://www.linkedin.com" + ct_link['href']
    driver.get(detail_page)
    soup = BeautifulSoup(driver.page_source,"html.parser")    
    
    soup_location = soup.find('li', {"class":"t-16 t-black t-normal inline-block"})
    
    if soup_location==None:
        contact_location.append("N/A")
    else:
        contact_location.append(soup_location.text.strip())
        print(soup_location.text.strip())

On the final webpage, we try to scrap their email & phone number.

linkedin

soup_phone_number = soup.find('span', {"class":"t-14 t-black t-normal"})
    
    if soup_phone_number==None:
        contact_phone_number.append("N/A")
    else:
        contact_phone_number.append(soup_phone_number.text.strip()) 
        print(soup_phone_number.text.strip())
    
    soup_email_1 = soup.find('section', {"class":"pv-contact-info__contact-type ci-email"})
    
    if soup_email_1!=None:
        soup_email_2 = soup_email_1.find('a', {"class":"pv-contact-info__contact-link t-14 t-black t-normal"})
    
    if soup_email_2==None:
        contact_email.append("N/A")
    else:
        contact_email.append(soup_email_2.text.strip())
        print(soup_email_2.text.strip())

Once done, we will put all the appended data arrays into a data frame so we can easily write on the CSV file.

data = { 
    'link': contact_link,
    'name':contact_name, 
    'job_title':contact_job_title, 
    'location':contact_location, 
    'phone_number':contact_phone_number,
    'email':contact_email
}

print(data)

print ([len(v) for v in data.values()])

df = DataFrame(data, columns = ['link', 
                  'name', 
                  'job_title', 
                  'location', 
                  'email',
                  'phone_number'])
df.to_csv('data.csv')

That’s all. Let me know if you have any questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top