Scraping of LinkedIn profiles is a very useful activity especially to achieve public relations/marketing tasks. Using Python you can make this process smoother, using your time to focus on those profiles that have critical peculiarities.
Before you start, feel free to learn basic web scraping using beautifulsoup and selenium.
Let’s get started
Download and install Python 3.8.7 here https://www.python.org/downloads/release/python-387/.
After that, download chrome driver here https://chromedriver.chromium.org/downloads.
Create a new folder for your new workspace. Place chrome driver into that folder.
Install all the packages via pip.
pip install selenium pip install bs4 pip install requests pip install pandas
Change into the workspace and run this command to open Jupiter notebook app.
cd <your_new_workspace> jupyter notebook
Import all the packages
import time import re from selenium import webdriver from selenium.webdriver.common.keys import Keys import bs4 from bs4 import BeautifulSoup import requests import pandas from pandas import DataFrame import csv
Now we target to scrap user profile link, name, job title, location, email & phone number. For email and phone number depends on the user profiler either private or public. If private we set try & catch.
contact_link = [] contact_name = [] contact_job_title = [] contact_location = [] contact_email = [] contact_phone_number = []
Now try to open the Linkedin web page.
url = "https://www.linkedin.com/mynetwork/invite-connect/connections" driver = webdriver.Chrome('chromedriver.exe') driver.get(url)
If you notice we need to bypass. Inspect this webpage before we code.
Try code below
time.sleep(1) usr = "" pwd = "" time.sleep(1) if driver.find_element_by_xpath('//*[@class="main__sign-in-container"]/a'): driver.find_element_by_xpath('//*[@class="main__sign-in-container"]/a').click() elem = driver.find_element_by_id("username") elem.send_keys(usr) elem = driver.find_element_by_id("password") elem.send_keys(pwd) elem.send_keys(Keys.RETURN)
Once successful, we try to scroll down the webpage until reach the bottom.
asLoadMore = True count = 0 while hasLoadMore : time.sleep(1) try: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") if count > 1: hasLoadMore = False else: count = count + 1 except: hasLoadMore = False
Inspect the webpage again to target all the attributes such as link user profile, name, and job title.
soup = BeautifulSoup(driver.page_source,"html.parser") soup_link = soup.findAll("div", attrs={"class":'mn-connection-card__details'}) for x in range(len(soup_link)): ct_link = soup_link[x].find('a', { "class": "mn-connection-card__link ember-view" }) contact_link.append(ct_link['href']) soup_name = soup_link[x].find('span', { "class": "mn-connection-card__name t-16 t-black t-bold" }) contact_name.append(soup_name.text.strip()) print(soup_name.text.strip()) soup_job_title = soup_link[x].find('span', { "class": "mn-connection-card__occupation t-14 t-black--light t-normal" }) contact_job_title.append(soup_job_title.text.strip()) print(soup_job_title.text.strip())
Next, we do targeting for each link user profile.
detail_page = "https://www.linkedin.com" + ct_link['href'] driver.get(detail_page) soup = BeautifulSoup(driver.page_source,"html.parser") soup_location = soup.find('li', {"class":"t-16 t-black t-normal inline-block"}) if soup_location==None: contact_location.append("N/A") else: contact_location.append(soup_location.text.strip()) print(soup_location.text.strip())
On the final webpage, we try to scrap their email & phone number.
soup_phone_number = soup.find('span', {"class":"t-14 t-black t-normal"}) if soup_phone_number==None: contact_phone_number.append("N/A") else: contact_phone_number.append(soup_phone_number.text.strip()) print(soup_phone_number.text.strip()) soup_email_1 = soup.find('section', {"class":"pv-contact-info__contact-type ci-email"}) if soup_email_1!=None: soup_email_2 = soup_email_1.find('a', {"class":"pv-contact-info__contact-link t-14 t-black t-normal"}) if soup_email_2==None: contact_email.append("N/A") else: contact_email.append(soup_email_2.text.strip()) print(soup_email_2.text.strip())
Once done, we will put all the appended data arrays into a data frame so we can easily write on the CSV file.
data = { 'link': contact_link, 'name':contact_name, 'job_title':contact_job_title, 'location':contact_location, 'phone_number':contact_phone_number, 'email':contact_email } print(data) print ([len(v) for v in data.values()]) df = DataFrame(data, columns = ['link', 'name', 'job_title', 'location', 'email', 'phone_number']) df.to_csv('data.csv')
That’s all. Let me know if you have any questions.
Hi there
In the code where you have:
time.sleep(1)
usr = “”
pwd = “”
Seems like an obvious question but would we need to put our own personal log in details for LinkedIn into those fields?
Cheers
M
yes you are correct