Mining Job Vacancies on Craigslist

Source code available on my GitHub.

We are using 3 libraries from Python in mining data and printing the mined data on a csv file.

First is BeautifulSoup.

What is BeautifulSoup?

BeautifulSoup is a third party Python library that is designed for quick turnaround projects like screen-scraping.

START YOUR DATA MINING CAREER ON FIVER

What can BeautifulSoup do?

BeautifulSoup can parse anything you give it. It can find all the links whose urls match “whichwebsite.com”. It can find any table heading from title to description to bold texts. Most importantly, it can find every “a” element an href attribute has and more.

To use it:

from bs4 import BeautifulSoup

Next, we’ll use requests.

What is Requests?

Requests is an Apache2 Licensed HTTP library designed for humans to interact with the Python.

What can Requests do?

Requests allow you to send HTTP/1.1 requests. You can add content like headers, from data, multipart files and parameters via simple libraries. This also allows you access to the response data of Python.

To use it:

import requests

Finally, we’ll use Pandas.

What is Pandas?

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structires and data analysis tools for Python.

What can Pandas do?

Pandas enable you to carry out your entire data analysis workflow without having to switch to a more domain specific language.

import pandas as pd

We will be using craigslist for mining job vacancies and printing the data on a csv file

url = "https://boston.craigslist.org/search/npo"

Then write a script that updates and creates a dictionary for printing

d = {'key':'value'}
print(d)
d['new key'] = 'new value'
print(d)
START YOUR DATA MINING CAREER ON FIVER

Then create a while loop for the request script

while True:
    
    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')
    jobs = soup.find_all('p',{'class':'result-info'})
    
    for job in jobs:
        
        title = job.find('a',{'class':'result-title'}).text
        location_tag = job.find('span',{'class':'result-hood'})
        location = location_tag.text[2:-1] if location_tag else "N/A"
        date = job.find('time', {'class': 'result-date'}).text
        link = job.find('a', {'class': 'result-title'}).get('href')
        
        job_response = requests.get(link)
        job_data = job_response.text
        job_soup = BeautifulSoup(job_data, 'html.parser')
        job_description = job_soup.find('section',{'id':'postingbody'}).text
        job_attributes_tag = job_soup.find('p',{'class':'attrgroup'})
        job_attributes = job_attributes_tag.text if job_attributes_tag else "N/A"
        
        job_no+=1
        npo_jobs[job_no] = [title, location, date, link, job_attributes, job_description]
        
        
#       print('Job Title:', title, '\nLocation:', location, '\nDate:', date, '\nLink:', link,"\n", job_attributes, '\nJob Description:', job_description,'\n---')
        
    url_tag = soup.find('a',{'title':'next page'})
    if url_tag.get('href'):
        url= 'https://boston.craigslist.org' + url_tag.get('href')
        print(url)
    else:
        break

Finally, we print the mined data into a csv file: 

print("Total Jobs:", job_no)
npo_jobs_df = pd.DataFrame.from_dict(npo_jobs, orient = 'index', columns = ['Job Title','Location','Date', 'Link', 'Job Attributes', 'Job Description'])


npo_jobs_df.head()


npo_jobs_df.to_csv('npo_jobs.csv')

Download the first mined data here.

References:
https://www.pythonforbeginners.com/beautifulsoup/scraping-websites-with-beautifulsoup
https://www.pythonforbeginners.com/requests/using-requests-in-python
https://pandas.pydata.org/
Advertisement

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.