Source code available on my GitHub.
We are using 3 libraries from Python in mining data and printing the mined data on a csv file.
First is BeautifulSoup.
What is BeautifulSoup?
BeautifulSoup is a third party Python library that is designed for quick turnaround projects like screen-scraping.
START YOUR DATA MINING CAREER ON FIVER
What can BeautifulSoup do?
BeautifulSoup can parse anything you give it. It can find all the links whose urls match “whichwebsite.com”. It can find any table heading from title to description to bold texts. Most importantly, it can find every “a” element an href attribute has and more.
To use it:
from bs4 import BeautifulSoup
Next, we’ll use requests.
What is Requests?
Requests is an Apache2 Licensed HTTP library designed for humans to interact with the Python.
What can Requests do?
Requests allow you to send HTTP/1.1 requests. You can add content like headers, from data, multipart files and parameters via simple libraries. This also allows you access to the response data of Python.
To use it:
import requests
Finally, we’ll use Pandas.
What is Pandas?
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structires and data analysis tools for Python.
What can Pandas do?
Pandas enable you to carry out your entire data analysis workflow without having to switch to a more domain specific language.
import pandas as pd
We will be using craigslist for mining job vacancies and printing the data on a csv file
url = "https://boston.craigslist.org/search/npo"
Then write a script that updates and creates a dictionary for printing
d = {'key':'value'} print(d)
d['new key'] = 'new value' print(d)
START YOUR DATA MINING CAREER ON FIVER
Then create a while loop for the request script
while True: response = requests.get(url) data = response.text soup = BeautifulSoup(data,'html.parser') jobs = soup.find_all('p',{'class':'result-info'}) for job in jobs: title = job.find('a',{'class':'result-title'}).text location_tag = job.find('span',{'class':'result-hood'}) location = location_tag.text[2:-1] if location_tag else "N/A" date = job.find('time', {'class': 'result-date'}).text link = job.find('a', {'class': 'result-title'}).get('href') job_response = requests.get(link) job_data = job_response.text job_soup = BeautifulSoup(job_data, 'html.parser') job_description = job_soup.find('section',{'id':'postingbody'}).text job_attributes_tag = job_soup.find('p',{'class':'attrgroup'}) job_attributes = job_attributes_tag.text if job_attributes_tag else "N/A" job_no+=1 npo_jobs[job_no] = [title, location, date, link, job_attributes, job_description] # print('Job Title:', title, '\nLocation:', location, '\nDate:', date, '\nLink:', link,"\n", job_attributes, '\nJob Description:', job_description,'\n---') url_tag = soup.find('a',{'title':'next page'}) if url_tag.get('href'): url= 'https://boston.craigslist.org' + url_tag.get('href') print(url) else: break
Finally, we print the mined data into a csv file:
print("Total Jobs:", job_no) npo_jobs_df = pd.DataFrame.from_dict(npo_jobs, orient = 'index', columns = ['Job Title','Location','Date', 'Link', 'Job Attributes', 'Job Description']) npo_jobs_df.head() npo_jobs_df.to_csv('npo_jobs.csv')
Download the first mined data here.