Web Scraping using python

Couple of weeks ago,I was working one of the project.Where my responsibility was to scrap several specific information from the website.

It was a dispensary listing website around 2600+ dispensary was there.Information like dispensary name,state,address,email,website etc was needed.I decided to use python for scrapping because of its huge  library collection and available third parties.

Using a csv file where directory name and the rss feed of the corresponding directory was listed.

The script read the csv file then take the name of the directory  and link.Then using feed parser it counts the number of dispensary from of the rss feed.

Feed parser is a very handy,to collect specif info from the rss feed.

feed_data = feedparser.parse(feed_path)
count = len(feed_data['entries'])
link_url = feed_data['entries'][i]['link']

Then,collected the dispensary url.And appended those into a list.

Now the real parts start,scrap the data using urls.For that I used BeautiFulSoup.Parsing data is very easy using it.

Before trying it,you have to install it.In ubuntu following terminal command is enough

 sudo apt-get install python-beautifulsoup

or using easy install

sudo easy_install BeautifulSoup

Parsing is as simple as below:

import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://mushfiq.me"
html = urllib2.urlopen(url).read()
data = BeautifulSoup(html)
print data

It will print whole parsed data of the url.And then you can navigate and collect yours need html tag values from the
soup :).
Then i loop through my url list and collected different scrap data from the soup.
Then the next part was to write those data into csv file.Its pretty simple in python.

import csv
data_list = [name,title,address,email]
file_path = "where_you_want_to_save_the_csv_file"
file_p = open(file_path, "ab")
write_pattern = csv.writer(file_p, delimiter=",", quotechar=",")
write_pattern.writerow(data_list)
file_p.close()

It generates a csv file with the data in this format
name,title,address,email
You can use the script or can contribute in that script.As It was my first scrapping,the script is not bullet proof,you can suggest better coding approach also.
Script is here
Finally I ran the script in my local machine,it took 5 hours to scrap 2660+ dispensary data.

Advertisements

3 comments

  1. PARAG · May 19, 2011

    Hey Brother i am from Bangladesh.i want to contact you,please give me your contact information.

    • mushfiqsimple · May 23, 2011
  2. web content scraper · December 19, 2012

    Good post. I learn something totally new and challenging on
    websites I stumbleupon everyday. It will always be
    interesting to read content from other authors and use something from other sites.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s