Couple of weeks ago,I was working one of the project.Where my responsibility was to scrap several specific information from the website.
It was a dispensary listing website around 2600+ dispensary was there.Information like dispensary name,state,address,email,website etc was needed.I decided to use python for scrapping because of its huge library collection and available third parties.
Using a csv file where directory name and the rss feed of the corresponding directory was listed.
The script read the csv file then take the name of the directory and link.Then using feed parser it counts the number of dispensary from of the rss feed.
Feed parser is a very handy,to collect specif info from the rss feed.
feed_data = feedparser.parse(feed_path) count = len(feed_data['entries']) link_url = feed_data['entries'][i]['link']
Then,collected the dispensary url.And appended those into a list.
Now the real parts start,scrap the data using urls.For that I used BeautiFulSoup.Parsing data is very easy using it.
Before trying it,you have to install it.In ubuntu following terminal command is enough
sudo apt-get install python-beautifulsoup
or using easy install
sudo easy_install BeautifulSoup
Parsing is as simple as below:
import urllib2 from BeautifulSoup import BeautifulSoup url = "http://mushfiq.me" html = urllib2.urlopen(url).read() data = BeautifulSoup(html) print data
It will print whole parsed data of the url.And then you can navigate and collect yours need html tag values from the
Then i loop through my url list and collected different scrap data from the soup.
Then the next part was to write those data into csv file.Its pretty simple in python.
import csv data_list = [name,title,address,email] file_path = "where_you_want_to_save_the_csv_file" file_p = open(file_path, "ab") write_pattern = csv.writer(file_p, delimiter=",", quotechar=",") write_pattern.writerow(data_list) file_p.close()
It generates a csv file with the data in this format
You can use the script or can contribute in that script.As It was my first scrapping,the script is not bullet proof,you can suggest better coding approach also.
Script is here
Finally I ran the script in my local machine,it took 5 hours to scrap 2660+ dispensary data.