Build Periodic crawler with Celery

Its a very common use case when you build a crawler and it will have to run periodically.And generally we set a unix Cron Job to handle the crawler periodically.

But its really pain when you add new task you have to login to the server and add new cron task into the crontab.Its only feasible when you have to run only few cron jobs.

I thought it would be great if I can handle it from my python code and do some interesting things.I have heard about Celery a lot as a meesage queue system. Truly speaking at first I couldn’t understand  how it works or how I can integrate with my projects.After googling I understood what is Celery then I thought it will be really great if I can run crawlers around it and use Celery to scheduling periodic work.

Install  Celery by following this link,then you will have to install and configure RabbitMQ from this link. BTW dont forget to add user,vhosts it is described on SettingUp RabbitMQ sections(ou can use mongodb,resddis as broker too).

And then you can clone my git repo,change the celeryconfig.py file as per as your configuration.Add a new task into tasks.py following the first method.

I have added a sample method which requests this site and print the HTTP response status code .

To run the project run “celerybeat”,then it will start celerybeat and start to send tasks to the broker like below:

Run “celeryd” into another terminal window to check the task output,you will see something like below:

It is printing the response status after every 5 seconds.

You can handle anything that you want todo after the crawling,like parsing the dom saving text,submitting form etc.

Btw dont forget to run


pip install -r requirements.txt

to install necessary packages for the projects.My Github Periodic Crawling Project

Advertisements

Python Movie Data Crawler

Couple of days ago,I was talking with newly joined Engineer in our team.Suddenly I found he is a very resourceful person in terms of collecting movi.He has personal collection of movie archive which is 3TB !

So,I was really happy to find such a person as a teammate.He has three different lists of movies in text file.And shared one of the text file which contained more than couple of  hundreds movie name,I decided to write a script that scrapes movie rating from IMDB.com.I was googling if there is any public REST API available and found this website http://www.imdbapi.com/ which returns json as a search result.

The main task is couple of lines code like below:

import requests
import urllib
import json
BASE_URL = 'http://www.imdbapi.com/?'
query = {'i': '', 't': movi_name ,'tomatoes':'true'}
response = requests.get(BASE_URL+urllib.urlencode(query))
output = json.dumps(response.content, separators=(',',':'))

The output is the movie information,to grab specific result I have done some more formatting.You can check the whole script from here
Right now the script would print output like this:

Getting Movi The Pianist...
{'Plot': 'A Polish Jewish musician struggles to survive the destruction of the Warsaw ghetto of World War II.', 'Rating': '8.5', 'Title': 'The Pianist', 'Director': 'Roman Polanski', 'tomatoRating': '8.2', 'IMDB Rating': '8.5'}

In this script I have used amazing python module called Requests.

You can add more functionality like getting movie poster or writhing the movie data into
And if you want to run this script please add text file names movies.txt,like below:

The Pianist
The Avengers

Happy coding!

Web Scraping using python

Couple of weeks ago,I was working one of the project.Where my responsibility was to scrap several specific information from the website.

It was a dispensary listing website around 2600+ dispensary was there.Information like dispensary name,state,address,email,website etc was needed.I decided to use python for scrapping because of its huge  library collection and available third parties.

Using a csv file where directory name and the rss feed of the corresponding directory was listed.

The script read the csv file then take the name of the directory  and link.Then using feed parser it counts the number of dispensary from of the rss feed.

Feed parser is a very handy,to collect specif info from the rss feed.

feed_data = feedparser.parse(feed_path)
count = len(feed_data['entries'])
link_url = feed_data['entries'][i]['link']

Then,collected the dispensary url.And appended those into a list.

Now the real parts start,scrap the data using urls.For that I used BeautiFulSoup.Parsing data is very easy using it.

Before trying it,you have to install it.In ubuntu following terminal command is enough

 sudo apt-get install python-beautifulsoup

or using easy install

sudo easy_install BeautifulSoup

Parsing is as simple as below:

import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://mushfiq.me"
html = urllib2.urlopen(url).read()
data = BeautifulSoup(html)
print data

It will print whole parsed data of the url.And then you can navigate and collect yours need html tag values from the
soup :).
Then i loop through my url list and collected different scrap data from the soup.
Then the next part was to write those data into csv file.Its pretty simple in python.

import csv
data_list = [name,title,address,email]
file_path = "where_you_want_to_save_the_csv_file"
file_p = open(file_path, "ab")
write_pattern = csv.writer(file_p, delimiter=",", quotechar=",")
write_pattern.writerow(data_list)
file_p.close()

It generates a csv file with the data in this format
name,title,address,email
You can use the script or can contribute in that script.As It was my first scrapping,the script is not bullet proof,you can suggest better coding approach also.
Script is here
Finally I ran the script in my local machine,it took 5 hours to scrap 2660+ dispensary data.