Build Periodic crawler with Celery

Its a very common use case when you build a crawler and it will have to run periodically.And generally we set a unix Cron Job to handle the crawler periodically.

But its really pain when you add new task you have to login to the server and add new cron task into the crontab.Its only feasible when you have to run only few cron jobs.

I thought it would be great if I can handle it from my python code and do some interesting things.I have heard about Celery a lot as a meesage queue system. Truly speaking at first I couldn’t understand  how it works or how I can integrate with my projects.After googling I understood what is Celery then I thought it will be really great if I can run crawlers around it and use Celery to scheduling periodic work.

Install  Celery by following this link,then you will have to install and configure RabbitMQ from this link. BTW dont forget to add user,vhosts it is described on SettingUp RabbitMQ sections(ou can use mongodb,resddis as broker too).

And then you can clone my git repo,change the celeryconfig.py file as per as your configuration.Add a new task into tasks.py following the first method.

I have added a sample method which requests this site and print the HTTP response status code .

To run the project run “celerybeat”,then it will start celerybeat and start to send tasks to the broker like below:

Run “celeryd” into another terminal window to check the task output,you will see something like below:

It is printing the response status after every 5 seconds.

You can handle anything that you want todo after the crawling,like parsing the dom saving text,submitting form etc.

Btw dont forget to run


pip install -r requirements.txt

to install necessary packages for the projects.My Github Periodic Crawling Project

Advertisements

One comment

  1. testdomain · March 17, 2013

    Great article, totally what I was looking for

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s