Its a very common use case when you build a crawler and it will have to run periodically.And generally we set a unix Cron Job to handle the crawler periodically.
But its really pain when you add new task you have to login to the server and add new cron task into the crontab.Its only feasible when you have to run only few cron jobs.
I thought it would be great if I can handle it from my python code and do some interesting things.I have heard about Celery a lot as a meesage queue system. Truly speaking at first I couldn’t understand how it works or how I can integrate with my projects.After googling I understood what is Celery then I thought it will be really great if I can run crawlers around it and use Celery to scheduling periodic work.
Install Celery by following this link,then you will have to install and configure RabbitMQ from this link. BTW dont forget to add user,vhosts it is described on SettingUp RabbitMQ sections(ou can use mongodb,resddis as broker too).
And then you can clone my git repo,change the celeryconfig.py file as per as your configuration.Add a new task into tasks.py following the first method.
I have added a sample method which requests this site and print the HTTP response status code .
To run the project run “celerybeat”,then it will start celerybeat and start to send tasks to the broker like below:
Run “celeryd” into another terminal window to check the task output,you will see something like below:
It is printing the response status after every 5 seconds.
You can handle anything that you want todo after the crawling,like parsing the dom saving text,submitting form etc.
Btw dont forget to run
pip install -r requirements.txt
to install necessary packages for the projects.My Github Periodic Crawling Project