Web Scrapper and Automation
Last updated
Was this helpful?
Last updated
Was this helpful?
Scrapping static data from the website
For scrapping particular information from loaded web-page, we will use python library Beautifulsoup
Scrapping dynamic data from the website (after logging in, into the site)
Scrapping data from the website with dynamic data and iframe, which also require to enter input
How to avoid getting blocked or detected?
Use multiple IP (proxies) and keep rotating them (to make less requests from a single IP)
Third party service (example Luminati / Bright data) for re-routing the IP (the IP from where requests are being made)
Wait on every page to let the data load on the browser
Once done with browsing, close the browser
Impersonate human like behaviour while surfing the website
Scrolling over the page
Use random, proper pause on the website
Captcha solving service
If you are scraping pages faster than a human possibly can, you will fall into a category called “bots”.
Following the same pattern while scraping. Like for example, you are going through every page of that target domain for just collecting images or links.
If you are scraping using the same IP for a certain period of time.
User Agent missing. Maybe you are using a headerless browser like
If login and password is required, then it is difficult to scrap a lot of data from a single account without being getting noticed.
Example: Linkedin disables a lot of data for a robot if number of monthly requests exceeds a particular number (different for premium and other accounts)
Scrapping is not legal!!
How to speed up the scrapping?
To speed up the processing, multiprocessing using Celery
Avoid loading images, ads and other data which is not required, into the browser
Flower UI for monitoring of scrapping and task completion
To deploy on browser-less server, selenium-grid docker image, which provides support for running multiple sessions and multiple browsers
Selenium is a headless browser, so we can not send header with the request to further lower down being detected as robot scrapping the website. ...