Web Scrapper and Automation

Scrapping static data from the website

For scrapping particular information from loaded web-page, we will use python library Beautifulsoup

Scrapping dynamic data from the website (after logging in, into the site)

Scrapping data from the website with dynamic data and iframe, which also require to enter input

How to avoid getting blocked or detected?

Use multiple IP (proxies) and keep rotating them (to make less requests from a single IP)

Third party service (example Luminati / Bright data) for re-routing the IP (the IP from where requests are being made)

# when testing proxies, get location of the proxy
print(browser.get('http://lumtest.com/myip.json'))

Wait on every page to let the data load on the browser
Once done with browsing, close the browser
Impersonate human like behaviour while surfing the website
- Scrolling over the page
- Use random, proper pause on the website
Captcha solving service
1. If you are scraping pages faster than a human possibly can, you will fall into a category called “bots”.
2. Following the same pattern while scraping. Like for example, you are going through every page of that target domain for just collecting images or links.
3. If you are scraping using the same IP for a certain period of time.
4. User Agent missing. Maybe you are using a headerless browser like Tor Browser

Selenium is a headless browser, so we can not send header with the request to further lower down being detected as robot scrapping the website. read more...

If login and password is required, then it is difficult to scrap a lot of data from a single account without being getting noticed.

Example: Linkedin disables a lot of data for a robot if number of monthly requests exceeds a particular number (different for premium and other accounts)

Scrapping is not legal!!

How to speed up the scrapping?

To speed up the processing, multiprocessing using Celery
Avoid loading images, ads and other data which is not required, into the browser

Flower UI for monitoring of scrapping and task completion

To deploy on browser-less server, selenium-grid docker image, which provides support for running multiple sessions and multiple browsers

Reference for Further Reading

10 Tips to avoid getting Blocked while Scraping Websites | CodementorCodementor

Web Scraping without getting blocked | ScrapingBeescrapingbee

How to avoid getting blocked while web scraping

How To Avoid Being Blocked With Web Scraping Best PracticesJust Understanding Data

PreviousCaching NextParallel Processing

Last updated 4 years ago

Was this helpful?