Web Scrapper and Automation

Scrapping static data from the website

For scrapping particular information from loaded web-page, we will use python library Beautifulsoup

Scrapping dynamic data from the website (after logging in, into the site)

Scrapping data from the website with dynamic data and iframe, which also require to enter input

How to avoid getting blocked or detected?

  • Use multiple IP (proxies) and keep rotating them (to make less requests from a single IP)

Third party service (example Luminati / Bright data) for re-routing the IP (the IP from where requests are being made)

# when testing proxies, get location of the proxy
print(browser.get('http://lumtest.com/myip.json'))
  • Wait on every page to let the data load on the browser

  • Once done with browsing, close the browser

  • Impersonate human like behaviour while surfing the website

    • Scrolling over the page

    • Use random, proper pause on the website

  • Captcha solving service

    1. If you are scraping pages faster than a human possibly can, you will fall into a category called “bots”.

    2. Following the same pattern while scraping. Like for example, you are going through every page of that target domain for just collecting images or links.

    3. If you are scraping using the same IP for a certain period of time.

    4. User Agent missing. Maybe you are using a headerless browser like Tor Browser

How to speed up the scrapping?

  1. To speed up the processing, multiprocessing using Celery

  2. Avoid loading images, ads and other data which is not required, into the browser

Flower UI for monitoring of scrapping and task completion

To deploy on browser-less server, selenium-grid docker image, which provides support for running multiple sessions and multiple browsers

Reference for Further Reading

Last updated

Was this helpful?