Our Courses

Web Scraping using Scrapy in Python for Data Science

  • Category
    IT & Software
  • View
    35
  • Review
    • 0
  • Created At
    7 months ago
Web Scraping using Scrapy in Python for Data Science

Hello and welcome to my new course 'The Complete Beginners to Advanced Guide to Web Scraping using Scrapy'.

You already know that in this information technology age 'data' is everything. And we have plenty of data every where. Each and every second tons of data is being generated. But, just like the saying,  'Water, water everywhere, nor any drop to drink', the usable structured tabular data is very less compared to the vast amount of data distributed across internet. Modern Data Analytics and Machine learning requires structured data for model training and evaluation.

Web scraping is a method that we can use to automatically obtain large amounts of data from websites. Most of this unstructured data is in the format of HTML Pages or tables. A web scraping tool can be used to extract and convert these data into structured form as in a spreadsheet or save to a database so that it can be used in various applications.

There are many tools available to perform web scraping. Scrapy is a free and open-source web-crawling framework written in Python, developed and maintained by Zyte, a web-scraping development and services company. Originally designed for web scraping, scrapy can also be used to extract data using APIs or as a general-purpose web crawler.

Here is an overview about the sessions that are included in this quick scrapy course.

In the first session, we will have an introduction to web scraping. What is web scraping, why we need web scraping and also an overview about the scrapy library. We will also discuss the difference between a Crawler, a Spider and a Scrapper.

In the next session, we will set up the scrapy library. We will start from the python interpreter installation, then we will proceed with installing the pycharm IDE for coding and finally the scrapy library. Then we will try a quick check using a scrapy shell command line to see if everything is working as intended. We will try to scrape the ebay website and will get the entire contents of the website, the page which is saved locally in a browser as well as the entire html in the command line.

Then in the next session, we will start with scrapy selectors. We will discuss two types of scrapy selectors. The CSS selectors as well as the XPath selectors. We will try to scrape an h4 element from the ebay website using get method, and fetch all product categories with get all.

And then in the coming session, we will deal with more examples of css selectors. We will use the Books and Quotes sandbox websites from zyte to perform our exercises. we will get list of book categories by navigating to the innermost html tags. We will learn the syntax and structure of using  both the css as well as the xpath selectors. After that we will also try some examples using xpath selectors.

In the next session, we will try the same scrapping expressions inside a python program. We will create a new pycharm project and include the scrapy code in the spider file and later we run the file.

We will also create a dedicated scrapy project in pycharm with all supporting files automatically created. We will then create spider which can scrape all the quotes from the quotes website. We will try different scrapy expressions in the command line itself by which we could extract the specific data from the website that we scrapped using xpath as well as css selectors.

In the spyder we will try iterating through each and every quote items using a looping statement. later after doing all the fine tuning, we will include this expression inside our scrapy spider project. The scraped result will be saved safely in a json file.

We often deal with websites where the content is listed in multiple pages. We will create spyder in our scrapy project which can automatically loop through each next page links and then scrape the content just like how a single page is scraped. Then the data is saved in json file.

Instead of json file, scrapy features option to directly include the scrapped result in a sqlite database as tables and rows. We will see how we can store the data into an sqlite database by using a feature called pipelining in which one process can be done after another and after another etc.. and we will check the data inside the sqlite to verify.

At times we need to go inside links like 'read more' or 'know more' of article items and this has to be automated so that every read more link is visited and scraped. We will see if the link exists and if yes a separate call back will be used to handle it.

Recent web development trend includes infinite scrolling pages where the user can keep on scrolling the website just like a social media post. As the user reaches almost the end of the page, the next set of posts or data will be loaded. We will see how we can scrape data from these kind of infinitely scrolling pages using scrapy. They mostly use an API to load the data dynamically and we will see how we can fetch data from a REST API link, parse it and later save it.

And there are pages which does not serve actual html content from the server. They just send only the javascript code to the browser and the browser will run the code and instantly generate html page. Most scrapping programs will get hold of only the JavaScript and not the html. Same with scrapy too. So inorder to parse these html contents, we have to use a JavaScript engine to simulate the content generation and later we will parse that html content. its a bit tricky, but its easier than you think.

A similar advanced scraping scenario is when we need to automate form submissions or sending post requests to server. For example, if some website is giving a page with information only if you logged in, scrapy can do at first the form submission and then once the form is submitted, the data for logged in users will be available and it can be scraped.

Once we have spider setup and if its a long running one, the best way is to transfer that to a server and run rather than running from your personal computer. We will see how we can setup a scrapy server so that you can host that with any of your favourite cloud provider and you have a scrapy server in the cloud.

And that's all about the topics which are currently included in this quick course. The sample projects and the code have been uploaded and shared in a folder. I will include the link to download them in the last session or the resource section of this course. You are free to use that with no questions asked.

Also, after completing this course, you will be provided with a course completion certificate which will add value to your portfolio.

So that's all for now. See you soon in my classroom.

Happy Learning !!