Beginner-friendly scraping with BeautifulSoup

Tabular scraping of Covid-19 data — using worldometer as the source

Deepika Vijay
Analytics Vidhya

--

Web scraping is among the most interesting aspects of data science. To be able to conduct your own analysis and prediction of any website’s data makes the field so much more exciting! The scope for data collection from websites through this process called web scraping is limitless. However, it isn’t as simple as copy-pasting on a word document. But thanks to the geniuses in the python community and the right amount of libraries out there like Scrapy, Selenium and BeautifulSoup, the process isn’t too tedious. In this article, we are going to explore how to use BeautifulSoup to scrape covid-19 data from worldometer.

source: FoodieFactor, via pixabay (Pixabay license)

For the past year, covid-19 has captured the world’s attention. There are so many questions still unanswered about this virus that has caused havoc in the world. So literally any kind of analysis can bring about a better understanding and useful insights for ourselves and potentially the society. Hence I’m choosing worldometer for this guide.

So let’s dig in!

Let’s start with importing the necessary libraries.

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import html

There are two alternatives to fetching the table, one of which is a lot easier and known to work for a few, but not me. The read_html() method is used to fetch tables directly from a website without needing much added work. This works easily for say, Wikipedia tables. It's worth trying if your goal is to move onto analysis instead of spending your energy on learning how to scrape. Scroll down to Alternative 2 if you want to skip ahead and get a jumpstart on fetching the data. Note that the goal of the article is to explore scraping in the good old fashioned way with BeautifulSoup which I’m going to refer to as “Alternative 1”.

Let’s first save our website into a variable say, url.

url = 'http://www.worldometers.info/coronavirus/'

We are going to get the Html content politely from worldometer by using requests. Requests is used to get raw data from websites.

r = requests.get(url)

Alternative 1

Here’s the soup! Using the following code, we are going to get the content from our response saved in the variable “r”. BeautifulSoup in a nutshell helps separate useful information such as links, text, titles etc. from Html tags, which can then be used for further analysis. The lxml feature is the parser used to interpret the Html code.

soup = BeautifulSoup(r.text,'lxml')

If you try to call the variable soup, you’ll see an overwhelming load of information. Our goal is to get only relevant tabular information out of it by executing the following code.

lst_crucial = [str(i) for i in list(soup.find_all('table')[0].find_all('td'))]

Now, what have we here? The above code narrows down our soup to only the information we need for the data frame by using a slightly complex list comprehension. You can try to execute each of soup.find_all(‘table’) and then soup.find_all(‘table’)[0] and then (soup.find_all(‘table’)[0].find_all(‘td’) separately to get an understanding of exactly what it is that we did here.

soup.find_all(‘table’): Gives us all the tables on the webpage and we are only concerned with the first one, which brings us to

soup.find_all(‘table’)[0]: Now that we have all the information relevant to the table we need, we can further narrow it down to just the data. We don’t need information on how the table is styled etc. so we will do as follows

(soup.find_all(‘table’)[0].find_all(‘td’): “td” is table data and that’s the only information we need from the entire soup. Notice that using this code eliminates the “tr” and “th” tags?

The rest of the code just saves each element of the relevant data into a list that I called “lst_crucial” (because this list forms the basis for the rest of our code) — in string form. This can also be done in a for loop of course but I find list comprehension much more efficient and elegant.

Alright! We have what we need, lets convert the list into a table! To do this, let's start with importing re and then creating an empty DataFrame. Re is the library that enables us to use regex in finding necessary elements from strings.

import re
df = pd.DataFrame()

Now let’s start creating columns and adding relevant data from our list into each of them.

df['Countries'] = [''.join([re.findall('>(.*?)<', lst_crucial[i-13:i+1][0])[1]]) for i,v in enumerate(lst_crucial) if 'world-population' in v]

Indeed another list comprehension and we’ll have 12 more of them for each column but its all very similar to this one. What does the above code do? If you observe our lst_crucial, you’ll notice that the data is written up row by row and each row ends with the “total population” of the country with an href reference starting with “world-population” for each one of them. Our goal is to derive the 13 items on the list before this item and this item included done by providing this range — “i-13:i+1” (“i” is the index of the 13th item and contains “world-population”). We don’t want all the elements of the list for our Countries column but just the countries. So with the re.findall(‘>(.*?)<’, lst_crucial[i-13:i+1][0])[1]]), we are essentially extracting all elements between >< tags because that's where the countries are and we are doing this only from the first item among the 13 items of each row hence [0] and from the three groups of regex elements that we have between ><, we want the 2nd group — the Countries, hence [1].

For all other columns, the code follows a similar structure. The following code is for the next column.

df['Total_cases'] = [''.join([k for s in lst_crucial[i-13:i+1][1] for k in s if k.isdigit()]) for i,v in enumerate(lst_crucial) if 'world-population' in v]

In this code, we are simply taking the numerical elements from the second of the range of 13 items that make up for a row, hence [1] — recall we did the same for the countries column but used [0] as it was the first element? The “.join” essentially joins each digit into one string to form the required number. We will continue to do the same for all the other columns as they are all numeric. Here’s the final code:

Embed by Author
GIF by Author

Et voilà! We have today’s complete dataframe from worldometer, ready for further analysis.

Alternative 2

Instead of the above alternative, we can try this shortcut that may or may not work. We are going to call the pandas read_html() method on the content of the url which we previously saved in “r”. Lets save this as “dfs” as there are multiple data frames in the content. Finally, calling dfs[0] gives us the first table which is the relevant one for us.

r = requests.get(url)
dfs = pd.read_html(r.text, attrs = {'id':'main_table_countries_today'})
dfs[0]

The following is a snippet of the output.

Photo by Author

That’s all for now — stay tuned for another article on the analysis of covid data

--

--