A Simple HTML Document
<body><h1>My First Heading</h1>
<p>My first paragraph.</p>
<!DOCTYPE html>declaration defines this document to be HTML5
<html>element is the root element of an HTML page
<head>element contains meta information about the document
<title>element specifies a title for the document
<body>element contains the visible page content
<h1>element defines a large heading
<p>element defines a paragraph
More Details refer to this
What is Web Scraping?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
More details refer to Wikipedia
Why we need Web Scraping?
A large organization will need to keep itself updated with the information changes occurring in multitudes of websites. An intelligent web scraper will find new websites from which it needs to scrap the data. Intelligent approaches identify the changed data, extract it without extracting the unnecessary links present within and navigate between websites to monitor and extract information on a real-time basis efficiently and effectively. You can easily monitor several websites simultaneously while keeping up with the frequency of updates.
You will observe, as has been mentioned earlier, that data across the websites constantly changes. How will know if a key change has been made by an organization? Let’s say there has been a personnel change in the organization, how will you find out about that? That’s where the alerts feature in web scraping comes to play. The intelligent web scraping techniques will alert you to the data changes that have occurred on a particular website, thus helping you keep an eye on opportunities and issues.
Web Scraping using Python and BeautifulSoup
Firstly, I will demonstrate you with very basic HTML web page. And later on, show you how to do web scraping on the real-world web pages.
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a
GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using
requests, of which
GET is just one.
Let’s try downloading a simple sample website,
http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.
After running our request, we get a Response object. This object has a
status_codeproperty, which indicates if the page was downloaded successfully.
We can print out the HTML content of the page using the
We can use the BeautifulSoup library to parse this document, and extract the text from the
p tag. We first have to import the library, and create an instance of the
BeautifulSoup class to parse our document:
We can now print out the HTML content of the page, formatted nicely, using the
prettify method on the
As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the
children property of
soup. Note that
children returns a list generator, so we need to call the
listfunction on it.
As you can see above, there are two tags here,
body. We want to extract the text inside the
p tag, so we’ll dive into the body(Refer to just above, under html.children).
Now, we can get the
p tag by finding the children of the body tag
we can use the
get_text method to extract all of the text inside the tag.
Finding all instances of a tag at once
What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the
find_all method, which will find all the instances of a tag on a page.
If you instead only want to find the first instance of a tag, you can use the
findmethod, which will return a single
If you want to fork this notebook go to Web Scraping Tutorial.
Now, I’ll show you how to perform web scraping using
We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this page.
Once you open this page then use CRTL+SHIFT+I to inspect the element, but here we are interested in this particular column (San Francisco CA).
So, by right-clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel.
We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a
div tag with the id
Explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a
divwith the class
We now know enough to download the page and start parsing it. In the below code, we:
- Download the web page containing the forecast.
- Create a
BeautifulSoupclass to parse the page.
- Find the
seven-day-forecast, and assign to
seven_day, find each individual forecast item.
- Extract and print the first forecast item.
Extract and print the first forecast item
As you can see, inside the forecast item
tonight is all the information we want. There are
4 pieces of information we can extract:
- The name of the forecast item?—?in this case,
- The description of the conditions?—?this is stored in the
- A short description of the conditions?—?in this case,
- The temperature low?—?in this case,
Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.
In the below code:
Select all items with the class
period-name inside an item with the class
Use a list comprehension to call the
get_text method on each
Combining our data into Pandas DataFrame
We can use a regular expression and the Series.str.extract method to pull out the numeric temperature values.
I hope now you have a good understanding of how to Scrape the data from web pages. In the coming weeks, I’ll do web scraping on
- News articles
- Sports scores
- Weather forecasts
- Stock prices
- Online retailer price etc.
Hope you like this article!! Don’t forget to like this blog and share with others.
Show Some Love ?