Web Scraping with Python and BeautifulSoup

A Simple HTML Document

Example

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body><h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

Try it Yourself »

Example Explained

  • The <!DOCTYPE html> declaration defines this document to be HTML5
  • The <html> element is the root element of an HTML page
  • The <head> element contains meta information about the document
  • The <title> element specifies a title for the document
  • The <body> element contains the visible page content
  • The <h1> element defines a large heading
  • The <p> element defines a paragraph

More Details refer to this

HTML Tutorials

What is Web Scraping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

More details refer to Wikipedia

Why we need Web Scraping?

A large organization will need to keep itself updated with the information changes occurring in multitudes of websites. An intelligent web scraper will find new websites from which it needs to scrap the data. Intelligent approaches identify the changed data, extract it without extracting the unnecessary links present within and navigate between websites to monitor and extract information on a real-time basis efficiently and effectively. You can easily monitor several websites simultaneously while keeping up with the frequency of updates.

You will observe, as has been mentioned earlier, that data across the websites constantly changes. How will know if a key change has been made by an organization? Let’s say there has been a personnel change in the organization, how will you find out about that? That’s where the alerts feature in web scraping comes to play. The intelligent web scraping techniques will alert you to the data changes that have occurred on a particular website, thus helping you keep an eye on opportunities and issues.

Web Scraping using Python and BeautifulSoup

Firstly, I will demonstrate you with very basic HTML web page. And later on, show you how to do web scraping on the real-world web pages.

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.

After running our request, we get a Response object. This object has a status_codeproperty, which indicates if the page was downloaded successfully.

We can print out the HTML content of the page using the content property:

BeautifulSoup

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

We can now print out the HTML content of the page, formatted nicely, using the prettify method on the BeautifulSoup object:

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the listfunction on it.

As you can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body(Refer to just above, under html.children).

Now, we can get the p tag by finding the children of the body tag

we can use the get_text method to extract all of the text inside the tag.

Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

If you instead only want to find the first instance of a tag, you can use the findmethod, which will return a single BeautifulSoup object.

If you want to fork this notebook go to Web Scraping Tutorial.

Now, I’ll show you how to perform web scraping using

Python 3 and the BeautifulSoup library. We’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.

We now know enough to proceed with extracting information about the local weather from the National Weather Service website. The first step is to find the page we want to scrape. We’ll extract weather information about downtown San Francisco from this page.

Once you open this page then use CRTL+SHIFT+I to inspect the element, but here we are interested in this particular column (San Francisco CA).

So, by right-clicking on the page near where it says “Extended Forecast”, then clicking “Inspect”, we’ll open up the tag that contains the text “Extended Forecast” in the elements panel.

We can then scroll up in the elements panel to find the “outermost” element that contains all of the text that corresponds to the extended forecasts. In this case, it’s a div tag with the id seven-day-forecast.

Explore the div, you’ll discover that each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is contained in a divwith the class tombstone-container.

We now know enough to download the page and start parsing it. In the below code, we:

  • Download the web page containing the forecast.
  • Create a BeautifulSoup class to parse the page.
  • Find the div with id seven-day-forecast, and assign to seven_day
  • Inside seven_day, find each individual forecast item.
  • Extract and print the first forecast item.

Extract and print the first forecast item

As you can see, inside the forecast item tonight is all the information we want. There are 4 pieces of information we can extract:

  • The name of the forecast item?—?in this case, Today.
  • The description of the conditions?—?this is stored in the title property of img.
  • A short description of the conditions?—?in this case, Sunny.
  • The temperature low?—?in this case, 69°F.

Now that we know how to extract each individual piece of information, we can combine our knowledge with CSS selectors and list comprehensions to extract everything at once.

In the below code:

Select all items with the class period-name inside an item with the class tombstone-container in seven_day.

Use a list comprehension to call the get_text method on each BeautifulSoupobject.

Combining our data into Pandas DataFrame

We can use a regular expression and the Series.str.extract method to pull out the numeric temperature values.

If you want to fork this notebook go to Web Scraping and GitHub

I hope now you have a good understanding of how to Scrape the data from web pages. In the coming weeks, I’ll do web scraping on

  • News articles
  • Sports scores
  • Weather forecasts
  • Stock prices
  • Online retailer price etc.

Hope you like this article!! Don’t forget to like this blog and share with others.

Thank You

Go Subscribe

THEMENYOUWANTTOBE

Show Some Love ?

Facebook Comments

More Stuff

What I learned from 6 months contributing to open ... It’s been 6 months since I’ve been contributing with a pseudonymous account to several open source projects, it started with me fixing some typos, but...
An In-depth Review of Andrew Ng’s deeplearni... So you’ve seen the recent news about how artificial intelligence (AI) is changing everything. However, the idea of AI has been around for a long time....
Boost your face recognition accuracy with this qui... I admit, we’ve made it almost too easy to deploy a state-of-the-art face recognition machine learning model. So easy in fact, the only thing you need ...
Adversarially generated Julia sets I thought I’d follow up my first article/tutorial about Julia, by showcasing another side of the language’s ecosystem, libraries for machine learning...
Spread the love

Posted by News Monkey