Python Web Scraping Cookbook
上QQ阅读APP看书,第一时间看更新

How to do it...

This recipe, and most of the others in this chapter, will be presented with iPython in an interactive manner.  But all of the code for each is available in a script file.  The code for this recipe is in 02/01_parsing_html_wtih_bs.py. You can type the following in, or cut and paste from the script file.

Now let's walk through parsing HTML with Beautiful Soup. We start by loading this page into a BeautifulSoup object using the following code, which creates a BeautifulSoup object, loads the content of the page using with requests.get, and loads it into a variable named soup.

In [1]: import requests
...: from bs4 import BeautifulSoup
...: html = requests.get("http://localhost:8080/planets.html").text
...: soup = BeautifulSoup(html, "lxml")
...:

The HTML in the soup object can be retrieved by converting it to a string (most BeautifulSoup objects have this characteristic).  This following shows the first 1000 characters of the HTML in the document:

In [2]: str(soup)[:1000]
Out[2]: '<html>\n<head>\n</head>\n<body>\n<div id="planets">\n<h1>Planetary data</h1>\n<div id="content">Here are some interesting facts about the planets in our solar system</div>\n<p></p>\n<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Name\r\n </th>\n<th>\r\n Mass (10^24kg)\r\n </th>\n<th>\r\n Diameter (km)\r\n </th>\n<th>\r\n How it got its Name\r\n </th>\n<th>\r\n More Info\r\n </th>\n</tr>\n<tr class="planet" id="planet1" name="Mercury">\n<td>\n<img src="img/mercury-150x150.png"/>\n</td>\n<td>\r\n Mercury\r\n </td>\n<td>\r\n 0.330\r\n </td>\n<td>\r\n 4879\r\n </td>\n<td>Named Mercurius by the Romans because it appears to move so swiftly.</td>\n<td>\n<a href="https://en.wikipedia.org/wiki/Mercury_(planet)">Wikipedia</a>\n</td>\n</tr>\n<tr class="p'

We can navigate the elements in the DOM using properties of soup. soup represents the overall document and we can drill into the document by chaining the tag names. The following navigates to the <table> containing the data:

In [3]: str(soup.html.body.div.table)[:200]
Out[3]: '<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Name\r\n </th>\n<th>\r\n Mass (10^24kg)\r\n </th>\n<th>\r\n '

The following retrieves the the first child <tr> of the table:

In [6]: soup.html.body.div.table.tr
Out[6]: <tr id="planetHeader">
<th>
</th>
<th>
Name
</th>
<th>
Mass (10^24kg)
</th>
<th>
Diameter (km)
</th>
<th>
How it got its Name
</th>
<th>
More Info
</th>
</tr>

Note this type of notation retrieves only the first child of that type.  Finding more requires iterations of all the children, which we will do next, or using the find methods (the next recipe).

Each node has both children and descendants. Descendants are all the nodes underneath a given node (event at further levels than the immediate children), while children are those that are a first level descendant. The following retrieves the children of the table, which is actually a list_iterator object:

In [4]: soup.html.body.div.table.children
Out[4]: <list_iterator at 0x10eb11cc0>

We can examine each child element in the iterator using a for loop or a Python generator. The following uses a generator to get all the children of the and return the first few characters of their constituent HTML as a list:

In [5]: [str(c)[:45] for c in soup.html.body.div.table.children]
Out[5]:
['\n',
'<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n ',
'\n',
'<tr class="planet" id="planet1" name="Mercury',
'\n',
'<tr class="planet" id="planet2" name="Venus">',
'\n',
'<tr class="planet" id="planet3" name="Earth">',
'\n',
'<tr class="planet" id="planet4" name="Mars">\n',
'\n',
'<tr class="planet" id="planet5" name="Jupiter',
'\n',
'<tr class="planet" id="planet6" name="Saturn"',
'\n',
'<tr class="planet" id="planet7" name="Uranus"',
'\n',
'<tr class="planet" id="planet8" name="Neptune',
'\n',
'<tr class="planet" id="planet9" name="Pluto">',
'\n']

Last but not least, the parent of a node can be found using the .parent property:

In [7]: str(soup.html.body.div.table.tr.parent)[:200]
Out[7]: '<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\r\n Name\r\n </th>\n<th>\r\n Mass (10^24kg)\r\n </th>\n<th>\r\n '