Using BeautifulSoup with Python

Posted by in Python

In this post I will show you how to extract text (or data) from a website, using Python. I will start with the installation of BeautifulSoup, and next, I solve a small exercise.

Installing BeautifulSoup

If you want or need to deal with HTML code, one of the most interesting options is BeautifulSoup, so, let’s install it.

Instalando BeautifulSoup

Instalando BeautifulSoup

As you can check, I couldn’t install it using pip, but i finally got it installed by using easy_install. Now, let’s check it works!.

BeautifulSoup

BeautifulSoup

NOTE: I am not going to write a tutorial about BeautifulSoup, just I am going to show you how I use it to solve concrete problems I have!

Trying BeautifulSoup

To work with BeautifulSoup, I am going to use the getHTML function that I exaplained here (getting the HTML code from Python). For this example, I am going to use another example that I did when I explained Twitter Bootstrap for lists and tables.

The main objective is to retrieve text from the table.

The target:

Tablas con Twitter Bootstrap

Tablas con Twitter Bootstrap

The full code is:

def probandoBS():
codigoHTML = gethtml(url) # Recupero el código HTML
soap = bs(codigoHTML)       # Paso el código HTML a BeautifulSoap
#print soap.title
#print soap.title.string # findAll('title')
tabla = soap.find('table')
trs = tabla.findAll('tr')
for tr in trs:
tds = tr.findAll('td')
#print tds
#        for td in tds: # Imprime cada TD por separado
#            print td.string
lista = soap.findAll("ul", { "class" : "nav nav-pills" })
print len(lista)
for li in lista[0]: # Seleccionamos el primer elemento
print li.string

Let’s explain the code line by line. After receiving the HTML code, we transfer the string (with the HTML) to BeautifulSoup, and I store it on a variable called soap (line 3). You can see the title of the page (line 4), just writing:

soap.title

If you want just the text, clide the method .string:

soap.title.string

To retrieve the content of the table, let’s find an attribute called “table“:

tabla = soap.find('table')

I store the content on the varaible tabla, and let’s verify how the content of this variable is the HTML code of the table.

The next step is to split the tables finding the tr labels, just writting:

trs = tabla.findAll('tr')

The outcome is a list with all HTML code of each row (tr). Now let’s find the td labels to get the content of each box :

for tr in trs:
tds = tr.findAll('td')

Here you have the code run:

test BeautifulSoap

test BeautifulSoap

And .. that’s it! This is the way to solve out task!

Now, with the list

To get the content of a list (ul label), I am doing almost the same as before. You need only to know how to select the “container” (in this case I need to find the class nav nav-pills, and iter over each item.

lista = soap.findAll("ul", { "class" : "nav nav-pills" })
print len(lista)
for li in lista[0]:
   print li.string

Here you have the output working with the list (ul label):

test BeautifulSoap

test BeautifulSoup

As you can see, extracting data with Python using BeautifulSoup is very easy, once you study a bit the HTML code of the URL to work with. A few search and working with list is enough to solve the problem!

You have the full code on GitHub.

The next post will be another useful example of extracting content from a URL that I am using on my Betting and Soccer Software!

I hope you like it and have a nice day!