In this post I will show you how to extract text (or data) from a website, using Python. I will start with the installation of BeautifulSoup, and next, I solve a small exercise.
Installing BeautifulSoup
If you want or need to deal with HTML code, one of the most interesting options is BeautifulSoup, so, let’s install it.
As you can check, I couldn’t install it using pip, but i finally got it installed by using easy_install. Now, let’s check it works!.
NOTE: I am not going to write a tutorial about BeautifulSoup, just I am going to show you how I use it to solve concrete problems I have!
Trying BeautifulSoup
To work with BeautifulSoup, I am going to use the getHTML function that I exaplained here (getting the HTML code from Python). For this example, I am going to use another example that I did when I explained Twitter Bootstrap for lists and tables.
The main objective is to retrieve text from the table.
The target:
The full code is:
def probandoBS(): codigoHTML = gethtml(url) # Recupero el código HTML soap = bs(codigoHTML) # Paso el código HTML a BeautifulSoap #print soap.title #print soap.title.string # findAll('title') tabla = soap.find('table') trs = tabla.findAll('tr') for tr in trs: tds = tr.findAll('td') #print tds # for td in tds: # Imprime cada TD por separado # print td.string lista = soap.findAll("ul", { "class" : "nav nav-pills" }) print len(lista) for li in lista[0]: # Seleccionamos el primer elemento print li.string
Let’s explain the code line by line. After receiving the HTML code, we transfer the string (with the HTML) to BeautifulSoup, and I store it on a variable called soap (line 3). You can see the title of the page (line 4), just writing:
soap.title
If you want just the text, clide the method .string:
soap.title.string
To retrieve the content of the table, let’s find an attribute called “table“:
tabla = soap.find('table')
I store the content on the varaible tabla, and let’s verify how the content of this variable is the HTML code of the table.
The next step is to split the tables finding the tr labels, just writting:
trs = tabla.findAll('tr')
The outcome is a list with all HTML code of each row (tr). Now let’s find the td labels to get the content of each box :
for tr in trs: tds = tr.findAll('td')
Here you have the code run:
And .. that’s it! This is the way to solve out task!
Now, with the list
To get the content of a list (ul label), I am doing almost the same as before. You need only to know how to select the “container” (in this case I need to find the class nav nav-pills, and iter over each item.
lista = soap.findAll("ul", { "class" : "nav nav-pills" }) print len(lista) for li in lista[0]: print li.string
Here you have the output working with the list (ul label):
As you can see, extracting data with Python using BeautifulSoup is very easy, once you study a bit the HTML code of the URL to work with. A few search and working with list is enough to solve the problem!
You have the full code on GitHub.
The next post will be another useful example of extracting content from a URL that I am using on my Betting and Soccer Software!
I hope you like it and have a nice day!