Skip to content

Python and BeautifulSoup: matches scores from Spanish League

A new chapter of web scrapping, and today my goal is to extract the results of matches played on Spanish League (first and second division). Let’s start by preparing a function that as the user to introduce a division and a “day”:


def pregunta(mensaje, opciones):
ask = raw_input(mensaje)
try:
return ask if int(ask) in opciones else ''
except:
return ''

The outcome is this:

Pregunta division y jornada

Pregunta division y jornada

Division and day are the two needed data to keep on doing!

I have prepared a list with the URL (they are this and this) for each division. I already explained how you can extract the HTML code with #Python, and the string that we get will feed the BeautifulSoup module.

Before continue, let’s define a match class:

class Marcadores():
""" Datos extraidos desde MARCA.com con los marcadores """
def __init__(self, eql, eqv, golL, golV ):
self.eql = eql.lower().decode('utf-8')
self.eqv =  eqv.lower().decode('utf-8')
self.golL = golL
self.golV = golV
def __str__(self):
return self.eql + " " + self.eqv + " --> " + self.golL + " - " + self.golV

We still need more functions. One of them for removing accents (I found it here).

</pre>
import unicodedata # Tildes
# Elimina TILDES
def elimina_tildes(s): return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
<pre>

I need another function to “fix” the score:

def ajustaGoles(marca):
""" Ajuste de goles, devolviendo gol como string """
try:
marca[0] = str(int(marca[0]))
marca[1] = str(int(marca[1]))
except:
marca[0] = ''
marca[1] = ''
return marca

The function that does the job is:

def obtenerResultados(data, jorn):
""" Ejecuta la extracción de la información desde el código HTML extraido """
soup = bs(data)
tjornadas = soup.findAll("div", { "class" : "jornada calendarioInternacional" })
print len(tjornadas), " jornadas"
#for jornada in tjornadas:
#    print jornada.text
#print tjornadas[jorn-1]
tlocales = tjornadas[jorn-1].findAll("td", { "class" : "local" })
tvisits = tjornadas[jorn-1].findAll("td", { "class" : "visitante" })
tresults = tjornadas[jorn-1].findAll("td", { "class" : "resultado" })
#print len(tlocales), len(tvisits), len(tresults)
lMarcadores = []
for i in range(0,len(tlocales)):
goles = tresults[i].text.split(separaGoles)
goles = ajustaGoles(goles)
partido = Marcadores(elimina_tildes(tlocales[i].text), elimina_tildes(tvisits[i].text), goles[0], goles[1])
lMarcadores.append(partido)
return lMarcadores   # tjornadas[jorn-1].text

To understand what we need to extract, it’s a good practice to study the HTML code. For this task I use Firefox Developer Tools:

Herramientas de desarrollo web

Firefox Developer Tools

And … the code running:

Extrae Marcadores

Extrae Marcadores

You can get the full source code on GitHub.

And … that’s it! This version that I have explain here is not what I really use on my software: I don’t need to ask anything, because the data I need is stored on a config file (I already write about using config files here).

i hope you enjoy, and have fun! Have a nice day!

Manejando Datos Newsletter

Noticias, entradas, eventos, cursos, … News, entrances, events, courses, …


Gracias por unirte a la newsletter. Thanks for joining the newsletter.