A new chapter of web scrapping, and today my goal is to extract the results of matches played on Spanish League (first and second division). Let’s start by preparing a function that as the user to introduce a division and a “day”:
def pregunta(mensaje, opciones): ask = raw_input(mensaje) try: return ask if int(ask) in opciones else '' except: return ''
The outcome is this:
Division and day are the two needed data to keep on doing!
I have prepared a list with the URL (they are this and this) for each division. I already explained how you can extract the HTML code with #Python, and the string that we get will feed the BeautifulSoup module.
Before continue, let’s define a match class:
class Marcadores(): """ Datos extraidos desde MARCA.com con los marcadores """ def __init__(self, eql, eqv, golL, golV ): self.eql = eql.lower().decode('utf-8') self.eqv = eqv.lower().decode('utf-8') self.golL = golL self.golV = golV def __str__(self): return self.eql + " " + self.eqv + " --> " + self.golL + " - " + self.golV
We still need more functions. One of them for removing accents (I found it here).
</pre> import unicodedata # Tildes # Elimina TILDES def elimina_tildes(s): return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')) <pre>
I need another function to “fix” the score:
def ajustaGoles(marca): """ Ajuste de goles, devolviendo gol como string """ try: marca[0] = str(int(marca[0])) marca[1] = str(int(marca[1])) except: marca[0] = '' marca[1] = '' return marca
The function that does the job is:
def obtenerResultados(data, jorn): """ Ejecuta la extracción de la información desde el código HTML extraido """ soup = bs(data) tjornadas = soup.findAll("div", { "class" : "jornada calendarioInternacional" }) print len(tjornadas), " jornadas" #for jornada in tjornadas: # print jornada.text #print tjornadas[jorn-1] tlocales = tjornadas[jorn-1].findAll("td", { "class" : "local" }) tvisits = tjornadas[jorn-1].findAll("td", { "class" : "visitante" }) tresults = tjornadas[jorn-1].findAll("td", { "class" : "resultado" }) #print len(tlocales), len(tvisits), len(tresults) lMarcadores = [] for i in range(0,len(tlocales)): goles = tresults[i].text.split(separaGoles) goles = ajustaGoles(goles) partido = Marcadores(elimina_tildes(tlocales[i].text), elimina_tildes(tvisits[i].text), goles[0], goles[1]) lMarcadores.append(partido) return lMarcadores # tjornadas[jorn-1].text
To understand what we need to extract, it’s a good practice to study the HTML code. For this task I use Firefox Developer Tools:
And … the code running:
You can get the full source code on GitHub.
And … that’s it! This version that I have explain here is not what I really use on my software: I don’t need to ask anything, because the data I need is stored on a config file (I already write about using config files here).
i hope you enjoy, and have fun! Have a nice day!