Lessons learnt after doing webscrapping

Today it is time for lessons that I’ve learnt after finishing (successfully) a webscrapping to get all companies participans in #Fitur 2015, International Tourism Meeting (Feria Internacional del Turismo) in Spain, held every year in Madrid: the appointment for the turist sector.

As a conclusion, I did it in near 3 days, and only using 160 lines of code. Only 160 lines? Yes, more or less. The coding time has been around 12 hours, plus 10-14 hours of processing, spread in a week or so!.

Please, allow me not to publish the source code, mainly because it’s sensible information, although public. If you are interested, please, send me an email.

The challenge

In order to succedd to this challenge, let’s put to work my skills on databases, Python, HTML5 and CSS, because accessing the diferent pages modifying the URL is …. imposible! The task seems funny, because accessing the information is not as direct as you can think: only one company is shown at a time.

Lesson 1: use the right toolkit

The first lesson learnt is use the righ tools. Python is (in my opinion) the best choice for managing data. It’s true that I should have leant Python before, but it worth to do it even late!

If you are going to do webscrapping, you need to use BeautifulSoap and Selenium, because these two libraries simplifies everything a lot.

For the database, I have used SQLite, a database I rarely use. The result is positive, although I spent part of the time developing a class to deal with SQlite databases, (I will write in a future entrance about it).

Lesson 2: you need to know how to debug Python code

With no doubt, even the simpliest programming script requires to do some verifications, so debugging code is an aspect you cannot forget. It’s very important for fixing bugs and errors, and you shouldn’t undestimate it. Invest time in learning how to debug. I already wrote how to debug using PyScripter, and now, I’m doing it using PyCharm.

Lesson 3: write good code

I haven’t notice the benefits of writting good code until I leant it. Now, after that, to follow the Style Guide for Python allow me:

  • To make code more readeble and easy to understand
  • To write less code

I already wrote about it and the PEP 8 every developer should be aware of.

A few weeks ago I needed to modify some Visual Basic 6 code, and that was the moment I fall in love with the Python Coding style guidelines.  Some of the things I shouldn’t code: more variables than needed, global vars, clean code, …. I am proud of myself, I’m changing!!

Lesson 4: exercises you think you are not learning (but it’s the opposite)

When you learn to code, I always think that solving small and simple exercises (some times, not so small neither simple) is a good way to solve problems. Also, you can find several small exercises solved here on this blog, with apparently no difficulty, but focused on learning a concreate module or library. After solving small exercises you can think in bigger problems (not the other wae around).

Let’s see an example: compare strings. In languages such as Spanish where there are accents, and for a computer, “avión” is not the same as “avion“. Also, for databases, they are two different strings. But, could I remove accents on a string? Yes, and the solution is very easy … if you use the right module:

import unicodedata

def elimina_tildes(s):
    return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

test = u"avión"
print (test, elimina_tildes(test))

Here you can see how it works:



This little code can avoid you several problems for that, when matching records.

Lesson 5: you’ll get errors you don’t even hope but you need to solve them

Of course, because even an easy script can be a knighmare after an unexpected error, when trying to solve a well-design algorithm. And that’s true because some times, you design an algorithm only by using the first records you have to deal with, but you don’t know what you’ll get after 1.000 or 5.000 records. Here is a problem I needed to solve trying to extract city, code and country. Lots of them seemd like this:

Málaga 29001 Málaga (ESPAÑA)

But … I don’t receive what expected in another records, like this:

 Via Palestro, 30 00185 Roma   (ITALIA)

If you only face the first record, one can think in creating the information needed by splitting the string with whitespace. As you can see in the second record, extracting the information is more complex than thought, and not so direct as it seemed.

Chaging the algorithm to a new version will delay your initial plans.

Lesson 6: webscrapping in not only INSERTs

In theory, the execution of the script should be ony one time. But reality is so different: solving an error will force you to start again, but if you haven’t thoguht about it, maybe you will insert duplicated records.

Again, you need to modify your algorithm, so before an insert, you need to verify the record is not already on the database. The initial task get more and more complicated (this time, easy to solve).

Lesson 7: request speed

It can be sound absurd, but to avoid a server to go down with several requests, one technique is to include sleeps. This way, your visit to the website don’t seem a robot. Here is my solution to that. Let’s create a list with times, and choose one at random. 10% of times is only 3 seconds, 30% of times is only 2 seconds and the rest, only one second.

tiempos1 = [1, 2, 3, 1, 1, 2, 1, 1, 1, 2]

Lesson 8 and last: solving challenges are fun

I have a lot of fun when solving this challenge, and also, I have learn a lot, apart of getting the database …. ready to use! Mission accomplished!!

On one hand, I learn good code make me increase productivity, reduce errors. Also, in order to avoid your requests to be blocked, use random time pauses. On the other hand, using SQLite is also fun, although it’s totally different to MySQL and Microsoft Access, and now, thanks to this project, I have created a class for handling SQLite databases.

I hope you can use my lessons when you code, and … have a nice day!