Introduction to Big Data, course by Microsoft

Posted by in Big Data, Courses

It’s been a long time since I’ve taken courses, but with the advent of the New Year, I’ve ventured to follow the Microsoft Big Data Introduction course, offered from EdX (the platform chosen by Microsoft for its courses). The course is no longer available, but I will summarize what I have learned and what it has served me well.

Data

Throughout the life of this blog, there has been talk on many occasions about BigData, what it is, the actors, and I have even detailed the content of some course I have done. On the other hand, I’ve been more focused on TypeScript for some time now, and I haven’t talked about Big Data in a while. If you add to this that Microsoft wants to expand Azure, the course is almost perfect for beginners.

The first part is incredibly interesting, because it details with precision that….

Data is present in many places, in different formats, with different encodings, and can be structured (with a defined structure), or destructurized (where there is no structure or pattern, and changes with each record or element).

The subject of encoding is important, because choosing one or the other can cause content to be lost or not. Encoding also makes it easier to use special characters, such as those of other languages with special characters, such as Chinese or Arabic, hence the importance of knowing which encoding to use.

Microsoft Azure

There is no doubt that one of the objectives of the course is that you get to know Azure, the Microsoft cloud and the different services it offers. One of them is Azure Storage, which allows you to save files, but if you need much larger files, they have the Azure Data Lake Store.

microsoft azure

Microsoft Azure

Access to storage systems allows you to upload files that will be stored in the cloud. Subsequently, there are different ways to access these files, which, by the way, can also be distributed in different directories.

Another part of the group refers to databases, which have several options, both relational and No-SQL. Azure’s NoSQL database is CosmosDB, which has support for using different types of databases such as MongoDB, DocumentDB, … or table is that it is of the key:value type. Of course, there is a way to work with data in JSON format using Document Data Stores. One feature I really liked is that you can apply SQL syntax to perform queries, even if you are working with NoSQL databases. Finally, it is possible to work with Graph Data Stores, GraphDB, which allows you to relate data.

 

The last part is dedicated to the processing of Big Data. Because it works with large volumes of data, one of the principles to be applied is parallel processing, with the intention of not moving from the location where the data is stored, so that the processes are faster. The destination of the processed data will be to store them for reports, graphs, representations,…. The same applies to the data to be processed in real time.

Among the solutions offered by Microsoft, you have the classic solutions such as Apache hadoop or Spark, within a service called HDInsight, but there is also Azure Data Factory that allows you to set up networks and workflows so that you can manage all the data, clean them up, add them until they are ready for the visualization of results, or even store them in other databases, spreadsheets,… Of course, you can use all the solutions at once.

The options included for Big Data analysis from Azure are many, so you don’t lack anything, you even have Azure Machine Learning for you to apply it.

What I have learnt with this course

While I recognize that working with the different parts of this course takes time and dedication, the best conclusion I can draw is that the technology is available, within everyone’s reach, and of course, JSON is basic, as it is a flexible data structure that is well suited to Javascript, and TypeScript, and there are increasingly more powerful databases that make use of this format, which is also open.

We live in the age of Big Data, and if you intend to take the first steps in this world, where to place yourself, and where to start, this course is a first step. Because we must not forget that, although Big Data means working with the three V’s, it will be necessary to do data adaptation, data cleaning, aggregations, etc. and everything in the most automatic way possible, and knowing that the sources of data sources will be multiple and very different, and that it applies not only to text or numbers, but also to images, audio, video, etc……

Behind every action, of course, there must be a goal to be achieved.

 

Let’s keep on learning!!