Data scientist are profesional on maths and statistics, apart of programming, computer science and analitics. You can read more here.
As you can read, a lot of requirements, but the skills allows to:
- Define questions to solve
- Establish the necesary data
- Establish the data available (data available and data needed can be differents!!) to solve the questions
- The real data available, and those available on worksheets, databases, on the web, …
- Clean the data, because not all data will be used
- Expliain the data analysis, also with representation of results
- Model and prediction, trying to predict future behaviour based on actual data, models and patrons
- Interpretation of results
- Challenge the results, because solving a question means new questions, related with the first one, and the process begin again
- Document the results, because it’s important to solve a problem, but it is also more important how the problem was solved, and the public for this solution.
- Distribute result on the scientific community and every person that can be interested on the question, becuase the code used during the investigation must be reproduced by other scientist
A lot of things can be done as a data scientist, but also the difficulties to achieve are high.
You must notice that data is the important key, and it doesn’t matter if they are raw data or already processed/prepared. The raw data need to be prepared before procesing them, by selecting only those needed and forget about those other that aren’t. Of course, the nature of the problem to solve will show up about the utility or not of the. Even some times it’s necesary to create new data based on raw data.
All these process can be more or less complicated depending on the data type, amount, number of data, operations to apply to raw data, … and of course, get rid of incoherent data (such as remove a character data where a number is expected, null values, …. There are several posibilities.
Apart from that, data can be structured or unstructured. The best is the first option but the most common is the second. Thank you to #BigData tools, processes have reduced the conversion from unstructured to structured data, even the amount of data that can be processed.
This is the … long road … to become a data scientist!!