You mention techniques. Can you provide an overview of the basic concepts.
Sure. I talked about deduction. Another way to think about this is teaching a model to learn. That’s a counter-intuitive concept, I know. How do you teach a model? Basically there is a logic behind all data: information comes from somewhere. The data scientist looks at categories or values that are already present and extrapolates what might be missing by refining and adjusting a mathematical model intended to represent this logic. There is supervised learning, where you are trying to either predict a number through something called regression analysis or a class or a category or a type, through what we call classification. Then, with unsupervised learning, maybe you don’t know the value are looking for, so you take the values that you do have and then try to cluster them. These are the two most popular approaches, but new methods are being developed all the time.
Cluster? Are you referring to a scatter plot?
Not quite. Clustering is what it sounds like. It refers to groups or clusters of dots representing data, but it can take many forms. Maybe you assign different colors to data points. Maybe you have more or fewer clusters. It all rests on your choices and what your algorithm or model requires.
What is the point of clustering data?
People don’t recognize that data science is beautiful! Data science is all about finding patterns, and these often reveal themselves through visual representations. For example, imagine I plot my data on a graph, but all those numbers are noisy. Upon first glance, they are represented as a jagged or even random assortment of dots on a grid. Many of our models or algorithms, especially for supervised learning, seek to fit a line on top of the data – to apply a pattern to this information. As you train the model with more inputs, it gets closer and closer and closer to the actual distribution of your data points, forming a shape that the eye can follow. But beware: the line can’t exactly match your data or you are in danger of over-fitting.
Isn’t fitting a model to the data a good thing? How can you over-fit?
Over-fitting is one of your arch enemies as a data scientist. Here’s an analogy: If you’re given a sample exam and you study just that sample over and over and over, you may eventually get a 100% on the exam, but will you pass the real exam? Probably not. You don’t understand the underlying concepts – or in this case the logic underlining how data is organized. Instead, to extend the analogy, you are just parroting back the responses you memorized, even though the questions may have changed. The point is to learn why certain pieces of data are what they are – and then to apply these insights to answer new questions.