Motivation
I have always found data science fascinating but never bothered giving machine learning a shot. This is my effort to understand what it is.
What is machine learning?
A video on machine learning by TensorFlow (A more low-level library than Scikit-learn)
The application of AI that provides systems the ability to automatically learn and improve from experience. The process of learning is enabled by observations of data. Based on the data observed, trends/patterns can be figured out by the system. Take Spotify music recommendations as an example. Their suggestions will improve as you play more songs.
A machine learning system takes inputs and outputs. The inputs are often called features while the outputs are usually called labels.
Some definitions:
- Features: Measurable variables affecting a scientific experiment. The machine learning system would then try to distinguish relevant patterns between them. The patterns are then used to generate the outputs.
- Labels: The outputs are called labels because they are assumed to belong to certain categories.
Machine learning algorithms are typically categorized as supervised or unsupervised. Will be focusing on supervised learning here. Classification is a type of supervised learning.
Some definitions:
- Supervised learning: Means that the input to a machine learning network has already been labeled, with important features already separated into distinct categories. Therefore:
- The machine learning network knows which parts of the input are important
- There is a ground truth that the network can check itself against
The process of model training is vital to make a system learn about patterns of a dataset. As the system needs to figure out the patterns, it is only natural that it needs both the features and lables of the training data (the bigger the better).
The testing process is also important to test what the network has learned so far. Features are fed to it and the prediction accuracy of the labels is observed. The input data fed into the network for testing should not be the same as that used for model training. Just imagine taking the exact same test as the practice test given by your professor the week before!
What do we need?
What better library to use for machine learning than the Python library called Scikit-Learn. The library makes machine learning look simple, thanks to the relatively straightfoward syntax of Python and the wonderful functionalities provided by the library itself.
Classification methods made available by Scikit-Learn
- K-Nearest Neighbors
- Support Vector Machines
- Decision Tree Classifiers/Random Forests
- Naive Bayes
- Linear Discriminant Analysis
- Logistic Regression
What's next?
I will dive into how to use the k-nearest neighbors (KNN) classifier in the next post. The idea behind it is proximity; it checks the distance from some test example to known values of some training example. The group of data points that would give the smallest distance will be selected.
To know more about KNN: