What is Machine Learning?

Machine learning is a computer programming technique that uses statistical probabilities to give computers the ability to “learn” without being explicitly programmed. In essence, machine learning is getting computers to learn—and therefore act—the way humans do, improving their learning and knowledge over time autonomously. The idea is to get computers to act without being explicitly programmed. Machine learning utilizes development programs that can adjust when exposed to different external inputs.

But the key to machine learning is inputting lots and lots of data into the student computer. In order to learn, the machine needs Big Data.

A good example of machine learning is the self-driving car. A self-driving car has camera, radar, and lidar sensor systems that:

  • Use GPS to determine location.
  • Watch the road ahead.
  • Listen for various objects behind or to the side of the car.

All of this information is processed by a central computer. The computer constantly takes in and analyzes massive amounts of data, classifying the information in a similar way to how a human brain (neural networks) would. Then, the computer makes decisions based on mathematical probabilities and observations, such as how to steer, when to brake, when to accelerate, etc., which guide the car through its environment.

Three Types of Machine Learning

Machine learning is not new. The first artificial neural network (ANN)—Perceptron—was invented in 1958 by psychologist Frank Rosenblatt.

Perceptron was initially intended to be a machine, not an algorithm. It was used to develop the image recognition machine “Mark 1 Perceptron,” in 1960. The Mark 1 Perceptron was the first computer that used ANNs to simulate human thought and learn by trial and error.

Machine learning is now used more commonly due to open source libraries and frameworks, and the trillion-fold increase in computer processing power from 1956 to 2015. Now it can be found everywhere from financial trading to malware prevention to marketing personalization. But no matter how basic or complex, machine learning fits into three general categories:

1. Supervised Machine Learning

Supervised machine learning is basic and strict. The computer is presented with example inputs and desired target outputs, and finds a way of doing it. The goal is for the computer to learn the general rule that maps inputs to outputs.

Supervised machine learning can be used to make predictions about unseen or future data — called predictive modeling. The algorithm attempts to develop a function that accurately predicts the output from input variables, such as predicting the market value of a house (output) from the square footage (input) and other inputs (age, type of construction, etc).

Two types of supervised learning are:

  • Classification — The output variable is a category.
  • Regression — The output variable is a real value.

Supervised machine learning algorithms include: random forest, decision trees, k-Nearest Neighbor (kNN), linear regression, Naive Bayes, support vector machine (SVM), logistic regression, and gradient boosting.

2. Unsupervised Machine Learning

In unsupervised machine learning, the algorithm is left on its own to find structure in its input. No labels are given to the algorithm. This can be a goal in itself — discovering hidden patterns in data — or a means to an end. This is also known as “feature learning.”

An example of unsupervised machine learning is Facebook’s predictive facial recognition algorithm, that identifies people in photographs.

Two types of unsupervised learning are:

  • Clustering — The goal is to find inherent groupings in the data.
  • Association — The goal is to find rules that define large groups of data.

Unsupervised machine learning algorithms include: K-Means, hierarchical clustering, and dimensionality reduction.

3. Reinforcement Machine Learning

In reinforcement machine learning, a computer program interacts with a dynamic environment in which it must perform a certain goal, such as driving a vehicle or playing a game against an opponent. The program is given feedback in terms of rewards and punishments as it navigates the problem space, and it learns to determine the best behavior in that context.

In 2013, it was a reinforcement machine learning algorithm, using Q-learning, that famously learned — without any input from a programmer — how to beat six Atari video games.

Two types of reinforcement learning are:

  • Monte Carlo — Rewards are received at the end “terminal” state.
  • Temporal Difference (TD) Learning — Rewards are estimated at each step.

Reinforcement machine learning algorithms include: Q-learning, Deep Q Network (DQN), and State-Action-Reward-State-Action (SARSA).

Machines Learn Through Probability

All forms of machine learning occur through the process of probability, more specifically, the Bayesian interpretation of probability where things might or might not happen.

For example, here is how a machine would learn whether or not the sun comes up each day.

Day 1 — The sun will rise or not rise. The probability that the sun will rise is 0.5, or 50 percent. There is a one out of two probability, as only two results are possible.

Day 2 — The sun rose on Day 1, so the probability has changed. The machine now knows that the sun has risen once before, but it still might not rise again. The probability has changed to two out of three, or 0.66.

Day 3 - Day 6 — The sun keeps rising every day; probability goes up.

Day 7 — By the end of the week, probability is around 0.857, or 85.7 percent, that the sun will rise the following day.

End of the Year — The sun rose every day; the probability it will rise again the next day is now 0.997, or more than 99 percent.

Important to note is that the probability can never be 1, or 100 percent. There is always a minuscule chance—infinitesimally small as time goes on — that the sun will not rise the next day.

Three Types of Machine Learning Algorithms

An algorithm is a sequence of specified actions that solve a problem. Computers use algorithms to list the detailed steps that are needed to carry out an operation. There are many types of machine learning algorithms, in addition to the ones listed above.

Which algorithm is used depends on the complexity and type of problem that needs to be solved, such as clustering (looking how data clusters together) or regression (predicting a real-value output). A few machine learning algorithms are:

Decision Tree Algorithms

Decision trees are one type of algorithm that can be applied to many settings: retail, finance, pharmaceuticals, etc. The machine simply makes a tree of various results that can or cannot happen, and follows each down to its natural conclusion, working out all the probabilities of what might happen.

For example, a bank uses decision tree algorithms to decide whether to finance a mortgage. Drug companies use these algorithms during drug trials to work out the probability of side effects and to calculate the expected average cost of treatment.

Random Forest Algorithms

Random Forest is another commonly used algorithm. It builds multiple Classification and Regression Trees (CART), each with different scenarios and initial variables. The algorithm is randomized, not the data. It is used for classification and regression predictive modeling.

For instance, say you have 1000 observations in a population with 10 variables. The Random Forest algorithm will take a random sample of 100 observations and five randomly chosen initial variables to build a CART model to work through. It repeats this process over and over again, and then makes a final prediction on each observation. The final prediction is simply a function of each prediction added up.

K-Means Algorithms

K-Means are unsupervised machine learning algorithms used to solve clustering problems. They divide and classify a set of unlabeled (no external classification) data points into a group, called clusters. Each iteration of the algorithm assigns each point to a group with similar features. Data points can be tracked over time to detect changes in the clusters.

K-Means algorithms can confirm assumptions about what types of groups exist in a specific data set, or be used to discover unknown clusters. Business use cases include grouping inventory by sales activity and detecting anomalies within data such as a bot.

Apache Spark and Machine Learning

Apache Spark is an ultra-fast, distributed framework for large-scale processing of big data. It has built-in modules for machine learning, SQL, streaming analytics (Spark Streaming), and graph processing (GraphX).

The Spark ecosystem includes MLlib (machine learning library), which constantly accelerates and improves data processes like classification, regression, clustering, and more. Spark can, for example, power intelligent data pipelines that connect real-time and batch data for real-time analytics and up-to-the-minute business intelligence.

Talend puts all that machine learning power at your fingertips.

Talend and Machine Learning

The Talend platform is the first big data integration system built on Hadoop and Apache Spark. Pre-built drag-and-drop developer components leverage Spark machine learning classifiers in a single tool. Graphical tools and wizards generate native code to get your organization set up with Hadoop and Apache Spark in minutes.

Talend can help your company bridge the gap between business, IT, and data scientists to seamlessly deploy critical machine learning models. Take a look at the blog post “How to Operationalize Machine Learning.”

Ready to get started with Talend?