Cheat Sheet: Machine Learning with KNIME Analytics Platform, Study notes of Machine Learning

Cheat Sheet: Machine Learning with KNIME Analytics Platform. Resources. • E-Books: Learn even more with the KNIME books. From basic concepts in “KNIME ...

Typology: Study notes

2021/2022

Uploaded on 07/05/2022

lee_95
lee_95 🇦🇺

4.6

(59)

999 documents

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
®
Cheat Sheet: Machine Learning with KNIME Analytics Platform
Resources
E-Books: Learn even more with the KNIME books.
From basic concepts in “KNIME Beginner’s Luck”,
to advanced concepts in “KNIME Advanced Luck”,
through to examples of real-world case studies in
“Practicing Data Science”. Available for purchase
at knime.com/knimepress
KNIME Blog: Engaging topics, challenges,
industry news, and knowledge nuggets at
knime.com/blog
KNIME Hub: Search, share, and collaborate on
KNIME workflows, nodes, and components with
the entire KNIME community at hub.knime.com
KNIME Forum: Join our global community and
engage in conversations at forum.knime.com
KNIME Server: The enterprise software for
team-based collaboration, automation, manage-
ment, and deployment of data science workflows
as analytical applications and services. Visit
knime.com/server for more information.
SUPERVISED LEARNING UNSUPERVISED LEARNING
CLASSIFICATION
Logistic Regression: A statistical algorithm that
models the relationship between the input
features and the categorical output classes by
maximizing a likelihood function. Originally
developed for binary problems, it has been
extended to problems with more than two
classes (multinomial logistic regression).
Decision Tree: Follows the C4.5 decision tree
algorithm. These algorithms generate a
tree-like structure, creating data subsets, aka
tree nodes. At each node, the data are split
based on one of the input features, generating
two or more branches as output. Further splits
are made in subsequent nodes until a node is
generated where all or almost all of the data
belong to the same class.
Naive Bayes: Based on Bayes'
theorem and assuming statistical
independence between input
features (thus "naive"), this
algorithm estimates the conditional
probability of each output class
given the vector of input features.
The class with the highest
conditional probability is assigned
to the input data.
Generalized Linear Model (GLM):
A statistics-based flexible generaliza-
tion of ordinary linear regression,
valid also for non-normal distribu-
tions of the target variable. GLM
uses the linear combination of the
input features to model an arbitrary
function of the target variable (the
link function) rather than the target
variable itself.
Deep Learning: Deep learning
extends the family of ANNs with
deeper architectures and additional
paradigms, e.g. Recurrent Neural
Networks (RNN). The training of
such networks, has been enabled
by recent advances in hardware
performance as well as parallel
execution.
Support Vector Machine (SVM):
A supervised algorithm construct-
ing a set of discriminative
hyperplanes in high-dimensional
space. In addition to linear
classification, SVMs can perform
non-linear classification by
implicitly mapping their inputs into
high-dimensional feature spaces,
where the two classes are linearly
separable.
NUMERIC PREDICTION
TIME SERIES ANALYSIS
Auto-Regressive Integrated Moving Average (ARIMA): A linear Auto-Regres-
sive (AR) model is constructed on a specified number p of past values; data
are prepared by a degree of differencing d to correct non-stationarity; while
a linear combination - named Moving Average (MA) - models the q past
residual errors. All ARIMA model parameters are estimated concurrently by
various algorithms, mostly following the Box–Jenkins approach.
ML-based TSA: A numeric prediction model trained on
vectors of past values can predict the current numeric value
of the time series.
Custom Ensemble Model:
Combining different supervised
models to form a custom
ensemble model. The final
prediction can be based on
majority vote as well as on the
average or other functions of the
output results.
XGBoost: An optimized distributed library for machine learning models in the
gradient boosting framework, designed to be highly efficient, flexible, and portable.
It features regularization parameters to penalize complex models, effective
handling of sparse data for better performance, parallel computation, and more
efficient memory usage.
CLUSTERING
Self-Organizing Tree Algorithm
(SOTA): A special Self-Organiz-
ing Map (SOM) neural network.
Its cell structure is grown using
a binary tree topology.
Fuzzy c-Means: One of the
most widely used fuzzy
clustering algorithms. It works
similarly to the k-Means
algorithm, but it allows for data
points to belong to more than
one cluster, with different
degrees of membership.
RECOMMENDATION ENGINES
Association Rules: The node
reveals regularities in co-occur-
rences of multiple products in
large-scale transaction data
recorded at points-of-sale. Based
on the a-priori algorithm, the
most frequent itemsets in the
dataset are used to generate
recommendation rules.
Collaborative Filtering: Based on
the Alternating Least Squares
(ALS) technique, it produces
recommendations (filtering)
about the interests of a user by
comparing their current
preferences with those of
multiple users (collaborating).
k-Nearest Neighbor (kNN):
A non-parametric method that
assigns the class of the k most
similar points in the training data,
based on a pre-defined distance
measure. Class attribution can be
weighted by the distance to the k-th
point and/or by the class probability.
Cross-Validation: A model validation technique for
assessing how the results of a machine learning model will
generalize to an independent dataset. A model is trained and
validated N times on different pairs of training set and test
set, both extracted from the original dataset. Some basic
statistics on the resulting N error or accuracy measures
gives insights on overfitting and generalization.
Numeric Error Measures: Evaluation metrics
for numeric prediction models quantifying the
error size and direction. Common metrics
include RMSE, MAE, or R2. Most of these
metrics depend on the range of the target
variable.
NUMERIC PREDICTION
& CLASSIFICATION
Artificial Neural Networks (ANN,
NN): Inspired by biological nervous
systems, Artificial Neural Networks
are based on architectures of
interconnected units called artificial
neurons. Artificial neurons'
parameters and connections are
trained via dedicated algorithms,
the most popular being the
Back-Propagation algorithm.
Linear/Polynomial Regression: Linear Regression
is a statistical algorithm to model a multivariate
linear relationship between the numeric target
variable and the input features. Polynomial
Regression extends this concept to fitting a
polynomial function of a pre-defined degree.
Regression Tree: Builds a decision tree to predict
numeric values through a recursive, top-down,
greedy approach known as recursive binary
splitting. At each step, the algorithm splits the
subsets represented by each node into two or more
new branches using a greedy search for the best
split. The average value of the points in a leaf
produces the numerical prediction.
BAGGING
Random Forest of Decision/Regression Trees:
Ensemble model of multiple decision/regres-
sion trees trained on different subsets of data.
Data subsets with the same number of rows
are bootstrapped from the original training set.
At each node, the split is performed on a
subset of sqrt(x) features from the original x
input features. Final prediction is based on a
hard vote (majority rule) or soft-vote (averaging
all probabilities or numerical predictions) on all
involved trees.
Tree Ensemble of Decision/Regression Trees:
Ensemble model of multiple decision/regres-
sion trees trained on different subsets of data.
Data subsets with less or equal rows and less
or equal columns are bootstrapped from the
original training set. Final prediction is based on
a hard vote (majority rule) or soft-vote
(averaging all probabilities or numeric
predictions) on all involved trees. BOOSTING
Gradient Boosted Regression Trees:
Ensemble model combining multiple
sequential simple regression trees into a
stronger model. The algorithm builds the
model stagewise. At each iteration, a simple
regression tree is fitted to predict the
residuals of the current model, following the
gradient of the loss function. This leads to
an increasingly accurate and complex overall
model. The same regression trees can also
be used for classification.
EVALUATION
ROC Curve: A graphical representa-
tion of the performance of a binary
classification model with false
positive rates on the x-axis and true
positive rates on the y axis. Multiple
points for the curve are obtained for
different classification thresholds.
The area under the curve is the
metric value.
ENSEMBLE LEARNING
Hierarchical Clustering: Builds a hierarchy of
clusters by either collecting the most similar
(agglomerative approach) or separating the
most dissimilar (divisive approach) data points
and clusters, according to a selected distance
measure. The result is a dendrogram clustering
the data together bottom-up (agglomerative) or
separating the data in different clusters
top-down (divisive).
k-Means: The n data points in the dataset are
clustered into k clusters based on the shortest
distance from the cluster prototypes.
The cluster prototype is taken as the average
data point in the cluster.
DBSCAN: A density-based non-parametric
clustering algorithm. Data points are classified
as core, density-reachable, and outlier points.
Core and density-reachable points in high
density regions are clustered together, while
points with no close neighbors in low-density
regions are labeled as outliers.
Long Short Term Memory (LSTM) Units: LSTM units
produce a hidden state by processing m x n tensors of input
values, where m is the size of the input vector at any time
and n the number of past vectors. The hidden state can then
be transformed into the current vector of numeric values.
LSTM units are suited for time series prediction as values
from past vectors can be remembered or forgotten through
a series of logic gates.
Accuracy Measures: Evaluation metrics for a classification
model calculated from the values in the confusion matrix,
such as sensitivity and specificity, precision and recall, or
overall accuracy.
Supervised Learning: A set of machine learning algorithms to predict the value of a target class or variable. They produce a mapping function (model) from the input features to the target class/variable. To estimate the model parameters during the training phase, labeled example data are needed in the training set.
Generalization to unseen data is evaluated on the test set data via scoring metrics.
Classification: A type of supervised learning where the target is a class. The model learns to produce
a class score and to assign each vector of input features to the class with the highest score. A cost
can be introduced to penalize one of the classes during class assignment.
Numeric Prediction: A type of supervised learning for numeric target variables. The model learns to associate one or more numbers with the vector of
input features. Note that numeric prediction models can also be trained to predict class scores and therefore can be used for classification problems
too.
Time Series Analysis: A set of numeric prediction methods to analyze/predict time series
data. Time series are time ordered sequences of numeric values. In particular, time series
forecasting aims at predicting future values based on previously observed values.
Evaluation: Various scoring metrics for assessing model quality - in particular, a model’s predictive ability or propensity to error.
Ensemble Learning: A combination of multiple models from supervised learning algorithms to obtain
a more stable and accurate overall model. Most commonly used ensemble techniques are Bagging
and Boosting.
Bagging: A method for training multiple
classification/regression models on different
randomly drawn subsets of the training data.
The final prediction is based on the
predictions provided by all the models, thus
reducing the chance of overfitting.
Boosting: A method for training a set of
classification/regression models iteratively.
At each step, a new model is trained on the
prediction errors and added to the
ensemble to improve the results from the
previous model state, leading to higher
accuracy after each iteration.
Unsupervised Learning: A set of machine
learning algorithms to discover patterns in the
data. A labeled dataset is not required, since data
are ultimately organized and/or transformed
based on similarity or statistical measures.
Clustering: A branch of unsupervised learning
algorithms that groups data together based on
similarity measures, without the help of labels,
classes, or categories.
Recommendation Engines: A set of
algorithms that use known information about
user preferences to predict items of interest.
Confusion Matrix: A representation of a classification
task’s success through the count of matches and
mismatches between the actual and predicted classes,
aka true positives, false negatives, false positives, and
true negatives. One class is arbitrarily selected as the
positive class.
Read Model
Data Input Data Output
Predictor
DEPLOYMENT
TRAINING
TransformRead Data Learner
Predictor Scorer
Write Model

Partial preview of the text

Download Cheat Sheet: Machine Learning with KNIME Analytics Platform and more Study notes Machine Learning in PDF only on Docsity!

Cheat Sheet: Machine Learning with KNIME Analytics Platform

Resources

  • E-Books: Learn even more with the KNIME books. From basic concepts in “KNIME Beginner’s Luck”, to advanced concepts in “KNIME Advanced Luck”, through to examples of real-world case studies in “Practicing Data Science”. Available for purchase at knime.com/knimepress
  • KNIME Blog: Engaging topics, challenges, industry news, and knowledge nuggets at knime.com/blog
  • KNIME Hub: Search, share, and collaborate on KNIME workflows, nodes, and components with the entire KNIME community at hub.knime.com
  • KNIME Forum: Join our global community and engage in conversations at forum.knime.com
  • KNIME Server: The enterprise software for team-based collaboration, automation, manage- ment, and deployment of data science workflows as analytical applications and services. Visit knime.com/server for more information.

SUPERVISED LEARNING (^) UNSUPERVISED LEARNING

CLASSIFICATION

Logistic Regression: A statistical algorithm that models the relationship between the input features and the categorical output classes by maximizing a likelihood function. Originally developed for binary problems, it has been extended to problems with more than two classes (multinomial logistic regression).

Decision Tree: Follows the C4.5 decision tree algorithm. These algorithms generate a tree-like structure, creating data subsets, aka tree nodes. At each node, the data are split based on one of the input features, generating two or more branches as output. Further splits are made in subsequent nodes until a node is generated where all or almost all of the data belong to the same class.

Naive Bayes: Based on Bayes' theorem and assuming statistical independence between input features (thus "naive"), this algorithm estimates the conditional probability of each output class given the vector of input features. The class with the highest conditional probability is assigned to the input data.

Generalized Linear Model (GLM): A statistics-based flexible generaliza- tion of ordinary linear regression, valid also for non-normal distribu- tions of the target variable. GLM uses the linear combination of the input features to model an arbitrary function of the target variable (the link function) rather than the target variable itself.

Deep Learning: Deep learning extends the family of ANNs with deeper architectures and additional paradigms, e.g. Recurrent Neural Networks (RNN). The training of such networks, has been enabled by recent advances in hardware performance as well as parallel execution.

Support Vector Machine (SVM): A supervised algorithm construct- ing a set of discriminative hyperplanes in high-dimensional space. In addition to linear classification, SVMs can perform non-linear classification by implicitly mapping their inputs into high-dimensional feature spaces, where the two classes are linearly separable.

NUMERIC PREDICTION

TIME SERIES ANALYSIS

Auto-Regressive Integrated Moving Average (ARIMA): A linear Auto-Regres- sive (AR) model is constructed on a specified number p of past values; data are prepared by a degree of differencing d to correct non-stationarity; while a linear combination - named Moving Average (MA) - models the q past residual errors. All ARIMA model parameters are estimated concurrently by various algorithms, mostly following the Box–Jenkins approach.

ML-based TSA: A numeric prediction model trained on vectors of past values can predict the current numeric value of the time series.

Custom Ensemble Model: Combining different supervised models to form a custom ensemble model. The final prediction can be based on majority vote as well as on the average or other functions of the output results.

XGBoost: An optimized distributed library for machine learning models in the gradient boosting framework, designed to be highly efficient, flexible, and portable. It features regularization parameters to penalize complex models, effective handling of sparse data for better performance, parallel computation, and more efficient memory usage.

CLUSTERING

Self-Organizing Tree Algorithm (SOTA): A special Self-Organiz- ing Map (SOM) neural network. Its cell structure is grown using a binary tree topology.

Fuzzy c-Means: One of the most widely used fuzzy clustering algorithms. It works similarly to the k-Means algorithm, but it allows for data points to belong to more than one cluster, with different degrees of membership.

RECOMMENDATION ENGINES

Association Rules: The node reveals regularities in co-occur- rences of multiple products in large-scale transaction data recorded at points-of-sale. Based on the a-priori algorithm, the most frequent itemsets in the dataset are used to generate recommendation rules.

Collaborative Filtering: Based on the Alternating Least Squares (ALS) technique, it produces recommendations (filtering) about the interests of a user by comparing their current preferences with those of multiple users (collaborating).

k-Nearest Neighbor (kNN): A non-parametric method that assigns the class of the k most similar points in the training data, based on a pre-defined distance measure. Class attribution can be weighted by the distance to the k -th point and/or by the class probability.

Cross-Validation: A model validation technique for assessing how the results of a machine learning model will generalize to an independent dataset. A model is trained and validated N times on different pairs of training set and test set, both extracted from the original dataset. Some basic statistics on the resulting N error or accuracy measures gives insights on overfitting and generalization.

Numeric Error Measures: Evaluation metrics for numeric prediction models quantifying the error size and direction. Common metrics include RMSE, MAE, or R 2. Most of these metrics depend on the range of the target variable.

NUMERIC PREDICTION

& CLASSIFICATION

Artificial Neural Networks (ANN, NN): Inspired by biological nervous systems, Artificial Neural Networks are based on architectures of interconnected units called artificial neurons. Artificial neurons' parameters and connections are trained via dedicated algorithms, the most popular being the Back-Propagation algorithm.

Linear/Polynomial Regression: Linear Regression is a statistical algorithm to model a multivariate linear relationship between the numeric target variable and the input features. Polynomial Regression extends this concept to fitting a polynomial function of a pre-defined degree.

Regression Tree: Builds a decision tree to predict numeric values through a recursive, top-down, greedy approach known as recursive binary splitting. At each step, the algorithm splits the subsets represented by each node into two or more new branches using a greedy search for the best split. The average value of the points in a leaf produces the numerical prediction.

BAGGING

Random Forest of Decision/Regression Trees: Ensemble model of multiple decision/regres- sion trees trained on different subsets of data. Data subsets with the same number of rows are bootstrapped from the original training set. At each node, the split is performed on a subset of sqrt(x) features from the original x input features. Final prediction is based on a hard vote (majority rule) or soft-vote (averaging all probabilities or numerical predictions) on all involved trees.

Tree Ensemble of Decision/Regression Trees: Ensemble model of multiple decision/regres- sion trees trained on different subsets of data. Data subsets with less or equal rows and less or equal columns are bootstrapped from the original training set. Final prediction is based on a hard vote (majority rule) or soft-vote (averaging all probabilities or numeric predictions) on all involved trees. BOOSTING

Gradient Boosted Regression Trees: Ensemble model combining multiple sequential simple regression trees into a stronger model. The algorithm builds the model stagewise. At each iteration, a simple regression tree is fitted to predict the residuals of the current model, following the gradient of the loss function. This leads to an increasingly accurate and complex overall model. The same regression trees can also be used for classification.

EVALUATION

ROC Curve: A graphical representa- tion of the performance of a binary classification model with false positive rates on the x-axis and true positive rates on the y axis. Multiple points for the curve are obtained for different classification thresholds. The area under the curve is the metric value.

ENSEMBLE LEARNING

Hierarchical Clustering: Builds a hierarchy of clusters by either collecting the most similar (agglomerative approach) or separating the most dissimilar (divisive approach) data points and clusters, according to a selected distance measure. The result is a dendrogram clustering the data together bottom-up (agglomerative) or separating the data in different clusters top-down (divisive).

k-Means: The n data points in the dataset are clustered into k clusters based on the shortest distance from the cluster prototypes. The cluster prototype is taken as the average data point in the cluster.

DBSCAN: A density-based non-parametric clustering algorithm. Data points are classified as core, density-reachable, and outlier points. Core and density-reachable points in high density regions are clustered together, while points with no close neighbors in low-density regions are labeled as outliers.

Long Short Term Memory (LSTM) Units: LSTM units produce a hidden state by processing m x n tensors of input values, where m is the size of the input vector at any time and n the number of past vectors. The hidden state can then be transformed into the current vector of numeric values. LSTM units are suited for time series prediction as values from past vectors can be remembered or forgotten through a series of logic gates.

Accuracy Measures: Evaluation metrics for a classification model calculated from the values in the confusion matrix, such as sensitivity and specificity, precision and recall, or overall accuracy.

Supervised Learning: A set of machine learning algorithms to predict the value of a target class or variable. They produce a mapping function (model) from the input features to the target class/variable. To estimate the model parameters during the training phase, labeled example data are needed in the training set. Generalization to unseen data is evaluated on the test set data via scoring metrics.

Classification: A type of supervised learning where the target is a class. The model learns to produce a class score and to assign each vector of input features to the class with the highest score. A cost can be introduced to penalize one of the classes during class assignment.

Numeric Prediction: A type of supervised learning for numeric target variables. The model learns to associate one or more numbers with the vector of input features. Note that numeric prediction models can also be trained to predict class scores and therefore can be used for classification problems too.

Time Series Analysis: A set of numeric prediction methods to analyze/predict time series data. Time series are time ordered sequences of numeric values. In particular, time series forecasting aims at predicting future values based on previously observed values.

Evaluation: Various scoring metrics for assessing model quality - in particular, a model’s predictive ability or propensity to error.

Ensemble Learning: A combination of multiple models from supervised learning algorithms to obtain a more stable and accurate overall model. Most commonly used ensemble techniques are Bagging and Boosting.

Bagging: A method for training multiple classification/regression models on different randomly drawn subsets of the training data. The final prediction is based on the predictions provided by all the models, thus reducing the chance of overfitting.

Boosting: A method for training a set of classification/regression models iteratively. At each step, a new model is trained on the prediction errors and added to the ensemble to improve the results from the previous model state, leading to higher accuracy after each iteration.

Unsupervised Learning: A set of machine learning algorithms to discover patterns in the data. A labeled dataset is not required, since data are ultimately organized and/or transformed based on similarity or statistical measures.

Clustering: A branch of unsupervised learning algorithms that groups data together based on similarity measures, without the help of labels, classes, or categories.

Recommendation Engines: A set of algorithms that use known information about user preferences to predict items of interest.

Confusion Matrix: A representation of a classification task’s success through the count of matches and mismatches between the actual and predicted classes, aka true positives, false negatives, false positives, and true negatives. One class is arbitrarily selected as the positive class.

Read Model

Data Input Predictor Data Output

DEPLOYMENT

TRAINING

Read Data Transform Learner

Predictor Scorer

Write Model